Expected goals models have been developed in a number of sports to better predict future performance. For sports like hockey and soccer where goals are inherently random and scarce, expected goals models proved to be particularly useful at predicting future scoring. This is because they take into account shot attempts, which are better predictors of a team and player’s performance than goal totals alone.
A notable example is Brian Macdonald’s expected goals model dating back to 2012, which used shot differentials (Corsi, Fenwick) and other variables like faceoffs, zone starts and hits. Important developments have been made since then in regards to the predictive value of those variables, particularly those pertaining to shot quality.
Shot quality has been the subject of spirited debate despite evidence suggesting that it plays an important role in predicting goals. The evidence shows that shot characteristics like distance and angle can significantly influence the probability of a certain shot resulting in a goal. Previous attempts to account for shot quality in an expected goals model format have been conducted by Alan Ryder, see here and here.
In Part I, an updated expected goals (xG) model will be presented that accounts for shot quality and a number of other variables. Part II will deal with testing the performance of xG against previous models like score-adjusted Corsi and goals percentage.
All data is from even-strength situations
The model also takes into consideration shooter talent, which we know varies significantly from player to player. Accounting for shooting talent makes intuitive sense, as we expect that shots attempted by Brad Marchand on average have a higher likelihood of resulting in goals than shots taken by, say, Tanner Glass. To this end, a “Shot Multiplier”*** was developed to approximate a player’s effect on each shot’s probability of resulting in a goal. The Shot Multiplier was determined by following these steps:
- Regressed Shots: the number of shots for which 5-on-5 Sh% for forwards and defensemen begins to stabilize was determined using Kuder-Richardson Formula 21 (K-R 21). Sh% stabilized at approximately 375 shots for forwards and 275 shots for defensemen. For each forward, 375 shots were added to the player’s season shot total. Similarly, 275 shots were added for each defenseman’s total season shots. For explanation purposes this number of added shots will be designated as regressed shots, or rShots.
- Regressed Goals: a player’s regressed goals (rGoals) was calculated by multiplying a player’s season goal total by (rShots * league average Sh%). Note: rShots is 375 or 275 depending on if the player is a forward or defenseman, respectively. Similarly, forwards and defensemen had different league average Sh%.
- Regressed Sh%: was calculated by dividing a player’s rGoals by rShots.
- Shot Multiplier: was computed by dividing a player’s regressed Sh% (rSh%) by the league average Sh%.
Data from the 2012-13 lockout-shortened season was excluded
To test how this expected model performs against previous models like score-adjusted Corsi and goals %, year-to-year correlations were performed using the methods described by Jlikens here, with some changes. The first test consists of an in/out sample to determine how past results predict future results at the team level. Given the relative complexity of the test, the methods will be detailed below in a stepwise fashion:
- Select a random sample of X number of games in a season, which will make up Group A. Group B will be the remaining number of games in the season
- Choose a metric of interest (e.g. CF%) and calculate it for Groups A and B
- Calculate the correlation between Group A and Group B for each Team-Season (using 30 teams * 7 seasons = 210 Team-Seasons)
- Repeat steps 1) to 3) 1000 times
- Using Fisher-Z transformation, convert all correlations obtained into Z
- Convert Z into an R value
- Repeat steps 1) to 6) for every 10-game interval (10, 20, 30, …, 70)
- Repeat steps 1) to 7) for each metric
It should be noted that step 4) was repeated 1000 times to smooth out any correlation quirks that can arise when working with random samples. In step 7), intervals of 10 games were chosen arbitrarily to save time, as the number of games in an interval has no significant bearing on the final conclusions reached by the model.
The second test performed is a modified version of Emmanuel Perry’s previous work found here. This test determined how past results predict end of season statistics (or final results). It follows the same steps as the first test, except that Group B consists of end of year statistics.
The same tests were carried out at the player level. The analysis was restricted to players who dressed for at least 80 games in a given season. The resulting sample consisted of approximately 1000 player-seasons. Given the large sample of player-games, step 5) was only repeated 25 times for on-ice stats and 100 times for shooting stats.
Finally, root mean squared error (RMSE) values were computed for each statistic.
- Expected Goals at the Team Level
At the team level, xG has the same predictive power at the 20-game mark as score-adjusted Corsi (CF%) and Goals For (GF%) but proves to be a far more superior predictor of future goals past that mark (Figure 1). Note that xG also outperformed CF% and GF% with regards to root mean squared errors (RMSE).
- Expected Goals at the Player Level
xG also predicts future goals better than (score-adjusted) CF% and GF% at the player level. A comparison of Figures 1 and 2 shows that xG’s higher correlations can be appreciated sooner than for xG at the team-level. As early as the 10-game mark, xG outperforms previous models:
- Expected goals and individual performance
xG also better estimates future individual scoring. As seen in Figures 3-4 and Tables 5-6 below, individual xG per 60 minutes (ixG/60) outperforms iCF% and Sh% across the board:
As described above, xG is significantly more predictive of future goal scoring compared to previous models. In addition to being predictive, xG also appears to have superior descriptive power as it explains more of the variance in GF% than score-adjusted Corsi at the team level. Note that removing the two Buffalo outliers from the data presented in Figures 4 and 5 below did not significantly affect the correlation values.
Conclusion and Future Directions
Expected Goals (xG) significantly outperforms score-adjusted Corsi (CF%) and Goals For (GF%) in predicting future goals at the team and player levels. xG is also descriptive, which makes it a superior tool in evaluating a team and player’s past and current offensive performance. All data is posted in the spreadsheet below.
An obvious future direction would be splitting forwards and defensemen for analysis at the player level. Presumably, variables included in this xG model can vary in their descriptive and predictive value when testing for defensemen and forwards separately.
Lastly, future work will also include looking at special teams, as one would expect that the significance of predictor variables would differ from even-strength situations.
Please let me know if you have any thoughts, questions, concerns or suggestions. You can comment below, reach me via email me here: DTMAboutHeart@gmail.com or via Twitter @DTMAboutHeart.
** Score state was a variable that was accounted for in the model but was (mistakenly) not included in the original write-up. After accounting for all these variables, it was found that a shot attempted by a trailing team still has a lower likelihood of resulting in a goal than a shot taken by a leading team.
***The shot multiplier in Part I was adjusted using a historical weighted average instead of the in-season data. Thus, a 2016 shot multiplier for example would be based on the average of the regressed goals (rGoals) and regressed shots (rShots) of 2014 and 2015. This adjustment improved the model’s performance against score-adjusted Corsi and goals % in predicting future scoring, as seen below: