Extras, Blending and Seasonal Adjustment

(Photo by Rich Graessle/Icon Sportswire)

(Photo by Rich Graessle/Icon Sportswire)

This is Part 4 of a 5 part series detailing the my WAR model, Part 1 of the series can be found here, Part 2 of the series can be found here and Part 3 of the series can be found here.


Now that we have covered the overall player models here and here, we will explore how to blend these two together to achieve maximum out-of-sample predictive power. We will touch on what I have coined the “extras” section made up of penalties and faceoffs. Faceoffs are a fairly standard and well accepted player skill, even though it is overvalued by many hockey “traditionalists.” Penalties are an aspect of player analysis that typically goes unaccounted for in most current analysis. Finally, we will implement a yearly adjustment most commonly used in baseball WAR.


Both of these “extra” models are derived using the exact methodology originally developed by Sam Ventura and Andrew C. Thomas at War-On-Ice. A player’s faceoff value is determined via a Generalized Binomial Linear Model, that takes into account the skill level of the two players engaged in the face-off. The net value of a faceoff win is set as 0.013 as found by Michael Shuckers. A player’s penalty value is determined via a Poisson Mixed Effects Model, that is position sensitive and gives us the value of a player’s ability to draw and (not to) take penalties. The position adjustment helps account for the fact defensemen tend to take more penalties and draw less relative to forwards. The net value in terms of goals of a single penalty is set at 0.17. This comes from the assumption, each penalty is worth 1.8 minutes of GF60 going from 2.5 to 6.5, GA60 from 2.5 to 0.73. (6.5-2.5 + (2.5-0.73))*1.8/60 = 0.17

Blending Expected Goals Plus-Minus and Box Plus-Minus

Earlier we explored how to derive player value via two separate methods (found here and here). In order to achieve the most accurate player values we will blend these two models together in a way that maximizes their out-of-sample predictive power. Determining the weight and importance of each model was done using a repeated 10-fold cross validation. The target variable in this exercise was single season Goals Plus-Minus (determined the same way as the XPM model, but with actual goals instead of expected goals). Since the GPM at the special teams level was too noisy from season to season, the target variable for those regressions was 9-year GPM (2007-2016). In determining the special team weights for XPM and BPM, forwards and defensemen were looked at together instead of separately. The table below gives an estimate of the relative value of XPM vs. BPM used in the final model (the exact weights are determined via the cross validation, these estimates hopefully provide an easier interpretation).


As was hinted in the BPM write-up, the defensive version of the stat has basically zero value when compared to the information provided by XPM. Offensively, however BPM is much more useful especially on the power-play. BPM’s overwhelming dominance on the power-play is probably more due to the increased noise introduced to XPM when using the limited play-by-play data available for special teams, but XPM does still provide some value.

Replacement Level

“Replacement is defined very specifically for my purposes: it’s the talent level for which you would pay the minimum salary on the open market, or for which you can obtain at minimal cost in a trade.”Tom Tango

Replacement level is a hot button issue with some people, but it is also a useful baseline for quick analysis. I chose to baseline my statistic to “above replacement” instead of “above average” because it provides a quick way to decipher if a player is fit for an NHL roster spot. Another positive is that you will not end up with odd looking cumulative results due to a higher baseline. If the baseline is set to average, then the results of the metric will actually favor inferior talents whose lack of skills convince their manager to keep them chained to the bench in lieu of better players who actually take the field on a semi-regular basis.”Dave Cameron. For example, Player A has a -0.20 GAA  (Goals Above Average) per 60 and gets 600 minutes of playing time and Player B has a -0.15 GAA GAA per 60 and gets 1200 minutes of playing time. Even though Player B is the better player, they actually end up with a worse total GAA (-3 vs. -2). While this issue still can exist when we baseline to replacement level, it will be much less common since more players fall around average than replacement level. Dave Cameron explains it best:

That’s where the other side of the replacement level coin comes in. Since we only want to start assigning negative value to players when they’re actually performing worse than the alternative option, we have to know what the alternative might actually do if given the chance, but we also have to factor in the cost to acquire that alternative…Realistically, this is the kind of player that a Major League team can get for something close to the league minimum, without surrendering any talent in trade in order to acquire them…The replacement level baseline also becomes useful in calculating worth in terms of dollars. If you begin with a performance baseline that equates to a cost above the league minimum, you are essentially just creating an extra step for yourself. For instance, if you use Wins Above Average, turning that into a dollar valuation requires you to then calculate the value of an average player, since they’re clearly not freely available and require real assets in order to sign or trade for. And, in order to find the value of that +0 WAA player, you need a new baseline to establish their value, which just leads you back to a number that approximates replacement level. Even if you don’t want to call it replacement level, measuring the performance expected from a league minimum salary is necessary for any kind of financial valuation tool. Of course, not everyone cares about financial valuations of players, so I’d consider that a secondary reason for why replacement level is a necessary baseline. The playing time issue is a larger one, as everyone should want a single value metric to account for the fact that, in general, better players receive more playing time, and we shouldn’t want to punish those players for playing more than the guys who sit on the bench and watch.”

Putting It All Together

The final adjustment here is a technique that requires use of a replacement level baseline and some theory. The theory comes from baseball and is as simple as the fact that there are only so many wins in a season that we can distribute to players. The formula to determine the amount of wins available is as follows, (Teams in the league) * (Games per team in a season) * (.500 – Replacement Level Team Win%) = Wins Above Replacement in the league as a whole. There have been a few estimates made of what replacement level is, mostly falling around the .300 level mark which is what we will use here. This leaves us with 30*82*(.500 – .300) = 492 Wins Above Replacement available in the league (adjust accordingly for lockout shortened seasons). All of the metrics up to this point however have come in terms of goals, which now need to be converted to wins, the value of which was determined using this method. That leaves us with these conversion rates:


Now it’s time to decide how to divide the available wins among the necessary categories. The distribution of these wins were made as follows. It is split 50/50 between offense and defense (246 each). Within each of those categories we distribute the wins between even-strength and special teams situations based on the the average distribution of goals that come from those situations. This distribution changes slightly from season to season, but on average works out to ~70% of goals occurring during even-strength, meaning we give 172 WAR to even-strength and 74 to special teams. The last adjustment is to account for goaltending, it has been shown that on the penalty kill goalies are basically at the mercy of their circumstance, so we apply their contributions only to even-strength defense. We divide even-strength defense evenly between players and goalies, giving them 86 WAR each. Using these distributions we can then readjust the numbers based on playing time so that we do not exceed or fall short of the allotted amount. Up to this point we have 7 components:

  1. Even-Strength Offense
  2. Even -Strength Defense
  3. Power Play Offense
  4. Short Handed Defense
  5. Drawing Penalties
  6. Taking Penalties
  7. Faceoffs

Since these don’t cleanly fit with how WAR is distributed it will be broken down as such:

  • Even-Strength Offense (172 WAR) = Even-Strength Offense (1)
  • Even-Strength Defense (86 WAR) = Even-Strength Defense (2)
  • Power Play Offense (74 WAR) = Power Play Offense (3) + Drawing Penalties (5)
  • Short Handed Defense (74 WAR) = Short Handed Defense (4) + Taking Penalties (6)
  • Faceoffs (7) are evenly distributed among the above categories

After the readjustments have been made and testing conducted, it was decided that Short Handed Defense (4) will be dropped from the WAR formula. The decision was based on its extremely minor impact and essentially zero correlation from one season to the next. This topic is worth further exploration, but it fits with previous research noted here. Short Handed Defense in the model will now be completely accounted for in a player’s ability to not take penalties, the best way to help their team’s penalty kill will be to not put them in that situation in the first place.


In total this model contains six components with three of those each being broken down into two subcomponents. So while WAR tends to be seen by many as a simple one number metric that is hardly the case. It can, and sometimes needs to, be decomposed into its many parts in order to be applied effectively. A player’s overall WAR will be hurt by their coach’s decision to not put them on the power play even though that is out of the player’s control, but that doesn’t mean if they were given the chance they could not succeed on the power play. Repeatability and predictive power will be discussed in the final post of this series as well as a release of all WAR data from 2008 – 2016.

3 thoughts on “Extras, Blending and Seasonal Adjustment

  1. Well done! Just one note: I’m beginning to wonder whether using WOWY data to account for quality of teammates in a WAR metric would be appropriate. If teammates can have a synergistic relationship with one another, whether positively or negatively, then changing teammates (or teams, even) could influence their WAR significantly. Also, as recent research has found about playing style from entry-zone data (https://hockey-graphs.com/2016/07/06/neutral-zone-playing-styles/), I’m thinking that is something that needs to be examined.

    Again, thank you for doing all this. I cannot wait for your next posts.

    Side note: How much do you think a coach’s system plays a role in PK defense? I’m thinking it has less to do with the players and more to do with the system in place, but I could be wrong.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s