A Defense of WAR from a WAR-Skeptic

Embed from Getty Images

Note: This was originally intended to be a tweet-thread which grew far too long and unmanageable, so you’re getting a poorly-written post instead. Apologies in advance.

Recently, David Johnson, owner of the awesome puckalytics.com has been on a bit of a warpath (pun intended) against the use of WAR/GAR. Most of David’s arguments can be found here and here, but there are some other comments in this thread.

I consider myself a bit of a WAR skeptic. I think Dawson’s work is great, but I think there are limitations/issues with it. A good summary of some of my concerns can be found in another ill-advised and long tweet thread.

With that being said though, I still think it’s extremely useful as a first pass to start discussion. WAR can be broken down into 5 useful components to see where a players impact derives from.

For example, it’s clear in the Scheifele/Perreault example that the metric likes Perreault’s defence a lot more than Scheifele. David’s argument tends to focus on “How can Scheifele be the same as Perreault when he has almost 2x the points”?
But the answer to that question is pretty evident – GAR views Scheifele’s offense as being much more valuable than Perreault’s. Where Perreault gets his advantage is in his own end, and a bit on the PP. Is this fair?

I’ve already stated that I think there are issues with how GAR measures defence, but it is worth noting that Perreault’s CA60 Rel TM is 3.9 better than Scheifele’s.
Should that produce such a wide difference in GAR? Probably not, but it’s also not completely unreasonable.

On the PP, the Jets shot rate is a bit higher with Perrault and their goal rate is much higher, so maybe that’s not so far off either. So it seems that on the Scheifele vs. Perreault issue it appears that there may be some statistical backing for the comparison, if you dig into the details. (As an aside, I’m team Scheifele > Perreault, but I’d read the numbers as suggesting Perreault has more value than some might think).

Moving on – next up is “Saad & Foligno don’t deserve to be ranked near the top of the league”. This is a bit of a curious argument to me. Columbus is currently 4th in the league, so I’d imagine at least some of their players should be good. Whether they are “long run Crosby-good” is irrelevant – GAR is measuring, to a degree, their contributions this year.

And this year, Columbus has been a very good team, presumably powered by a few very good player-seasons. Foligno and Saad are 1-2 in EV TOI, and both play significant minutes on a very good power play. It’s not at all unreasonable to think two of the top contributors to one of the best teams would be among the top 30 F in the NHL.

Third point: David lists off a series of players who he disagrees with the order on. This, I think, is a bit lazy. Yes there are going to be players whose rankings are clearly wrong, but that’s true of whatever metric you use. Again, the point of most of analytics is to challenge conventional thinking. It’s not designed to replace common sense, but rather to provide a means to challenge the assumptions that go into our traditional rankings.

As an example Curtis McElhinney is playing out of his mind this year and his save % is ahead of Cam Talbot. No one would take him over Talbot.
But no one is going to discount save percentage because of this small inconsistency – there’s nuance involved.

Next point: What stats David uses to evaluate players. Here he provides a list of 17 stats he’d use for player evaluation. First, all of these stats/ideas get built into WAR (with the exception of Sv% Rel).

Second, this is probably too many stats for one person to combine reasonably in a consistent manner.

The only difference between David’s method and Dawson’s is the aggregation. Dawson uses an algorithm, David does it manually. Personally, I’d lean towards using the algorithm. I’ve contradicted myself within tweet threads before and don’t trust my brain to handle all that info the same way each time.

Dawson has provided evidence that the algorithm he’s designed is better at identifying talented players than many stats we have today.  That’s much more difficult to provide for a personal subjective evaluation for individual players. The algorithm may not be perfect, but it does appear that in a lot of cases it’s pretty good.

Lastly, David’s final critique today dealt with the “missing” inputs into WAR.
He claims that GAR doesn’t appropriately account for on-ice shooting percentage – but Dawson’s does! His expected goals model explicitly takes the shooter into account, and BPM includes goals and assists.

He claims that GAR doesn’t include on-ice save percentage because it doesn’t. That is because players have no significant impact on it. This article that he offers as proof is a brilliant exercise in binning to falsely show correlation. And this article shows that Sv% Rel varies with TOI, which is correlation rather than causation. (As an aside, I do believe that players can impact shot quality against to a degree, but what happens after that is full of randomness).

Finally, David claims that there’s an over-emphasis of shot metrics over goal metrics. I find this point most confusing.

Goals are a result, and there is some value in that result. But we know that factors in expected goals models explain some of those results. If you have 2 shots by equally talented shooters from the same spot and 1 goes in while the other doesn’t, why credit 1 more than the other.

GAR is meant to be descriptive, but not purely descriptive – there’s some value in crediting the process over the result. I view the Corsi-Goal scale as a continuum. Goals are one extreme and Corsi are another extreme, and Dawson’s xG sits between them. I think xG does a pretty good job at managing the balance between descriptive and predictive. There are factors it misses (pre-shot movement, screens, etc.) but it’s better than either extreme on it’s own.

I appreciate that David is trying to help push us forward and improve in the analytics community, but I don’t find the arguments that he’s made convincing. It is very important to know (even at a high level) the details of how a metric is calculated when making a critique, because it allows you to make useful suggestions for where the model may be going wrong.

Again, none of this is to say that GAR/WAR are the best metric we have available today or should be the only thing we use, but there’s clearly some strong arguments for using it to start a conversation, or as a sanity check on subjective evaluations. And the more productive discussions we have using models like GAR, the more clarity we’ll get on where it’s strengths and weaknesses as a metric lie, and the more avenues for other research we’ll open up.

3 thoughts on “A Defense of WAR from a WAR-Skeptic

  1. I don’t disagree with much of what you write but let me address a few points before I have to get to work.

    ———-
    Whether they are “long run Crosby-good” is irrelevant – GAR is measuring, to a degree, their contributions this year.
    ———-

    Fair, however looking at this years GAR I don’t in any way think you can make the following conclusions that the article made:

    “Unheralded Brandon Saad, Nick Foligno bring star power to Blue Jackets”
    (Unheralded? Sure. Star power? Not convinced)

    “Foligno and Saad might be two of the league’s most unheralded star forwards, depending on what numbers you’re looking at.”

    (Sure, cherry picking numbers you can say almost anything about anyone, I still wouldn’t call them “star forwards”)

    “In Foligno and Saad, the Jackets might not have a traditional “superstar,” but they might have two guys who bring the same value…”

    (Same value as a superstar? This is a bold statement that I think needs a lot more support than one season GAR and certainly a lot more discussion than was in the article.)

    “But even though the numbers are big fans of their work, it might still be a while before the rest of the league catches on.”

    (Again, if you are trying to suggest GAR knows something the NHL GMs, coaches, scouts, etc. don’t know you better have a ton of evidence to support your case).

    My problem is actually less in using GAR as a starting point (as Dom suggests) but in the fact that the article barely moved off the starting line. Even looking at GAR for Foligno/Saad in previous seasons to see if this season is an anomaly or reflective of true value would be a good start. There is no depth in the article and the tone of the article (certainly the headline and the closing paragraphs) was that these guys are unheralded star players, not good players having an unusually star-like seasons. Instead conclusions (or suggestions of conclusions) about the players are made almost solely on one season’s GAR.

    —————
    As an example Curtis McElhinney is playing out of his mind this year and his save % is ahead of Cam Talbot. No one would take him over Talbot.
    But no one is going to discount save percentage because of this small inconsistency – there’s nuance involved.
    —————

    Well, we discount save percentage over small sample sizes all the time, as we should. We know it takes a track record before we have confidence in a goalies save percentage so if a goalie has a handful of good games or even a good season we largely discount it if it is unusual.

    Similarly, if a players GAR was 5, 7, 4, 6, 5, 13 over a 6-year span we shouldn’t start calling that player a star because he has a 13 GAR in one season. I have been fairly consistent over the years in saying, if a player has more than one year track record we should be looking at those years when we evaluate them.

    ————–
    The only difference between David’s method and Dawson’s is the aggregation. Dawson uses an algorithm, David does it manually. Personally, I’d lean towards using the algorithm. I’ve contradicted myself within tweet threads before and don’t trust my brain to handle all that info the same way each time.
    —————

    This is fair, and if you want to use GAR to help aggregate statistics in a consistent way, have at it. I am not saying don’t ever use all-encompassing stats but use them with caution. Understand the reliability of them and the component metrics.

    My second tweets in the series of tweets that started this whole mess we are in was:

    “I get why people love WAR/GAR stats but if you rate Perreault > Scheifele there are flaws, or at least great uncertainties, in the metric.”

    and it finished with

    “We ought to be discouraging this behavior within our community. Use WAR/GAR stats if you like but understand/acknowledge the limitations.”

    I have been fairly consistent in my critiques of hockey analytics over the years including that there are too many claims about players with too much (perceived) confidence with very little discussion of the underlying uncertainties. My critique of Dom’s article was no different and I find it unfortunate it exploded into the mess it did. I apologize for any part I may have played in it getting out of hand however I won’t back down critiquing the use of limited and unreliable statistics to make bold claims. The analytics community would bash Steve Simmons or Pierre McGuire for doing that so we ought to not do that ourselves.

Leave a comment