Visualizing Goaltender Statistics Through Beeswarm Plots

A picture is worth a thousand words. Yes, it’s a cliché, but when it comes to visualizing data, an individual can tell a story via the choices they make when presenting their data. One of the most common visualizations is a plot showcasing the frequency and distribution of an event. Data like this are often presented in a histogram or box-and-whisker-plot. However, a limitation of both of these types of plots is that neither shows the individual where each data point falls. On the other hand, a beeswarm plot allows the user to see where each individual point falls across a range. A random jitter effect is applied to maintain a minimum distance between each point to minimize overlap.

Inspired by the wonderful graphs from Namita Nandakumar and Emmanuel Perry, I thought I would attempt to visualize how goaltenders have fared in goals saved above average over the course of their careers.

Methods

I collected data from 1955 to 2019 using Hockey-Reference’s Goalie Statistics page. I manually added in the season for each player for that year in such a manner that 1955 represents 1955-1956, 1956 represents 1956-1957, and so on and so forth.

Goaltenders are notoriously difficult to evaluate, especially when we look to evaluate goaltenders of the past. I selected goals saved above average as the metric to use for comparison given that we have data going back to 1955 and it at least provides some context for how a goaltender stopped the puck relative to the quantity of shots faced. However, it does not provide us with any information on the quality of shots faced. For that additional context, we could look to metrics like goals saved above expected, which can be found on evolving-hockey.com as far back as 2007-2008.

I selected four of the most prominent goaltenders to start my comparison. Using R’s ggbeeswarm package, I created beeswarm plots to visualize how each goaltender compared to their contemporaries. Shown below are the plots for Dominik Hasek, Martin Brodeur, Ken Dryden, and Patrick Roy.

hasek

dryden

roy

brodeur

Bottom line – Dominik Hasek was pretty good.

If you’re interested in viewing the code for this or seeing the data, I’ve made both available at the link below.

Code + Data

Hacking the NHL Play-by-Play App in Shiny

Recently, I created a web application for interactively visualizing shot data for all games in the 2017-2018 season. In this article, I will walk through a month-long process building the National Hockey League Play-by-Play App from scratch, giving a behind-the-scenes look.

What started this project was this #rstats Shiny contest tweet. Shiny is a R package built by RStudio for creating interactive web applications. It allows R programmers to create web applications without having to exclusively code in HTML, CSS or JavaScript. I had looked at several sports visualizations (e.g. Ryo’s Visualize the World Cup) and wanted to create something similar in hockey. This announcement provided the motivation for me to start.

I started with sharpening my Shiny skills by taking DataCamp’s Shiny Course. I particularly found Chapter 2 (Inputs, outputs, and rendering functions) and Chapter 3 (Reactive Programming) helpful in reminding myself of the essence of Shiny. They are great visual learning resources and I highly recommend beginners in Shiny to take this course.

Now, I focused on the structure of my application. The organization of my end product is instrumental in its usability, so I wanted to get it right. I looked at the Shiny Application Layout Guide and decided to go with the Grid Layout, which contains a plot at the top and parameters of the plot at the bottom in a three column format. This is the best organization for focusing the users on the animation at the top. The secondary features, which are the parameters controlling the plot, are stationed at the bottom.
Now, to the actual animation. I relied on Ryo’s World Cup animations, which was rendered in gganimate, a R package for animations. Unfortunately, unlike Ryo’s dataset, my dataset didn’t contain coordinate data points with the location of each player over time. Rather, my Play-by-Play, Real Time Scoring System dataset only had shot location:

Figure 1: Snapshot of raw shot data by Corsica Hockey

If the NHL had tracked real time coordinate data like the NFL, I could have created a fluid animation like this:

Figure 2: Tyreek Hill’s TD reception during Week 1 of 17/18 season. Video here. Source: http://bit.ly/nfl-bigdata

So, here is a hack I came up with. First, I “normalized” the shot locations so that all shots taken by the home team were shown on the right and shots taken by the away team were shown on the left. Then, after every shot location data, I input (x,y) coordinates (82, 0) and (-82, 0) to mark the location of both nets. Next, I created a column called event_index that groups each pair of shot data (1 row for shot location, 1 row for net location). I then created a column called event_frame that numerates all the rows. Last, I used group aesthetic on event_index and added transition_components(time = event_frame) to render the animation.

Figure 3: Data processed for animation

This was all great, but I realized that the gganimate package doesn’t work well with Shiny. There is no function designed to render gganimate animations on Shiny. In other words, there was no natural way to put my animations on my end product, which was a huge concern.

This StackOverflow answer was super helpful in coming up with another hack.  It recommended saving the animation as a .gif file and returning the file as a list along with the dimensions of the animation. There is one drawback to this method though: the animation looks stretched out if I increase the width too much, and it moves downward if I increase the height too much. As a result, what I currently have is the best I could come up with. High image resolution. Optimal placement.

The animation happens on a NHL ice rink created by War On Ice. I added “reactive” team logos on Shiny to clearly indicate which side is the home/away side. Also, in the app, users need to input the official game ID in order to navigate between games. In order to facilitate this process, I included a datatable of all the game IDs, game dates, home teams, and away teams next to the animation. That way, the user can find the desired game by searching through game dates or teams, locate the right Game ID, and render the right animation.

Figure 4:  Animation of 2017-10-04 Regular Season Game between the Toronto Maple Leafs and the Winnipeg Jets

Now, the other visualizations. I took a long, hard look at the dataset and thought about which columns to make use of. I thought the shot distance was pretty interesting, so I created a histogram of the shot distance. This illustrates the number of shots a team took at a certain distance from the net. To help the user interpret the distances, I labelled the location of the faceoff circles, blue line, and the red line. Furthermore, expected goal probability is a frequently occuring metric in hockey analytic discussions. I thought it would be interesting to see its change throughout the game. As a result, I animated expected goal probabilities for each team. This plot generated the most buzz.

Figure 5: Animation of Expected Goal Probability during 2017-10-04 Regular Season Game between the Toronto Maple Leafs and the Winnipeg Jets

Last, I wanted to include a summary of the game by showing the boxscore. However, I ran into too many roadblocks with html / css, so I decided to simply show the nhl.com official recap.

Some neat features I added to the app include a short tour using the rintrojs package. When the user presses the Help bottom on the top right corner, Shiny gives a short tour, explaining what each of the parameters do. Also, the “Share” button allows users to easily share the app with a custom message I included and the “Code” button redirects users to the Github repo.

Figure 6: Illustration of the rintrojs package

The final product is available here: NHL Play-by-Play App

The Epilogue to Quantifying Differences between the Regular Season and the Playoffs

Introduction

After several months of learning the concept of survival analysis and applying it to hockey, I published my article, “Quantifying Differences between the Regular Season and Playoffs using Survival Analysis”. Among the readers, one noteworthy individual in the sports analytics community commented on Twitter:



His tweet motivated this brief analysis to answer the first question: “Can I repeat my previous analysis for regular season by period?”. First, I only look at regular season data and change the treatment variable from whether the game is played during the regular season or playoffs to whether it was played in Period X vs Period Y. Then, I approach Tom’s question in a different way by keeping the treatment variable as regular season vs playoffs, but filter for the 1st, 2nd, and 3rd periods. This further shows the discrepancy in change in rates of events by period.

Continue reading

Wins Above Replacement: Replacement Level, Decisions, Results, and Final Remarks (Part 3)

In part 1 of this series we covered the history of WAR, discussed our philosophy, and laid out the goals of our WAR model. In part 2 we explained our entire modeling process. In part 3, we’re going to cover the theory of replacement level and the win conversion calculation and discuss decisions we made while constructing the model. Finally, we’ll explore some of the results and cover potential additions/improvements. 

Continue reading

NL Ice Data: A Swiss Hockey Analytics Website

In the last 10 years, I have been impressed by the development of the hockey analytics community in North America as well as the tools made available to the public in the hope of increasing the general hockey knowledge.

Unfortunately, in Switzerland, the Swiss Ice Hockey Federation (SIHF) does not provide the same level of information as there is in North America and keeps part of its proprietary data for itself. As such, fans and journalists, except on very rare occasions, don’t have access to the same kind of in-depth researches/analyses as there are in the NHL or some other European leagues. Plus/minus is still THE hockey statistic for some journalists or analysts.

The first part of my project with the Hockey-Graphs Mentorship program was to create a platform entirely dedicated to Swiss hockey statistics, called NL Ice Data, the main goal was to exploit as much as possible the available data and to give fans access to additional statistics the SIHF doesn’t necessarily provide:

  • GF/GA: for players, RelGF%, GF/60, …;
  • time on ice deployment and evolution;
  • linemates information;
  • aggregated shot tracker maps per player, goalie and team;
  • and many others.

Current features include the same core of statistics for players, goalkeepers and teams: statistics, fouls, shootouts and shot tracker maps. Easy to use, the website provides interactive tables and charts so that fans can engage more with data. Additional features, charts and metrics will be added along the project.  

By slowly integrating further metrics and concepts after the website’s launch (xG or Game Score for example), the modest goal is to build overall knowledge amongst fans. A secondary goal was to have a platform ready to publish more *advanced* statistics (including at the player level) as soon as the League publishes more of its proprietary data.

Wins Above Replacement: The Process (Part 2)

In part 1, we covered WAR in hockey and baseball, discussed each field’s prior philosophies, and cemented the goals for our own WAR model. This part will be devoted to the process – how we assign value to players over multiple components to sum to a total value for any given player. We’ll cover the two main modeling aspects and how we adjust for overall team performance. Given our affinity for baseball’s philosophy and the overall influence it’s had on us, let’s first go back to baseball and look at how they do it, briefly.

Continue reading

Wins Above Replacement: History, Philosophy, and Objectives (Part 1)

Wins Above Replacement (WAR) is a metric created and developed by the sabermetric community in baseball over the last 30 years – there’s even room to date it back as far as 1982 where a system that resembled the method first appeared in Bill James’ Abstract from that year (per Baseball Prospectus and Tom Tango). The four major public models/systems in baseball define WAR as such:

  • “Wins Above Replacement (WAR) is an attempt by the sabermetric baseball community to summarize a player’s total contributions to their team in one statistic.” FanGraphs
  • “Wins Above Replacement Player [WARP] is Prospectus’ attempt at capturing a players’ total value.” Baseball Prospectus
  • ”The idea behind the WAR framework is that we want to know how much better a player is than a player that would typically be available to replace that player.” Baseball-Reference
  • “Wins Above Replacement (WAR) … aggregates the contributions of a player in each facet of the game: hitting, pitching, baserunning, and fielding.” openWAR

Continue reading

Penalty Goals: An Expanded Approach to Measuring Penalties in the NHL

Intro

Penalty differential figures are a rather ambiguous concept in hockey. It seems only recently that the majority of analysts and fans have stopped touting a player’s total penalty minutes as a positive aspect of a player’s game. Let’s get one thing clear: taking penalties is a bad thing and drawing penalties is a good thing. When a penalty is taken or drawn, the change in strength state (5v5 to 5v4 for instance) directly impacts the rate of goal scoring for a given player’s team (goals for and goals against). We can measure this change by determining league average scoring rates at each strength state and can then determine the net goals that are lost/gained from a penalty that was taken/drawn. This was first shown in the penalty component of the WAR model from WAR-On-Ice (WOI) here. A.C. Thomas explains it:

Continue reading

Reviving Regularized Adjusted Plus-Minus for Hockey

Introduction

In this piece we will cover Adjusted Plus-Minus (APM) / Regularized Adjusted Plus-Minus (RAPM) as a method for evaluating skaters in the NHL. Some of you may be familiar with this process – both of these methods were developed for evaluating players in the NBA and have since been modified to do the same for skaters in the NHL. We first need to acknowledge the work of Brian Macdonald. He proposed how the NBA RAPM models could be applied for skater evaluation in hockey in three papers on the subject: paper 1, paper 2, and paper 3. We highly encourage you to read these papers as they were instrumental in our own development of the RAPM method.

While the APM/RAPM method is established in the NBA and to a much lesser extent the NHL, we feel (especially for hockey) revisiting the history, process, and implementation of the RAPM technique is overdue. This method has become the go-to public framework for evaluating a given player’s value within the NBA. There are multiple versions of the framework, which we can collectively call “regression analysis”, but APM was the original method developed. The goal of this type of analysis (APM/RAPM) is to isolate a given player’s contribution while on the ice independent of all factors that we can account for. Put simply, this allows us to better measure the individual performance of a given player in an environment where many factors can impact their raw results. We will start with the history of the technique, move on to a demonstration of how linear regression works for this purpose, and finally cover how we apply this to measuring skater performance in the NHL.

Continue reading