What did you think the first time you watched hockey? Did you know the difference between a forward and a defensive skater? Could you tell the difference just by watching? It’s likely that some outside factor (a friend, the play by play announcer, a graphic on the broadcast) alerted you to the fact that NHL teams use more than one type of skater.
But, say that outside variable never intervened, and you were left to your own devices. How long would it take for you to develop the idea of “forwards” and “defensive skaters”? Would you come up with your own classifications? Would you differentiate them at all?
These are some of the questions I have been thinking of lately. I wanted to find a way to classify skaters purely through the use of statistics, without any heuristic or observational method. Cluster analysis is one way to do this.
K-Means is an algorithm that assigns each data point in a set to a cluster in an attempt to classify the data. This process is repeated until the most optimal clusters are found. This process can be used to discover a structure or pattern within the data that isn’t otherwise apparent.
In this post, I’ll walk through the process of using K-Means in R on a basic data set, discuss the results, and identify further uses for this type of analysis.
I’m using 5v5 score-adjusted data from War On Ice in this analysis. This data is at the player-season level, with primary assists, goals, TOI per game, and Position as the required fields. You can download the data here.
#Set your working directory #Required packages ("ggplot2", "flexclust") #Read War On Ice skater data into R woi.skater.clustering <- read.csv("woi.skater.clustering.csv") names(woi.skater.clustering) [1] "X.2" "X.1" "X" "Name" "pos" "Team" "Gm" [8] "season" "Age" "Salary" "AAV" "G" "A" "P" [15] "G60" "A60" "P60" "PenD" "CF." "PDO" "PSh." [22] "ZSO.Rel" "TOI.Gm" "iHSC" "HSCF.Rel" "HSCF." "HSCF.off" "HSC..." [29] "HSCF" "HSCA" "HSCF60" "HSCA60" "HSCP60" "SCF.Rel" "SCF." [36] "SCF.off" "SC..." "SCF" "SCA" "SCF60" "SCA60" "SCP60" [43] "iSC" "CF.Rel" "CF.off" "C..." "CF" "CA" "CF60" [50] "CA60" "CP60" "OCOn." "BK" "AB" "iCF" "FF.Rel" [57] "FF." "FF.off" "F..." "FF" "FA" "FF60" "FA60" [64] "FP60" "OFOn." "MS" "iFF" "SF.Rel" "SF." "SF.off" [71] "S..." "SF60" "SA60" "SF" "SA" "iSF" "OCAOn." [78] "OFAOn." "GF.Rel" "GF.off" "GF60" "GA60" "GF" "GA" [85] "G..." "GF." "PFenSh." "OSh." "OFenSh." "OSv." "OFenSv." [92] "FO.." "FO_W" "FO_L" "ZSO." "ZSO" "ZSN" "ZSD" [99] "HIT" "HIT." "A1" "A2" "SH" "GV" "TK" [106] "PN" "PN." "PenD60" "TOI" "TOIoff" "TOI." "TOIT." [113] "TOIC." "CorT." "CorC." "tCF60" "tCA60" "cCF60" "cCA60"
The Position data from War On Ice is messy, so we’ll simplify the data to “Forwards” and “Defense”.
#Identify unique values in the "pos" column unique(woi.skater.clustering$pos) [1] RL C L LR R D RC CR CL LC RD DR DL Levels: C CL CR D DL DR L LC LR R RC RD RL #Classify all rows in "pos" not equal to "D" as "Forward", else "Defensemen" woi.skater.clustering$new_pos <- ifelse(woi.skater.clustering$pos == "D", "Defensemen", "Forward") #Identify unique values in the "new_pos" column unique(woi.skater.clustering$new_pos) [1] "Forward" "Defensemen"
War On Ice doesn’t have Primary Points Per 60 in their data, so let’s create that and add it to the data frame.
#Create new column "PP60" for primary points per 60 woi.skater.clustering$PP60 <- ((woi.skater.clustering$G + woi.skater.clustering$A1) / woi.skater.clustering$TOI) * 60 #Inspect new variable "PP60" str(woi.skater.clustering$PP60) num [1:5117] 1.09 1.365 0.953 1.139 1.406 ...
Let’s see how the data looks when colored by the position as given on War On Ice.
#Load ggplot2 for graphing library(ggplot2) #Graph original data frame and color by position orig_df <- ggplot(woi.skater.clustering, aes(TOI.Gm, PP60, color = new_pos)) orig_df_graph <- orig_df + geom_point(alpha=I(.75)) + theme_bw() + xlab("TOI Per Game") + ylab("Primary Points Per 60") + scale_color_discrete(name="Position") orig_df_graph
Looks like there are clearly two groups of skaters, but there’s an area where defensemen and forwards overlap. Let’s see what K-Means can tell us about this data set.
We don’t need all the variables in the War On Ice data, so let’s create a separate data frame for the data we’ll use in the cluster analysis.
#Isolate the variables used for cluster analysis woi_c <- subset(woi.skater.clustering, select=(c("PP60", "TOI.Gm"))) str(woi_c) 'data.frame': 5117 obs. of 2 variables: $ PP60 : num 1.09 1.365 0.953 1.139 1.406 ... $ TOI.Gm: num 10.4 10.8 13.1 13.5 13.2 ... head(woi_c) PP60 TOI.Gm 1 1.090413 10.41 2 1.364924 10.85 3 0.952638 13.12 4 1.139277 13.54 5 1.405841 13.22 6 1.100473 8.14
Since PP60 and TOI.Gm are measured differently, we need to standardize them to one scale.
#Coerce all variables to the same scale woi_c <- as.data.frame(scale(woi_c)) #View data frame again head(woi_c) PP60 TOI.Gm 1 0.230159623 -1.07375361 2 0.699772975 -0.89671733 3 -0.005537132 0.01662890 4 0.313751279 0.18561807 5 0.769770788 0.05686441 6 0.247369161 -1.98709984
Let’s use a WSS plot to determine the optimal number of clusters to use.
wssplot <- function(woi_c, nc=15, seed=12345){ wss <- (nrow(woi_c)-1)*sum(apply(woi_c,2,var)) for (i in 2:nc){ set.seed(seed) wss[i] <- sum(kmeans(woi_c, centers=i)$withinss)} plot(1:nc, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")} wssplot(woi_c, nc=6) #The point where the line forms an "elbow" indicates the optimal number of clusters
This graph indicates the sum of the distance between each point and its assigned cluster. We want to choose the amount of clusters that yields the greatest reduction in distance. The “elbow” of the graph is at two, so we’ll use that.
#Perform cluster analysis #Set Seed so the results will be reproduceable set.seed(12345) #Use two clusters since we expect two position types woi_kmeans <- kmeans(woi_c, 2, nstart=25) woi_kmeans$cluster <- as.factor(woi_kmeans$cluster) #Add cluster column to the data frame woi_c$cluster <- woi_kmeans$cluster
#Graph new data frame and color by cluster assignment woi_cluster_graph <- ggplot(woi_c, aes(TOI.Gm, PP60, color = woi_c$cluster)) woi_cluster_graph <- woi_cluster_graph + geom_point(alpha=I(.75)) + theme_bw() + xlab("TOI Per Game") + ylab("Primary Points Per 60") + scale_color_discrete(name="Cluster") woi_cluster_graph
K-Means draws a firm line between the two clusters it identifies. How well does the algorithm match skaters to their actual position?
#Let's see how the clusters match up against the actual position of the players table(woi_c$cluster, woi.skater.clustering$new_pos) Defensemen Forward 1 120 3115 2 1773 109 #Load flexclust to determine accuracy of clusters library(flexclust) randIndex(table(woi_c$cluster, woi.skater.clustering$new_pos)) ARI 0.8281581 #randIndex is ranged from 1 (perfect agreement) to -1 (total disagreement)
The structure the K-Means algorithm detected indicates that, given PP60 and TOI.Gm, there are two types of skaters. From its perspective, it cannot tell the difference between a forward that accumulates few primary points in sparse minutes and a defensemen that does the same.
While this does not tell us anything groundbreaking, a few important questions do emerge. Firstly and most simply, is there really a difference between a low-end forward and a low-end defensemen? It is probably not coincidence that skaters such as Deryk Engelland can switch between being a 3rd-pair defensemen and a 4th line forward fairly seamlessly.
In addition, using machine learning methods helps us to question why things are the way they are. Should some current forwards be converted to defense, and vice versa (looking at you, Dustin Byfuglien and Brent Burns)?
There are many possible uses for this type of analysis in hockey:
- Drill further down into each position to determine if there is anything besides ice time separating the various tiers of forwards and defense.
- Use shot danger save % bins, AAV, and TOI to statistically determine clusters of goalies, i.e. premier starters, workhorse starters, backups etc.
- Use the data from Ryan Stimson’s passing project to determine which players are passers and which are shooters. Additionally, the data could be split by zone to see if there is a distinct group of players that lead breakouts from the defensive zone.
- We could also use the analysis at the team level to distinguish teams in terms of shots (Corsi, Fenwick, Scoring Chances), as well as shooting percentage (for and opposed).
These are the articles I used for reference in this post:
- https://rstudio-pubs-static.s3.amazonaws.com/33876_1d7794d9a86647ca90c4f182df93f0e8.html
- https://baseballwithr.wordpress.com/2015/02/22/pitch-classification-with-k-means-clustering/
- http://www.rdatamining.com/examples/kmeans-clustering
- http://www.r-bloggers.com/k-means-clustering-from-r-in-action/
- http://rischanlab.github.io/Kmeans.html
- http://sherrytowers.com/2013/10/24/k-means-clustering/
Follow me on Twitter @Null_HHockey
Overall, your latter conclusions would be an interesting line of inquiry to pursue. But I’d hesitate to say, with the above data, they might be so interchangeable. I think maybe more of an argument could be made for switching back a forward that had been pushed out of the defense corps early in their career, than necessarily thinking a defensemen could become a forward. Most bottom-pair defenders are possibly better off with play in front of them. Still intriguing.
Hi Conor, this is great stuff. I’m looking to do something similar in for rugby – this is a great guide, as both games are pretty fluid and dynamic. Is there any chance you could post/share the data file? The War on Ice site is now defunct, and I can’t seem to find the correct data on their Github repo. I wanted to try and reproduce your work here, as a learning exercise, hoping it will guide me in what I’m trying to do. Thanks!
Hi Rob, if you reach out to me on twitter (@Null_HHockey) with an email adresss, I’ll send you the data and the code. Thanks for reading!
Hi Connor, I followed you on twitter (nipper68). If you follow me back, I can DM you my email.
Hi Rob,
Really intrigued by your analysis, great work! I recently learned about cluster analysis and want to apply it to hockey data for fun. Do you mind sharing the data file, as Rob pointed out, the war on ice website is now defunct and I’m having trouble finding a dataset! I followed you on twitter, my twitterhandle is @sameerdesai1
Thanks, keep up the good work.