# Using Cluster Analysis To Identify Player Position

What did you think the first time you watched hockey? Did you know the difference between a forward and a defensive skater? Could you tell the difference just by watching? It’s likely that some outside factor (a friend, the play by play announcer, a graphic on the broadcast) alerted you to the fact that NHL teams use more than one type of skater.

But, say that outside variable never intervened, and you were left to your own devices. How long would it take for you to develop the idea of “forwards” and “defensive skaters”? Would you come up with your own classifications? Would you differentiate them at all?

These are some of the questions I have been thinking of lately. I wanted to find a way to classify skaters purely through the use of statistics, without any heuristic or observational method. Cluster analysis is one way to do this.

K-Means is an algorithm that assigns each data point in a set to a cluster in an attempt to classify the data. This process is repeated until the most optimal clusters are found. This process can be used to discover a structure or pattern within the data that isn’t otherwise apparent.

In this post, I’ll walk through the process of using K-Means in R on a basic data set, discuss the results, and identify further uses for this type of analysis.

I’m using 5v5 score-adjusted data from War On Ice in this analysis. This data is at the player-season level, with primary assists, goals, TOI per game, and Position as the required fields. You can download the data here.

```#Set your working directory
#Required packages ("ggplot2", "flexclust")
#Read War On Ice skater data into R

names(woi.skater.clustering)
[1] "X.2" "X.1" "X" "Name" "pos" "Team" "Gm"
[8] "season" "Age" "Salary" "AAV" "G" "A" "P"
[15] "G60" "A60" "P60" "PenD" "CF." "PDO" "PSh."
[22] "ZSO.Rel" "TOI.Gm" "iHSC" "HSCF.Rel" "HSCF." "HSCF.off" "HSC..."
[29] "HSCF" "HSCA" "HSCF60" "HSCA60" "HSCP60" "SCF.Rel" "SCF."
[36] "SCF.off" "SC..." "SCF" "SCA" "SCF60" "SCA60" "SCP60"
[43] "iSC" "CF.Rel" "CF.off" "C..." "CF" "CA" "CF60"
[50] "CA60" "CP60" "OCOn." "BK" "AB" "iCF" "FF.Rel"
[57] "FF." "FF.off" "F..." "FF" "FA" "FF60" "FA60"
[64] "FP60" "OFOn." "MS" "iFF" "SF.Rel" "SF." "SF.off"
[71] "S..." "SF60" "SA60" "SF" "SA" "iSF" "OCAOn."
[78] "OFAOn." "GF.Rel" "GF.off" "GF60" "GA60" "GF" "GA"
[85] "G..." "GF." "PFenSh." "OSh." "OFenSh." "OSv." "OFenSv."
[92] "FO.." "FO_W" "FO_L" "ZSO." "ZSO" "ZSN" "ZSD"
[99] "HIT" "HIT." "A1" "A2" "SH" "GV" "TK"
[106] "PN" "PN." "PenD60" "TOI" "TOIoff" "TOI." "TOIT."
[113] "TOIC." "CorT." "CorC." "tCF60" "tCA60" "cCF60" "cCA60"
```

The Position data from War On Ice is messy, so we’ll simplify the data to “Forwards” and “Defense”.

```#Identify unique values in the "pos" column
unique(woi.skater.clustering\$pos)
[1] RL C L LR R D RC CR CL LC RD DR DL
Levels: C CL CR D DL DR L LC LR R RC RD RL

#Classify all rows in "pos" not equal to "D" as "Forward", else "Defensemen"
woi.skater.clustering\$new_pos <- ifelse(woi.skater.clustering\$pos == "D", "Defensemen", "Forward")

#Identify unique values in the "new_pos" column
unique(woi.skater.clustering\$new_pos)
[1] "Forward" "Defensemen"
```

War On Ice doesn’t have Primary Points Per 60 in their data, so let’s create that and add it to the data frame.

```#Create new column "PP60" for primary points per 60
woi.skater.clustering\$PP60 <- ((woi.skater.clustering\$G + woi.skater.clustering\$A1) / woi.skater.clustering\$TOI) * 60

#Inspect new variable "PP60"
str(woi.skater.clustering\$PP60)
num [1:5117] 1.09 1.365 0.953 1.139 1.406 ...
```

Let’s see how the data looks when colored by the position as given on War On Ice.

```#Load ggplot2 for graphing
library(ggplot2)
#Graph original data frame and color by position
orig_df <- ggplot(woi.skater.clustering, aes(TOI.Gm, PP60, color = new_pos))
orig_df_graph <- orig_df + geom_point(alpha=I(.75)) +
theme_bw() +
xlab("TOI Per Game") +
ylab("Primary Points Per 60") +
scale_color_discrete(name="Position")
orig_df_graph
```

Looks like there are clearly two groups of skaters, but there’s an area where defensemen and forwards overlap. Let’s see what K-Means can tell us about this data set.

We don’t need all the variables in the War On Ice data, so let’s create a separate data frame for the data we’ll use in the cluster analysis.

```#Isolate the variables used for cluster analysis
woi_c <- subset(woi.skater.clustering, select=(c("PP60", "TOI.Gm")))

str(woi_c)
'data.frame': 5117 obs. of 2 variables:
\$ PP60 : num 1.09 1.365 0.953 1.139 1.406 ...
\$ TOI.Gm: num 10.4 10.8 13.1 13.5 13.2 ...

PP60 TOI.Gm
1 1.090413 10.41
2 1.364924 10.85
3 0.952638 13.12
4 1.139277 13.54
5 1.405841 13.22
6 1.100473 8.14
```

Since PP60 and TOI.Gm are measured differently, we need to standardize them to one scale.

```#Coerce all variables to the same scale
woi_c <- as.data.frame(scale(woi_c))

#View data frame again
PP60 TOI.Gm
1 0.230159623 -1.07375361
2 0.699772975 -0.89671733
3 -0.005537132 0.01662890
4 0.313751279 0.18561807
5 0.769770788 0.05686441
6 0.247369161 -1.98709984
```

Let’s use a WSS plot to determine the optimal number of clusters to use.

```wssplot <- function(woi_c, nc=15, seed=12345){
wss <- (nrow(woi_c)-1)*sum(apply(woi_c,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(woi_c, centers=i)\$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(woi_c, nc=6)
#The point where the line forms an "elbow" indicates the optimal number of clusters
```

This graph indicates the sum of the distance between each point and its assigned cluster. We want to choose the amount of clusters that yields the greatest reduction in distance. The “elbow” of the graph is at two, so we’ll use that.

```#Perform cluster analysis
#Set Seed so the results will be reproduceable
set.seed(12345)

#Use two clusters since we expect two position types
woi_kmeans <- kmeans(woi_c, 2, nstart=25)
woi_kmeans\$cluster <- as.factor(woi_kmeans\$cluster)

#Add cluster column to the data frame
woi_c\$cluster <- woi_kmeans\$cluster
```
```#Graph new data frame and color by cluster assignment
woi_cluster_graph <- ggplot(woi_c, aes(TOI.Gm, PP60, color = woi_c\$cluster))
woi_cluster_graph <- woi_cluster_graph + geom_point(alpha=I(.75)) +
theme_bw() +
xlab("TOI Per Game") +
ylab("Primary Points Per 60") +
scale_color_discrete(name="Cluster")
woi_cluster_graph
```

K-Means draws a firm line between the two clusters it identifies. How well does the algorithm match skaters to their actual position?

```#Let's see how the clusters match up against the actual position of the players
table(woi_c\$cluster, woi.skater.clustering\$new_pos)
Defensemen Forward
1 120 3115
2 1773 109

#Load flexclust to determine accuracy of clusters
library(flexclust)
randIndex(table(woi_c\$cluster, woi.skater.clustering\$new_pos))
ARI
0.8281581

#randIndex is ranged from 1 (perfect agreement) to -1 (total disagreement)
```

The structure the K-Means algorithm detected indicates that, given PP60 and TOI.Gm, there are two types of skaters. From its perspective, it cannot tell the difference between a forward that accumulates few primary points in sparse minutes and a defensemen that does the same.

While this does not tell us anything groundbreaking, a few important questions do emerge. Firstly and most simply, is there really a difference between a low-end forward and a low-end defensemen? It is probably not coincidence that skaters such as Deryk Engelland can switch between being a 3rd-pair defensemen and a 4th line forward fairly seamlessly.

In addition, using machine learning methods helps us to question why things are the way they are. Should some current forwards be converted to defense, and vice versa (looking at you, Dustin Byfuglien and Brent Burns)?

There are many possible uses for this type of analysis in hockey:

• Drill further down into each position to determine if there is anything besides ice time separating the various tiers of forwards and defense.
• Use shot danger save % bins, AAV, and TOI to statistically determine clusters of goalies, i.e. premier starters, workhorse starters, backups etc.
• Use the data from Ryan Stimson’s passing project to determine which players are passers and which are shooters. Additionally, the data could be split by zone to see if there is a distinct group of players that lead breakouts from the defensive zone.
• We could also use the analysis at the team level to distinguish teams in terms of shots (Corsi, Fenwick, Scoring Chances), as well as shooting percentage (for and opposed).

These are the articles I used for reference in this post: