Analytics, so hot right now. But how do you get started? People from all sorts of background and levels of expertise have contributed valuable work to hockey analytics, but the journey can feel daunting.
In this post, I want to lay out my personal advice for what knowledge and skills are needed and how to get them. Your mileage will vary, but I think much of this will be useful to anyone who is interested in starting to do their own analytics research or writing.
In my mind, there are three fundamental steps to starting your own hockey research project:
- A question or area you want to study based on your knowledge of past work and your personal interests.
- The data that will enable your work.
- The data science technical skill set to actually conduct the analysis. I’ll cover each of these phases in their own section.
These three steps need not be sequential; in fact, they almost certainly won’t be. Hockey analysis is a constant process of growth in all of these areas, and they’ll generate feedback loops that enforce one another. Similarly, this article does not need to be read in its totality or in the order it’s presented. It is intended as a resource you can turn to whenever needed.
Part 1: Domain Expertise
1A: Picking Your Topic
Any type of data science work requires some degree of domain expertise, and hockey is no different. To get started, you need to know enough about hockey to formulate questions and research what prior work has been done on the topic.
“Figure out what to write about” may not seem like a challenge to many, but it is worth addressing. Plenty of fans think and talk about hockey all the time, but it takes additional work to develop that into a rigorous area of study.
That said, general hockey discussions remain my best source of ideas. Reading articles, joining Twitter discussions, and listening to hockey podcasts have all provided inspiration for topics. And of course, watching the game itself will prompt ideas.
Sometimes the subject can quickly be formulated as a specific question. For example, should teams pull their goalie on the powerplay is clear and concise; the challenges come from designing an analysis that answers it. Other topics are broader. You may be wondering what the best neutral zone strategy is. Or just come across an interesting dataset and want to find the most important insights within it. These are all valid, but some will require additional work to add structure.
Whatever you pick, it is not permanent. Every serious analyst I know has a long list of areas they would like to look into and has other projects they’ve abandoned midway. Save all your ideas and review them, making modifications along the way, so that it’s easy to pick something up when you’re ready to work.
There’s no area of hockey that has been picked dry, but I think some are especially open to new research:
- NHL entry draft
- Special teams
- Neutral zone play
- Cap-based analysis
There are of interest to teams and fans, and they are all in a nice middle ground where some early work has been done but there is a lot of opportunity for growth. Building highly sophisticated machine learning tools like WAR or xG models can be fun – and you should go for them if you want to – but they are far from the only subjects available to study.
1B: Conducting Background Research
You’ll also want to take the time to read up on past analytics work. This has multiple purposes. The most obvious one is to see how others have studied topics similar to yours, and what they have found. There’s plenty of value in replicating or expanding that work, but it’s important to be in conversation with those studies rather than ignoring them. Past work will also inspire additional ideas and simply teach you more about hockey. Finally, reading studies is a great way to learn new methods and communication styles. As you read, note how the author presents his or her arguments.
Previous research is spread out across many sites, but the best central repository is undoubtedly metahockey.com. Meta Hockey compiles some of the best work from throughout the internet. It’s a great place to find work on a particular topic or to just browse sections.
Many other sources are worth your time as well; if you see a site frequently linked from Meta Hockey, check out their full archives. Obviously, Hockey Graphs is one such site, as is Outnumbered, NHL Numbers, Arctic Ice Hockey, The Athletic, and more. There are also quite a few academic papers that cover sports analytics and are readily available online. These papers tend to be more mathematically rigorous and are an underutilized resource.
Read as much as you can, but make it a boost to your process rather than a hurdle. It’s important that you don’t get paralyzed waiting to read everything (an impossible task) before you jump in yourself. Keep reading, but keep doing your own work at the same time.
Part 2: Data Gathering
Once you have know what you want to do, you’ll need to find the data that helps you do it. The fundamental data source currently available is the collection of play-by-play stat sheets produced by the NHL. Many analysts have coded their own scrapers to collect this data from the NHL’s website after (or while) the games are played. You can do this yourself, but thankfully, I have never had to, because some sites such as Corsica and Natural Stat Trick make it available in a simpler form. Evolving-Hockey also has a tool for subscribers to search through play-by-play files, which can be very helpful.
There are also several “repository” sites that have a wider but somewhat shallower collection of data. Sites like Hockey Reference and Hockey DB don’t have the same advanced statistics as a site like Corsica, but they do provide categories I haven’t found elsewhere. If you want stats on coaches, drafts, historic standings, or other non-gameplay stats, I have found these sites to be invaluable.
Finally, some data has been painstakingly collected through manual tracking. The most prominent such effort is Corey Szajder’s All Three Zones project, which has generated thousands of data points for zone entries, zone exits, pre-shot passes, and more. I’d argue that this is the most underutilized resource in hockey analytics today, and would strongly encourage you to check it out. Others have also done manual tracking for particular projects, such as special teams analysis or CHL research. And of course, if none of these sources have what you’re looking for, a committed enough person could start tracking data themselves.
Part 3: Technical Skills
At last, we’ve reach the part of this guide that covers what you were probably expecting when you first clicked on it in the first place. Sports analytics is famously associated with spreadsheets. In truth, some analysis can be done in a spreadsheet program like Microsoft Excel, but a majority is better done with a programming language such as R or Python. These tools tend to be more powerful and make it easier to develop and test a reproducible piece of research. Consequently, the two are the most common tools used in blog posts today.
Which is better: R or Python? It depends, but mostly doesn’t matter. For 90% of work, you can use either one. I personally learned R first, so I stick with that. I find it to be the better tool for quick summary statistics and other basic exploration of datasets. It’s also a bit stronger at super advanced statistics, and I find that its data analysis libraries (especially the tidyverse, see more below) are a bit more intuitive to people who are new to coding. That said, I have recently been learning Python to augment what I can do in R; I find it better at things involving the web, like data scraping, as well as moving work into ongoing production, such as maintaining a live database that updates automatically.
It’s also worth noting that while these two languages are most common for public sports analytics work, plenty of others are used in other situations. In a work environment where a company has databases set up, SQL is often more common than either. In addition, more specialized work often utilizes a variety of other tools or languages, such as Tableau or Scala. And to the horror of some data scientists, I think it’s fine to do a first project in Excel and worry about the coding later. You can also move back and forth between Excel and R/Python, using each for the parts you are comfortable with. For all of these, focus on the one or (at most) two that will be most useful for your personal goals.
So, once you’ve picked out your language, how do you learn it? A quick Google search will show the many books, courses, and blog posts available, and most of them are totally fine. What I want to do here is to first talk about a general strategy for learning, and then only briefly offer suggestions for particular resources.
In my opinion the best learning plan is whatever will get you successfully doing what you want in the shortest amount of time.
I hope to write more about this in the future, but in short: have as limited a scope as possible to avoid getting overwhelmed, find the best sources to learn that particular area, and then structure your practice so you get feedback and then success as quickly as possible.
In practice, this means a few things. First, target your learning to exactly what is most useful. My biggest issue with most “learn to code” materials is how much theoretical background they make you cover before you are empowered to actually do what you want. You should absolutely learn about data structure types and why they differ, but invest the time into that only after you’ve done some of the data manipulation and plotting that got you interested in data science in the first place.
For hockey analytics, I suggest you learn the very, very basics of whichever language you choose (how to turn it on, how to load a dataset), and then focus on data exploration and visualization. The most crucial things you need are the abilities to work with data until it’s in the format you want, and to create visuals for both your own understanding and eventual publication. In particular, this means focusing your learnings on the best libraries for those tasks and temporarily ignoring the many other things that can be done in Python on R. (A library is a set of functions within a programming language. There are several that I’ll mention below as relevant for data science work.)
- If you’re learning R, focus on:
- dplyr for manipulating data
- tidyr for cleaning and organizing data
- ggplot2 for creating charts
- If you’re learning R, focus on:
- If you’re learning Python, focus on:
- Pandas and Numpy for cleaning and manipulating data
- Matplotlib for creating charts (Seaborn and Altair are also options here)
How should you go about learning these? I suggest a mix of formal instruction methods to learn best practices and opportunities to practice yourself so you can remember it and experience the errors that arise. Ideally, try to combine some sort of textbook or documentation with an online course that has interactive exercises, and then apply what you learned to your own hockey data as soon as possible. This may seem like overkill, but it is the best way to comprehensively digest the skills you need.
As for specific guides, I’ve personally had very positive experiences with the R for Data Science textbook (free online!), especially if you can pair it with a course or interactive tool, like Dataquest or Swirl. Manny Perry has also created an Intro to R course based on hockey data, which can be a great way to learn things using actual examples from the world of analytics. And Evan Oppenheimer wrote a series of posts explaining R for hockey analytics from the very basics of installation to some of the most essential functions. While I don’t have personal experience with them, I’ve heard good things about Automate the Boring Stuff (also free!) and Learn Python the Hard Way, two Python textbooks. Finally, there’s a larger list of resources here. Again, the most important thing is to organize your studies and pick chapters to focus on what is most relevant to you and your data science goals.
As you try to apply what you’ve learned to hockey data, you’ll inevitably encounter problems and get stuck. That’s okay! Computers are notoriously picky, and every analyst spends a ton of time getting things to work just right. The important thing is to learn how to troubleshoot well, as solving errors is its own useful skill. When something breaks, pull your code into smaller parts to see exactly where the issue lies. Often, doing a Google search for your exact message will lead to StackOverflow comments discussing that exact problem. You may also find YouTube tutorials helpful for specific topics. Get a sense for where things go wrong and how to implement solutions you find elsewhere.
One final note: I have not addressed either statistics or machine learning. The first is essential to most analysis, and the latter is increasingly popular. For statistics, I don’t have any particular recommendations, but an understanding of the basics will make your work more rigorous; online material is helpful, as is seeing the methods and pitfalls of past hockey work. For machine learning, don’t feel obligated to look into it, but if you want to explore the area, I’m particularly fond of Stanford’s free online course and accompanying textbook.
Whew! There’s a lot of material above, but don’t let it overwhelm you. Everyone is on their own journey to achieve their own goals at their own pace. My biggest fear is that this guide has made things seem too daunting. These are my suggestions, but none of it is required, and there’s nobody who will force you to do them. Take what helps and ignore the rest.
There are also plenty of ways in which people will disagree with some of the points I’ve outlined. That’s okay! Everyone’s experience is different. If you’re reading this and have additional suggestions, criticism, or other feedback, please do leave it as a comment on this post. Tweets are fleeting, and I sincerely hope that comments on this post make it a stronger overall resource.
The most important thing is to go try stuff. Do it now!
Don’t get frozen by paralysis. Preparation is worthwhile, but doing work is always better than not doing it. The best way to learn more and create results is to get started; you’ll be surprised by how much opportunity there is for any smart person willing to put in the time, regardless of their prior background.
One of the best parts about doing analytics in sports is the large community. Blog posts, Twitter, and in-person conferences are a great way to share your work, brainstorm ideas, and get help when you are stuck. In particular, I’ve encouraged people to tweet with the hashtag #HockeyHelper if they’re stuck on any hockey analytics issue; myself and others in the hockey stats community keep an eye out for that hashtag to help out.
[Special thank you to the Hockey Graphs team for offering feedback on this post]