How to Debug Data Science Code

Think of everyone who has a talent you admire. Athletes, writers, anyone. If you were to ask each of them for the secret to their success, how many of them would be able to give the true answer? I’m not saying that they would deliberately lie. Rather, it’s just genuinely very hard to objectively assess oneself and turn natural implicit behaviors into explicit lessons that can be described to others.

Implicit lessons can be a barrier to people learning new skills: it’s much harder to learn something if their instructor doesn’t know it’s something they ought to teach. The best teachers are able to put themselves into the shoes of their students and convey the most important pieces of information.

One area of data science that is too often left implicit is troubleshooting. Everyone who writes code will get error messages. This is frustrating and can halt progress until solved. Yet most resources devoted to teaching new data scientists don’t discuss what to do, as if they’re expected to study enough to code everything correctly the first time and never encounter an unexpected error. You can find articles about common mistakes that data scientists make, but what about when you inevitably make an uncommon one? There are very few resources around how to debug broken code. (This one is quite nice, and these two are worth a read as well.) 

That’s what I’m hoping to partially remedy with this article. It’s far from the single canonical process for debugging, but I hope that it helps people get unstuck while they learn. The key points I want to convey are:

  • Every data scientist hits an error messages regularly, and doing so as a new programmer is not a sign of failure
  • Isolate the issue by finding the smallest piece of code that creates the problem
  • The exact language of an error message can be extremely helpful, even if it doesn’t make sense
  • The internet is (only in this particular instance) your friend, and there are particular resources that are particularly helpful for solving problems

Shit Happens

First, an important high-level point before we dig into debugging tactics: if you’re writing code, you’re going to make mistakes no matter how skilled or experienced you are. Hitting errors is an inevitable part of the process.

This might seem obvious, especially to experienced coders, but many “learn to code” resources can mask it. When you’re copying lines from a textbook or working through an interactive course that can give you hints, you’re in an artificially easy coding environment. That’s great for getting started, but working on your own projects will be a different experience. 

It can be jarring to not know what to use or to have errors block your progress. It can make you feel like you’re failing or you didn’t learn the material well enough. But hitting errors is not a sign of poor coding ability. It is inevitable, no matter how long you’ve been programming. I hit them every single day. I make typos, or forget that I changed the input data, or apply the wrong syntax, or misunderstand how a new function is supposed to be used. And then I have various tricks and processes to fix these errors. So get frustrated, sure, but not discouraged.

Hitting errors is especially common precisely because data scientists are often trying out new things. While there are probably some people who rely on the same few techniques they know especially well, the field attracts people who want to be constantly learning new subjects or applying new packages. To do so, they have to get good at reading the documentation for new tools and figuring out how to apply those tools to their problems. This requires a certain mindset of experimentation and a willingness to overcome obstacles when they inevitably arise.

Since this is a hockey blog: sometimes you have to take a hit to make a play.

It can be painful to try something new and have it fail, but it’s often the right decision. And that hit isn’t going to go away no matter how good you get.

I’m Afraid I Can’t Do That

Often, you’ll first find out that your code has a bug because your computer will tell you. You’ll run some code, and instead of getting the output you want, you’ll see some red text saying that an error occurred.

It may not feel like it, but that text is your friend. That message is the most direct information you’re going to get about why the error is occurring. So be sure to take a look at it closely and see how much of it you can understand.

Sometimes you made such a common error that the message knows how to fix it, like this:

That’s straightforward enough.

Sometimes the error doesn’t explain the solution but makes the problem fairly clear, like this:

The explanation isn’t perfect, but you can get a general idea: you’re trying to smooch two things together and R isn’t letting you because they’re different sizes. Now you know to check the two objects, see why they are different sizes. Are they supposed to be? If no, figure out which one is the wrong size and fix it. If yes, think about why you now need to combine them and whether there is a better way (in this case, perhaps a join).

Sometimes the error will make no sense at all. That’s okay too! It’s still a term that you can search online, which we’ll discuss later in this article.

Each of these errors also provides another useful piece of information: where exactly in your code the error occurred. This starts a key area of investigation: isolating the bug’s root cause

Isolation Plays

A key part of debugging is figuring out precisely what step caused the error and what is happening in that step. The error message will likely the line that caused the error, which makes that piece easy.

One piece of advice: if the piece of code referenced in the error looks okay, check if the problem is just before or just after that piece. For example, this piece of SQL code:

from roster
limit 10

Might produce this error:

ParsingException: line 4:1: mismatched input 'from'. Expecting: '*', <expression>

The mistake isn’t actually on line 4, but on line 3. The extra comma on line 3 means that the computer gets confused once it reaches line 4.

Now that you’ve gotten the broken line, dig into it further. If it’s especially complicated, break it down into pieces, Try to find the smallest possible step of code that recreates the error. Next, examine the inputs going into this piece. Typically, that’s going to be one or more data objects and one or more functions: Do the data objects look like what you expect, or did they get messed up in a previous step? Compare how you’ve written the function to examples in the documentation: are you setting it up correctly and using the right arguments? Often, you’ll find that an input has changed in a way you didn’t expect or you need to change the format of the function inputs to match what is expected.

If you’ve done this and are still hitting the error, you’ve made progress anyway. You’ve likely done most of the work towards creating a reproducible example, or reprex. This is essentially a super simple example that anyone else can copy and see the same error that you do. This can help you better understand what you’re dealing with and makes it easier to get help from someone else.

Finally, official debugger tools can be helpful for isolating the key issue. This is particularly true if you are building your own functions and have “nested” behavior where functions are using other functions and you don’t know which layer is causing the problem. To be honest, while using debuggers are a best practice, I personally rarely use them. It seems that many other data scientists do not use them either. They’re much more common in other types of software engineering, so feel free to give them a shot but don’t feel beholden to them

Ask Jeeves

So, you know where the error is occurring and what error the computer is giving, but you still don’t know how to fix the damn thing. There are still many options, and a lot of them are online resources. It sounds silly, but a key skill for a data scientist is knowing what to Google, finding the result that best applies to the problem at hand, and figuring out how to apply the online solution to your use case. 

What to Google? Start with the error message. If that doesn’t produce helpful results, try searching for your programming tool and the keywords describing the problem, like “R datasets different sizes won’t combine”. Including the name of the function or package can also help. If that doesn’t work, try a higher level description of what you want to do, like “R get info from one table for each row in another table”. This last approach is probably least likely to solve your error directly, but could show you a different approach to accomplishing what you want.

There’s a few different types of results that tend to be most helpful

Official documentation

Most packages and functions in R, Python, SQL, and other programming languages will have a website describing how they work and how to use them. On these sites, look at the description of the function to make sure it does what you want. Check the arguments that are used as the inputs of the function – are you using yours correctly? Are there other ones available that you are not using but need to add? These docs usually include simple examples of how to use the code correctly, which you can compare to your own code. 

Stack Overflow

Stack Overflow is a forum where people post questions and answers about programming. It’s almost certain someone has made the same mistake you did, which means it’s almost certain that there is a Stack Overflow post solving it. In my experience, these are almost always near the top of my search results though the very top one may not be the one that most directly applies to my code.

Check a couple of different Stack Overflow posts. Read the question and see if the asker is going through a similar process as you. Does it seem like they want a similar output as you but are hitting the same problem? As you read, get familiar with the example data that they’re sharing — it will be important to understand which keywords are “the fake names made up for this example” and which keywords are “the crucial commands used in the code’s functions”.

Finally, check the two couple of top-rated solutions. Read through them to see if you understand how they work. Do they seem like they’d apply to your problem? If so, copy them over piece by piece into your code.

Blog Posts

Lots of people run great blogs talking about R, Python, and other data science skills. It’s quite possible that somebody has put together a tutorial that walks through using the precise tool that you’re trying out. I find these to be especially helpful when I’m experimenting with a package or practice area that’s new to me and I want a comprehensive overview of the most important pieces. My only hesitation is that these can get outdated fairly quickly. So definitely try them out, but be willing to move on.

Carry On My Wayward Code

This has been a very detailed list of steps to take when debugging code, but it’s still not complete. You’ll inevitably develop your own habits and tricks to solve problems. (I hope you share them in the comments of this post!) In closing, I just want to emphasize that debugging code is a skill that will improve with practice. You’ll start remembering the solution to common errors, you’ll be more confident in identifying exactly what is going wrong, and you’ll be quicker to figure out which possible solution is most likely to apply. So, if this is something you want to get good at, keep pushing.

One thought on “How to Debug Data Science Code

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s