# Testing the Quality of a Binary Classifier: ROC Curves

Let’s suppose that we have a methodology for classifying players into Hall-of-Fame status. This methodology can be of any type: it can be a random forest that uses proximity matrices or it can be a simple measure that uses a threshold, such as Kidd Score. Either way, the result is the same: a certain number of players are classified as Hall-of-Fame players and the rest are not. One such goal of predicting Hall-of-Fame players is to use the methods that adequately separate Hall-of-Fame players from non-Hall-of-Fame players and then apply the methods to up-and-coming players. While this requires some level of predictive analytics, such as forecasting, the thought process is that if a team discovers indicators of finding the next Magic Johnson, then they would like to exercise a plan to acquire such a player.

This means that our methodology needs to perform well in classifying Hall-of-Fame players. So how do we go about looking at the performance of such a methodology? In this article, we investigate a common machine learning technique and see how it applies to a given algorithm for classifying Hall-of-Fame players. This, in turn, allows us to understand the quality of our metric.

Here, we will use the Kidd Score as the classification metric is simple to understand and translates easily as an illustration. Also, we provide the Python code, because, you know… learning is fun!

## Crash Course: Kidd Score

Kidd Score is an analytic developed by Sixers Science (2017) that looks at the product of assists per 75 possessions and rebounds per 75 possessions and multiplies them together. This is a cross-product statistic that identifies the contributions of NBA players on off-ball possessions. High scores indicates that players are either successful in both assists and rebounds; or wildly successful in either of those particular categories.

As noted in a recent Sixer Science podcast (#9), the analytic performs relatively well in discriminating hall-of-fame players. Looking at the distribution of players with Kidd Scores above 7.0, many Hall-of-Fame players make that list.

Here, instead of looking at a “Top N” list and saying that holds weight, let’s actually identify how well this classifies players. Here’s the Python commands for generating Kidd Scores from per 100 possessions data obtained from basketball reference.

### Finding Current Players Not Eligible for HoF

This chunk of code opens up the file, checks the year and then identifies players that have played within the last five years. These players cannot be in the Hall-of-Fame and we therefore eliminate these players from the system. Similarly, we hold a CurrentKidds list of players to be sure to remove them from seasons previous to 2013.

This will allow us to only look at players who have labels as non-eligible players cannot have a HoF label.

### Split HoF Players and non-HoF players

This chunk of Python code operates in the same loop and picks apart Hall-of-Fame players from non-Hall-of-Fame players; each storing their Kidd Scores. Notice that I personally scaled the Kidd Scores by percentage of games played. This was used to clean the data as certain players, like Danuel House, post amazing Kidd Scores due to seeing only minutes of action and getting 1-2 assists and rebounds. So I got that noise out of here.

## Classification: Sensitivity and Specificity

In every binary classification process, we become interested immediately in sensitivity and specificity. The reason for this care comes from the hypothesis test that we propose for classification. In a hypothesis test, we have what is called type I error, which is “bad error” that arise from believing a set of assumptions that the data truly follows and upon observing the data, rejecting the assumptions. For shame. Similarly, from hypothesis tests we have type II error, which arises when the assumptions are indeed not true, but the data looks like they follow the assumptions just enough to say they follow the assumptions. This error isn’t as bad, but we’d like to correct for this.

How tests of hypotheses relates to machine learning classification is through the error process. Sensitivity is the process of classifying a group of interest correctly. This term is also known as recall or true positives.

Here, our hypothesis test states a relationship that identifies a piece of data as a Hall-of-Fame player. In the case of Kidd Score, this is a value above a score of K. In the Sixer Science podcast, they use the value of K = 7. Which shows some really good names!

Therefore, sensitivity identifies the percentage of Hall-of-Fame players correctly classified. This is effectively mimics the Type-I error of a test. We have Hall-of-Famers. We built an analytic that classifies Hall-of-Famers. And sensitivity identifies how well we predicted Hall-of-Famers.

Similarly, specificity is the process of classifying a group of non-interest correctly. This means, in the Kidd Score context, we ensure for players that are not in the Hall-of-Fame they are not accidentally classified in the Hall-of-Fame. Whenever this error occurs, we have what is called a false positive. This in turn shows us that specificity identifies the percentage of non-Hall-of-Fame players correctly classified while the false positive rate is the percentage of non-Hall-of-Fame players incorrectly classified as being Hall-of-Fame players. Think of this as akin to type-II error as the methodology is attempting to control the hall-of-fame players and the class for which the assumptions aren’t geared towards are being incorrectly classified.

Therefore, in a binary classification process, we are interested in maximizing sensitivity while minimizing false-positive rates. Therefore, in finding this maximum and minimum, we change the threshold score until we find a happy medium. This value is then determined to be our optimal classifier.

Note that we didn’t state how well it classifies, nor did we mention that it is a good classifier. We merely stated, for the given methodology, it is the best we are going to get. And this leads us to a Receiver Operating Characteristic (ROC) Curve.

## Receiver Operating Characteristic (ROC) Curve

A ROC curve is a graphical tool that allows a data scientist to look at the quality of their classification procedure. It can also be used as a tool to help compare competing classification models. In this case, we will perform two classification procedures and compare them using ROC Curves.

Graphing a ROC curve is simple. The x-axis of a ROC curve is the false-positive rate. This value ranges between zero and one. A value of zero means that no players outside of the Hall-of-Fame are classified as being in the Hall-of-Fame. The y-axis of a ROC curve is the sensitivity. This values also ranges between zero and one. A value of zero means that every Hall-of-Fame player is incorrectly classified. This is a disaster.

### Building a ROC Curve

To build a ROC curve, we start by looking at the threshold that discriminates the players. For Kidd Score, this is the score K. Here, we set a linear space of possible thresholds and walk through the thresholds. For example, if we consider a score of zero, we see that every player is a Hall-of-Fame player! This is bad news. Here, our sensitivity is perfect with 100% but our false-positive rate is an atrocious 100%. Therefore a score of zero gives us the upper most right hand point in the ROC curve: (1.0, 1.0).

Correspondingly, suppose we have a value of K = 15, a score no one surpasses. Then every player is seen as not being in the Hall-of-Fame. This results in a sensitivity of 0% and a false-positive rate of 0%. This is the lower left-hand corner of the plot.

Continuing in this process, we obtain a sequence of points drawing the remainder of the ROC curve. Now, we consider two competing methodologies using Kidd Score. The first methodology states that any player with a Kidd Score over K is in the Hall-of-Fame. The second methodology states that any player with a career average Kidd Score over K is in the Hall-of-Fame. Let’s break these two methodologies down.

### Raw Seasons

The first method is to consider when a player as a single season above a score of K. In this case, if consider Nikola Jokic from the Denver Nuggets, Jokic posted a single season Kidd Score of 9.1217 last season (adjusted to 8.1205 when considering games played) and that followed his rookie season of 7.0795 (adjusted down to 6.9068 when considering games played). This indicates that Jokic spent one season as a Hall-of-Fame player, while the other not. If Jokic were eligible and not in the Hall-of-Fame, this would be a 1/2 contribution to the false-positive rate.

Here is the Python code for raw seasons:

Before we draw the plot, let’s look at the ROC curve for average seasons.

### Average Seasons

In the average season case, we simply build the dictionaries, KiddHOF for Hall-of-Fame players and KiddNoHOF for non-Hall-of-Fame players. The keys are the player names and the values are arrays of Kidd Scores over their career. In this case, we again throw out players who are not eligible for the Hall-of-Fame. Before we build the ROC curve, we must take the average Kidd Score for each player. This is given by:

Then we perform the exact same K walk in building the sensitivity and false-positive rate for the average Kidd Scores. This is performed by:

Now we are ready to plot!

### Plotting and Interpreting the ROC Curves

Plotting is simple in Python. We just use matplotlib.pyplot using the following code:

This code yields the ROC curve of interest:

ROC curve showing the ability of raw season Kidd Score (blue) and career average Kidd Score (orange) in predicting Hall-of-Fame status of NBA Players. Green is random chance.

Here, we see some good things! First off, there is predictive power in both using raw seasons and using career averages. To be clear, these plots were much uglier (and worse) without the game smoothing. That said, the career average tends to be better than the raw season breakdown.

To understand what optimal value of K should be used, we look for the point on the curves that are closest to the upper left corner. This corner indicates that we have zero false positives while obtaining perfect recall. This is the Holy Grail spot on the ROC curve.

In order for a competing methodology to perform “better” than these methodologies, the corresponding ROC curve must dominate the above ROC curve. Domination here means for every point on one algorithm’s ROC curve, the dominating ROC curve must be above.

Using this definition, we see that the average Kidd Score approach is effectively better than the raw season Kidd Score.

### Word of Caution!

Note that we used a hard threshold for separation. In most cases, there is an algorithmic formulation that requires distributions of labels. In those cases, we must perform a cross-validation method to construct this ROC curve. Since Kidd Score does not require this process, we are fine in continuing in this manner; as there is no learned decision boundary!

### Back to Our Regularly Scheduled Program

Finally, we take a look at how well Kidd Score discriminates Hall-of-Fame players. While it is significantly better than random chance; there is a lot of room for improvement. For instance, the optimal thresholding score in the raw season case is a Kidd Score of 4.7605. This value manages to correctly classify 70.17% of Hall-of-Famers. However, it also incorrectly classifies 34.01% of non-Hall-of-Famers.

Similarly, looking at the career average scores, the optimal thresholding Kidd Score is 4.1761. Using this cut-off, we have that 85.19% of Hall-of-Fame players being correctly classified! However, this is at the cost of incorrectly classifying 30.30% of players that are not in the Hall-of-Fame.

What this shows is that Kidd Score is good at predicting Hall-of-Fame players, but it’s Type II Error still misses the mark a bit.

### Recency Bias?

That said, this may be a function of recency bias. Meaning, that in order to be in the Hall-of-Fame, a player must be out of the league for five or more years. After that, they must be voted in. For instance, the namesake player of the statistic, Jason Kidd, is currently not eligible for the Hall-of-Fame as his last NBA season was in 2013. He is eligible starting this season.

Similarly, players take time to be voted in. Players such as Mitch Richmond, who waited roughly 12 years to be inducted. Similarly, in recent times, players with less than distinguishable careers statistics wise are included due to their impact on the game: Drazen Petrovic and Sarunas Marciulionis are two examples. This is not to deny them of being capable players and worthy of being in the Hall-of-Fame; however their respective statistics are quite lower than most Hall-of-Fame players. Which in turn indicates that a pure statistics-based approach (statistics meaning accumulated values in particular playing categories) may not be the best way to approach this classification process.

In fact, if we look at season when Hall-of-Fame players played, we indeed see an incredible drop off.

Number of Hall-of-Fame players playing in each NBA Season. There is a flat line of zero from 2013 through today because those players are not eligible for the Hall-of-Fame.

So how would you construct a classifier? In the linked post at the top, I used random forests, but left out a couple key variables; such as international impact. That would have separated players like Drazen Petrovic from players like Alexander Volkov.

Either way, in order for your method to be better, you must dominate the ROC curves above. If you don’t…. you lose.

## One thought on “Testing the Quality of a Binary Classifier: ROC Curves”

This site uses Akismet to reduce spam. Learn how your comment data is processed.