# Deep Dive on Regularized Adjusted Plus Minus II: Basic Application to 2017 NBA Data with R

In our previous post, we introduced the theory associated with Regularized Adjusted Plus Minus (RAPM) through an illustrative example. In this post, we walk through a vanilla-flavored methodology for building a RAPM model for NBA data. In this article, we focus on the data necessary, the required data manipulation process, and methodology for determining required hyper-parameters. All of this done, by request of the readers, in R. Apologies ahead of time if my code is spaghetti. I rarely program in R. I will also provide screen caps of the code, so it can fit and be readable on zoom. If you’d like a copy of the code, feel free to message me.

## In Case You Missed It: Why RAPM

In our previous article, we asked how to identify player contribution with respect to scoring in a game. We provided a relatively simple example and walked through the most fundamental statistic: plus minus.

Identifying that the statistic really isn’t all that helpful for understanding player contribution, we set out to identify how players interact and use a basic regression model. This yielded Adjusted Plus Minus (APM). In this model, however, multicollinearity crushes us and we either have to get creative with inputs or deleting players in a strategic manner. The resulting multicollinearity rears its ugly head in the form of over-inflated variances. So our resulting estimates aren’t statistically different.

Instead, we control the degree of multicollinearity by introducing a little bias. This method is called ridge regression and yields the Regularized Adjusted Plus Minus (RAPM) model. We are able to control the impact of multicollinearity in attempt to get stable and comparable numbers between players. We give up a degree of interpretability in the model, but are still able to perform prediction and can compare the relative contribution of players.

First, we have to define an observation of data. By asking the question, “who contributes the most to building a positive point differential for a particular team?” we are effectively asking what players are on the court and how do we quantify positive point differential? We can either ask as a possession model and treat every possession as an observation. Or we can ask as a stint model and treat every stint as an observation. The goal is to select an interpretation and proceed accordingly.

In the classic sense of RAPM, analysts tend to use stints as observations. A stint is a period of time that the same set of ten players are on the court. Anytime a substitution is made, the stint ends and a new stint begins. For each stint, the number of possessions are counted. If this possession count is zero, then the stint has done nothing on the court and is not considered an observation.

If a possession is logged, then we obtain a viable stint and start recording the point differential. Recall that a possession is ended by a converted last free throw, made field goal, defensive redound, turnover, or end of period. For this exercise, we ignore end of period possessions.

Before we even begin, we start with the players. We need to know who is available before we begin. To this end, we can simply walk through every game and build a dictionary of players.

Once we know how many players we anticipate in all the games, we can initialize the data frame to pull in each stint.

Now we march through every file, build stints and count possessions. Many folks like to do this using time. Instead, I partition on the type of play to obtain a smaller matrix and transition through the possession ending, and non-possession ending scoring situations.

There are a list of columns of the data indicating the away players on the court and the home players on the court. I use these to indicate stints. Anytime this ten-some changes, I check to see if there are any possessions and then form the observation.

I initialize the stint using the first action of the game.

Then walk through all shot/defensive rebound/turnover/free throw situation. First things first, as the next action occurs, I check the current ten-man rotation and see if it changes.

If it hasn’t, we go right into checking possessions and counting points.

We see the possession ending situations. If points changed for a team, we add them to the stint. In all cases, we increment possession counts.

### Free Throws are Miserable. Not For Possessions But Stints.

We have to manage free throws. In order for a possession to end, the shooter must convert the final free throw of the bunch. In this case we check how many free throws a shooter gets and compare to which number of free throw they are on.

Here we keep track of time as substitutions can be made between free throws. If this is the case, we cannot write the stint observation until all free throws are concluded. This is due to free throws being awarded to the stint prior to changing.

Unfortunately, if a team substitutes, and the last free throw is missed with an offensive rebound; we double count the possession. This is the nature of the fact that we either divide by a slightly larger number or throw away points. We could get tricky and count the possession as one-half possession for each stint. But, I didn’t do that here.

Paying close attention to the free throw block of code, there are really two sub blocks. The first block check to see if the free throw is the last free throw. If it isn’t, we just check score. The second block is if the free throw is indeed a final free throw, if it is possession ending.

When substitutions are made, we have to check free throws first. This is the only situation where a previous stint gets credit for points when they are not on the court. To ensure the stints are indeed correctly called out, we simply asked when times of free throws are taken. Since time cannot advance on a free throw, the time matches and we know to add the points to the previous stint.

We perform the same calculation for when free throws occur immediately after a substitution. If this time matches, then free throws are added to the stint. This is due to time outs after fouls.

### Save Off the Stint

Now that the free throw situation is over, we can save off the stint.

Before we move on in the file, we check if this new line-up has any non-free throw activity. And start populating the new stint.

### Voila, It’s Done!

After iterating through all 1230 NBA games, we obtain a stint matrix where the response is point differential per 100 possessions. This results in slightly over 34,000 stints for the 2017 NBA season. Now we are able to proceed with APM and RAPM.

Recall that if we performed straight-linear regression, we will have an uninvertible matrix. To perform this task, we trawl through nba.stats.com and located all players who played less than x minutes. In this case, we eliminate all players who played less than 125 minutes; and then again with less than 250 minutes.

Sorry, I cut off the names of all the players. It’s a long list.

Now we can run the APM model, which is just a regression.

The summary yields:

We see that the variance is still too large for all the players; as indicated by the associated p-values. This is no different than in the example with our last post. Despite this, let’s take a look at the top 20 players:

In case you can’t zoom in, that’s (in order):

1. Damien Lillard
2. Cody Zeller
3. George Hill
4. Stephen Curry
5. Beno Udrih
7. JJ Redick
8. Klay Thompson
9. Lou Williams
10. Darren Collison
11. Harrison Barnes
12. James Harden
13. Paul Millsap
14. CJ McCollum
15. Karl-Anthony Towns
16. Kyle Lowry
17. LeBron James
18. Devin Booker
19. Nikola Jokic
20. Dirk Nowitzki

Now, mind you, while this list is not surprising; there are some questionable additions such as Beno Udrih and Cody Zeller. Take that back, I thought Zeller was questionable. One team exec told me “that pretty reasonable.”

Despite this, the r-squared (no relation to squared 2020) is an abysmal 0.0245. Compare this to Rosenbaum’s equally abysmal 0.15 in his studies. Sure, we get answers, but the multicollinearity is so high, there must be further action taken.

So let’s move to ridge regression. In this case, we need to identify a proper lambda. To do this, we turn to scree plots! A scree plot is a cross-validation tool that identifies predictive performance of a model. Cross-validation, in turn, is a method for splitting our data up into a test set and a train set. We build the regression model on the training set and apply it to the test set. By walking over all possible combinations of test and train sets, we obtain a jackknife measure of error associated with the model. We perform this for every lambda in ridge regression and we obtain a function of lamdba and jackknifed error. Plotting this is a scree plot.

Ideally a scree plot will have a convex shape, so we can find a minimum. In many cases, this never happens. Therefore we look for areas that flatten out. This is entirely subjective.

Taking a look at the jackknife error, we don’t see much variation. Hence we will take at roughly lambda = 1000. This equates to column ten of the RAPM output.

The ordering here is:

1. Damian Lillard
2. Stephen Curry
3. JJ Redick
4. Klay Thompson
5. George Hill
7. Paul Millsap
8. Lou Williams
9. Cody Zeller
10. Harrison Barnes
11. LeBron James
12. Darren Collison
13. Kyle Lowry
14. CJ McCollum
15. Nikola Jokic
16. Karl-Anthony Towns
17. Devin Booker
18. Kawhi Leonard
19. Isaiah Thomas
20. James Harden

Here we see names we expect with a couple exceptions: Darren Collison and Lou Williams. I’ll leave Cody Zeller alone this time. While these players contributed significantly to their teams successes (Sacramento’s big time run towards 8th in January before the Boogie trade and Lou’s carrying of the Lakers before shipping to Houston); they cannot be adequately viewed as top 20 talent in the league. Therefore we can perform further analysis.

We only looked at point differential. What happens if we change point differential to rewards? Here, we replace points differential per 100 possessions with positive activity per possession. And instead of stints, we use possessions. Further, we apply a ridge regression. What do we obtain here?

This is a similar list, with similar attributes; such as Solomon Hill climbing up the ranks. We could use these ranking lists and then perform a rank-aggregation analysis such as Kemeny-Young to obtain a listing that combines both positive activity on the court and point differential.

## Next Steps…

We see that RAPM is a great improvement on plus-minus models, but many of the assumptions are failed to be met. Ignoring this, we still obtain understandable; somewhat non-iterpretable values that adequately measure player contribution.

Advanced models build off this and investigate multi-liner models, where player statistics are incorporated. Some use different definitions of responses, like we summarized above. Some use different input schemes. Whichever the methodology, the process is almost identical: build the data set, identify characteristics of the data to perform model assessment, and test the model.

That said, what do you have in mind for a model?

## 3 thoughts on “Deep Dive on Regularized Adjusted Plus Minus II: Basic Application to 2017 NBA Data with R”

1. Gptp20 says:

Would it be possible to share the data you have for this example?

Like

2. Hey Justin,

I just want to say that your blog posts are fantastic (though I’ll admit I only understand half of everything). I though I’d take a shot an ask you a question: what would be a good way of going about finding the affect of a player’s RAPM on the spread of a game?