Regularized Adjusted Plus-Minus Part III: What Had Really Happened Was…

Over the previous couple seasons, I have written extensively about how Regularized Adjusted Plus-Minus (RAPM) is constructed, what the assumptions really mean, and how we interpret the results. If you’re curious for a refresher, feel free to remind yourself here. There’s an example in there that clearly breaks down how various forms of adjusted plus-minus work. We will rehash some of the key points in an effort to understand the pitfalls with RAPM. And in this article, we are going to turn the crank and show explicitly why we need to be very careful when using this analytic.

Get ready for some math on this fine Christmas Eve!

Bayesian-Bayesian Process

Warning: You’ll be working on a masters degree in Statistics here….

The RAPM process is simply a linear regression model with a weight played on the square of the coefficients. The idea is simple: adjusted plus-minus is a poor tool that has bloated variances due to a non-invertible distribution of players. This non-inversion bloats the coefficients for each player and gives us a false representation of how players actually play. Again, clearly shown in the previous example.

To combat this bloat, we place a penalty weight on the square of the coefficient. This forces bloating coefficients to fall to zero, while hopefully forcing true coefficients stay relatively the same. It’s almost a magical process, except that all we did was place a prior distribution on the multiple linear regression model (adjusted plus-minus model), and we let the design matrix take care of the rest!

Over the previous years, I’ve been asked over 100 times for this proof; and have even written it up on the walls of three… three… NBA Front Offices in an effort to identify what is really going on with RAPM. Here’s the process…

Bayesian Proof

Start: Posterior Distribution of Player Weights = Likelihood of Stint Ratings x Prior of Player Weights (centered at zero)

We immediately see that the mean of the posterior distribution is exactly the ridge regression solution. Coding this up directly or applying sklearn.linear_model.Ridge, we will obtain the exact same coefficients. The key takeaways are this: we must prove that offensive ratings for a given stint must be a Gaussian distribution with identical variation across all stints, and that the variance of the posterior distribution comes along for free!

In case you are worried about that last step, we applied a completing the square step. For a quick refresher, here’s that process:

Complete the Square

Completing the Square: We identify what V and R are in an effort to write an equation as a weighted square with weight Q.

Application to this Single Year

What Do the Components Mean???

There are several different ways to compute RAPM; there is no single true answer. Some folks will force predictive errors to be minimized, but this cannot theoretically happen thanks to the elbow-like distribution of errors. Similarly, by enforcing some scheme, over time, the weight becomes stale and must be recomputed. By recomputing, previous results become obsolete when compared to current results. But, for the sake of argument, we will treat this as a mere nuisance and ignore its existence.

Many other ways to end up with different RAPM scores is through the construction of the design matrix. This is the stint matrix obtained from possessions. Some old-school RAPM creators will not separate offense and defense and compute RAPM. Some new-school RAPM creators will separate offense and defense and compute O-RAPM and D-RAPM. The difference in the models are fairly strong, but the results are similar. It’s fairly intuitive to suggest that some players are more offensively inclined and some players are more defensively inclined. What’s striking is that almost all RAPM creators sum the two values to obtain RAPM. When they do this, they are assuming that the player has played an equal number of possessions on both offense and defense. Oops.

Regardless, we can look at the components of the RAPM process. Above, we see that there are XtX, XtY, and lambda. The matrix X is the design matrix with N rows of stints and 2p+1 columns that identify p players on the court. In our set-up, we place a 1.0 for the first columns (constant) and if the player is on offense. We also place a -1.0 if the player is on defense. The first half of the matrix are offensive players while the second half of the matrix are defensive players. Of course, we will do what most people do and erroneously add the two RAPM values. Sure…. why not?

The value XtX is the adjacency matrix. The diagonal is the number of possessions a player has played on offense or defense. The off-diagonal components identify the number of interactions between a player and their opponent. Teammates are on offense or defense together and have positive values. Opponents are negative values are they are on offense-defense pairings.

The value XtY is the ADDITIVE RATING across all stints. I emphasized additive ratings as we are adding ratings regardless of the number of possessions. As an example, if a stint has played twice with a rating of 200 and 100. The resulting value in XtY is 150. In truth, the rating is really 109.09 as the two ratings are derived from one stint with 2 points over 1 possession and another stint with 10 points over 10 possessions. As a flaw with RAPM, this is a commonly accepted atrocity whenever RAPM is computed. For this season, it happens A LOT.

Edit Note: If we introduce a diagonal matrix, W, with the number of possessions along the diagonal, we can rectify this additive rating problem. However, introducing this weighting may have unintended effects.

Finally, the value of lambda controls the betas. This value is really the variance of the stint ratings divided by the variance of the prior distribution. For most RAPM calculations, particularly on ESPN, RAPM was being produced with a lambda value of 2000! This means the variance on the stints are 2000 times greater than the variance on the prior distribution. No rhyme or reason other than, “it passed an eye test” through a broad range using cross-validation error. In fact, for a single season, selecting any value between 500 and 5000 is perfectly acceptable; hence making the 2000 a subjective selection.

Edit Note: There was an argument brought forth from Joe Sill, the 2010 winner of the Sloan Paper competition for RAPM, and he explicitly indicates that he boiled it down to ~2222 for lambda based on a cross-validation error and point differential per 100 possessions argument. It’s fairly lengthy, and fair. However for the above set-up for O/D-RAPM, a range of 500 – 5000 is still too broad to state the same argument holds.

And in that presentation of RAPM on ESPN, they still threw out players with a minimum minutes threshold. But as lambda goes to zero, we obtain adjusted plus-minus. And as lambda goes to infinity, we obtain all zeros for everyone. This means everyone must find a sweet-spot for lambda to fall.

Possessions are Key…

For our results, we compute stints as consecutive possessions played by a group of ten players. The way that I compute possessions is quite different than many other RAPM creators. For instance, the definition of possession is not uniform across the league. While my method of counting possessions matches the end of game results; distributing such possessions raise eyebrows. For instance, if a substitution is made during a free-throw and an offensive rebound occurs with a putback score; instead of double counting the possession, I say the second unit has given up 2 points over zero possessions as they had an empty possession that they yielded a score. Other folks will double count possessions. Others will count half-possessions. Either way you slice that possession, you induce an implicit bias in the direction of either unit. My implicit bias is that you should be penalized for not securing an offensive rebound despite being placed in best position, given the rules of the league.

In a similar manner, I tack on technical fouls onto current possessions. Many folks treat technical fouls as new possessions. Therefore the games where three technical fouls occur on one possession, I count it as one possession; while another person will count it as four possessions; three possessions with at most one point. And if we look at the computation of XtY above, you wee this will have grave effects on the resulting distributions. In fact, the biggest discrepancy year after year is Kevin Durant benefits from technical fouls like no other. It happens again this year as the technical foul discrepancy I impose drops his defensive RAPM upwards of a point per 100 possessions. It’s crazy to see how minor possession definitions dramatically affect RAPM. But if you’ve been following along, we see exactly why.

Now, with these caveats out of the way, let’s look at a set-up…

Golden State versus Milwaukee

Through December 24th, the Golden State Warriors and the Milwaukee Bucks have already played their two games this season. The starting units of the second game: Kevin Durant, Stephen Curry, Klay Thompson, Andre Iguodala, and Kevon Looney versus Giannis Antetokounmpo, Malcolm Brogdon, Eric Bledsoe, Khris Middleton, and Brook Lopez played a whopping two stints against each other. In fact, this five-some for the Warriors has played in a mind-blowing 27 stints over 34 games. That’s a starting lineup with less than one stint played together per game

Regardless, we have the same problem indicated above. One stint is short while the other is a starters’ stint. The ratings? 66.67 for one and 100.00 for the other. Therefore, the weight stint is, of course 83.333; when in reality it is much closer to 100.

Excruciatingly, we must assume that these two values are enough to satisfy a Gaussian assumption and that the ratings do indeed form a Gaussian distribution with equal variance to all other stints. For grins, here’s the global distribution of all offensive ratings for this NBA season:

Screen Shot 2018-12-24 at 5.06.56 PM

Histogram of all offensive ratings for the 2018-19 NBA Season through December 24th.

Crap…

Now the Top 50…

Given how RAPM is clearly a Gaussian revisioning of a definitely non-Gaussian process, we can still compute the Top-50 RAPM players through December 24th.

Screen Shot 2018-12-24 at 4.33.25 PM

Top 50 Single-Season RAPM NBA Players.

 

if we compare this list to Ryan Davis’ Single Season Performers, we find there are some similarities. Of course, we are different thanks to possession counting, small-samples, extreme confounding and the whole PCA is rotational invariant thing… but the RAPM results are effectively the same.

Wait… effectively?

Why Doesn’t Anyone Report on Standard Deviations?

In our given profession of NBA analytics, if someone doesn’t report the associated standard deviation with their analytic result they are either lazy or being malicious. As a statistician, no one cares about the expected value; they care about the error associated with an expected value. It’s typically coined as bias and variation. We do the same in GPS… we don’t report the error estimate, but rather the center of the error ellipse. Which is not guaranteed to be the same.

In the computation process above, remember we obtained the variance “for free.” So let’s tack these on…

Screen Shot 2018-12-24 at 5.17.14 PM.png

Let’s do this RAPM thing again but produce the Standard Deviations. O-Var is the standard deviation for the offensive RAPM value.

And there we have it… the standard deviations are approximately 2 to 2.5 points per 100 possessions for each player. So let’s see what this means for Fred VanVleet with respect to Gary Clark.

Since we are working with a Gaussian distribution, we can compute the test for comparison… we obtain a test-statistic of approximately 0.05; which has a ridiculously high p-value. This indicates the difference between first and fiftieth is not discernible. That’s right… being the top in RAPM is effectively meaningless from a statistical stand-point. And that’s the rub; RAPM is not an effective tool to significantly measure the impact of a player. It’s just a tool to rank guys and hope no one notices all the pitfalls along the way.

And it’s this primary reason that three-year RAPM becomes popular. In this case, the error variation drives down a bit, but the same problems exist. In fact, the tails will start to separate, but the middle of the pack still looks the same. For one team (over the years), I showed them that players between 100 and 300 were nearly identical.

So armed with this knowledge, what would you do to minimize the impact of the assumption fallacies and the associated standard deviations? Regardless… now you know!

 

Disclaimer: Our discussion of RAPM over the previous year has been focused on offensive and defensive versions of the original model developed by Joe Sill. For the work developed by Jerry Engelmann, he focuses more on single possession stints; whereas Joe focuses on net-differential stints. Due to this, Jerry does not require weights and Joe does. Similarly, Jerry is able to produce O/D-Ratings and Joe does not, explicitly.

In the work presented here, we focused on unweighted stints with O/D-ratings. By pushing in weights, we rectify a couple addition problems but do not see much improvement on the confidence bounds. This is a function of majority of stints lasting 3 or less possessions; causing us to lose the Gaussianity assumption.

This method of write-up is to avoid directly critiquing the work of Joe and Jerry; but alluding to potential issues when the models are tinkered with… such as possession counting, using single-seasons, or partitioning stints. Let alone, the biasedness of the results and the lack of interpretability of the coefficients; as they are indeed not points per 100 possessions for that respective player; but rather a biased estimate.

One takeaway I’d like to point out is that this methodology is a massive step forward from Adjusted Plus-Minus and is an important basis for further modeling; such as RPM and PIPM… and even some models I have developed directly for teams that are still in use today. However, understanding that coefficient confidence bounds are much more important than the estimates is key here. Especially if you are trying to use RAPM to help make a decision.