Over the previous years, linear models have been introduced across the NBA as the “new”and “innovative” way to determine “value” of a player on the court during the course of a game. Every methodology, an increment on the past: Plus-Minus to Adjusted Plus-Minus to Regularized Adjusted Plus-Minus to Box Plus-Minus to Real Plus-Minus to Player Impact Plus-Minus. Each level attempts to leverage play-by-play data as either a linear model (APM), a Bayesian hierarchical model (RAPM), a Conditional Model (BPM and RPM), and a Bayesian hierarchical-conditional model (PIPM). The latter models: BPM, RPM, and PIPM attempt to incorporate player actions in an effort to normalize the wonky results that pop out in models such as APM and RAPM. For instance, good luck trying to push Danny Green as one of the best players in the entire league, worthy of a max contract. These models typically suffer from **swamping** and **sparsity**, a high-correlation high-noise environment that does not allow most linear models to identify proper signal above the noise floor for determine “points per 100 possessions” for a particular set of lineups.

To combat swamping, RAPM employs a regularization parameter that mechanically serves as a Bayesian filter on the parameter, mimicking Principal Component Analysis. While this drives down noise and dramatically improves analysis over APM, the model introduces slight biases; and due to the sparsity of the sampling frame, a rotational invariance problem emerges. A simple way to see this is swap columns in the lineup matrix and randomize the seed in the LBFGS optimizer. [**Side note:** If you use Newton-Raphson optimization for RAPM, you may need to take a moment to study local vs. global optimizers.] From mid-season for single season RAPM, we will actually see Danny Green and Fred Van Vleet swap places with near identical values.

To combat swamping, BPM, RPM, and PIPM employ “additional” variables to help drive up the signal on high performing players. This better captures the true nature of the lineups as certain players have higher **usages** than others and other players are more **efficient** than their counterparts. While this is great, the linear models still fail to capture the lineup strategy within the game.

To this end, we propose **action networks**.

## The Game is a Novel. The Possession its Chapter.

A piece of advice I obtained from an Eastern Conference coach back in 2014 was that **we think of every game as a story that unfolds with multiple** **subplots.** When that coach broke down the philosophy, he was aiming for the thought of **hopefully the other team is a shit writer and we can figure out the ending by chapter** **three.** What the story really made me think of was that we can **employ natural language processing to break down possessions**. At the time, we had **SportVU** and **Synergy** data available. But before we delve deep into the data, let’s take a look at a classic story.

### Beautiful Ball Movement: San Antonio Spurs

Since most of this methodology was completed back in 2014, the then presentation “Breaking Down NBA Plays Using Spatio-Temporal Constructions” contained several clips of the 2014 NBA Finals between the San Antonio Spurs and the Miami Heat. One of the clips focused on **writing a story using actions**.

Let’s break out the story. The **first action** is to set a **dual pin-down** on **Tony Parker**. This action occurs as **two passes are made** by the ensuing **screeners: Manu Ginobili** and **Tiago Splitter**. The **second action **is to perform **rub** by **Matt Bonner **to bring Ginobili to the strong side of the court. The **third action **is a **hand-off **from **Kawhi Leonard **to Ginobili leading immediately into a **fourth action** of a **pick-and-roll** from Bonner onto Ginobili.

As the Heat go into **BLUE**, Bonner pops instead of roll as Ginobili is double-teamed, leading into **two swing passes** back to Parker, followed by the **fifth action: **a **pick**. The ensuing drive **(sixth action)** creates a double team, forcing Parker to pass back out to Bonner. This allows for a **seventh action /** **second drive **by Bonner as the defender, **Rashard Lewis**, is slow in the closeout. As **Chris Andersen **steps out, Bonner sumps the pass off to Splitter, who makes the reverse layup.

In the linear model space, this is registered as “1” where we see the offense and “1” where we see the defense (or some multiple; it doesn’t really matter). In the box score models we identify that Bonner gets an assist and Splitter gets a made field goal. But that’s about where it ends. **Tony Parker gets no love for the drive the opened the lane**. The staggered pin-down gets ignored even though it forces the Heat to guard the actual play with **Andersen**, **Lewis, **and a **confused Norris Cole** instead of **LeBron James**.

### Turning the Action into Features: Possession to Vec

Armed with SportVU and Synergy, we could develop a methods for building tracking features. We’ve already done some this for identifying passes. From here, we can start to tackle different types of actions: **handoffs, screens, cuts, passes, types of shots**, **dribbles, ****etc.**

For instance, we can model a basic pick and roll play as follows:

Here, we have the classic elbow screen with roll. Of course, the defense is not from the School of Thibodeau and BLUE is not employed. So we can build the features **Ball Screen, Pick and Roll, No BLUE, Drive Left,** and append on other actions as necessary, such as number of dribbles, switches by defense.

Therefore, a possession story unfolds as counting the number of **plot devices **that define the story. **(Side Note: **We do this because I’m lazy and I don’t to worry about **exchangeability issues****)**

In my 2014 analysis, I had broken out 152 different types of actions. These actions would encapsulate a possession based on actions taken during the course of the possession. Therefore we would have a 152-long count vector counting the different actions that occurred within the possession and this would effectively serve as our **possession to vector quantity**. A requirement for becoming an action was that **it had to be performed a minimum 5 times per game **(on average) and **it had to contribute to the network**.

### The Action Network

Finally, we arrive to the action network. This is exactly as you should expect: **A Multilayer Perceptron ****Neural Network **(fancy hierarchical regression) that attempts to learn the value of each action relative to the number of points scored within a possession. This will give weight to each action **and we can use these weights as priors for every player in the league**. The weights would then be viewed as “contribution to the points scored within a possession.” This process gets a little weird as values can **and will **go beyond 2-3 points. There are then **negative actions** such as turnovers. For instance, in the 2013-2014 season, a turnover was worth **-5.66 points**. While the **positive actions **such as backcourt pass contribute to **+0.83 points**. The sum of the moving parts then identifies the value of the possession.

And to make this more menacing, when predicting the number of points scored, we enforced penalties on **non-integer values** to ensure we obtain whole points. The result? Using **10-fold cross validation**, we were able to recover **72.83% of possessions correctly **for the 2013-14 season. For the 2018-19 season, it’s at **68.27%**.

## Hierarchy!

So, if you’ve followed this site, you’ve seen a series of explanatory and critical articles about RAPM. This was my fix to RAPM several years ago. By using RAPM and SportVU, we could now place the action network coefficients as a **Gaussian** **prior** and start to better understand players based on their actions. The nice part of this is that many of the swamping errors went away. Unfortunately, there were still issues with lower level players due to low action counts; something that not even employing RAPM could help solve.

Using the Gaussian prior framework, we have to be careful with how we use “RAPM.” To this point, we’ve been using it interchangeably with **ridge regression on lineup data**. So to be clearer we must stick to the same units of interest. In this case it must be **points per possession **instead of **rating**. That’s a big difference as we introduce potentially much more error, despite predictive results improving over the ~60% in RAPM.

Also, due to the action network, **players are nearly zero-sum**. This means that players who gain points force opponents to absorb negative points. We use the term nearly because actions can occur with no defenders; as for instance, we can be victim to **meaningless passes in unopposed transition. **

Therefore we view weighted actions as a **Bayesian filter** into regularized linear regression.

## But Tracking Data Only Goes So Far…

Unfortunately, SportVU only goes so far back. Similarly, SportVU (and its successor Second Spectrum) doesn’t cover all games in season. Due to this, we perform **imputation**.

### In-Season Imputation

Within a season where we have tracking data, we can impute actions based on **play-by-play data**. For instance, using 80+ games we can build a regression model to predict the number of different types of actions that occurred. It’s not pretty (one season was as low as ~40% predictive capability) but we can’t afford to throw out data. **(Side Note: **This is a great place for research to occur… estimating the actions given only play-by-play and Synergy.**)**

The good news is that **trends within season stay true**. Teams play similar styles almost all season. It’s rare for a team to switch from being a mid-range dominant team to a three-point-bombing bunch. Also, shooting trends remain stable over the course of the year, but can become volatile from year-to-year.

**Out-of-Season Imputation**

For years prior to 2011/2012 SportVU data, we have to impute **everything**. And this is where things may get a little sketchy. The great news is that the three point revolution and advanced offenses barely existed prior to 2011. While three’s have been ever increasing, the actions to unlock large quantities of three point attempts have been a thing of the recent past; say only 3-4 years.

Therefore, using the first two seasons of tracking, we are able to build a regression model to **guess** the number of actions in each case. And, unfortunately, this only goes back to the beginning of play-by-play: in 1997.

Good news, however, is that in the current NBA there’s no more players left from the **pre-1997 years.** So those players are have no bearing on us now.

With that, let’s get to some results!

## 2017-18 NBA Season

Without jumping in headfirst into this season, let’s take a look at previous seasons and see how well this analytic performs. Recall that our goal is to identify **point contributions over the course of a game**. Therefore, we normalize the values to **points per 100 possessions **contributed by that player.

Here, we see that we are missing two All-NBA second team players: **Joel Embiid** and **DeMar DeRozan**. DeRozan came in at 22nd on the list while Embiid landed at 37th. Similarly, **Paul George** (3rd Team All-NBA) floundered down to 54th on the list. All defensive players **Robert Covington** and **Draymond Green** landed at 21st and 24th, respectively. 23rd was **Goran Dragic**.

Another notable item from this analysis is that **Ben Simmons **and **Donovan Mitchell **both rate high and were the two rookies in consideration for Rookie of the Year, an award which was ultimately won by Simmons.

## 1997-98 NBA Season

Using the Out-Of-Season imputation method, we were able to identify scores for the first year of play-by-play. In this case we obtain similar results as above. Here, we see the noise of the out-of-season sampling creep in as **the entire ALL-NBA 3RD TEAM is left off the top 20**.

We do however come up with a hotly contested **top spot** in the league, which ultimately goes to **Gary Payton. **These were indeed the top three players for MVP voting that season. We once again nail the rookie of the year with **Tim** **Duncan**.

Now for this year…

## 2018-19 NBA Season

Finally, running the numbers today, we obtain this season’s “Top 20” points contributed players. To recall, we use quotes because these numbers are **distributed, **meaning that there is variation attached. While the order may possibly pass the “eye test” we have to recognize that errors could be as large as 5-6 points. So this is merely a guide to a potential ordering.

This suggests that **Giannis Antetokounmpo **“should” be the MVP of the season. However, with such a tight score as compared to the 1997-98 season, we could make the identical argument for **James** **Harden**, the current MVP of the league.

For Rookie of the year, **Luka Doncic** rests at 39th overall, with **Trae Young **actually tumbling down to 71st in the league. This would suggest Doncic “should” be the ROY this year.

There we have it, a brief introduction into tracking-based story-telling through hierarchical Bayesian models. This model has been in practice since 2014 and has several areas for improvement. What areas would you improve? Unfortunately, we have to wait to see the outcomes for the awards. But until then, sound off in the comments below!

Using a neural ordinary differential equation would probably lead to less std. error as it’s less likely to get caught in local minima when solving the loss function

LikeLike

What are the defensive actions that go into the prior? The article only mentions offensive ones. Of the 152 actions, how many are defensive?

LikeLike

An example was given above: BLUE use, switch, etc. Also types of closeouts such as high-velo close, low-velo close, run by, no contest, etc.

There’s many used.

LikeLiked by 1 person

Hi there, I had a few quick initial questions.

1. How do you determine the value of an action from the neural network? The network’s target can’t be the ground truth value of the action, since that’s unknown, so it sounds like you extract the action value from the neural network’s weights. This seems very non-trivial. Could you please explain more?

2. Could you provide more details on the prior for players? Did you use action values together with action distributions of players to determine the parameters of the gaussian for each individual? Or is your prior for the value of an action conditioned on the player performing the action, in which case player-conditional action values are updated through the Bayesian update, and individual player values are determined by their player-conditional action values and action distribution?

3. Do you have uncertainty estimates? Are your results are MAP values?

4. How do you determine an action contributed to the network?

Thanks!

LikeLike

In reply to my own comment, particular my first question, I’m now wondering if you determine action values by feeding the action network a vector that is 1 at the given action and 0 everywhere else?

LikeLike

Hello! Here’s some answers without getting too deep.

1. The target of the action network is the number of points scored on a possession. The loss function is multinomial softmax, which enforces all actions contribute linearly (through the softmax function) to learn the number of points on the possession. The resulting coefficients of the MLP is is the “action weight.” Since softmax is used, this becomes nonlinear in players; linear in action space. To extract the action, simply leverage the dot-products at each layer of the backpropagation method used to learn through each stage of the softmax mixing.

2. The resulting coefficients for the actions are then used as “Gaussian weights.” This is an incorrect assumption (but useful enough to leverage the PCA equivalency contained in ridge regression). So the prior is then a Gaussian distribution centered on the types of actions within a possession. It’s effectively a count vector. If we proceed using a “conditional model” we can condition on actions taken in a possession and this gives an “expected score” before we know the players. Then given the players (ridge part) we obtain how those actions have “helped” or “hurt” the players.

In a fully Bayesian model, we eliminate the conditional of plays an then run a Gibbs sampler. An issue that will arise is that we may simulate non-traditional moves within a game. But that’s where I stopped with this model (and why EPV-like methods can improve the simulation process).

3. I do not have the uncertainty estimates handy, but they are relatively easy to compute. Last time Iooked, they were approximately .5-.7 “points” per possession. Linearly stretch to 100 possessions, it would be ~5-7 “points.” I may be wrong on this, but I assume they are MAP estimates. The reasoning is that Gaussian prior + Gaussian Likelihood = Gaussian Posterior through conjugacy. In this case, Posterior Mean = MAP estimator. However, since I’m effectively using an Empircal Bayes-type approach, I may not see that hold… not entirely 100% certain.

4. Through the back-propagation algorithm, I maintain a confusion matrix and can perform a forward-backward step-wise insertion method. Using a chi-square goodness of fit, I obtain a rough idea of whether a variable is “helpful.” It’s not perfect, and I guess where this becomes more “analytics” than “statistics,” but at least you know how I was attempting to understand (within a neural network) how helpful my features were in the points modeling process.

Hope this is helpful and thanks for going through the post!

LikeLike

Thanks so much!

LikeLike

Thanks again for the reply. I just wanted to follow up, if you don’t mind.

1. By softmax loss do you mean cross entropy / negative log likelihood on the softmax output of the action network? Just want to be sure. Anyways, I don’t understand why using a softmax makes the action network linear in the actions. Do you know of a paper or some other reference that I can look at with a similar claim?

2. I might be missing something here. It sounds like your prior is for a distribution over the counts of actions in a possession. I don’t see where this is used in the model pipeline. Continuing, you predict the expected outcome of a possession from the actions that occurred and your actions values. What exactly is the objective of player model and where is the expected outcome used?

3. Thanks!

4. So, you update the confusion matrix at regular interval during back-propagation, and then assess what actions should be added or removed? Neat idea! My worry is could unimportant actions for the action network be important for the player model? Let’s just suppose the actions “screen” and “pocket pass” were in your initial set and they were perfectly correlated. Then the action network only needs one or the other, so if “screen” is already in the network when you add “pocket pass”, the goodness of fit shouldn’t change and “pocket pass” wouldn’t be added. I still don’t fully understand your player model, but if it uses the actions players take to estimate player impact (or prior) on PPP, then the impact (or prior) of players who make pocket passes will be underestimated since the model is blind to that action.

LikeLike

Okay, it seems I can’t edit my previous reply, but I wanted to provide details on what I’m not understanding.

On the softmax being linear with respect to the actions, what confuses me is the softmax isn’t linear with respect to the logits, which are already a non-linear function of the actions if you include non-linear activations. I thought the action network was approximating p( point outcome | action vector ) without any dependance on the players, but maybe not, since you mention it’s nonlinear in the players.

On the player model and its prior, I guess what would help most is understanding the distribution you are trying to model and how you represented it. I’m guessing here, but I think your hierarchical player model is approximating the distribution

p( point outcome | players )

= integral_{action vector} p( point outcome, action vector | players)

= integral_{action vector} p( point outcome | action vector, players ) * p( action vector | players ).

I understand how this provides a natural hierarchy. And it makes sense to have a prior over the action vectors in this case. If this is in fact the learning task, then my only question would be how do you tease out the impact of individual players? Is there another regression on top of this model?

LikeLike

There are indeed two regressions.

First is Point Value | Actions

Second is Point Value | Players [following the derivation you have above]; where the “action_vector” is now treated using the counting-measure-at-possession-level application learned off the neural network.

By the way, softmax is indeed linear in actions. It’s called a linear model, after all. Work through the equations and it will pop out immediately. It’s the move from the linear weights to a Gaussian prior that makes this non-linear in players. This is because we’ve effectively introduced a natural logarithm.

LikeLike

Hi, thanks for the clarifications. I still don’t understand why the softmax is linear in the actions, though. If the softmax were a linear map in its input it would satisfy softmax(x+y) = softmax(x) + softmax(y) but it doesn’t. I don’t think this changes when we consider the softmax inputs are non-linear functions of the actions. When you said the actions contribute linearly, did you mean something other than the action network was a linear map from the actions to probabilities (via the softmax)? Maybe I’m just misunderstanding what you meant by that.

LikeLike

That’s the mathematical definition for a function to be an additive linear operator; not the definition of linearity within a statistical model.

A good reference: https://www.amazon.com/Foundations-Linear-Generalized-Probability-Statistics/dp/1118730038

LikeLike

Pingback: The Action Network

To close that loop, consider the basic example: Logistic regression. That’s a one-layer multi-layer perception with softmax loss. It’s a (if not the) canonical example of a linear model.

LikeLike

Okay, I now understand what you mean, thanks. I am familiar the GLMs, but I was thinking about the mathematical definition because I think that’s what’s meaningful here. I’ll have to dwell on that for a bit.

LikeLike

Pingback: Weekly Sports Analytics News Roundup - April 16th, 2019 - StatSheetStuffer

Hi, quick question – is the tracking data used here for recent seasons publicly available? I’ve seen it floating around for older seasons, but nothing for the last couple.

LikeLike

No, it’s not publicly available. I have mine through partnerships. To test out the ideas and develop your own models, you can always build for older seasons!

LikeLike