The **usage** of an NBA player consists of the number of chances a player takes out of the possible chances a team has when that player is on the court. A **chance** being the number of possessions that can result in a **scoring possession**. The higher the usage for a particular player, the more likely that player is the primary option for the team when that player is on the court. We can identify chances as **scoring chances**, which are potential possession ending field goal attempts or free throw attempts, and **lost chances**, which are turnovers. We use the term potential as a field goal attempt is a chance, however a missed chance that returns into play as an offensive rebound leads to another chance; despite the possession not ending. Therefore teams can have multiple chances per possession.

## Usage Example: Washington Wizards

Let’s start with a simple exercise with the **Washington Wizards**. Through 30 November, the Wizards have played in **2013 possessions** over the course of 21 games. The 2013 possessions yielded **2318 chances** for the Wizards.

### Estimated Chances: NBA Stats

If we were to statistically calculate the number of chances for the Wizards, we would expect

**Number of Chances = FGA + 0.44*FTA + TOV**

The value of 0.44 is an archaic one that no longer necessarily holds true; however NBA stats will yield the Wizard’s number of chances as **1793**** + 0.44*519 + 303** = **2324.36**, which is not far off from the truth. Now, if we consider **John Wall**, we find that Wall is estimated to have completed **255 + 0.44*112 + 49** = **353.28 chances** while participating in an estimated **977 + 0.44*298 + 162 **= **1270.12 chances. **This comes out to a **27.815 Usage Rate**. NBA stats says:

As we see, we have correctly picked up John Wall’s usage rate. The only challenging task in the exercise above is counting the team statistics when Wall was in the game. This was performed manually using a Python script. If this exists on the web, please feel free to link in the comments! (**Note: **I assumed Basketball-Reference would have such a tool. I was unable to find it.)

### Exact Counting: Actual Usage Rate

If we count every possession and chance, we can see how well NBA stats estimated Wall’s usage. As we saw with the Wizards’ totals, since the number of free throws are small and the actual rate of free throws terminating possessions is **0.427, **the estimated number of possessions is only off by 6 possessions.

For John Wall, we find that he actually obtained **354 actual chances** out of **1267 actual chances** when he was on the court. This results in a **27.94% Usage Rate**. Not too shabby for the possession estimation process. Provided the number of free throws stay down, estimation will never pose a problem. However, this is John Wall we are talking about; and he can get to the rim and finish on fouls.

## Estimating / Predicting Usage

Once we obtain the entire team’s usage stats, we can start looking at prioritization of teams. For instance, let’s consider the starting line-up of **Bradley Beal**, **John Wall**, **Marcin Gortat**, **Otto Porter Jr.**, and **Markieff Morris**, we find that their combined usage is **28.7349 + 27.9400 + 18.7114 + 15.0822 + 21.0602 **= **111.5287%**. Well, that certainly is not feasible.

In this case, we can look into assuming a uniform distribution on usage to help us estimate usages for rotations. What this means is, if a player maintains a 25% usage rate overall, they apply the equivalent usage rate when aggregated with other players. Illustratively, this means **John Wall **moves from **27.94%** to **27.94 / 111.5287** = **25.05%**. This indicates that a quarter of the rotations chances are expected to go through Wall’s hands. Unfortunately, this does not translate well… thanks **Chris McCullough**.

What we want to do here **build a model ** that predicts usage for a player. In this sense, we can construct a **13-variable indicator feature vector** that identifies the rotation. For example, the starting unit would be labeled

**(1,1,1,0,1,0,1,0,0,0,0,0,0).**

The labeling comes from the order in the table above. We can then place a **multinomial response **variable of chance recorded. This is effectively an **categorical value** that identifies the player that took that particular chance.

In total, there are **1,287 possible rotation combinations** for the Wizards to employ from their 13 roster players. Over the course of their first 20 games, only **124 rotations** have been put onto the court. Seeing less than 10% of the combinations, we are left with an excessively difficult task of building a regression. In fact, regression is not the way to go, due to the fact that inference over the players is akin to over the 1,287 possible rotations, which we do not have measured.

In light of this, we can apply a popular technique such as **Regularized Least Squares** (which is also called **Ridge Regression**) if we believe our data is Gaussian (it’s not). Or we can look into **Bayesian modeling**. In doing this, I could build a massive-scale post; but at approximately 800 words in, I have bigger fish to fry.

## Team Usage

What we want to do now is compute **Team Usage**. This percentage accurately reflects the usage of a player **over the course of the season** as opposed to when they are specifically on the court. In this case, we can understand the role of a player with respect to the overall team’s performance.

In this case, we find that players like **Markieff Morris** truly has a usage rate of **6.3417 **instead of 21.0602. Why does this happen? Morris has only played in 14 out of Washington’s 21 games. What’s more important here is that we can now start to look at how usage is distributed on the team.

### Implied Team Usage

For instance, **Morris** was inactive/suspended for the first 7 games of the season. Similarly, **Wall** was out for 5 games. Due to this, their Team Usage rates are low. Despite this, their usage rates are still relatively high with respect to when they are on the court. So let’s do the math:

**Morris: **14 / 21 games played, 6.3417 Team Usage implies **9.52% Implied Team Usage**.

**Wall: **16/21 games played, 15.2718 Team Usage implies **20.04% Implied Team Usage**.

This indicates that Wall is a priority player when it comes to chances. In fact, he is almost identical to **Bradley Beal**. Whereas, Markieff Morris is not the priority option. If we take this one step further, we can look into **implied usage**.

**Implied Usage **

Implied usage corrects implied team usage to account for minutes played. By adjusting for minutes played, we should obtain numbers closer to usage. Let’s do that math on this one:

**Morris: **22.6 MPG, 9.52% implied team usage implies **20.22% Implied Usage**.

**Wall: **34.4 MPG, 20.04% implied team usage implies **27.96% Implied Usage**.

If we go back and look at the usage rates, we find that these implied values are on par with the player usage values. If we see a discrepancy, **Morris is off by 1 percent**, this is merely due to rounding, small numbers, and counting processes.

### What We Can Do With Team Usage

The reason we focus on team usage is to see the impact of a player over the course of the season. That, and we obtain a stochastic relationship to a player’s efficiency.

## Efficiency

A player’s **efficiency** is a simple score that counts the number of positive actions with the basketball and demerits for negative actions on offense. The formula is given as

**{PTS + REB + AST + STL + BLK – (FGA – FGM) – (FTA – FTM) – TOV} /GP**

This formula is divided over their games played, but instead let’s look at **cumulative efficiency**. This quantity will show us how well a player gains is over the course of the season. Let’s take a look at John Wall:

**325 + 54 + 147 + 17 + 18 – (255 – 111) – (112 – 84) – 49 = 340 **

**Note: **Basketball Reference is off by 1 FGA (they have 256) and 1 TOV (they have 48) in case you leverage Justin’s site.

If we divide out by the 16 games played, Wall has an efficiency of 21.15, which is respectable. However, Wall was not efficient in the games he played. Therefore, we should look at cumulative efficiency. And since we are doing that, we need to look at Team Usage instead of usage conjunction with cumulative efficiency.

**Note: **Efficiency is a **box score statistic**. In this case it suffers from lack pace and possession incorporation. Due to this, teams with a higher tempo may potentially yield higher efficiencies. This is a primary reason that John Hollinger developed the **Player Efficiency Rating**.

## 2017 NBA Season

Let’s take a step back for a moment and apply this methodology to the previous NBA season. If we plot the Team Usage against the Cumulative Efficiency, we get a picture of familiar things.

Ideally, players would be as far north as possible on this graph. These players have high efficiencies, which means these players **score lots of points** or **get lots of rebounds** or even **steal the ball or assist a ton**.

Similarly, an a player is efficient in **scoring**, then a team wants that player’s usage to skyrocket to the right of this graph. It is alright if a player obtains rebounds for the bulk of their efficiency, however, they may not be scoring much if their usage is low.

From the display, we see that **James Harden** and **Russell Westbrook** are indeed the top players in the league when it comes to scoring, rebounding, assisting, and stealing. On top of that, they are used in heavy rotation, **both accounting for over 25% of their team’s chances on offense**. We see two high scoring bigs creep up into the right corner as well with **Karl-Anthony Towns** and **Anthony Davis**.

### Clustering Appears

It is very difficult to have a high efficiency due to scoring but a low usage rate. Consider a player who scores 20 points a game. Typically 20 points a game comes off of roughly 8 field goals and 4 free throws. For the sake of argument, assume that the player is low on all other totals and takes no free throws. Then this player has an efficiency starting at 20 just from the points alone. If this player shoots less than 100% from the field, then the efficiency drops below 20. Therefore, a perfect shooter with 20 points per game requires 10 field goal attempts. As a team typically takes around 85 field goals a game, this player’s usage is already treading around 12%. And this requires perfection.

If the player dithers away from perfect field goal shooting, their usage goes up. However, their efficiency goes down. We start to sway towards volume shooters.

What this starts to show is how players cluster. If we take a look at the display again, we labeled **DeAndre Jordan** and **Rudy Gobert**. These bigs were rough and tumble rebounders, shot blockers, and had high field goal percentages. However, they were low points double-double machines. **Hassan Whiteside** is also creeping in that area. This region is the superstar centers.

The small region bowing out at roughly 20% usage but have low efficiencies are players who shoot the ball frequently, have lower field goal percentages, but more importantly, **are prone to turnovers**. The two worst culprits are **Devin Booker** and **Andrew Wiggins**; who make up those two markers. There is a third dot that has slightly more usage than Booker and Wiggins, but has high efficiency. This player looks to be part of the turnover machine crowd, but he is borderline. This player is **DeMarr DeRozan**.

## 2018 NBA Season

If we take a look at the current season, we see a similar trend.

As we are a quarter of the way through the season, we find ourselves with roughly 700 as the maximum for cumulative efficiency; on pace for roughly 2800, which is near where we topped out in 2017.

We again see familiar faces: **James Harden, LeBron James, Giannis Antetokounmpo, **and **Anthony Davis**. We have lost **Russell Westbrook** and **Karl-Anthony Towns **so far this season.What is impressive about this is LeBron James’ push to the top of this display. Once again, in his 15th year, here’s another analytic that show cases an MVP argument for James.

So let’s start identifying the **average line** for players when it comes to efficiency over usage. By finding this line, we can start figuring out players who are detrimental when usage increases and players who are producing an edge given their usage rates. What this line will not identify is who’s usage should be increased; as rebounds are not considered a part of chances.

## Linear Regression

If we apply the naive tactic of fitting a regression line, we do terrible.

First off, we see that the regression line is tilted. This isn’t really a problem as that may be what the data is actually suggesting. In fact, the **R-Squared is 92.44%**. Despite this, are we really sure that the line is that good of a fit?

### Leverage: Looking for Influential Observations

The particular reason that we could see a regression line has a high R-squared but looks tilted is due to a phenomenon called **leverage**. Leverage is the amount that a regression line **tilts** due to particular values of the data that are far away from the regression line (think outliers).

If you’re familiar with linear regression, then you know of the **hat matrix**. The hat matrix is the quantity **H = X(X’X)^(-1)X’. **It’s purely based on the explanatory variables. In this case: usage.

Since there are 459 players in the league, the hat matrix will be a 459 by 459 matrix. The diagonal elements of the hat matrix, **h_ii**, indicate the amount of leverage for each data point.

From the leverage plot alone, we cannot ascertain whether any players are truly influential. Instead, we focus on a statistical test such as **Cook’s D-Distance**.

### Cook’s D Distance

Cook’s D-Distance is a metric that attempts to capture the amount of influence leveraged by each data point in a linear regression. To calculate for a data point **i**, we compute the linear regression with the **i-**th data point removed from the model. Performing this for every data point, we obtain **N** models (459 for the 459 players in this case). We compute the square error for each data point estimated by each model and divide by the total error. The formula is given by

Applying this, we find Cook’s D-Distance for every point.

The general rule of thumb is that anything over **4/n** = **0.0087 **is an influential point. The dotted reference line identifies this rule. Here, we find two very particular values that are causing problems. These two players?

**LeBron James** and **Anthony Davis**.

Therefore, these two players are going to be tilting the line. Despite this, the variance plays a crucial role here too.

### Heteroscedasticity: Residual Plots

**Heteroscedasticity** is when the variance is non-constant for a model. If the variance is constant, then we are said to have **homoscedastic** errors. For the homoscedastic case, we would see an equal width band of data running along the regression line. If not, we have heteroscedasticity.

If we take a look at the display again, we see as the usage goes up, the band of data about the regression line gets larger. We see quickly that we have heteroscedasticity. Despite having this, we simply don’t say that our regression is terrible. We need to understand the errors in our model. To do this, we look at a **residual plot**.

Here, we see that the residuals (errors associated with the fitted regression line) explode as we get larger usages (and inherently larger fitted efficiencies). If we take a look at the histogram, we have that the errors actually look Gaussian!

They are centered just below zero (thanks LeBron and Anthony) but have a fairly symmetric shape. Here’s the reason we have a strong fitting line. So what does this really tell us?

**While the regression line fits seemingly exceptionally well, and we are able to identify players in a sortable manner such that high efficiency players that have high usage are identifiable, we are unable to compare players within the cluster due to LeBron James and Anthony Davis, along with a variance that gets large as usage gets large. **

So let’s try a **nonparametric attack.**

## Local Linear Regression

**Local Linear Regression** is a nonparametric regression method that abandons the Gaussian assumption and performs a nearest-neighbor weighting using a smoothing kernel. For **LOESS **regression, the tri-cube function is the smoothing function.

Let’s see how this works:

Due to the nearest neighbor technique of local linear regression, we are able to better approximate the mean values of cumulative efficiency amongst the players with having LeBron James or Anthony Davis pulling the line.Given this, we find a counter-intuitive thing: **there are far more players below the red line than above the red line**. If you read the above plots this way, congratulations! If not (and Twitter was 10-for-10 in thinking the red line was **too low **in the upper right), you are not alone.

However, the purple line (LOESS regression) manages to capture this and avoid the overfitting given to us by James and Davis. Now, the challenge is identifying the right neighborhood to smooth over. If we choose too small, we overfit:

The way we find the optimal fit is to use **cross-validation**. By peforming a cross-validation, we obtain the following fit.

Now we are able to begin discerning between players with respect to their usage and efficiency. Similarly, we can see how the expected value moves, thanks to LOESS regression, and start to find when players’ efficiencies drastically change as their usage either increases or decreases. This gives us extra insight, analytically, into possible fatigue or learning strategy within the game.