The end goal for all basketball analytics is to gain wins. In order to gain wins, a team must score more points than their opponent. It may seem like a completely obvious and yet, vague statement; but this is the reality: how does this “newfound” intelligence gain wins? In sports, if this intelligence is proprietary, we call this an edge. Therefore it’s important to understand how an analytic ties back into understanding points scored, and ultimately, wins. In this article, we will focus on building up analytics from the ground up. The goal here is not to be pedantic about all analytics, but rather gain insight as to why many of these analytics eventually revolve around points scored.
The Basic Points Scored Models
To start, let’s first identify how points are scored. In a singular game only free throws, two-point field goals, and three-points field goals can score points. Therefore the basic points model is given by
On offense, our goal is to increase all three categories. Conversely, on defense, our goal is to reduce all three categories. Using this formula for points scored leads to uninspiring models. Lots of information is left on the cutting room floor. However, we identify the Madden Equation
For the uninitiated, this equation is read as a win equals a situation where an offense scores more than a defense. In other words, “The team who scores more than their opponent usually wins.”
Ultimately, we become interested in increasing our number of FTM, 2PM, 3PM while decreasing our opponents’ FTM, 2PM, and 3PM. Or do we?
A Goldsberry Points Scored Model
Nearly a decade ago, Kirk Goldsberry (then of Grantland fame) wrote about the frequency and efficiency of teams and players. It wasn’t a new idea, but it was one of the first times it was explicitly put into practice. While phrases such as “limiting opponents attempts” has been around for decades, the points model could be explicitly written as
The trick here is that we multiplied by one and applied the commutative law of products. Viola! Frequency and Efficiency.
Now we can interpret the points model as gaining or limiting attempts (frequency) while increasing or decreasing field goal or free throw percentages (efficiency).
While this is a much more helpful “model” when it comes to understanding points, it’s still light on intelligence. Thankfully, there are many ways to branch from here. Let’s start with a little bit of Four Factors.
An Oliver Points Scored Model
In 2002, Dean Oliver introduced the idea of the Four Factors to the world. Within the four factors, Effective Field Goal Percentage and Free Throw Rate were identified as key components for understanding team success. We can obtain the relationship of these to points scored by again introducing the multiply by one trick:
In this case, we obtain points as a multiplier to FGA. There is no marked difference between this model and the Goldsberry model above, other than the follow along analysis, which we will eventually get to below. Before that, another few representations of the model.
A Zonal Model
The zonal model builds off of the Goldsberry model above by introducing spatial components into the attempts. Using this model, we determine ahead of time which locations, called bins, we would like to aggregate attempts into. The traditional model used seven regions: 2 corner three slots, rim, paint, non-paint 2-pt FGA, and above-the-break left / right 3PA. As of this morning, NBA stats uses a total of fifteen spatial locations:
Writing out the model for this becomes tedious. Instead, we use the summation notation to help condense the points scored model. We label S_2 as zones within the 2-point range and S_3 as zones within the 3-point range. By using the notation “i in S_2” we are simply just taking bin “i” from S_2. Understanding this, we have the zonal model of
From this model, we can start to ask where players are taking (and making) their field goal attempts. Using this block model, we can discretize field goal locations. It’s simple to understand and helps to quickly tease out locations.
The Counting Model
Building off of the spatial model, instead of building a discrete binning structure, if we apply the counting measure, we obtain the traditional shot chart. The chart is a “well, duh” but the mathematics behind it is much more complicated, but allows us to do some fairly wild analysis. Instead of making this a PhD level analysis, we will use the term counting measure to illustrate field goal attempts. In reality, we are using the Lebesgue measure and this allows us to immediately develop random process models, which are leveraged near-explicitly in Expected Point Values (EPV) and what I’ve called my “Left-Of-Boom” Model from 2015.
The idea here is that every field goal attempt is part of a spatial point process that occurs with some random structure. In doing this, we divide the court into spatial regions again, this time using D_2 for the region of two point attempts and D_3 as the region of three point attempts. The model is then written as
Using the two integrals allows us to plot this attempts such as these for Trae Young, Giannis Antetokounmpo, and Ben Simmons.
Immediately, we see that Trae Young is primarily an above-the-break shooter with a tendency to kick to the corners. Similarly, Giannis Antetokounmpo is a left-wing three point shooter, but will primarily attack the rim. Finally, Ben Simmons is not a three point shooter. He gets a ton of assists out there, but he’s primarily a paint player.
From here, we then would model points as a process model. The benefits of this is that we can get surgical with our analysis. The drawbacks is that we need lots of data; which we rarely have.
A Ratings Model
While we’ve primarily focused on counting the number of attempts, we also would like to look at Per Possession Models. The easiest way to limit points for a team on defense is to reduce the number of possessions. However, looking at the Madden Model above, limiting possessions, while indeed reducing points, does not necessarily translate to wins; as the number of offensive possessions decrease as well. Instead we may be more interested in points per possession, or (effectively) ratings; that is, points per 100 possessions. Since games tend to hover near the 100 possession mark, we will focus on a ratings model for scoring.
To build a rating, we simply count the number of possessions and dive that into the total number of points scored. To make the number more palatable, we then multiply this total by 100, obtaining what we call a rating. Applying this to the points model, we obtain
This can be used in the Goldsberry Frequency-Efficiency model as well:
And immediately, we focus on the likelihood of getting at least one 2PA or 3PA within a single possession. More importantly, we begin to introduce, through possession counting, non-scoring factors such as turnovers and missed FGA/FTA with defensive rebounds. These are situations with potentially zero points scored on a possession. In fact, this leads us to the last model we will touch on: the zero point model.
A Zero Point Model
The zero point model includes zero point possessions. It’s implicitly defined in all the above models, but we intentionally left it out to build up the thought process in modeling points scored. In this model, we simply introduce the zero point:
Notice we put an X in the model. This is why we left out the zero-points in all the above models. The challenge that arises is placing the appropriate features into X. Since X is flattened by zero, interpretation of the feature is lost; and hence why a ratings model becomes much more favorable. Initially, we can think of the obvious: Turnovers. On a similar accord, defensive rebounds, would lead to zero points for the action.
It is important to note that points can still be scored within a possession that contains a turnover or defensive rebound. Despite this, all of the models above partition these instances into points scored prior to a defensive rebound or turnover.
While these two values are obvious, we are missing much much more when it comes to analysis of points. And it’s here where we actually put action into the term MODEL.
Finally Putting the MODEL of “Model” to Work
The phrase model simply means description of a system. In the above, we described points as a function of field goals, turnovers, and defensive rebounds. In reality, the system of the game of basketball is much more. And it’s impact is important on points scored. For instance, what’s the value of Andre Drummond and Steven Adams to their respective teams? They are big on offensive rebounds.
Similarly, what’s the impact of a drive from De’Aaron Fox? From Trae Young? From Lonzo Ball? If we run the Hammer to Patty Mills, how likely is he to score? More importantly, how likely will that shot attempt be available? It’s these questions that begin to drive the components of the models above.
Treating outcomes as observations from a probabilistic model, we can begin to statistically model points scored. Let’s go back to that counting model. Let’s treat the Milwaukee Bucks as an opponent. The Bucks still lead the league as of this morning in percentage of FGA at the rim with 35.2% of attempts within three feet. They are second in the league in percentage of FGA beyond the arc with 41.8%. And the Bucks are very effective at the rim (70% – 3rd in the league) despite being pedestrian from beyond the arc (35% – 15th in the league).
Now, we are going to start at a basic level. Anything more would be too complicated for this setting. Let’s just assume that all attempts are generated by some distribution and nothing more. No assists, drives, screens, etc. At the novice level, we’d say FTM, FGM, and TOV all follow a Poisson distribution, but that’s if we’re completely disregarding the game of basketball. Instead, let’s understand the system.
Look at the Counting Vector
The counting vector is a vector that contains the number of instances of a component of interest. For our most basic scoring model, these might be FTM, 2PM, 3PM, and TOV. Then we look at the distribution of each over the course of a game. For our opponent, the Milwaukee Bucks, some examples (FTM, 2PM, 3PM, TOV) are:
- 15, 28, 14, 21
- 13, 27, 17, 17
- 17, 28, 17, 14
- 28, 28, 13, 11
- 8, 30, 19, 17
- 21, 31, 10, 21
- 15, 26, 19, 17
- 22, 32, 9, 14
- 30, 24, 22, 12
- 7, 24, 16, 11
These are the numbers from the Bucks’ first 10 games. For all games this season, we end up with a correlation structure that’s not excessively significant, but also not surprising.
Here we see that whenever they take less 3PM, they get more 2PM (correlation -.34) and that turnovers are positively correlated with FTM and 3PM (correlations .03 and .18, respectively). For the 3PM and 2PM relationship, we have a significant enough correlation (p-value < .0005, effect size > 3.0) to suggest that there is an inverse relationship between the two components. Comparing 3PM to turnovers, we see a similar relationship: p-value of .03 and significant effect size of over 3.0. This indicates the Bucks turnover the ball more often when 3-point attempts are not falling.
Similarly, there is a weak inverse relationship between 2PM and turnovers. Here we obtain a p-value of .03 but an effect size of 0.17; which indicates its a weak but potentially existent positive relationship.This would indicate that the team is likely to increase their odds in turning the ball over when attacking inside the 3-point line.
Finally, all other relationships are weak-to-nonexistent.
What this exercise emphasizes is that there is a correlative relationship between the different mechanical parts of the basic model. This is a big deal, as we cannot assume independence. Therefore we model for even the basic model is:
And here we don’t specifically specify the error distribution. Instead, we identify that error is unbiased with some covariance artifact. Therefore, we may be interested in expanding the model to include interactions.
Or we try a different approach.
Conditional or Hierarchical Modeling
A popular method in basketball analytics is to develop a conditional or hierarchical model. These models assume that quantities FGM or FGA are sub-targets that are responses of other basketball characteristics such as passes, drives, “openness” for attempt, defensive pressure. The most common example is the Shot Quality model. In this model, we typically (implicitly or explicitly) model FG% based on minutes played, distance to closest defender, shot location, number of dribbles taken, etc. In this case, we can write the distribution of points scored as
We can also begin to derive more complicated models, such as EPV, through the counting process model. In these cases, we can begin to start surgically building a hierarchical model that takes the spatial, temporal, and mechanical components of the game and develop a sophisticated model that ultimately goes back to quantifying the Madden Model.
The resulting coefficients of these hierarchical components help us then identify the contribution a player makes within the scoring model. Want to improve scoring when Lonzo Ball is on the court? We can now measure the impact of pick-and-roll offenses that lead to drives; with understanding of personnel on the court.
But be cautious when performing a hierarchical analysis. Small samples will begin to creep in and borrowing strength will become ultimately important. And it’s here where the edge gets to be gained.