On December 31st, James Harden dropped yet another 40+ points triple double on the Memphis Grizzlies in a 113 – 101 victory in Houston. It wasn’t the 43 points, 10 rebounds, and 13 assists that was the most impressive stat of the night, but rather the fact that Harden scored 43 points to spite making only 8 field goals. If we were to have a discussion on efficiency 20 years ago, let’s call it the Allen Iverson era, we would have called Harden’s performance highly efficient as he only had taken 19 field goal attempts, identifying as a whopping 2.26 points per field goal attempt; a dream to the New York Times when they posted Allen Iverson’s 41 points on 36 field goals as dominant. However, with the power of hindsight, we know how to better measure scoring efficiency thanks to true shooting percentage.
True shooting percentage (TS%) isn’t a new concept by any means. It’s been around for roughly 15 years, and maybe more to some savvy analysts, and has been discussed quite frequently over the years. Four years ago, Justin Willard of Nylon Calculus gave a nice introduction to TS%. For the uninitiated, the formula is given as
As a quick refresher, the idea of true shooting percentage is simple. We take half of the number of points scored by a player and divide it by the number of possessions that result in a chance at scoring. Some folks call this scoring possessions. Some folks call this scoring chances. Some folks call this true shooting attempts. It has many names and causes a little confusion between analysts from time to time. I personally prefer just calling it a scoring attempt.
Slight Detour on the “44”
As the analytic was developed several years ago, the ability to trawl through play-by-play made it fairly difficult for analysts to correctly count the number of scoring attempts. Through some detailed analysis performed during the same era, a value of .44 was used to help approximate the number of possessions when using box score stats. The idea is that if we knew the number of free throws, along with defensive rebounds, field goal attempts, and so forth, and we assume that made field goals terminate possessions; rendering made “And-1” free throws as non-possession ending, then roughly 44% of free throws actually potentially terminate possessions. This idea is rather straightforward as this suggest six percent of free throws are And-1’s, missed three-point attempts, technical, flagrant, clear path, or “Away from the Play” fouls. And the analysis was fairly spot on!
Two years ago, Matt Femrite of Nylon Calculus spent time showing that the coefficient of .44 had become outdated, suggesting that possessions were now being overestimated. As possessions were being over-estimated, it was theorized that the value of .44 was too high, indicating that a higher percentage of free throws were of the three-point variety, And-1’s and the other various types listed above. We took a look at the same phenomenon from an estimation standpoint and found Matt’s work to be corroborated with a proposed .436 value for the 2016-17 NBA season.
While our focus was on possession counting, and inherently their impact on ratings, showing that ratings were now being underestimated, there was a domino effect on TS%. Meaning, that TS% is now being underestimated. Therefore a field goal today is worth less than a field goal yesterday because players are more effective at drawing fouls on higher-valued scoring attempts: particularly And-1’s and three point attempts.
This led us into digging deep into the distributions of free throws for each team. We found that by counting possessions explicitly, the changes in true shooting percentages would actually shuffle players around from the theoretical answers to an updated tactical answer. Great, but honestly, overkill and unnecessary unless we really wanted to squeeze out the extra point of edge in a game. And we can find 2-3 points somewhere else… just by inducing a rotation. But that’s another story for another time.
If we compared the analytically-derived value of .436 and compare it to .44, every single free throw a player makes, the denominator of TS% is theoretically affected by .008. The article above shows that the .436 is actually not uniform at all (Nuggets and Jazz are the example) and instead we can use the offset from .436 to give a proxy for understanding a player’s ability attack and finish and a team’s scheme for attacking the rim and drawing fouls beyond the arc. You can read about that above in the last link. Regardless, unless a player takes a significant amount of free throws relative to their field goal attempts, the change in TS% is minuscule.
TS% is Not a Small Sample Analytic
The reason for our reminiscing about the work performed on the “.44” in TS% is due to the fact that this is a regressed value, meaning that we are looking at a distributional effect of free throws on possessions and scoring attempts. Because of this distributional effect, small sample values are going to be relatively meaningless. Let’s consider this example:
Example A: Player A cuts through the lane and hammers home a dunk. What’s their TS%? This is simple: 2 points goes in the numerator of TS%, while the denominator sees 1 FGA added to 0.44 x 0 FTA, which is 1. Since we cut points in half and divide, we obtain a true shooting percentage of one.
Example B: Player B cuts through the lane and get decapitated by a wild-armed center. Fortunately, Player B survives and hits both free throws. What’s their TS%? This is simple as well: 2 points goes in the numerator of TS%, while the denominator sees 0 FGA added to 0.44 x 2 FTA, which is 0.88. Since we cut points in half and divide, we obtain a true shooting percentage of 1.14.
This shows the fundamental flaw in small-samples using a large-sample estimator. Realistically, Example B has one scoring attempt, not .88. Therefore the real true shooting percentage is one. Therefore, we should take TS% along with other stats, particularly, scoring attempts. A savvy analyst today already does this.
So now that you’ve made it this far, it’s time to tell you that this post is neither about the deficiencies of TS% nor how the “.44” is over-valued. There’s actually not many deficiencies with TS% as the grain of salt about small samples and fluctuation across teams and players has been well documented. Rather today’s post is about the distributional aspect of TS% and how we can begin using it to model effects of the game.
Let’s begin with a reader-requested team: the New York Knicks.
New York Knicks and their Scoring Abilities
As an example, let’s consider the 2018-19 New York Knicks. Through their first 40 games, the Knicks have settled into a 10-30 record. Some of this is due to their league bleeding (last place) .528 TS% and their second-to-last .543 opponent effective Field Goal Percentage. While their other offensive stats are top-half-of-the-league, there are some defensive deficiencies on defensive rebounding while middling in the areas of opponent TOV% and opponent FTr. As a team, Tim Hardaway Jr. is the main catalyst of the offense while Kristaps Porzingis rehabs from a torn ACL towards the end of the 2018 season.
By extracting the different types of free throws, we can now write a scoring attempt as:
FGA + (FTA – A1A – TA – 3A – APA – FA)/2 + 3A/3
This will accurately count the number of scoring attempts generated by the shooter. But, as indicated in the previous section, the updated true shooting percentage, TS%^, barely budges by more than a percent for anyone. What’s more important here is that we have broken up the components of true shooting percentage into semi-independent, measurable count processes. Wahoo!
Our ultimate goal is to build a model that identifies the variability of true shooting percentage, as well as provide a guideline for building a regression model to identify the impact of actions on court that affect TS%. We could be naive and suppose a Gaussian model, but we would have to admit we are ignoring that the Central Limit Theorem fails, a derivative result of another Nylon Calculus post about the stability of the three point attempt, this time by Darryl Blackport from 4 years ago.
Therefore, we need to identify the counting process associated with the components of the model. And, unfortunately, Poisson ain’t it. In fact, I use the term semi-independent as a surrogate for the fact we assume independence of the terms despite there may an argument that the terms are indeed not independent. The term measurable does not mean Lebesgue measure (if you don’t know that means, it’s cool… we won’t talk about it here anyways), but rather we can measure the counts using the counting measure. Yes, that’s a joke… but yes… that’s a true mathematical statement too.
All we are saying is that FGA and non-FGA FTA independently occur and that we can count them.
Why Not Poisson?
As we mentioned before, the counting process above is not Poisson. A Poisson distribution describes this process: For a given period of time, if items arrive at random, independent, times; each with a mean time of arrival (L), how many items will arrive before the period of time ends? The collection of observations of such as experiment form a Poisson distribution.
This sounds very much like how field goal attempts occur! We have a series of minutes played in a game and we suppose that all field goal attempts are independent. Therefore, the number of field goal attempts that arrive within the time window must follow a Poisson distribution! If you’re an analyst who gave out the exercise Can you model the 3-PT% of every player in the league? question for potential new hires, you’ve probably been inundated with this exact response. Unfortunately, while it’s a good first try (and you’ll even do well predicting some); you’re failing assumptions and (more specifically) the data science associated with the problem at hand.
The Data Science at Hand
If we were to look into developing a paper for Sloan, we would immediately look into the game theoretic events associated with the types of shots taken. IE: How likely are we to attack the rim given the current situation with the offensive capabilities handling the possession and the defensive abilities in movement. This type of analysis requires aid with tracking data. Instead, we stay on task with play-by-play data and ask, how do I model my response of the number of FGA?
Let’s take a look at Tim Hardaway Jr. once again. Over the course of 37 games, Hardaway Jr. has averaged a total of 16.7 FGA, 7.7 3PA, and 5.3 FTA per game. Respectively, the variances for each are 18.3 FGA^2, 6.4 FGA^2, 14.5^2. We use the square-notation to indicate the units for the variances. In these cases, none of the variances are the same as the means. However, there are only 37 samples. If we were to fit a Poisson distribution, we would actually obtain a relatively good fit.
We see that Hardaway Jr.’s distribution of FGA don’t necessarily satisfy a Poisson distribution, and appears to be over-dispersed indeed. Despite this, with the smaller sample size, are we able to do better? To better understand over- and under-dispersion, we can look at the Conway-Maxwell Poisson model.
Conway-Maxwell Poisson Model
The Conway-Maxwell Poisson model is a generalized form of the Poisson model that allows us to estimate over- and under-dispersion through a new parameter, nu. The generalization is in the same vein as in the generalization of the Gamma Distribution to obtain the Rayleigh or Weibull distributions. Here, the probability mass function of the Conway-Maxwell distribution is given by
This model looks very similar to the Poisson model, except that the normalizing constant isn’t a pretty exponential, e^(-lambda). This is where we gain some added flexibility.
In the Poisson model, the value lambda represents the expected number of arrivals over a given period of time. In the Conway-Maxwell distribution, lambda no longer represents this value. Instead it turns into a location-type parameter, which helps “center” the distribution. Similarly, the parameter nu is a scale-type parameter, which helps “smooth” the distribution to give the distributions its shape. These interact together in a non-linear way, meaning a simple adjustment in lambda does not just shift the distribution left or right by that amount despite primarily controlling left and right movement; hence the “type” added. In fact, the expected value cannot be given in closed form other than the infinite sum:
While we can compute the mean numerically, it is still a chore to estimate the two parameters given a data sample. The way we perform this task is to write out the log-likelihood of the distribution, take the partial derivatives and set equal to zero. This leads us to solving the following equations for lambda and nu:
Don’t let those equations fool you, lambda and nu are tucked in the expected values; just use the expectation formula above with the appropriate values for each equation. Given these equations, we do not have a closed-form solution. Therefore, we must apply Newton-Raphson optimization. And once we do that, we can estimate Tim Hardaway Jr.’s FGA using our flexible distribution.
And we immediately see that a Poisson model is indeed preferred. The maximum likelihood estimates from the Newton-Raphson optimization scheme even favor the value one for nu; which gives us the Poisson model explicitly!
But I thought…
While the above model fits the Poisson distribution, this is in the full unconditional model, meaning that no outside factors affect the distribution of field goals. If we were to suggest that defense variables affect field goal attempts, we would require setting up a generalize linear model and the resulting conditional distribution may not be Poisson. However, let’s look at the other potion of scoring attempt within TS%.
Tim Hardaway Jr.’s FTA
If we look at the distribution of Tim Hardaway Jr.’s distribution of free throw attempts, we find that the distribution of FTA is considerably different than that of the distribution of FGA. We see that once again the Poisson fit isn’t the greatest, but this time it can be improved.
We see here that the distribution is indeed over-dispersed. In this case, we should definitely find a good fit using the Conway Maxwell Poisson distribution. And in this case, we find that lambda of approximately 1.7 and nu of approximately 0.35 help fit this distribution.
And it’s here that we see the fit of the Conway Maxwell Poisson model performs much better than the Poisson fit. And it’s this type of data that the majority of NBA players follow when it comes to FGA and FTA per time period over the course of the season. What we now find is that the scoring attempts for TS% can now be modeled as a mixture of Conway-Maxwell Poisson models.
What this allows us to do is the following:
- Understand the impact of player variation on TS% and start to log a well-fitting distribution for TS%.
- As long as we develop semi-independent parts to scoring chances (which we did in out three part breakdown of scoring attempts above), we can sum the distributions.
- Logging the distribution gives us parameters, which change over time. This creates a helpful longitudinal study to monitor player learning.
- Develop a generalized linear model in attempts to test components.
- Break away from Poisson modeling and build instead a flexible model that better represents the process we are interested in.
And it’s here where the fun begins and becomes challenging. We can now start to develop a distributional model for quantity in the now-traditional Quantity-and-Quality models exploited by Kirk Goldsberry. Therefore we can build a stronger model in predicting quantities of shots over a desired time period with an associated quality, when using a traditional logistic regression model.
At nearly 3,000 words, that’s a different story for a different day.
3 thoughts on “True Shooting Percentage Part I: Introduction and Framework for Advancement”
Pingback: Weekly Sports Analytics News Roundup - October 15th, 2019 - StatSheetStuffer
How could someone somehow use all these to model points and 3s for a player?
What I know is that points follow CMP and threes also CMP.
How could someone simulate threes given that they have some results already from simulating points?
Player A: points~ CMP(l1,n1) and threes ~ CMP(l2,n2)
Points are a cmp mixture of threes and ones or twos combined.
If a simulate 25 points for a match, how can I simulate threes?
And overall respect points and threes distributions?
Use a multilayer regression, or in fancier terms but all the same: a neural network. You can control the loss function using a lagrangian penalization term.