As of the morning of November 20th, the Four Factors line for the 7-9 Dallas Mavericks read off as follows:

- Effective Field Goal Percentage: 0.527 (8th in the league)
- Turnover Percentage: 14.4 (28th in the league)
- Offensive Rebound Percentage: 22.6 (18th in the league)
- Free Throw Rate: 0.22 (7th in the league)
- Opponent eFG%: 0.527 (21st in the league)
- Opponent TOV%: 13.6 (8th in the league)
- Defensive Rebound Percentage: 79.6 (4th in the league)
- Opponent FTr: .203 (15th in the league)

Overall, the Mavericks appear to be a wildly varying team that turns over the ball and has difficulty obtaining offensive rebounds, but can score and get to the rim. Similarly, they are strong in inducing turnovers and can rebound on defense; but have difficulty in stopping opponents from scoring. When reading through the four factors, it would appear that teams play at the level (and mirror that) of Dallas; and is potentially a reason why the team is near .500.

So if we are to report how the team is doing to coaches or management, how do we go about presenting and, more importantly, discussing these numbers?

Over the previous eight seasons in the NBA, I’ve witnessed effectively every team present these numbers as **rankings** or **percentiles**. Most commonly, the phrase “We are **xx**-th in the league…” is the one that I have heard. And effectively every coaching staff or management chain responds in almost identical fashion: “How do we move up…?” It’s a curious tale that once got me into a philosophical debate with a head coach about the difference between moving from 11th to 10th in a statistical category versus moving 11th to 10th in another category. The problem was that the report didn’t indicate the value of the difference between 10th and 11th. Both had been wistfully whisked away to Gaussian-land, making them look identical; when in reality the stats were telling different stories.

## Gaussianity Is Not Your Friend.

The biggest challenge is that many analysts enjoy standardizing data and treating the data as Gaussian. Standardization is indeed helpful when attempting to remove scaling effects in an effort to treat distributions on **the same scale**. However, mean zero – standard deviation one distributions are not the same. Take for instance, this fun example: One distribution is a Gaussian sample of scores that are standardized (orange group); while another distribution is an Exponential sample of scores that are standardized (blue group). Each have the same mean and standard deviation. But what do their distributions look like…

Now, if we are to compute the rankings, we see that movement among the rankings for the exponential group should be much easier to commit to than the movement within the Gaussian group. This is due to the fact that many of the teams are located in a tight concentration about a small set of values. In this example, between -1 and -2. The Gaussian group has a more difficult time moving as the bulk of the teams are spread among -1.5 to 1.5. If we impose a Gaussian percentile, we would seriously undercut the tightness of the Exponential distribution and would therefore be misleading team officials in the value of a statistic.

## Distribution of Four Factors

So let’s take a look at the distribution of the Four Factors for the 2018-19 NBA season. To do this, we merely copy the Four Factor stats from Basketball Reference to create an importable csv file that we read into a pandas data frame. We also build off the **RANK** functionality of Excel to help produce sample ranks for each team at each Four Factor category. This builds us ranks columns that we can also import into Python.

Then, given the Four Factors table, we trim down to a condensed data set of offense-only factors. We leverage the seaborn package to produce a nice bi-variate distribution plot, which results in:

Unfortunately, on the diagonal, the estimated densities are flattened; so let’s recover Offensive Rating, Turnover Percentage, and Offensive Rebound Percentage:

We might be able to sneak under the radar for OREB% through the use of a statistical test and rely on small sample sizes. However, it’s near blasphemous to treat Free-Throw Rate, TOV%, and Effective FG% as Gaussian due to their heavy right tails. If we p-hack, we can claim Gaussianity; but that’s being deceptive (and lazy) in reporting.

Remember that example we had above? Compare those exponential plots to the the kde plots above… There’s a reason for that example.

## Building a Percentile

The way we build a percentile is simple. The basic definition of a **sample percentile** is the location of which a **certain percentage of points** fall beneath. The most common example is the **median**. For the median, we are looking for the explicit value of a statistic such that **fifty percent **of the data falls below the value. For a given data set, the sample median is either a data point (if there is an odd sample size) or **any value in-between two data points** (if there is an even sample size). Let’s see an explicit example…

Consider the set of effective Field Goal percentages heading into today:

If we sorted these values (right panel above), we immediately see that the median is located between the 15th and 16th data point. This would be any value in between .517 and .518. Our undergraduate text book may tell us to just average these values to obtain a median of **0.5175**.

If we continued this example, we would find that the 25th percentile is 0.502. There is no give or take on that. It’s the exact data point. Similarly, the 75th percentile is 0.527. However, if we keep going in opposite directions, we see that the distribution of effective Field Goal percentage is skewed to the right. The way we view this is through a **Probability Plot**.

For the probability plot, we see that there is a semblance of a Gaussian distribution, but the tails are definitely skewed, both in the high value versus quantile value. **Side Note: **A quantile is the value of the statistic at a given percentile. Some text books may use them interchangeably.

So how does eFG% compare to the exponential and Gaussian distributions? Well, let’s look at the Probability Plots for both. Referring to the example above, we have:

We see that the eFG% is actually closer to the Gaussian set-up as the tails aren’t as nearly skewed. However, if we take a nuanced look at the plots, we see the same shapes as the exponential for the eFG% plot: both tails are above the red line and a pronounced bend below the red line. So we see semblance of both the Gaussian and the Exponential distribution. So which is it?

## Effective Field Goal Percentage is Neither Gaussian nor Exponential…

If we apply an **Anderson-Darling goodness-of-fit **test, we find that the distribution of effective Field Goal percentages are **neither**. In fact, the p-values (and effect sizes for those who knee-jerk against p-values) strongly disagree with the distributions being from that of a Exponential (10E-9) or a Gaussian (10E-14). At least, it notes that the sample is more akin to an exponential than to a Gaussian as we indicated above!

To bring this example home, the same tests identify the exponential sample as Exponential (0.97) and the Gaussian sample as Gaussian (0.922). So this means if we do indeed apply a **parametric distribution**, we’d be directly lying by giving false information. Remember our core problem we are trying to answer? **How do we report “percentiles” effectively across all statistics**.

So what do we do to correct this issue? The simplest way is to rely on **nonparametric statistics**. We are already kind of doing this by deferring to the rank and percentile; but the traditional analyst tends to force Gaussianity. Which we just dispelled using eFG%. Instead of focusing on using **z-Scores, **we should really zero in on the **empirical distribution function (EDF)**.

## Empirical Distribution Function

The empirical distribution function is an estimator for the cumulative distribution function. The EDF is a stepwise function that counts the percentage of data points at or below the given value. If we take a quick glance back at the sorted eFG% values, we would see **zero** until the value **0.487**, where we would see a jump by 1/30. This plot remains constant until we see the value **.490**, in which a jump of 1/30 to 2/30th occurs. This continues on until we run across all data points in the sample. The resulting R plot is then:

**Side Note: **This is one of those times where R completely outperforms Python. Can you believe that Python does not have an EDF function built in?!?!

Again we see that aggressive right tail in the EDF. This is due to the Golden State Warriors (2nd) and Milwaukee Bucks (1st). Now, if we simply report the position/rank, we wouldn’t be necessarily lying, but we would be mis-representing the data. Let’s compare eFG% to OREB%. Notice in the **sunny-side up plots** above, there is actually a seemingly **negative trend** between eFG% and OREB%. We tell that from the two-dimensional distribution pulling along the line Y = -X. Regardless, let’s plot the EDF for OREB%:

Notice that the tail is left-heavy for OREB%. This is due to the Chicago Bulls, Memphis Grizzlies, and Phoenix Suns.

Now using the EDF’s we can apply **Bernoulli Distributions** to understand the statistical properties of the statistics of interest. To see this, we just have to take a moment to recall what the EDF is doing. Recall that the EDF asks whether a **data value** is below a particular value of the domain **x**. For example, **is team, i, below the value of x = .175 for** **OREB%? **If the team is the **Chicago Bulls**, then the answer is yes. Otherwise the answer is no. By removing the label and treating all teams as a random sample, we get a value of **1 (yes) **or **0 (no) **for that **one data point**. Therefore, the probability of falling below, say .175, is just the **TRUE CDF** at that particular value. This means that we can start making an inference on the true distribution without having to guess a distribution!

And since this is a Bernoulli random variable for each team, we obtain a variance estimate for free! So let’s understand what this is telling us…

## What’s better… 28th in OREB% or 28th in eFG%?

We will start simple: let’s compare two rankings across two categories. Suppose we are a team in such a position and need to focus on personnel changes or coaching strategies to help nudge the values in a positive direction. Further suppose that, while we’d love to improve both categories, we only have enough budget to isolate one category. Which do we choose?

28th in OREB% puts us at **18.1%** (Phoenix Suns) and 28th in eFG% puts us at **49.7% **(Minnesota Timberwolves). Now, the movement for a team to jump a spot requires an improvement of **1.8% **for offensive rebounding percentage, and **0.1%** for effective Field Goal Percentage. Ideally, we would use a prior distribution on the counting stats that construct the statistic of interest; however, reporting rarely does that. If we did perform the prior calculations, we could further identify how close a team really is to capturing the next spot. Instead we focus on EDF calculations alone.

Applying the Dvoretzky-Kiefer-Wolfowitz (DKW) bound, we obtain the following confidence regions:

And what we can do is measure the deviations from north-to-south to help understand where a team really is sitting, despite their hard-coded number of say… **18.1****%**. In this instance, 28th is as good as 25th in offensive rebounding, while 28th is as good as 23rd in effective field goal percentage. This would indicate that improving shooting would be key to emphasize over rebounding; if one had to choose.

**Side Note: **By placing a prior distribution on the counting stats, we would be able to better control the widths of these intervals.

Now that we have seen the univariate attack on these Four Factors, let’s do two things. First, revisit the **Dallas Mavericks** and second, identify next steps.

## Dallas’ Conundrum

For the Mavericks, we saw that their shooting numbers are up and their offensive rebounding numbers and turnover percentages are towards the bottom of the league. To recall, the offensive Four Factors are given by:

- Effective Field Goal Percentage: 0.527 (8th in the league)
- Turnover Percentage: 14.4 (28th in the league)
- Offensive Rebound Percentage: 22.6 (18th in the league)
- Free Throw Rate: 0.22 (7th in the league)

We saw above that the Mavericks are currently 8th in the league in eFG%. This places them within a tier of 5th through 11th, 11th being a steep drop off. For completeness, we include the TOV% and FTr plots:

We see that 28th in TOV% is a tough spot to climb out of to get to 26th and is a difference of over 3 percent to change dramatically. They are effectively between 30th and 26th when it comes to turning the ball over on offense. Being 18th in OREB% places them well between 22nd through 15th, indicating they can climb up, but have a better chance of slipping. Similarly, being 7th in FT rate is the entry-point of the funnel in the EDF plot. This places them between 11th and 4th in the league in getting to the foul line. It’s still fairly impressive, but teetering towards middle-of-the-pack.

So what’s the diagnosis?

**The Dallas Mavericks are performing well in scoring categories, maintaining a second-tier status among the league at roughly 8th in eFG% and Ftr. Their non-shooting change of possession capabilities are near bottom of the league, but are being masked by a couple good performances. Despite rating 18th in OREB%, they are really rebounding at a 20th rate team with noise. Their turnovers are effectively at the bottom of the league. Emphasis on protecting the basketball on offense and positioning for offensive rebounds will improve the team’s numbers. **

This sounds like “Well, duh.” But we’ve given a top-level quantification of the potential slip in Dallas at some point in the near future. To really dig into the specifics, this top-level breakdown of understanding what percentiles and rankings are really telling us allows us to dive into the right areas of analysis.

So… what next?

## Next Steps…

In truth, the above analysis is only good at an introductory level. Realistically, a team cannot simply isolate rebounding and ignore shooting. Often times, the offensive flow requires players to be out of “optimal” rebounding positions. To counter act this, we can look at the **interactions** of the statistics. And to do this, we need **Empirical Distribution Functions on Steroids**. Or as well call them… **copulas. **

This is an area of research I’ve focused on for a few years and one thing very subtle about copulas is that they treat discrete distributions as continuous. One of my fellow colleagues created a phenomenal attack by introducing right-censoring to overcome the discrete-to-continuous problem. He started writing a paper on this a while back, and if you notice, there may be a Justin influence on his work. Unfortunately, I departed for Orlando and I was dropped down to a “thank you” despite all the hours of work put in to crafting variance bounds and analyzing the NBA data set on the XPCA solutions; but none-the-less, it’s a great paper to follow on the next steps!

Hello

I read you column. It is really pleasant for me.

It would be more practical to try to correlate the e.g. Dallas weaknesses with factors that are base on each of the 16 opponents factors.

Taking a full year you can see what is the highest correlations among such (analyzing all) parameters for each team and also for all teams.

This will help understanding what should be the focus of each specific team when playing against another team with known “factors” and in the end what are the ‘factors” that the team has to improve on a game, half year and full year levels..

Best

S

LikeLike

So in doing that over the span of 16 games, you’ve introduced a fractional-factorial model with too few replications. Sure, we can build a model and get numbers; but the regression (IE: correlation) will be nothing more than faces in clouds.

This is primarily the reason why most folks induce priors from multiple years. This gives stability to the numerical process. However, rookies cause problems under these models and priors drag tail players down. See Giannis this year on many multi-year models.

At the full year level, I’m with you on the regression as we know have a repeated measure fixed full model. In fact, Nylon Calculus (I forget which writer off hand) builds a model to show how the four factors correlate to winning each year. He then uses the stale model to propose future wins (and updates over the course of the year).

That said, coaching staffs always care about the now. Which is why analysts (particularly those who act on data after weeks/months/years) have such nice results. Coaches are forced to make changes on the fly and constantly need 5 game reports, 10 game reports, etc. At that point a yearlong model doesn’t suffice. You’ll almost never get past the door by saying “I ran a regression over the previous year of data…”

LikeLike

Pingback: NBA Ranks and Percentiles by Squared Statistics – Advance Pro Basketball

Pingback: Random Manatees: The Art of Ranking Players | Squared Statistics: Understanding Basketball Analytics