In this post, we take a cursory look at streaming computations and perform a simple application to **scheduling in the NBA** in an effort to identify teams who have the most **compact schedules**; meaning they have the **least amount of rest between Game 1 and Game 82**.

**Note: **This is discussion about real data science problems. If you’re only here for the basketball analysis skip to the next section.

A common (and sometimes forgotten!) problem in data science is the ability to **compute** statistics. Recall that the definition of a statistic is merely a **function of the data**. What we mean to compute statistics is the ability perform **computations **that the computer is able to **store in memory**. Take for instance a 100 gigabit per second stream of data with the charge of calculating a mean.

If we can compute only 1 gigabit per second stream, we lose **99% of our data**. If we’d like to have a response in **real time**, we’d be forced to perform **sublinear computations**.

There is no way we can calculate the mean of the data. Suppose we get the mean in one calculation through some fairy tale manner, we are still requiring **100 seconds **to compute one second of data. If we need to compute an entire hour’s worth of data, we are stuck waiting for an **extra 356,400 seconds** for the computation to be complete. That’s and extra **four days and three hours**. Sorry, MapReduce won’t help you here either (it’s slower).

A sublinear algorithm is akin to **random sampling. **In this case, we sample data using some random number generator, selecting an adequate amount of data and computing the mean. The idea is straightforward, however several checks need to be performed to ensure the sample is appropriate enough to represent the global mean. We can do that through the process of **data forking** and working on multiple streams at the expense of requiring less data to perform computations.

Now if our data stream is only 1 gigabit per second, we are able to compute the true mean at real time, but only for **order 1 **calculations. This means we can only perform “one” action to the data as each data point comes in. Recall that the mean is computed by

From the formula above, the time to compute the mean is actually **order n**. This means we require all **n** data points to compute the mean. From this viewpoint, we are in trouble. However, computation of the mean is probably the most well-known streaming computation out there. So there’s hope!

In the streaming paradigm, we change the method of computation to be a **forked stream**. What this means is that we can construct a **window **of data by building **n** streams of data. Here, all we need to do is perform a series of three calculations, which has nothing to do with the size of the data stream. The only worry here is memory allocation; which is already assumed to be OK from our first sentence.

Let’s rewrite the mean calculation.

To perform this computation, we simply hold the sum and an number of data points in memory. We can start with an **initial fill** of all zeros. This means the first **n-1** means are **biased**. But this is alright as we only cared about the mean of **n** points at a given time.

As we fork the stream for **n** data points, for the **n+1** data point, we subtract the first stream, containing data point **x_1**, replace the stream element with **x_(n+1)**, and reassign the stream orders. By doing this, we perform **three (four)**** calculations on the data: **subtract the oldest data point, add the newest data point, and divide by the sample size. The (four) comment is performed when the **initial fill washes out**. The update equation is **n = n + 1** until the initial fill is removed.

So let’s apply this methodology to a simple NBA problem: **scheduling.**

Opening day for the 2018 NBA season is on **October 17th** and proceeds for the following **177 days**, concluding on **April 11th**. Over the course of these 177 days, a series of 1230 games are needed to be scheduled. The games are distributed to ensure that all 30 NBA teams have 82 total games with 41 games on the road and 41 games at home.

Of the 82 games, each team must play their own division **four times** for 16 games. They also must play every team in their opposing conference **two times** for 30 games. The remaining 36 games are distributed across the remaining 10 teams; resulting in six teams being played **four times** and the other four teams **three times**.

Furthermore, the NBA has a required set of nine days off:

- November 23rd: Thanksgiving
- December 24th: Christmas Eve
- February 16th – 21st: All-Star Break
- April 2nd: NCAA National Championship

As all teams need to be mixed, with no other days off for the league, and in such a way that all teams get an **acceptable amount of rest**. The NBA does a phenomenal job in scheduling, despite the acceptable amount of rest being well-defined across the years.

For the 2018 season, we can walk through the entire 1230 game schedule and construct a 30×177 binary matrix representing the entire schedule. Using the **imshow** command, we can view the matrix where **yellow** represents the **game played** and **black** represents the **day off**.

We immediately see the days off in the league as they are black columns. By running across the rows, we can start to measure the **compactness **of the schedule. A simple, naive, measure would be to simply count the number of days off. However, every team plays 82 games in 177 days. This means every team gets 95 days “off.” Therefore counting days off just doesn’t cut it.

Instead, we become interested in **compactness **of the schedule. Compactness means that the schedule has **accumulation points**. In this sense, an accumulation point is finding days where there are **many games within a neighborhood** of that day. For instance, a team **might **have two games before a day and two games after a day; seen as a **4-in-5** set of games. That day is an accumulation point.

We can then start to map accumulation points to **high volume sections** of the schedule. That is, we hunt for **short burst** games where 66-75% (or more) of the days are played. We will also focus the course of **two weeks intervals**. This allows us to **nest** bursts in the schedule.

Nesting is a situation where certain scenarios in the schedule **cannot happen** unless other scenarios occur. For instance, a **back-to-back** series of games is a high-volume set of games. However, a **3-in-4** burst of games **must require a back-to-back **to occur.

Similarly, a **4-in-6** burst of games **must require a back-to-back **and a **3-in-4 **to occur. Let’s illustrate these using **binary vectors**. Suppose that we have a three game set. Using a **one** to indicate a game is played, and **zero** to indicate a game is not played, then all possible 2-in-3 bursts are **011, 101,** and** 110. **This means two of three possibilities can result in back-to-back sets.

But for 3-in-4, there’s **always a back-to-back:**

- 0111 – Two Back-to-Back’s and a Three-In-A-Row!
- 1011 – Back-to-Back
- 1101 – Back-to-Back
- 1110 – Two Back-to-Back’s and a Three-In-A-Row!

Similarly, for 4-in-6, there’s **almost always a 3-in-4:**

- 001111 – Three Back-to-Back’s, Two 3-in-3, One 3-in-4, one 4-in-4, and one 4-in-5
- 010111 – Two Back-to-Back’s, one 3-in-3, two 3-in-4, and one 4-in-5
- 011011 – Two Back-to-Back’s, two 3-in-4, and one 4-in-5
- 011101 – Two Back-to-Back’s, a 3-in-3, three 3-in-4, and one 4-in-5
- 011110 – Three Back-to-Back’s, two 3-in-3, two 3-in-4, one 4-in-4, and two 4-in-5
- 100111 – Two Back-to-Back’s, one 3-in-3, one 3-in-4
- 101011 – One Back-to-Back, one 3-in-4
- 101101 – One Back-to-Back, one 3-in-4
- 101110 – Two Back-to-Back, one 3-in-3, two 3-in-4, and one 4-in-5
- 110011 – Two Back-to-Back’s…
**no 3-in-4** - 110101 – One Back-to-Back, one 3-in-4
- 110110 – Two Back-to-Back’s, two 3-in-4, and a 4-in-5
- 111001 – Two Back-to-Back’s, one 3-in-3, and a 3-in-4
- 111010 – Two Back-to-Back’s, one 3-in-3, two 3-in-4, and a 4-in-5
- 111100 – Three Back-to-Back’s, two 3-in-3, one 3-in-4, one 4-in-4, and a 4-in-5

That’s 15 total options. There is one exception to the 3-in-4 option. That’s the **110011** burst. In this case **there must be LOTS of rest days to avoid a 3-in-4**. In fact, in order for a 3-in-4 to not exist is to require there to be a **0011001100**; which is **four games in 10 days**. Ridiculous. There’s effectively a 3-in-4.

Starting in the 2018 season, the NBA announced that 4-in-5 games would be eliminated from the schedule after a series of studies had been produced on the effect of travel, lack of sleep, and game density (compactness). This resulted in a **red alert **system for games; with a **Charlotte Hornets **game being cited as one of the worst offenders: a 4-in-5 with a back-to-back to finish the 4-in-5 being a Memphis-Charlotte transfer; resulting in “three hours of sleep.” The Hornets were dispatched by the **Detroit Pistons** with ease: 112-89.

So let’s analyze the 2018 NBA season!

To perform this analysis, we take the entire schedule and walk through every scenario of **compactness**. We look for all back-to-back games, all 3-in-4, 4-in-6, 5-in-7, and so on. We can do this by rolling through the schedule, or we can perform **streaming computations**!

To perform the streaming computation, we first construct a **window** of games. Here, we are interested only in 12 day bursts of windows. Therefore, we build an initial fill of all zeros and walk through the process of the first 12 days of the season. This is **October 17th through October 28th. **

We then compute a fixed series of computations: add the first two elements of the binary representation. If this is two, we have a **back-to-back**. Similarly, we add the first four elements and if this is more than three, we have a **3-in-4**. We can do this, as there are no 3-in-3’s and 4-in-4’s. We perform this computation similarly for all other high volume bursts through the 12 games.

Once we perform this computation, we add in the new day. Remember in the streaming computation, we just **pop** off the oldest day and add in the new day at the end. Since the data is small scale, we can hold the windows in a 30×12 matrix. This can be performed in Python using the **numpy.roll** function and a wipe.

As we walk through the schedule, the window slides across the binary matrix of games and pushes the oldest games to the front of the matrix, where the fixed burst computations are performed. Therefore, we don’t have to hunt any games!

The only catch is the end of the season, where we can either place a **terminal fill** of zeros, or sweep across the final games. I chose the latter.

By taking in a schedule data file, we can walk through each line, strip the teams, and populate the games window matrix (**games**) and the season matrix (**gameMat**). We only populate **gameMat** in order to display the matrix image above.

# Walk through the scores file. Each line will consist of: # DATE: Winning Team Score Losing Team Score for line in lines: # Strip to ensure no white space messes with the key structure of the team dictionary. winner = line[12:30].strip() loser = line[41:60].strip() day = line[8:10].strip() if day == currentDate: if gameNumber < 12: games[int(teams[winner])][int(gameNumber)] = 1. games[int(teams[loser])][int(gameNumber)] = 1. gameMat[int(teams[winner])][int(gameNumber)] = 1. gameMat[int(teams[loser])][int(gameNumber)] = 1. else: games[int(teams[winner])][11] = 1. games[int(teams[loser])][11] = 1. gameMat[int(teams[winner])][int(gameNumber)] = 1. gameMat[int(teams[loser])][int(gameNumber)] = 1. else: if line[11:15] == 'NONE': print 'NO GAMES TODAY!', line[0:10] # New day shifts games. Let's get our counts before we run away! for i in range(30): specials = CountGames(games[i],i,specials) # Work on shifting games; as it is a new day! games = np.roll(games,-1,axis=1) # Must wipe out the last element of every row for i in range(30): games[int(i)][11] = 0. gameNumber += 1. continue # New Day! gameNumber += 1. currentDate = day if gameNumber < 12: games[int(teams[winner])][int(gameNumber)] = 1. games[int(teams[loser])][int(gameNumber)] = 1. gameMat[int(teams[winner])][int(gameNumber)] = 1. gameMat[int(teams[loser])][int(gameNumber)] = 1. else: # New day shifts games. Let's get our counts before we run away! for i in range(30): specials = CountGames(games[i],i,specials) # Work on shifting games; as it is a new day! games = np.roll(games,-1,axis=1) # Must wipe out the last element of every row for i in range(30): games[int(i)][11] = 0. # Now add the new guy! games[int(teams[winner])][11] = 1. games[int(teams[loser])][11] = 1. gameMat[int(teams[winner])][int(gameNumber)] = 1. gameMat[int(teams[loser])][int(gameNumber)] = 1.

What we see in this code block is that we apply the roll function and then wipe the new column of zeros in order to populate new data. **This is a cheat** as the data is small. If we performed a purely streaming computation, we hold the row sums for each burst type; search the new teams playing on the new day and add/subtract accordingly. Despite this, you can see how the streaming window interacts.

Performing the above computations, we can then look at a **compactness measure**. In this case, we perform a simple calculation of counting **high volume days**, weighted by their volume. In this case, we add all **back-to-back** games to **75% of 3-in-4 **games, **66.6% of 4-in-6 **games, to **71.4% of 5-in-7 **games, and so on. The higher the number, the more dense games are.

In total there were **425 instances of Back-to-Back games**, with each team averaging a total of **14** back-to-back games. The **Memphis Grizzlies **and **Utah Jazz** are the worst offenders with **16 Back-To-Back games**. Compare this to the **Charlotte Hornets, Cleveland Cavaliers, Detroit Pistons, Golden State Warriors, Los Angeles Lakers, Miami Heat, **and **New Orleans Pelicans**, who all have **13 Back-To-Back** games; we find that the discrepancy isn’t too high. This is arguable dependent on travel time/distance between games.

Three-In-Four games are fairly common as well. In this case, we find that each NBA team averages **18.9** three-in-four games during the season. As there were a total of **567** of these instances, we found that there was a much larger discrepancy among teams when compared to **Back-To-Back** games. For the 2018 season, the **Memphis Grizzlies **have **23 3-in-4 bursts**, while the **Cleveland Cavaliers have 10. **The next lowest total? The **Denver Nuggets ****have 15**.

Finally, the NBA season witness **zero 4-in-5 bursts!**

For 4-in-6 games, we see **611 instances** throughout the season. However, recall from the breakdown above, a sequence of **101101 **results in a pair of **3-in-4 bursts**. This is similar for **4-in-6 **situations when we expand the window. So don’t read this as 611 **unique** scenarios.

Despite this, teams witness **20.4** **4-in-6 bursts** per season. The worst offender is the **Charlotte Hornets with 27**. The team with the least? You guessed is… the **Cleveland Cavaliers with 14**. The **Los Angeles Lakers **come in second lowest with **15. **

The NBA attempted to reduce the number of 5-in-7 games this season, but were unable to eliminate them. Under this scenario, we find there are **35 total instances** where the **Minnesota Timberwolves **and the **San Antonio Spurs** take the brunt with three of these situations each.

There are eight teams that observe this burst **two times** during the season: **Atlanta, Boston, Chicago, Orlando, Philadelphia, Phoenix, Sacramento, **and **Toronto****. **There are thirteen teams with one observance. However, there are **seven teams** that escape the wrath of 5-in-7 bursts. These teams are **Cleveland, Golden State, Indiana, Los Angeles Lakers, New Orleans, New York, **and **Memphis**.

Fortunately, there are no 6-in-8 games this season.

Six-In-Nine games start to become fleeting, as there are **87 instances** of these bursts; each team averaging **2.9 6-in-9** bursts. Again, we find a large discrepancy across teams. The **Boston Celtics** appear in **7 games that are the tail end of a 6-in-9 burst.** **Phialdelphia **(6), **Orlando** (5), and **Minnesota **(5) take the brunt with a high number of these scenarios.

There are, however, two teams that avoid the 6-in-9 run. **What this means is they ALWAYS **get **FOUR days rest every 9 days**. These two fortunate teams? **Cleveland **and **New York. **

The final scenario we look at are 8-in-12 games. This is because there are **zero instances **of **7-in-9, 8-in-11, **and **9-in-12** bursts. In fact, there are **only three instances of 8-in-12 bursts**. The poor teams who suffer these bursts are the **Boston Celtics, Phoenix Suns, **and **Chicago Bulls**.

By computing the compactness score over this sliding window, we find that the **Philadelphia 76ers** (51.929),** Memphis Grizzlies** (51.917)**, **and **Orlando Magic** (51.595) suffer the most from the schedule. In fact, teams with compactness scores over 50 have dense schedules; with the least amount of rest. This includes the top three above as well as the **Boston Celtics **(50.512), **Phoenix Suns **(50.179), **Charlotte Hornets **(50.131), and the **Minnesota Timberwolves **(50.060).

There is a significant drop off after 50 (to 47). However, there is an exceptional outlier: **Cleveland **(29.833). This means that Cleveland’s schedule is roughly twice as spread out compared to **Philadelphia, Memphis, **and **Orlando**.

The entire table can be viewed here:

Looking at the distribution of scores, we find that Cleveland does indeed have a significant advantage in schedule. Denver finds itself in a similar situation, and this added rest along with their home court advantage may potentially be giving Denver a slight boost in the NBA standings.

Comparing to the previous season, we find that there are only 170 total days between opening day and the final day of the season. The 2018 season not only witnessed less 5-in-7 bursts and elimination of 4-in-5 bursts; but also obtained one extra week of play. Computing the binary matrix, we find that the is also no **NCAA National Game** off.

We find that the distribution of compactness is **much higher** in 2017 than in 2018.

So much higher that the number of Back-To-Back games totaled 486 (425 in 2018), 3-in-4 bursts totaled 714 (567 in 2018) and 4-in-6 bursts totaled 855 (611 in 2018). Similarly, there were **23 total 4-in-5 games in the 2017 season**. Also, there were dreaded **7-in-10 bursts** for teams in the 2017 NBA season; another non-existent burst in the 2018 season. The **Los Angeles Clippers**, **Philadelphia 76ers**, and **Orlando Magic** all experienced these during the season. And as a note, each of these teams suffered significant injuries during the season.

Continuing this path, the number of 8-in-12 games dropped from 22 (2017) to 3 (2018) while the number of 5-in-7 games dropped from 93 (2017) to 35 (2018).

For the 2017 season, we see almost every team is above the 50 score. The exceptions are **Dallas** and **Brooklyn**. Every team, however, dropped in compactness; indicating that every team technically obtains more rest this season than in the previous season. The lone real exception is the **Minnesota Timberwolves**, who dropped from 50.679 to 50.060. This may become a problem as the Timberwolves are notorious for not resting their players.

As we have seen the NBA help increase rest for the NBA season, we still find kinks in the scheduling format as some teams are generously given days off while others still find themselves in 5-in-7 ruts. So how would you improve scheduling the season? Note that we didn’t even factor in traveling and time-zone changes. This increases the problem with scheduling and rest; making the job of an NBA scheduler an unenviable one.

]]>One key assumption is that every defensive setting in in **equilibrium**. What this means is that for a given offense, the defense is in the best position it can be in, physically. It may be an entirely wrong defensive set, communication poor, and their coach may lose his mind in the process; but it’s the defensive position the team takes. For better, for worse.

Similarly, the offense is viewed to be in equilibrium as well. This means that the offense is getting to the spacing that they are capable of given the circumstances of the possession. For better, for worse.

We make this assumption in an effort to extract out the gravity values. Under equilibrium, we are able to obtain a series of equations that yield **masses** for players that describe the gravitational pull of a player. The larger the mass, the more gravity that player draws **across all players**.

Once we have this assumption in hand, we can start to pick away at the interactions on the court.

Let’s start with a 0.04 segment of time. In this case, a team steps into its initial set in the half-court and the defense is in position. In this case, we note that the red team is on offense while the blue team is on defense. We landmark 12 locations on the court: 10 players, the basketball, and the basket.

In the most basic example, we can begin to dissect portion of the court into **partial bodies**. In doing this, we obtain simple rigid body equations that help us break down the mathematics associated with the possession. For each defender, the basic building block combination is the **primary offensive player**, the **basketball**, and the **basket**. This was the simple **Man-You-Ball** and **Between-Your-Defender-And-The-Basket** philosophies.

Using the partial body, we make a second assumption: **the basketball has constant mass**. What this means is that the basketball serves as a **reference point**. By making this assumption, we are able to compare masses across possessions. Under this conscious choice, as opposed to the basket, we are able to see how the basket gains larges amounts of gravity on drives to the basket; as a driving ball-handler will always have the basketball in hand. It can be argued to make the basket constant mass; and that’s fine. But that’s not the choice here. Either way, without a suitable reference point, solution of the system is not attainable. As you’ll see in a moment.

Once we extract out the partial body, we must focus on building a reference frame to perform the mathematics. As we opted to make the basketball the constant mass quantity in our system, we can build our **reference frame**.

A reference frame is the **coordinate system** for which we can perform math. Selecting a smart reference frame makes the mathematics easy. In this case, using constant mass for the basketball, we can make the **negative y-axis** serve as our reference direction. What this requires is a rotation matrix.

Now we see that the basketball is the y-axis. And since we are at equilibrium (assumption one) with the basketball having mass one (assumption two), we can solve the remainder of this partial bodies problem!

To solve the partial bodies problem we apply the equilibrium state and break up the vertical motion from the horizontal motion. Equilibrium merely suggests that the state is **not accelerating** in any direction. Therefore, the forces in the vertical direction **sum to zero** and the forces in the horizontal direction **sum to zero**. Before we begin, let’s drop down the reference axes and draw in the angles we will need very shortly.

For horizontal motion, we see that the offensive player is to the right (positive direction) while the basket is to the left (negative direction). For completeness, the basketball is neither right nor left, indicating **no horizontal influence on the system**. This is why we rotated the system.

The black line from the defender to the basket identifies the **force between the defender and the basket**. Call this **F_basket**. The red line from the defender to the offensive player identifies the **force between the defender and the offensive player**. Call this **F_player**. Finding the angle each of these two lines with the **x-axis**, and we are able to quantify the right and left motion of the two bodies.

Using equilibrium, we obtain **0 = F_player * cos(A1) – F_basket * cos(A2)**. Since we are modeling the player interaction as **gravity**, we can write the forces as gravitational pulls. In this case, we get **F_player = m_defender * m_player / d_{def,play}^2**. Here, **m_player** is the mass of the offensive player; **m_defender** is the mass of the defender; and **d_{def,play}** is the distance between the defender and player.

As we have equilibrium, the mass of the defender cancels out of both terms and we are left with **one equation **and **two unknowns**.

For vertical displacement, we again apply equilibrium to obtain our equation. In this case, we assumed the mass of the basketball is one. This allows us to write equilibrium as **0 = F_player * sin(A1) – F_basket * sin(A2) – F_basketball. **The forces are the same as in the horizontal case. Except this time, **F_basketball** enters into the equation as **m_defender * m_basketball / d_{def, ball}^2**. Again, the mass of the defender cancels out. And from our assumption, **m_basketball = 1**. This is a constant. And we obtain the **second equation** with the same **two unknowns**.

In this case, we are able to solve for the partial body by **Gaussian elimination!**

Expanding on the entire possession, if we simply look at the partial body problem in repeated fashion, we will not obtain full rank to extract out the **eleven masses** we require! Similarly, the partial body solutions are only in vacuums; meaning that introducing another player will change the equilibrium. In this case, we must look at the **full body solution**. In this case, for each defender this yields **two equations** and **10 unknowns! **Not a good situation.

To regulate this, we apply a **jackknife technique**. The jackknife is a process that eliminates a data point from the system and computes the system without the data point. In fact, this is exactly the process we performed in the partial body system above!

This time, we must use multiple jackknife points. In this case, we only set the basketball as the reference point. Therefore we shall apply 11 jackknife iterations to obtain the entire system.

So let’s apply that to our single snapshot!

Apologies for swapping the colors of offensive and defensive players. However, as this movie plays out, we map each player on the court to the basketball. We then construct the reference frame, and then sniff out the rotation matrix. From here, we identify the horizontal and vertical components of each of the **eleven jackknifed bodies**. And we pause for a moment to realize that statement sounds more morbid that it should.

The resulting jackknife process yields a system of **twenty-four equations and eleven unknowns**. More importantly, this system of equations **has full rank!!!** This is important as we need to solve the **homogeneous system** we built. The only challenge now is that **we have a nonlinear homogeneous system**.

Note that homogeneous means that the sum is zero for all equations in the system. The nonlinearity comes from the **N-Body** problem we have constructed using the gravitational forces used above. In the above examples, we could easily solve the equation. In this case, we no longer have that luxury.

With the mechanics above, we can take a quick moment to look at the network we have just built.

The homogeneous system, in all its glory.

Here, the **green** network are the **constant **gravity values being obtained from the basketball. Recall that here, the basketball has mass one. The **cyan **network is the gravity associated with the basket. This is a residual effect from the **partial bodies problems**. Similarly, the **black **network is the **offense-defense interaction** network. This portion is also distilled from the partial body solutions. The **red** network and **blue **network come from the interaction **within each respective team**. This is the skewing that will exist due to the team interaction that the partial body solutions cannot capture.

While the network drawn is a pretty picture, we have to realize all the math performed is on the rotation spaces in the video above. This means the lines are all **weighted as gravitational forces** with **sine and cosine adjustments due to the reference frames**.

The solution to this system is not so straightforward. Newton’s method does not work as the system is not square; in fact gradient methods suffer greatly and will explode. Nelder-Mead can help but converges rather slowly; thanks to the 11-dimensional search space. In this case, we rely on **Powell’s Method** for identifying a root solver.

**Hint: **We can turn this root finding problem into a convex optimization problem by merely taking a square of the equilibrium functions.

For this time snapshot, we obtain the masses (gravity values) as

Looking at the output above, the first five gravity values are for the offensive players. The second five gravity values are the defensive players. The last gravity value is the basket. Let’s look at the play again.

What this tells us is that on offense, we find that **Player 3** and **Player 5** has the highest gravity in the system. **Player 3** is the elbow player on offense. We actually see the defender of the ball handler and the defender on the right wing **(Player 2) **are sagging ferociously off their men towards the basket and towards **Player 3**. Similarly, we find that **Player 5** is being guarded a little tighter than most other players.

We also see the basket obtain a substantial amount of gravity in this time snapshot. Looking at the play above, this is rather obvious as almost **every defender is sagging closer towards the hoop than necessary**. The ball-handler defender is more than seven back off his man. The defender of **Player 3** is a few feet back, **saddled in the lane**. The defender of **Player 2 **(right wing) is the worst offender, almost playing in “No Man’s Land” as if to **pre-double-team** **Player 3**. **Player 4’s **defender is also straying, but not as bad towards the basket.

As we have shown how to extract out the gravity values (masses) of each player in the time snapshot, we populate the **tensor slice** that occupies the possession. We will see a sparse effect as this time snapshot only covers 0.04 seconds. Suppose this possession lasts 15 seconds, then we expect **374 other snapshots**. And since there are roughly 950 players (475 offense, 475 defense), we expect to only populate 10 positions.

By incorporating the result of the possession, we can begin to classify the possession as **effective** or **ineffective** when it comes to **defense** or **spacing**, depending on the goal in mind. In doing this, we will require some form of **generalized Singular Value Decomposition (SVD) **such as the **CP-Decomposition** or the **Tucker Decomposition**. (CP is preferred)

However, those steps get really messy; and you know… begins to build the actual edge.

Happy Holidays!

]]>

To give a simple overview, gravity is a measure that ranges from 0 to 100. It reflects relative distances between one defender and two specific players on the court: **the ball handler** and **the offensive player being guarded**. The idea is an attempt to quantify the **“Man-You-Ball”** defensive posture a player takes on the court. The tighter a defender guards an offensive player, the higher the gravity. The tighter a defender moves towards the ball, the more distraction they observe from guarding their man. The idea is very straightforward.

In this article, we serve up an introduction into gravity from an elementary point of view: **center of masses**. We will use this article as a springboard into understanding the role gravity in areas such as **hedging, switching, pre-switching, **and ultimately, **spacing**. Commonly, gravity is seen as a measurement of defensive players relative to shooters; however, it is much much more than that.

Our goal here is not to replicate gravity, but rather give a relative definition of gravity that captures the original intention of gravity while giving us a tool set to start quantifying difficult features using spatial statistics.

During my NCAA playing days, we were taught the 1/3-1/2 rule; which meant we guard our man one-third of the distance from our man when he is one pass away and 1/2 the distance away when he is two or more passes away. This idea (now nearly 20 years old; and was even taught earlier) is simply **fixed gravity**.

In this case, we identify the two sources of gravity: the ball handler and the off-ball player a defender is guarding.

However, if we merely drag along this line, we run the risk of one of the simplest plays in the book, the** backdoor cut** towards the basket.

This means that the defender has the potential to get burned almost every time as they are not **protecting the basket** when they are the lone responsible player for protecting the basket.

**Note:** There are situations where a defender may refuse to protect the basket, such as in BLUE defense, where the primary defender’s responsibility is to refuse penetration into the key; and a secondary defender is responsible for covering a driving lane to the basket, thus protecting the rim.

To reduce the probability of a backdoor cut, we focus on **sagging** towards the basket. In this instance, we have now defined a third gravitational pull. If we consider a simple constant model for this, let’s pick a naive number and suggest one-sixth the distance to the basket.What happens to the defender now?

First, the defender is no longer on the vector between the two offensive players. Instead, they are pulled in the direction of the basket, contained in the triangle formed by the two offensive players and the basket.

When the defender sags, he forms three new triangular regions. These are indicated in red. The upper triangle between his man and the ball handler is the original **gravity-distraction** region, which we call **contest zone**. The right-hand triangle between his man and the basket is the **recovery zone**. The left -hand triangle between the ball-handler and the basket is the **help zone**. The lengths of each of the red spokes identify the distances to each of the three gravitational bodies

By assuming the fixed gravity model, we merely need to find the position that gives us the one-third and one-sixth distances we require. This calculation is commonly known as an **n-body center of mass**. In the n-body center of mass, we effectively calculate

Here, **p_i,** are the positions of offensive player (or basket), i, on the court. The values, **m_i**, are their associated **masses**, or weights. In our fixed gravity construct above, we have this one-third, one-sixth rule. Under this construction, we set the mass of the offensive player to **one** and see that the masses of the ball-handler and basket become **1/2** and **1/5**, respectively. This gives the explicit solution below.

In motion we can see the backdoor cut become guarded.

The description above is a classic denial defensive drill for wing defenders, where the ball-handler is a teammate throwing the ball into play. In reality, there are more players on the court and this begins to complicate even the fixed gravity model.

For instance, let’s place in one more offensive player and a two more defensive players.

Here, nothing has drastically changed. We isolated each part of the defense and solved the center of mass problems with respect to the gravity rules. The sketch is nothing out of the ordinary.

If we extend to the Minnesota Timberwolves offensive set, we can see how this fixed gravity problem adjusts the defense.

We see that the defense looks a little compacted, and somewhat unrealistic. However, the basic fundamentals of the defensive scheme are preserved: one-third distance from man to ball, one-sixth distance from man to basket.

And now as a play goes into motion, we are able to capture the intricacies of the movement. Pay close attention to the screen. This becomes important.

As we can see in the optimal movement, the defender caught in the pin down must manage to go under the screen and effectively beat his man to the pick and roll screen. This requires first that **all screens must be avoided** and then that **all defenders are perfect at anticipation. **

In the introductory cases above, we assumed that **all offensive players have the same gravity**. We know this is not true. We know that **Stephen Curry** needs to be closely covered on the perimeter more often than **Tristan Thompson**. In this case, we are able to start changing the one-third, one-sixth rule from above to a **vector of gravity values**.

In this case, Curry can have stronger gravity than Thompson. In the case of Curry, we may see the one-third rule get tightened up to one-sixth. Similarly, we may see Tristan Thompson move from one-third to one-half.

If we are able to measure these masses/weights, we are then able to start making decisions about how to guard a possession. For instance a wing screen with a low gravity screener may give the defense the opportunity to either BLUE the screen, leaving a Tristan Thompson-type player to be the perimeter shooter; or trap the ball handler off the screen.

Making such decisions on defense requires the **entire defense to react **(or at least two members). To start to understand these decisions, we need to look into the interaction of defensive players.

Up to this point, we focused on the most rigid form of gravity that was depicted as three-body action. That is, all players have the same gravity, that gravity is constant, and gravity depends only on the **Man-You-Ball** with **Between Your Man and the Basket** philosophies.

In reality, gravity of players is not a one-way street. Applying Newton’s third law of forces, we find that defensive players contribute equal and opposite gravity onto the offense. Let’s illustrate this:

If we look at the action in the video above, starting by the initial Timberwolves offensive set above; we know that a pin down screen occurs to set up a pick-and-roll action. This means that two defensive players exhibit gravity!

This defender driven gravity leads to the simplest defensive interaction on the court: **the switch**. Two players may switch in these exchanges. What this indicates is that the defender releases gravity on their primary man and picks up the new offensive player. In the physical setting the defender on the switch “orbits” the first offensive player and then gets pulled into the “orbit” of the second offensive player. Therefore a **screen** is measured by the amount of third force law interaction.

Finally, if an offensive player has high gravity, let’s suppose this is **LeBron James** in the post, then other defenders may hedge in the direction of James in attempt to reduce the probability of scoring on the possession.

In a more concrete example, let’s consider the BLUE defense. In this case, a defense may **blitz** the pick-and-roll. When the blitz occurs, a defender pulls across the lane in a two-nine attack to influence the baseline drive of the ball-handler from attacking the rim. This means, we may end up with three defenders on two offensive players for a short period of time. More importantly, this suggests that the ball-handler has a high amount of gravity for this brief period of time.

The question is then, how does the gravity of a player change over time?

Now that are are able to model hedging, screening, and switching, we can now start to ask how to compute the gravity for a player. Above, we assumed that the gravity of each player was **fixed**, **constant over time**, and **results in a center of mass calculation** for positioning defenders. In reality, gravity is fluid. That is, it changes as players move on the court.

A simple example would be: **How close is a center guarded when they are 85 feet from the basket?** In this case, this is simple transition. The reality is, the center is not guarded. Instead they are met down-court, if possible. But once that center comes into the half-court offense, their gravity increases. As they get closer to the rim, their gravity may increase even more.

Given this example, we need to identify a method for measuring gravity. Here’s one method: **N-body solutions**. Recall that the center of mass solution, given a set of defensive requirements (protect the rim, deny passes) realizes **gravity as masses**. The N-Body problem is the equilibrium of gravity when all masses are are put into play.

What this means is, applying Newton’s three force laws, we should have **55 combinations of masses**: 5 defenders, 5 offensive players, and one basket. Since we are able to measure all 11 positions, **q**, on the court, we can write the N-Body problem as

When this quantity equals zero, we have equilibrium. The values of **m** are the **masses** we are interested in. The value **G**? This is merely a scaling factor.

One method for estimating gravity in this situation is to break up possessions and compute the N-Body problem. Over the course of the possession, we can take the collections of masses for players and yield a response given by **points scored** during the possession.

Here, we are able to create a model that uses the estimated masses at each time stamp, with a class label of **points scored**.

Note that despite having 55 combination of masses at each time step, we only obtain 11 masses for the gravity calculations. What this results in is a **N x 2P** **x** **Q x R ****x S x Z** tensor with the entry of the computed **mass**. Let’s break down this **tensor…**

**N: **The maximal number of time steps in a possession. Will be sparse with structural zeros as possessions tend to be 12-14 seconds in duration; for 300 – 350 time steps. However, if we take the **maximal** possession (say it is 40 seconds), this value is **1000 entries**.

**P: **The number of players in the league. We select 2 as there is an **offensive gravity **as well as a **defensive gravity** for each player. This means **P** is typically near **950**.

**Q: **Finite baseline index. As the baseline is fifty feet long, a binning of the baseline into 2 foot segments, we obtain **Q **to be **25.**

**R: **Finite sideline index. As the baseline is ninety-four feet long, a binning of the sideline into 2 foot segments, we obtain **R** to be **47****.**

**S: **Class of possession scored. Only two possible values.

**Z: **The number possessions.

This results in a tensor that is **LARGE**, but also **SPARSE**. The non-zero elements are the estimated weights. Stacking these weights in this manner, we obtain a description of offenses and defenses relative to the locations on the court over the course of the possession.

More importantly, we have a feature space that we can finally create a model. One such model may be a neural network or a support vector machine. Or more importantly, if we are interested in spacing, we can look at the **CANDECOMP-PARAFAC Decomposition** that yields information on gravity values associated to particular locations on the court that lead to scoring opportunities. And more importantly, identifies defenders that have weak gravity, indicating **over-confidence** in their defensive capabilities.

On the other hand a block, while erasing a field goal attempt from a team, does not necessarily terminate the team’s possession. Instead, blocks serve more as an intimidation factor; with the possibility of terminating a possession. Blocks may serve as a valuable tool in **controlling the paint** on defense, but it still gives the opposition hope in scoring a basket. For instance, a team with a dominant rim-protecting shot blocker in the paint may be nearly ineffective against a long-range team such as **Golden State** or **Houston**. And if that shot blocker aims to get more “Oohs” and “aahs” from the crowd by sending field goal attempts into the stands; all that happens is the offense gets a designed out-of-bounds play. A second life if you will.

In this installment, we take a look at the blocked shot and how to make blocked shots count in the league.

First, we take a look at the distribution of blocked shots for each team. Let’s start simple. First we consider the number of field goal attempts taken against each team and count the number of blocks obtained. This in turn gives us a **block percentage**.

The first thing that jumps out is that effectively every team is at about 4-6% when it comes to blocking their opponents attempts. There are two primary exceptions: **San Antonio Spurs (7.00%) **and **Golden State Warriors (9.46%). **The Golden State Warriors rate is incredibly high as most teams tend to settle about 5% over the course of the season.

In fact the top five blocking teams, with respect to rate, are (in order) **Golden State (9.46), San Antonio (7.00), Milwaukee (6.76), Toronto (6.51), **and **Utah (6.39)**. It should be noted that blocks do not necessarily translate to wins as we see Memphis (6th) and Los Angeles Lakers (7th) creep into the top.

So why don’t blocks translate to wins? One simple answer is that blocks happen so rarely as it is, roughly one out of every 15-20 field goal attempts for almost every team that the net difference in blocks between teams is typically 0-2 blocks a game. This translates to effectively 0 – 2 points a game in difference; as most teams score roughly 1.1 points per possession. Note that we don’t state 2.2 points a possession; and this is because many blocks do not terminate possessions.

However, how are blocks distributed on the court?

As blocks are considered missed field goal attempts, we are able to identify the location of every block on the court, thanks to shot charts. Let’s consider the leader: **Golden State** against their rival **Houston Rockets**.

We see a healthy dose of blocks around the rim, but also an excessively healthy amount of blocks in the mid-range and the along the perimeter. For example, we note that there are **25 three point attempts **blocked, along with another **20 in the mid-range**. Combined, this is 45 blocks outside of the paint, for an estimated erasing or 115 points. Compare this to Houston…

…and we get a completely different story. Don’t let the image fool you, there’s four blocks beyond the perimeter. It just so happens two blocks were recorded in the exact same position. We find that Houston picks up another 13 blocks in the mid-range; which is slightly on par with Golden State. We see the force of **Clint Capela** in the post with his 51 blocks so far this season.

But what these comparisons show is that Houston is **more willing to let three point field goals go unblocked**. In fact, Golden State’s opponents have attempted 884 three point attempts against Houston’s opponents with 825 attempts. What this shows is that teams are willing to shoot at roughly the same rate and Houston is leaving these attempts less contested than their Warrior counter-parts.

This difference may be of two reasons:

The first is that, the team may run the risk of fouling more often and therefore tread lightly on contesting three point attempts. Foul trouble plus, giving up 3+ points on a field goal attempt is one of the worst policies to have as a team on defense.

Second, Houston acknowledges that three point field goal percentage is roughly random. Every team defense gives up between 9-12 threes a game. While the difference is nine points, the 3FG% ranges between 34% and 40%. Which means there’s maybe one lost attempt a game; and the other couple attempts turn into two-point field goal attempts. Ultimately, why gamble on blocking three’s when unblocked three’s could lead to only a 1-2 points discrepancy overall. **Instead**, hard contests that force the player to put the ball on the floor possibly lead to a two-point FGA instead, and shave that 1-2 points over three possessions.

Whatever the reason, Houston is not taking away the points by blocking long range shots; while Golden State is. And this may become a difference maker should these two teams meet later on in the playoffs.

Let’s take a moment to look into the distribution of blocks across two-point and three-point FGA.

And this is where Golden State really starts to separate themselves; if they haven’t already. Notice that **every 8 two-point attempts against the Warriors results in a blocked shot**. Similarly, **every** **35 three-point attempts results in a blocked shot**.

For two-point FGA, the **San Antonio Spurs come in second** with a relatively meager one block per 10 two-point attempts. For three-point FGA, the **Portland Trail Blazers **clock in second with one block for **every ****84 attempts!**

So how are the Warriors doing this? Two primary reasons:

The very same reason why San Antonio is second, and Milwaukee creeps up in the list. These teams are armed with Stretch Armstrong-like reaches and give shooters the false sense of security when they see their defender 6+ feet away. Due to their reaches, the close-out requires less time; thus allowing the probability of a block to go up.

One of the best attributes of the Warriors is their ability to guard almost all positions at all times. The main (but not only) exception is when **Stephen Curry** is forced to guard a **skilled post player**. However, players such as **Draymond Green**, **Kevin Durant**, **Andre Iguodala**, **Shaun Livingston**, **Klay Thompson**, and the emerging rookie **Jordan Bell** can guard every position on the court. This allows them to make screen-and-rolls ineffective as they can merely switch with pre-switch, hedge capability.

The question is, do they really make their blocks count?

In order to “make a block count,” the result should not only take away a potential FGM; but also eliminate the team’s possession. In this case, we are interested more in **what happens immediately after a block**.

So let’s say first things first: we must find the block. Trawling through play-by-play logs, we can extract out every block in every game. Directory walking through the files, we take each game as a csv dump and extract out our blocks, tossing a **block token **down, called **blockHappened. **Clever name, I know!

Notice I split out every block into a dictionary that identifies the player, their team, and a sequence of six counters. The first counter is **two-point FGA blocked**. The second counter is **three-point FGA blocked**. The remainder? That’s what happens next…

The next step is to identify the action after a block. Typically, a block results in a rebound as it is a missed FGA. However, this is not always the case. It also happens that the shot clock expires before a player is able to obtain the rebound. This actually happens **very rarely: **only **6 times **so far this season (out of 450 games).

Fun fact… some blocked shots result in jump balls. In this case, the possession terminates if the offense loses the jump. The result of the jump ball? It’s labeled a **team rebound**. These are lumped back into rebounds lost and won.

This means the third counter is the **number of possessions terminated due to rebound**. The fourth counter is the **number of blocks that return to the hands of the offense**. And the fifth counter is the **number of blocks that turn into shot clock violations**.

Finally, the last counter? **That’s the other bin**. With play-by-play data, we tend to get a handful of awkward recordings. No data is perfect, and watching blocks illustrates this mantra well. Out out the 4305 blocks, a total of **48 blocks resulted in random, nonsensical **responses. For instance, some blocks resulted in substitutions. Some blocks resulted in a secondary missed three point attempt by Trevor Ariza; or an Enes Kanter put-back (without a rebound recorded). Some blockes resulted in steals for Larry Nance Jr. Oddities. This chalks up to missing play-by-play actions. So in this case, we assume these 48 instances are **noise** and just deal with it.

As we walked through how to make blocks count, we now present the results for the **Golden State Warriors**.

Here we see the team’s breakdown of blocks. **Draymond Green** is the beast of the team, recording **seven blocks from beyond the arc**. Similarly, **David West** and **Kevin Durant** have picked up four blocks from the perimeter. These numbers, while seeming low, are staggeringly high; seeing as **multiple teams have only one block at the perimeter.**

However, do they make the blocks count? Notice that **half of Green’s blocks** return to the hands of the offense. **It’s actually one of the worst rates in the league** for **players with 20 or more blocks in the season**. Breaking this down **there are only 10 players with worse kill rates than Draymond Green**. On the flip side, there are **58 players with better kill rates than Draymond Green**. Note that there are four players tied with Green for kill rate. This places Draymond Green in the **15th percentile for making blocks count**.

What this indicates is that while Green is skilled at obtaining blocks, he is not making them count as much as he should. Compare this to **Jordan Bell**, who has a possession kill rate of **74.07% on blocks**, and we see that not only does Bell pick up blocks; **but he makes them count**.

By taking a look a the kill rates for each team, we find that the **Chicago Bulls** actually have the best rates of converting blocks into offensive possessions. While they are the tops, they do not get many blocks. Therefore, this suggests their kill rate is **ineffective**. Think of this as a free throw shooter who has a .875 FT% because they are 7-for-8. Sure, they have a great percentage, but they aren’t scoring points. The same phenomenon applies here.

Instead, we look for teams with high kill rates and high total blocks. These teams are now **making blocks look like steals** and getting the desired return: points-less possessions for opponents. So how do we determine such teams? We look at the blocks versus kills plot.

We can start drawing contours on this plot to start sectioning teams off. Ideally, rates stay high as the blocks go up, so teams in the upper right corner of this plot are ideal. Well… there’s no teams there.

So instead, we start cutting the region with diagonal lines. Doing this, we find that the **Golden State Warriors ** are still the top team; **even when adjusting for possession counts** (Note: block totals are impacted by possessions). After this, the **San Antonio Spurs are a close second**.

The third best team? **Miami**, who is in close contention with **Washington**.

Using the kill rates, we start to see the impact of blocks on the game and start to understand the player’s impact on the game. So while a player may accumulate many blocks, they may not be getting the same return as a steal; which, in the end, is what defenses are after: terminating possessions.

Below, feel free to scroll through all players in the league; distinguished by team.

Click to view slideshow. ]]>Let’s start with a simple exercise with the **Washington Wizards**. Through 30 November, the Wizards have played in **2013 possessions** over the course of 21 games. The 2013 possessions yielded **2318 chances** for the Wizards.

If we were to statistically calculate the number of chances for the Wizards, we would expect

**Number of Chances = FGA + 0.44*FTA + TOV**

The value of 0.44 is an archaic one that no longer necessarily holds true; however NBA stats will yield the Wizard’s number of chances as **1793**** + 0.44*519 + 303** = **2324.36**, which is not far off from the truth. Now, if we consider **John Wall**, we find that Wall is estimated to have completed **255 + 0.44*112 + 49** = **353.28 chances** while participating in an estimated **977 + 0.44*298 + 162 **= **1270.12 chances. **This comes out to a **27.815 Usage Rate**. NBA stats says:

As we see, we have correctly picked up John Wall’s usage rate. The only challenging task in the exercise above is counting the team statistics when Wall was in the game. This was performed manually using a Python script. If this exists on the web, please feel free to link in the comments! (**Note: **I assumed Basketball-Reference would have such a tool. I was unable to find it.)

If we count every possession and chance, we can see how well NBA stats estimated Wall’s usage. As we saw with the Wizards’ totals, since the number of free throws are small and the actual rate of free throws terminating possessions is **0.427, **the estimated number of possessions is only off by 6 possessions.

For John Wall, we find that he actually obtained **354 actual chances** out of **1267 actual chances** when he was on the court. This results in a **27.94% Usage Rate**. Not too shabby for the possession estimation process. Provided the number of free throws stay down, estimation will never pose a problem. However, this is John Wall we are talking about; and he can get to the rim and finish on fouls.

Once we obtain the entire team’s usage stats, we can start looking at prioritization of teams. For instance, let’s consider the starting line-up of **Bradley Beal**, **John Wall**, **Marcin Gortat**, **Otto Porter Jr.**, and **Markieff Morris**, we find that their combined usage is **28.7349 + 27.9400 + 18.7114 + 15.0822 + 21.0602 **= **111.5287%**. Well, that certainly is not feasible.

In this case, we can look into assuming a uniform distribution on usage to help us estimate usages for rotations. What this means is, if a player maintains a 25% usage rate overall, they apply the equivalent usage rate when aggregated with other players. Illustratively, this means **John Wall **moves from **27.94%** to **27.94 / 111.5287** = **25.05%**. This indicates that a quarter of the rotations chances are expected to go through Wall’s hands. Unfortunately, this does not translate well… thanks **Chris McCullough**.

What we want to do here **build a model ** that predicts usage for a player. In this sense, we can construct a **13-variable indicator feature vector** that identifies the rotation. For example, the starting unit would be labeled

**(1,1,1,0,1,0,1,0,0,0,0,0,0).**

The labeling comes from the order in the table above. We can then place a **multinomial response **variable of chance recorded. This is effectively an **categorical value** that identifies the player that took that particular chance.

In total, there are **1,287 possible rotation combinations** for the Wizards to employ from their 13 roster players. Over the course of their first 20 games, only **124 rotations** have been put onto the court. Seeing less than 10% of the combinations, we are left with an excessively difficult task of building a regression. In fact, regression is not the way to go, due to the fact that inference over the players is akin to over the 1,287 possible rotations, which we do not have measured.

In light of this, we can apply a popular technique such as **Regularized Least Squares** (which is also called **Ridge Regression**) if we believe our data is Gaussian (it’s not). Or we can look into **Bayesian modeling**. In doing this, I could build a massive-scale post; but at approximately 800 words in, I have bigger fish to fry.

What we want to do now is compute **Team Usage**. This percentage accurately reflects the usage of a player **over the course of the season** as opposed to when they are specifically on the court. In this case, we can understand the role of a player with respect to the overall team’s performance.

In this case, we find that players like **Markieff Morris** truly has a usage rate of **6.3417 **instead of 21.0602. Why does this happen? Morris has only played in 14 out of Washington’s 21 games. What’s more important here is that we can now start to look at how usage is distributed on the team.

For instance, **Morris** was inactive/suspended for the first 7 games of the season. Similarly, **Wall** was out for 5 games. Due to this, their Team Usage rates are low. Despite this, their usage rates are still relatively high with respect to when they are on the court. So let’s do the math:

**Morris: **14 / 21 games played, 6.3417 Team Usage implies **9.52% Implied Team Usage**.

**Wall: **16/21 games played, 15.2718 Team Usage implies **20.04% Implied Team Usage**.

This indicates that Wall is a priority player when it comes to chances. In fact, he is almost identical to **Bradley Beal**. Whereas, Markieff Morris is not the priority option. If we take this one step further, we can look into **implied usage**.

Implied usage corrects implied team usage to account for minutes played. By adjusting for minutes played, we should obtain numbers closer to usage. Let’s do that math on this one:

**Morris: **22.6 MPG, 9.52% implied team usage implies **20.22% Implied Usage**.

**Wall: **34.4 MPG, 20.04% implied team usage implies **27.96% Implied Usage**.

If we go back and look at the usage rates, we find that these implied values are on par with the player usage values. If we see a discrepancy, **Morris is off by 1 percent**, this is merely due to rounding, small numbers, and counting processes.

The reason we focus on team usage is to see the impact of a player over the course of the season. That, and we obtain a stochastic relationship to a player’s efficiency.

A player’s **efficiency** is a simple score that counts the number of positive actions with the basketball and demerits for negative actions on offense. The formula is given as

**{PTS + REB + AST + STL + BLK – (FGA – FGM) – (FTA – FTM) – TOV} /GP**

This formula is divided over their games played, but instead let’s look at **cumulative efficiency**. This quantity will show us how well a player gains is over the course of the season. Let’s take a look at John Wall:

**325 + 54 + 147 + 17 + 18 – (255 – 111) – (112 – 84) – 49 = 340 **

**Note: **Basketball Reference is off by 1 FGA (they have 256) and 1 TOV (they have 48) in case you leverage Justin’s site.

If we divide out by the 16 games played, Wall has an efficiency of 21.15, which is respectable. However, Wall was not efficient in the games he played. Therefore, we should look at cumulative efficiency. And since we are doing that, we need to look at Team Usage instead of usage conjunction with cumulative efficiency.

**Note: **Efficiency is a **box score statistic**. In this case it suffers from lack pace and possession incorporation. Due to this, teams with a higher tempo may potentially yield higher efficiencies. This is a primary reason that John Hollinger developed the **Player Efficiency Rating**.

Let’s take a step back for a moment and apply this methodology to the previous NBA season. If we plot the Team Usage against the Cumulative Efficiency, we get a picture of familiar things.

Ideally, players would be as far north as possible on this graph. These players have high efficiencies, which means these players **score lots of points** or **get lots of rebounds** or even **steal the ball or assist a ton**.

Similarly, an a player is efficient in **scoring**, then a team wants that player’s usage to skyrocket to the right of this graph. It is alright if a player obtains rebounds for the bulk of their efficiency, however, they may not be scoring much if their usage is low.

From the display, we see that **James Harden** and **Russell Westbrook** are indeed the top players in the league when it comes to scoring, rebounding, assisting, and stealing. On top of that, they are used in heavy rotation, **both accounting for over 25% of their team’s chances on offense**. We see two high scoring bigs creep up into the right corner as well with **Karl-Anthony Towns** and **Anthony Davis**.

It is very difficult to have a high efficiency due to scoring but a low usage rate. Consider a player who scores 20 points a game. Typically 20 points a game comes off of roughly 8 field goals and 4 free throws. For the sake of argument, assume that the player is low on all other totals and takes no free throws. Then this player has an efficiency starting at 20 just from the points alone. If this player shoots less than 100% from the field, then the efficiency drops below 20. Therefore, a perfect shooter with 20 points per game requires 10 field goal attempts. As a team typically takes around 85 field goals a game, this player’s usage is already treading around 12%. And this requires perfection.

If the player dithers away from perfect field goal shooting, their usage goes up. However, their efficiency goes down. We start to sway towards volume shooters.

What this starts to show is how players cluster. If we take a look at the display again, we labeled **DeAndre Jordan** and **Rudy Gobert**. These bigs were rough and tumble rebounders, shot blockers, and had high field goal percentages. However, they were low points double-double machines. **Hassan Whiteside** is also creeping in that area. This region is the superstar centers.

The small region bowing out at roughly 20% usage but have low efficiencies are players who shoot the ball frequently, have lower field goal percentages, but more importantly, **are prone to turnovers**. The two worst culprits are **Devin Booker** and **Andrew Wiggins**; who make up those two markers. There is a third dot that has slightly more usage than Booker and Wiggins, but has high efficiency. This player looks to be part of the turnover machine crowd, but he is borderline. This player is **DeMarr DeRozan**.

If we take a look at the current season, we see a similar trend.

As we are a quarter of the way through the season, we find ourselves with roughly 700 as the maximum for cumulative efficiency; on pace for roughly 2800, which is near where we topped out in 2017.

We again see familiar faces: **James Harden, LeBron James, Giannis Antetokounmpo, **and **Anthony Davis**. We have lost **Russell Westbrook** and **Karl-Anthony Towns **so far this season.What is impressive about this is LeBron James’ push to the top of this display. Once again, in his 15th year, here’s another analytic that show cases an MVP argument for James.

So let’s start identifying the **average line** for players when it comes to efficiency over usage. By finding this line, we can start figuring out players who are detrimental when usage increases and players who are producing an edge given their usage rates. What this line will not identify is who’s usage should be increased; as rebounds are not considered a part of chances.

If we apply the naive tactic of fitting a regression line, we do terrible.

First off, we see that the regression line is tilted. This isn’t really a problem as that may be what the data is actually suggesting. In fact, the **R-Squared is 92.44%**. Despite this, are we really sure that the line is that good of a fit?

The particular reason that we could see a regression line has a high R-squared but looks tilted is due to a phenomenon called **leverage**. Leverage is the amount that a regression line **tilts** due to particular values of the data that are far away from the regression line (think outliers).

If you’re familiar with linear regression, then you know of the **hat matrix**. The hat matrix is the quantity **H = X(X’X)^(-1)X’. **It’s purely based on the explanatory variables. In this case: usage.

Since there are 459 players in the league, the hat matrix will be a 459 by 459 matrix. The diagonal elements of the hat matrix, **h_ii**, indicate the amount of leverage for each data point.

From the leverage plot alone, we cannot ascertain whether any players are truly influential. Instead, we focus on a statistical test such as **Cook’s D-Distance**.

Cook’s D-Distance is a metric that attempts to capture the amount of influence leveraged by each data point in a linear regression. To calculate for a data point **i**, we compute the linear regression with the **i-**th data point removed from the model. Performing this for every data point, we obtain **N** models (459 for the 459 players in this case). We compute the square error for each data point estimated by each model and divide by the total error. The formula is given by

Applying this, we find Cook’s D-Distance for every point.

The general rule of thumb is that anything over **4/n** = **0.0087 **is an influential point. The dotted reference line identifies this rule. Here, we find two very particular values that are causing problems. These two players?

**LeBron James** and **Anthony Davis**.

Therefore, these two players are going to be tilting the line. Despite this, the variance plays a crucial role here too.

**Heteroscedasticity** is when the variance is non-constant for a model. If the variance is constant, then we are said to have **homoscedastic** errors. For the homoscedastic case, we would see an equal width band of data running along the regression line. If not, we have heteroscedasticity.

If we take a look at the display again, we see as the usage goes up, the band of data about the regression line gets larger. We see quickly that we have heteroscedasticity. Despite having this, we simply don’t say that our regression is terrible. We need to understand the errors in our model. To do this, we look at a **residual plot**.

Here, we see that the residuals (errors associated with the fitted regression line) explode as we get larger usages (and inherently larger fitted efficiencies). If we take a look at the histogram, we have that the errors actually look Gaussian!

They are centered just below zero (thanks LeBron and Anthony) but have a fairly symmetric shape. Here’s the reason we have a strong fitting line. So what does this really tell us?

**While the regression line fits seemingly exceptionally well, and we are able to identify players in a sortable manner such that high efficiency players that have high usage are identifiable, we are unable to compare players within the cluster due to LeBron James and Anthony Davis, along with a variance that gets large as usage gets large. **

So let’s try a **nonparametric attack.**

**Local Linear Regression** is a nonparametric regression method that abandons the Gaussian assumption and performs a nearest-neighbor weighting using a smoothing kernel. For **LOESS **regression, the tri-cube function is the smoothing function.

Let’s see how this works:

Due to the nearest neighbor technique of local linear regression, we are able to better approximate the mean values of cumulative efficiency amongst the players with having LeBron James or Anthony Davis pulling the line.Given this, we find a counter-intuitive thing: **there are far more players below the red line than above the red line**. If you read the above plots this way, congratulations! If not (and Twitter was 10-for-10 in thinking the red line was **too low **in the upper right), you are not alone.

However, the purple line (LOESS regression) manages to capture this and avoid the overfitting given to us by James and Davis. Now, the challenge is identifying the right neighborhood to smooth over. If we choose too small, we overfit:

The way we find the optimal fit is to use **cross-validation**. By peforming a cross-validation, we obtain the following fit.

Now we are able to begin discerning between players with respect to their usage and efficiency. Similarly, we can see how the expected value moves, thanks to LOESS regression, and start to find when players’ efficiencies drastically change as their usage either increases or decreases. This gives us extra insight, analytically, into possible fatigue or learning strategy within the game.

]]>

With access to play-by-play data, we can focus on properly computing the number of possessions. For a refresher, feel free to review the definition of a possession and how they are grossly overestimated by NBA possession models. Since possessions are largely overestimated, the announced offensive and defensive ratings are **lower** than their true values, presenting bias in the presented results, and therefore making it undesirable to compare teams (or players if ratings are used at the player level) between years.

Once we correct for the actual number of possessions, we can then start looking at the nuances of offensive and defensive ratings. For instance, can a team lose a game despite having a higher offensive rating? The answer is **yes**. Similarly, if a team has a higher offensive rating than another team is the offense truly better? The answer is **no**. In this article, we walk through offensive and defensive rating, look at a distributional representation to show how we can compare teams, and present illustrations to actually compare teams. We will focus on play-by-play data spanning over the 2018 NBA season up through November 24th.

We select the **Boston Celtics** as our case study as they currently hold the “best defensive rating” of 96.8 according to NBA stats. The question is if they, in fact, really hold opponents to 96.8 points per 100 possessions. First, let’s apply NBA’s estimation process to each game and compare it to truth.

Note that some games: **New York Knicks**, **Golden State Warriors**, and **Miami Heat** have more than **five possessions of discrepancy** between them. It is physically impossible to obtain this difference in possessions according to the definition of a possession from the NBA. While this is a problem with estimating possessions, the bigger problem is that most possessions are overestimated. In the possible forty possession counts, **only one was underestimated: **93.88 possessions against 94 actual against the Knicks. All others? **Consistently 4-6 possessions overestimated**.

Plotting all the possessions from the 20 Celtics games, we can easily see the overestimation from the NBA possession model. This, in turn, translates into drastic differences between Offensive and Defensive Ratings between teams.

Comparing the distribution of Offensive Ratings and Defensive Ratings, we see that the overestimation in possessions dominoes into **underestimating ratings**. This is shown by the shifts to the right when comparing estimated ratings (solid lines) to the actual ratings (dashed lines).

What this really means is that the NBA reported 103.7 offensive rating and 96.8 defensive rating are actually **110.4 **and **101.2**, respectively.

What this also means is that Boston Celtics opponents score more than **one point per possession**. In fact, **Every NBA team gets more than one point per possession** when we actually count possessions. Furthermore, the Celtics outscore opponents by 9.2 points per 100 possessions rather than 7.0 points per 100 possessions.

Now, if we use Offensive Rating and Defensive Rating to rate teams, what does this correction means for the rankings?

Before we start correcting team rankings, we first take a look at the **estimated rankings**. As of this morning, NBA stats states that the rankings using offensive ratings are given by:

However, using actual possession counts, we find the list is somewhat out of order:

Thanks to an outrageously high scoring effort last night from the **Golden State Warriors** with a 143 – 94 win over the Chicago Bulls, the Warriors moved from second in the actual list to first in the actual list. In fact, the **New York Knicks** are not 10th in scoring per 100 possessions, as according to NBA stats, but rather they are 5th in the league.

Now, if we compared the Warriors and Rockets offense, based on Offensive Ratings alone, we would state that the Warriors are the better of the two teams. However, this required the Warriors to drop a large amount of points on one of the worst defensive teams in the league. If we removed the Bulls game, the Warriors would have an offensive rating of **116.84**, which is **less than the Houston Rockets’ 117.71**.

What this suggests is that simply using ratings to rank teams is not robust. In fact, we obtain situations where teams have winning records such as the **Minnesota Timberwolves** (11-8) but get outscored by opponents 112.62 to 113.21. This particularly happens due to averages being non-robust estimators. For the Timberwolves, they have participated in large varying ratings (blow outs) games; for better and worse.

However, despite blow-outs messing up averages and the resulting ordering of teams, we also find that at the single game level, having a higher offensive rating than a defensive rating does not necessarily mean the offense wins the game.

Of the first 276 games of the NBA season, there have been **nine games **where a team has a positive net rating **but lost the game**. It happened in both opening night games.

On opening night, the Cleveland Cavaliers obtained 98 offensive possessions to Boston’s 95 possessions. If both teams had exactly 100 points per 100 possessions, or one point per possession, then the final score should be 98-95, Cleveland. In order for Boston to catch Cleveland, they must score at least 3 extra points over their 95 possessions. Suggesting that they could score up to 100*(98 points /95 possessions) to get **103.16 points per 100 possessions and still lose**.

Using this illustration, the Cavaliers finished with an offensive rating of **104.08**. In order for the Celtics to even tie the game, they must have maintained an offensive rating of **107.37!** Since the Celtics only managed **104.21** points per 100 possessions; they did not meet the requirement to win and lost despite scoring more per possession than their opponent.

While this happened to the Celtics, the Celtics have never been a participant in this situation ever since. In fact, over the nine games, this phenomenon has happened to 16 teams; only Cleveland and Minnesota being second offenders. Cleveland being the winner in both situations while Minnesota split the pair.

Here’s the a list of all the games that have been affected by the ORtg > DRtg but lost phenomenon:

Finally, we can start comparing teams based on their distributions of offensive and defensive ratings. To do this, we can assume something boldly naive and assume that ratings follow a Gaussian distribution. If we take a look at the density plots above; **particularly the dashed lines**, we find that defensive ratings for the Boston Celtics are effectively Gaussian, however offensive ratings are definitely not. This means, even if we extrapolate to the full league, we are giving up some information by assuming a Gaussian distribution. That said, using this as an illustration will help us for when we want to do this for real.

Before we start comparing teams, let’s focus quickly on the Gaussian assumption. Here, offensive and defensive ratings are being viewed as a **bivariate random variable**. That is, there are two random variables for each team. While having a higher offensive rating than defensive rating does not necessarily mean a win, we can point to the low frequency of such a phenomenon happening and integrate the distribution below a “break even” line to estimate the likelihood of a team winning.

This means we must understand the distribution function for the bivariate Gaussian distribution; as this distribution is modeling our offensive and defensive ratings. The distribution function is given as

Let’s break this down quick. First, the values of **x** and **y** are **offensive** and **defensive ratings**, respectively. This means that **mu_x** and **mu_y** are the **average offensive **and **defensive ratings**, respectively. Similarly, **sigma_x** and **sigma_y** are the **standard deviations **of the **offensive **and **defensive ratings**, respectively.

Finally, the value **rho** illustrates the **correlation between a team’s** **offensive and defensive rating. **If this correlation is exactly 1, then for every point increase (or decrease) of a team’s offensive rating, their defensive rating moves exactly a point increase (or decrease) as well. Translating this to basketball speak, the more a team scores per possession, the more they give up (exactly) on defense per possession. We consider this the Enes Kanter effect.

However, if this correlation is -1, then for every point increase in offensive rating, the defensive rating **decreases** by a point. This indicates that the teams with negative correlation tend to be in blowout victories and defeats.

Illustratively, the bivariate Gaussian distribution reflects the Gaussian distributions on both the offensive and defensive ratings. Their intersection with yield ellipses, stretched by the standard deviations in ratings and rotated by the correlations between ratings.

Therefore, we should be interested in standard ellipses such as the 95% confidence ellipse for each team.

Applying this distributional assumption to the Golden State Warriors, we can plot the 95% contour for their ratings distribution. We can then compute the “break even” line calculate which percentage of the distribution is underneath the line.

The Warriors green dots (wins) are all below the break-even line. However not all the red dots (losses) are above the line. In fact, the Houston Rockets game mentioned above is barely below the break-even line.

As we calculate the area underneath the break even line, the Golden State Warriors have **62.96% of their confidence region in the positive rating differential**. Also note that the resulting ellipse is oblong. This is due to the offensive ratings having a higher standard deviation than the defensive ratings. This identifies that the Warriors tend to have much better shooting days than others; as evidenced by some games being below 100 points and some games being higher than 130 points.

Now if we add the Houston Rockets to the mix, we can start comparing the two teams. By overlaying the two distributions, we gain insight into how the two teams operate over their first 18-19 games.

Here, we see that the Warriors distribution almost entire engulfs the Rockets’ distribution. What this indicates that the average offensive and defensive ratings are nearly equal, but the Rockets’ standard deviations are **tighter**. In fact, the Rockets have **64.24% **of their distribution underneath the break even line, indicating that the **Rockets are more capable of outscoring opponents over 100 possessions than the Golden State Warriors.**

In fact, when matching these teams up, this would even indicate that the **Rockets have a 51.37% chance of beating the Warriors, **with an expected score of 118 – 116 over 100 possessions. Sounds a little familiar, doesn’t this?

If we toss Boston into the mix, we find that all three teams are solidly below the break even line; as they are the three teams with the largest differentials.

We see that Boston is the tightest run ship of the bunch. Despite having the smaller ellipse, they have the highest below break even line probability with **68****.55%**. This indicates that the Celtics is indeed the top team when considering point differentials.

Displaying every team’s pictures would be tedious at this point. Instead, we can display every team’s summary statistics: **means, standard deviations, **and **correlation**. This will give an insight into their elliptical distributions.

Here, we see that every team is above 100 points per 100 possessions; thanks to corrected possession counts. We also see correlative phenomenons with teams such as the **Warriors**, **Wizards**, **Hornets**, **Pelicans**, **Celtics****, Thunder**, **Nets**, **Bucks, Heat, Lakers, **and **Bulls**. These teams are in the camps of allowing more points as they score more points. To be clear, this does indicate that games are close, but rather that as a team scores more per 100 possessions, their opponent scores more per 100 possessions.

For example, the Warriors may typically outscore teams 118 to 107. If they score 125, their opponent may score around 114. Similarly, if they score 110, their opponent may score around 100.

That said, the **Rockets, Knicks, Timberwolves, Magic, Hawks, **and **Mavericks** all have negative correlations, indicating that they are typically blow-out teams; for better or worse. The Timberwolves, for instance, have endured at least four blow-out losses while handing out at least four blow-out wins. What this indicates is that these teams will either excell when building leads or implode when giving up points. These are the teams that tend to be volatile over the course of the season and are **not likely to win an NBA championship**, even if their record is relatively high.

Despite this, the season is early, and these numbers are close enough to zero that anything can change over the series of four games.

We leave off with a simple comparison of teams within a division. For this example, we complete the Boston Celtics thread and look at the Atlantic Division.

Here, we see that the Philadelphia 76ers (salmon) has a wildly large defensive variation. Due to this, the Sixers are a difficult team to rate above and below others. In fact, their below break even probability is **50.37%**. This is a team that ratings wise should finish with 40-42 wins and may be on target for either 36 wins or 46 wins. All purely due to variability.

The Brooklyn Nets are also a curious case as they started out strong early in the season and their distribution has been slipping from below the break even line to above the break even line; as they are now being outscored by roughly 5 points per 100 possessions. Since the team is a fast-paced offense, looking at upwards to 100+ possessions, this does not bode well for the young team.

Given these distributions, we find that the Atlantic Division is one of the stronger divisions in the league when it comes to offensive and defensive ratings. In comparison, we can look at the Pacific Division and find that this is actually one of the weaker divisions in the league; thanks in part to the young teams of Los Angeles (Lakers), Sacramento, and Phoenix.

In case you are unclear which ellipse is the Kings and which is the Lakers; the Sacramento Kings have the larger ellipse that points upward and rests almost entirely above the break even line. The Lakers are pointed to the right and is tightly compact near the break even line in the lower left of all the ellipses.

Comparing the two divisions, we see that while the Warriors are by and large the best team in the division (if not, arguably, the league), the rest of the division is not so hot. Every other team is below 50% when it comes to having a positive ratings differential.

What we are able to gain in insight here is how teams actually interact with their ratings and how we can actually use them to compare teams; as opposed to simply performing a sort. Furthermore, by correcting for possessions, we can actually start to compare teams from different seasons. While folks do this now with estimated ratings, those comparison must be taken with a grain of salt as possessions are grossly overestimated.

The question now is… how do we correct for season with no play-by-play data?

]]>One of the question marks coming into this season were about the isolation tendencies of players such as Jimmy Butler and Andrew Wiggins; particularly when it comes to spacing and the ability to create. In this article, we break down the offensive schemes of the Timberwolves, their rotations, and associated statistics indicating the quality of player interaction.

First, we take a look at the rotations of the Timberwolves. A **rotation** is defined a period of time played by five players. The collection of the first five players on the court is called the **starting rotation**. **Stability** is then defined as having rotations that typically last the longest on the court. Stability can either be a blessing or a curse for coaching staffs. If a team is stable, then the rotations that play lengthy periods of time are playing either because they are successful with limited fatigue **(solid rotations);** or the team is in dire straights and maintain a short bench **(stretching rotations)**. Similarly, unstable teams may either have several quality, yet interchangeable, players **(distributed rotations)**; or the team is platooning players in hopes to either gain experience or find players capable of earning minutes **(platooning)**.

To determine a team’s rotation, we take a look at their **common rotation**. A common rotation is defined as the rotation that typically players over the course of a second. For Minnesota, with 15 games played, at each second of game played, we query the rotations that are on the court. At most, there are 15 rotations. The maximum number of games played for that particular second of the game is defined as the common rotation.

The distribution of common rotations ranges between 12 at its minimum and 51 at its maximum. To give insight, this means that rotations, on average, ranges between **56 seconds** and **4 minutes**. The average rotation lasts **91.78 seconds**.

From the distribution of number of rotations, we see that rotations for teams are split into the two camps: **stable** and **unstable**. If we look at the win percentages for each team, we start to see the separation into **platooning** and **solid rotations**.

We see that teams that tend to have a high number of rotations are teams that are struggling to find optimal line-ups that can sustain high-level of performance. Similarly, teams that have a low number of common rotations are teams with stable, high-performance offenses such as **Houston**, **Golden State**, and **Minnesota**.

The upper right quadrant of teams are winning teams that have a high number of common rotations over their first 14-17 games. These teams? **Boston** (Hayward injury), **Milwaukee **(Bledsoe – Monroe trade), **Philadelphia **(Limiting Embiid’s minutes), **Cleveland **(Age, multiple solid players), **San Antonio **(Age, multiple solid players).

In the lower-left corner of the plot, we obtain teams with stable rotations, but find themselves in losing situations. We actually see a trend that heads downward, indicating the more losing a team, the more likely they are to start platooning. This is the case with the **Los Angeles Lakers **and **Chicago Bulls**.

Despite only having 12 standard rotations, the Timberwolves have played a total of **52 different rotations** across 15 games. In comparison, the Boston Celtics have played **157 different rotations across 16 games! **What this shows is that the Timberwolves have stability through player capability, injury, and roster changes; as well as that Tom Thibodeau maintains a fairly predictable rotation schedule.

The primary rotation for Minnesota is **Andrew Wiggins**, **Jimmy Butler**, **Karl-Anthony Towns**, **Jeff Teague**, and **Taj Gibson**. Together, this unit has participated in 19,837 seconds of action. That is, this unit has played together for **330 minutes and 37 seconds**. This equates to **6 games, 42 minutes and 37 seconds** of action. The most in the entire league.

Comparing how the starting rotation stacks up, the rotation has played in far less offensive possessions than defensive possessions. These situations commonly occur when free-throw shooting becomes a requirement late in games and we find the offensive-defensive substitution pattern take effect. Despite playing in 28 fewer offensive possessions than defensive possessions, the starting unit maintains a **plus 37** in scoring. In effect, the starting rotation scores **1.19 points per possession** while holding opponents to **1.09 points per possession**. While this is not the best in the league, the differential over the high volume of minutes played is promising.

Thanks to the physical abailities of Towns and Gibson, this rotation also dominates the boards; **out-rebounding opponents by 49 rebounds** **in 28 fewer possessions. **This would not be eyebrow raising if the Timberwolves were a terrible shooting team; **but they aren’t**. This rotation **only four more misses than their opponents** on field goal attempts. This indicates that the Timberwolves rebounding percentage is **54.19%**. For a large number of field goal attempts; this indicates a wildly high rebounding differential over their opponents.

The second most common rotation has played 3908 seconds together. This rotation consists of **Gorgui Dieng**, **Jamal Crawford**, **Nemanja Bjelica**, **Shabazz Muhammad**, and **Tyus Jones**. This rotation is considered the **second string rotation** as all players are bench players; but score the highest amount of time on the court after the starting unit.

**Note:** The third rotation is a mixture of the starters and second string: **Andrew Wiggins, Gorgui Dieng, Jamal Crawford, Nemenja Bjelica**, and **Tyus Jones**. This rotation plays a total of 1966 seconds.

We see that the second unit outscores opponents in high fashion much like the starting rotation; outscoring opponents by 26 points over 6 extra possessions: **1.16 points per possession vs. 1.02 points per possession**.

While stability is maintained by both the starting and second-string rotations, it’s the transition that creates problems for the Timberwolves. For instance, the mixture rotation of **Jamal Crawford, Jeff Teague, Jimmy Butler, Karl-Anthony Towns, **and **Shabazz Muhammad** have been outscored **1.45 points per possession to their 0.96 points per possession.** While this unit has only played together over 24 offensive and 22 defensive possessions. This equates to losing 1-3 points per game. Making a change of **Andrew Wiggins **in for **Shabazz Muhammad** is no better; losing at extra **0.25 points per possession**, costing the Timberwolves an average of a point a game.

As we look at the common rotation strategy employed by Thibodeau, we see the progression of scoring over the course of a game.

We see that the starting rotation typically starts the game, finishes the first half, starts the second half, and finishes the game. Their usual stretches are 9:05 minutes in the first quarter, the final 5:47 of the first half, the starting 8:30 of the second half, and the final 7:42 of the game. Needless to say, this is Thibodeau’s main unit in **every quarter**.

From the above plot, we also see that the Timberwolves are more of a second half team. Their standard rotations are consistently outscored in the first half; even mid-way into the third quarter. Despite this, Minnesota turns on the jets and outscores opponents in the second half; when their standard rotations are on the court.

With the primary players of **Jimmy Butler** and **Andrew Wiggins**, the Timberwolves has a pair of notorious isolation players. Similarly, with **Karl-Anthony Towns**, **Taj Gibson, **and **Shabazz Muhammad**, the Timberwolves also have a strong interior presence. In an effort to make these two components work, Minnesota requires deep threats to deter defenses from blitzing the **(obviously)** pick-and-roll offense. These shooters are **Nemanja Bjelica**, **Jeff Teague**, and **Jamal Crawford**.

Despite this, the Timberwolves are not a “bombs away” type of team. Minnesota has only launched 342 three-point field goal attempts, connecting on 129 for a 37.7% rate. This is only 22.8 three point field goals attempted per game. **This is the second lowest total in the league**, ahead of the **Sacramento Kings **(21.4 attempts per game).

What this indicates is that the Timberwolves play almost an entirely inside game.

If we take a look at the shot distribution of the Timberwolves, we find that the team clusters about the arc, as well as inside the paint. However, the Timberwolves has one of the highest rates of mid-range jump shots in the league.

Jimmy Butler takes approximately 13 field goal attempts per game and scores roughly 11.2 points per game from the field.

As a wing player on offense, we find that a hefty amount of field goal attempts come from the 15-18 foot range, primarily from the wing positions. Being a 40.0% field goal shooter, this is not a desirable case. If we color code makes and misses, we find that majority of those misses are coming from 12-18 feet out.

The other premier wing scorer is Andrew Wiggins, who shoots roughly 45.5% from the field, accounting for 15.4 points of production from the field. Wiggins game is eerily similar to that of Butler’s: isolation plays, attack the rim (if possible) but pull up from mid-range if contested.

Again, we see a slew of mid-range jump shots, with the majority being missed. While this bodes poorly for players like Wiggins and Butler, they have the added advantage of knowing either Taj Gibson or Karl-Anthony Towns are underneath the rim; able to stalk out offensive rebounds. Recall above that this is to a tune of +20 offensive boards over their opponents. That is, **28.9% of rebounds during the Timberwolves offense is back in Minnesota’s hands**.

To identify that Karl-Anthony Towns is a better three-point shooter than Wiggins and Butler is a testament to the under-development of both Wiggins and Butler as an outside scoring threat, than it is for the development of Towns as a perimeter scorer. Despite his size, Towns has a high probability or taking a mid range jump shot as he attempted 35 over the course of the first 15 games of the season. Compare this to his 54 attempts beyond the arc, and we find that over 40% of Towns’ field goal attempts come from outside the key. Both his field goal percentage from this range (33-for-89; .371) and his low propensity of obtaining rebounds from these positions on the court are desirable for defenses.

Given this, Towns scores approximately 17.3 points per game from the court over 14.7 field goal attempts, displaying decent efficiency from the field.

Jeff Teague is the other primary scorer on the Timberwolves. Averaging roughly 11.5 points per game from the field, Teague plays in a very centralized location on the court. As the primary ball handler on the offense, Teague obtains most of his attempts from the top of the key and as penetration into the key.

As we see his distribution of field goal attempts, there are only a handful of attempts outside of the 60-degree wedge from the basket. While Teague has better success from mid-range than Butler, Wiggins, and Towns, he finds himself with difficult shots in the paint; missing a majority of these attempts. Of his 113 field goal attempts from within the arc, a total of 45 attempts are taken inside the paint, outside of the charge circle. Of these 45 attempts, Teague managed to convert only 14 of these attempts: **a conversion rate of 31.11%**. Why are these shots important? These attempts from from the standard offensive pattern for Minnesota; which result in **floaters**.

The reason these jumpers are commonly taken in the mid-range is purely due to the offensive game plan. The standard offensive game plan is a low-motion screen-and-roll offense. This action will force a 2-on-2 game between the ball handler and the post inside the lane. As a direct result, Minnesota will either score in the post or obtain a mid-range jump shot. If both looks are well-guarded, then a pass to the perimeter opens up extra looks. However, the offense can be stagnant at times, as we shall soon see.

Minnesota attempts to play to their strengths of strong isolation wing shooters and a dominant low post scorer. In an effort to create spacing, Thibodeau leverages the pick-and-roll offense to pain. To give an example, in their recent game against San Antonio, Minnesota ran 91 offensive possessions and ran the pick-and-roll offense **74 times out of these 91 times**. The remaining 17 possessions included fast-break attempts and possessions that resulted in immediate fouls.

Minnesota creates spacing by remaining relatively stagnant on the perimeter while allowing their premier big man pull the defense into the paint. Their initial offense will look like a four-out, one-in motion offense, but it is designed to place two bigs in the same short corner of the court.

This initial offense allows for the post to set a screen at the top of the key. Since the other three players are out at the perimeter, this creates a 2-on-2 within the key. At this point, a mid-range jumper is taken, a slip pass to the rolling big is given, or a kick out to the perimeter is initiated.

Let’s see it in real time.

In this clip, we see Towns screen Butler. **LaMarcus Aldridge** and **Danny Green** cover the screen well, forcing a kick out to Towns as he is unable to roll. Picking up a one-on-one against Aldridge as Butler rolls out to clutter the left hand wing, Towns drives to the hoop, forcing the entire Spurs defense to collapse into the lane. This leaves **Taj Gibson **open on the corner for an open three. Not a primary three point option, Timberwolves bigs are trained to shoot the three. In this case, Gibson connects for the first basket of the game.

Here we find one of our first **wedge screens**, which are common in Thibodeau’s offense. Here, Towns sets the screen for **Tyus Jones** from the left elbow. Setting a second screen, Towns frees **Nemanja Bjelica. **Bjelica hesitates on the perimeter attempt and kicks out to **Jamal Crawford**. This results in a sideline screen and roll which leads to a Towns 2-point basket.

This is probably the most sophisticated version of the Minnesota offense. Again, the shooters are planted in the corners. This time it’s Teague and Wiggins. Here, Towns and Bjelica set a dual staggered screen on Butler. Bjelica breaks off a secondary pin down on Teague in a twist action to free Teague for penetration. **Joffrey Lauvergne **picks up Teague, leaving Bjelica free to float to the extended elbow. Teague kicks out to Butler, who swings back to the open Bjelica for three.

The above possession starts with an overloaded right side, but quickly morphs back into the standard formation with Towns setting the screen on Butler. Teague and Gibson float to the corners as Wiggins creeps up along the sideline. With **Pau Gasol** and Danny Green reading the screen and roll, they entice Towns to become the long range shooter. Towns obliges and hits a low percentage basket from 20 feet out.

Here is a classic action from the Minnesota arsenal. In this case, the transition offense looks for a quick post up for Towns but doesn’t find it. Instead, Gibson and Teague look for the pick and roll at the top of the key. Teague penetrates and flips the floater at the free throw line. As usual, the basket does not fall.

Here, Minnesota breaks from standard formation to run a Warriors style offense. Butler slips a faux screen and turns into a wheel cut underneath the basket, coming off a weak-side staggered pin down. Towns, in turn, sets the pick-and-roll. As Teague goes over the top of the screen, **Patty Mills** goes over the top of the screen to force an Aldridge 1-on-2 against Teague and Towns. The options here are to find Butler coming off the screen for a jumper or take the mid-range attempt. Again, Teague takes the floater in the lane; this time for success.

Back to the patented screen and roll action, Teague is caught losing his dribble as **Patty Mills ** and **Danny Green** collapse onto Taj Gibson. Teague skips to Green’s man, Jimmy Butler, which results in an open look for three.

Once again with the standard formation, Gasol is forced to cover Tyus Jones. This allows Towns to slip freely down the lane for an uncontested dunk.

We see that the secondary unit is once again running more sophisticated plays as they run a wheel screen with Tyus Jones. Gibson sets the screen on Crawford, however Crawford pulls the pick-and-roll along the three point line, allowing Danny Green to hedge the roll. Having to reset the offense with 4.9 seconds remaining on the shot clock, Minnesota goes into scramble mode, taking a difficult floater in the lane. As is the case with Karl-Anthony Towns on the floor, the rebound is left unboxed and Towns slams home the offensive rebound.

Back to classic pattern with the primary offense on the floor. Towns comes to set the screen as little motion occurs on the perimeter. Butler takes the jumper at the elbow, misses, but manages to collect the long rebound and get fouled in the process.

Again out of standard formation, Gibson sets the screen for Teague. Teague penetrates and finds Gibson, who is fouled on the ensuing attempt.

Here is the second time we see the staggered dual pin down screen play. We start from standard formation, but the San Antonio defense responds by not letting Towns roll. Butler kicks out to Towns, who resets the offense, waiting for the wheel screen from Wiggins on the pin downs from Gibson and Butler. As this weakside motion goes on, Towns and Teague run a pair of screens, allowing Teague to drive baseline. Instead of kicking out to Butler, as Green hedges back for the potential kick-out, Teague takes the reverse lay-up attempt and misses.

Another pick and roll with Towns and Teague. Another series of no movement on the perimeter. Results in a Teague floater, but a foul on San Antonio.

Another screen and roll with a mid-range jumper from Butler. Another miss.

With a slight wrinkle with Gibson and Butler, the offense starts 12 seconds into their possession with a Towns on Teague screen. With a nice slip pass from Teague, Towns gets to the rim uncontested for another dunk.

With these 15 plays of 94 possessions, we have provided some insight into the Minnesota offense. The reason the Minnesota offense works is due to the ability for the wings to penetrate and the posts to dominate the paint. Thibodeau’s offense is stagnant with little weak-side movement and it shows in several points above. Despite this the strong guard play, as Teague was 7-13 from the field; and the slick shooting from Towns as he went 10-18 from the field (including 2-for-2 from beyond the arc) kept the Timberwolves ahead in this game. Contrary to usual format, Minnesota shot 9-18 from three, giving them some leeway later into the game.

However, how could the offense improve to better leverage the stars on the Minnesota roster? So far it’s worked in Minnesota’s favor as they are currently 10-5 through 15 games. The question is, will it continue as the season wears on?

]]>The most common model of this type is the **Bradley-Terry model**.

The basic form of the Bradley-Terry model focuses on pairwise match-ups between teams, dependent on location, and records whether the home team has come away with a victory. To model this, the **explanatory matrix**, **X**, is an **N-by-31** matrix of variables, where each row of the matrix represents a **game**. We use 31 variables to identify the 30 NBA teams, as well as an **intercept term** to correct for global mean of the observed results.

With the idea that each row of the explanatory matrix is a game, we indicate the home team as a “+1” and the away team as a “-1” value. For the intercept, we set the last value in the row to “+1.” For instance, last night the **Boston Celtics** defeated the **Los Angeles Lakers **107 – 96 in **Los Angeles, CA**. If we place each team in alphabetical order, the Celtics are the **second entry in the row**, after the Atlanta Hawks, while the Lakers are the **14th entry in the row**, after the Los Angeles Clippers. This means that the second entry in the row is “-1” as the Celtics are the visiting team and that the 14th entry is “1” as the Lakers are the home team. The final entry is the value **one** to correct for the intercept.

This means that the row of the explanatory matrix is given by

**(0,-1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1)**

Now, if we are able to write the result of the game in terms of a random variable conditioned on the teams playing and location played, a common question to ask is whether we can fit the response to a linear model. Why linear? Consider this fit:

Here, **Y_i**, is the **response for Game i**. This can be whatever the response we feel is right. Want to use the result of the game? Want “+1”** for home win, “-1” for road win?** Sure, go ahead. However, we must be careful of the **resulting distributional properties **after we make our decision. For instance, using least squares **will not work properly** if we use “1” and “-1.”

The **beta values** are the **weight** given to each explanatory variable in predicting, or explaining, the response. In the Bradley-Terry Model set-up, let’s apply the** Celtics-Lakers match-up:**

This means the linear model compares the Lakers and Celtics by looking at their coefficients. If the Celtics are the better team, then their coefficient will be **larger** than the Lakers’ coefficient; provided the **larger the response, the more likely a team wins the game**.

Using the explanatory matrix notation above, we immediately see that **beta_0** is the **league-average** **home-court advantage. **

The **epsilon term** is simply the additive error in the model. This means that while two teams play each other, and one is favored, there is some sort of associated error that could have us seeing a response that is not likely. **Accounting for resulting situations like when the Sacramento Kings defeated the Oklahoma City Thunder on November 7th**.

This given model is not quite correct, because the response is not well understood. Persisting with usual least-squares will lead us to predicting **real values** instead of **win-loss**; which is what we are ultimately after.

To account for this, enter **logisitc regression. **

Logisitic Regression is a methodology for identifying a regression model for **binary response data**. If you are familiar with linear regression, then the following explanation can be skipped down to applications to NBA data. If you are unfamiliar, strap in, it’s going to be a mathematical ride.

The Bradley Terry model makes an assumption that each game played is an **independent Bernoulli trial**. This means that it’s a coin flip, where the coin is weighted by a function of the teams playing and where they are playing. The distribution function is no different than that of Colley’s initial **independent model** without using the **beta prior distribution**. The Bernoulli distribution is given by

Here, **p** is the probability that the **home team will win the game**. The value, **x**, is actually the **response**. Not to be confused with the values of **X** above. The response is either a **1** if the **home team wins** or **0** if the home team loses.

The Bernoulli distribution falls under a class of models called the **exponential family model**. In regression, if we are able to write the distribution of a model in an exponential family format, we are able to identify a **link function** that allows us to build a **linear model** to understand the relationship between the explanatory variables and the response. The exponential family format is given by

The value **h(x)** turns out to be the marginal probability measure. This is not entirely necessary for our purposes. The value, **T(x)**, is called the **sufficient statistic**. This identifies a compacted way to collapse (or aggregate) the data in a manner that the distribution associated with the parameter space is unchanged. In simple terms, this is a **data reduction statistic**.

The value **theta** is called the **natural parameter**. This value identifies the link between a linear model and the parameter space identified by the sufficient statistic. The value, **A(theta)**, is the **moment generating function** associated with the distribution. If we were to take this function and compute derivatives, we will obtain the **moments ** of the associated distribution.

Let’s start by showing the Bernoulli distribution is indeed an exponential family model.

We find that **h(x) =1**, **T(x) = x**, **theta = log(p / (1-p))**, and **A(theta) = -log(1-p)**. The sufficient statistic shows that the data point itself contains all information about the parameter **p**.

This shows that the link is the natural log of the ratio of probability of success divided by the probability of failure for the home team. Think of this as the log of the **odds ratio** for a home team winning a game.

While we are here, let’s verify that the moment generating function indeed yields moments. The term –**log(1-p)** is not in terms of the natural parameter. So let’s first find that. We start with understanding what the natural parameter looks like:

We then substitute this in to the moment generating function to obtain a function of the natural parameter:

We needed that negative! Look at the exponential family model and notice that there is a negative lurking there. Here, we obtain it explicitly from rewriting the moment generating function in terms of the natural parameter! Now, let’s take the derivative with respect to the natural parameter.

We can check the second derivative as well:

So we are indeed in business as these are the first moment and second central moment (variance) of the Bernoulli distribution!

Now that we have an exponential family distribution with identity sufficient statistic, we can apply the link function. This is merely setting the **response** of the **linear model **above to being the **link function**. Explicitly, we have that

Note that since the probability of the home team winning the game is between 0 and 1, we have that the odds ratio is between 0 and infinity. Taking the natural logarithm, we obtain a value between negative infinity and positive infinity. We are in great shape for regression!

However, since we **do not know **p, we cannot just solve these equations for the number of games we have for the season. Instead, we look back at our exponential family for help. This process is called performing a **logistic regression**. The link function above, connecting **p** to **theta**, is called the **logistic link**.

To perform the logistic regression, we do exactly as we do in standard least squares regression. We look at “squared error distance” between the model and the results and minimize these errors. In linear regression, we assume the **Gaussian distribution**. When placed into the exponential family model, we get **squared error** loss. That’s not explicitly the case here.

Let’s look at how we do this in the Gaussian case. In the basic linear regression problem, we assume **homogeneity**. This means that the variances are constant and fixed. We still have to estimate them, but they are viewed as constants.

The **negative log-likelihood **of the exponential family identifies our loss function. In this case, for Gaussian distributions, we obtain

Viola! We have least-squares sitting in front of us. To recount, the first arrow simply applies negative logarithms to the exponential family distribution for the Gaussian. The second arrow just dusts off the constant terms that have no effect on the minimization procedure.

I will leave it to the reader to verify the exponential family form of the Gaussian, which results in the **identity link, theta = mu**. For now, let’s do the **exact same thing here for the Bernoulli distribution**.

Taking the negative log-likelihood for the Bernoulli distribution, we obtain

In the Gaussian case we simply take the derivative, set it equal to zero, and solve. In the Bernoulli case, we took the liberty of taking the derivative; which is the second arrow above. Setting this equal to zero and solving **does not work**. Thanks, Logistic function…

Therefore, an iterative scheme must be adopted. The simplest one out there is **Newton’s Method**. Newton’s method is a calculus based method that uses tangent lines to iteratively solve for a root (zero value). This inherently requires the functions we are maximizing to have **nice tangent (derivative) behavior**. To compute Newton’s method, we take the function in question with a **good starting point** and compute the tangent line at the function evaluated at the good starting point. Where this tangent line **intersects the x-axis** gives us an update for where the zero most likely is.

Since our function we are attempting to solve the root for is the derivative, we need to take the second derivative of the negative log-likelihood function:

The final line is the Hessian value. Piecing this together, we obtain Newton’s method for solving for the coefficients of Logistic Regression!

After choosing a good starting point, we run this until a desired convergence. The output is the **beta** vector of weights for each team. The larger the weight, the **higher associated probability** that team has of winning.

With the Celtics playing at the Lakers on November 8th, we can look at all games up until November 7th and compute the Logistic Regression using the Bradley-Terry formulation. In this case, we obtain the following rankings:

Here, we find that the Boston Celtics’ coefficient is **2.0004** while the Los Angeles Lakers coefficient is **0.4828**. Not shown above is the intercept; which is **0.2229**. Piecing this together, we take note that the Celtics are the visiting team. Therefore the linear model is given by **0.2229 – 2.0004 + 0.4828 = -1.2947**. Placing this into the **logistic link function**, we obtain a **Los Angeles Lakers probability of winning this game to be 21.5058%. **

Here, we must make a note that in certain cases, we obtain outrageously unrealistic cases. Let’s take for instance a proposed game between the **Boston Celtics **and the **Atlanta Hawks **in **Boston**. Using the coefficients above, we have that the natural parameter response is expected be **0.2229 + ****2.0004 + 1.2877 = 3.5110**. This leads to a **Boston Celtics probability of winning this game to be 97.10%**.

While the Celtics are expected to win, why is this absurd probability so high? This is in part due to **large variation** associated with the model. The Bradley-Terry model imposes an **iterative reweighted least squares **model where the weights are the associated Bernoulli variances for each game. This is identified immediately by placing the above gradient equation in terms of matrices.

From this construction, we obtain a method for approximating the error associated with the estimation of each teams’ weight, **beta_j**. In this case, we obtain a variance for each team to be roughly **11,200,00,000,000,000. **This is the exact same problem we run into with **Adjusted Plus-Minus!!!!**

This shows that while we identify a ranking, despite being statistically correct, the associated variance effectively says that the ranking is simply just that; a numerical ranking. Furthermore, this suggests, through Wald type testing, all teams are equal.

Why did this variance inflation happen? First off, **all teams have not played each other**. This identifies that the **support** for the model is missing observations. When this occurs, the model is only fit for games that have at least one sample. if a game has not been seen, the **explanatory matrix enforces that the win-loss response does leverage information between the capabilities of those two teams. **Information lost due to the absence of observable information.

Second, the model assumes we have enough responses for each match-up in order to estimate the variance. One observation? Variance is “**zero**” in the estimable sense. In the model sense, this is effectively infinity as no information of variability exists.

The explanatory matrix is effectively a schedule matrix. Therefore, for each match-up, we need to see multiple observations in order to adequately estimate variation of results within that schedule.

One way to correct for issues is to do the exact same methodology as in **adjusted plus-minus.** That is, apply a **regularizer**. This will control variance inflation, but also perform a singular value decomposition type construction, effectively muting a team’s weight. This in turn turns the team into a **baseline** for other teams to compare against.

In this case, we will be able to gain stable estimates, but at the cost of **interpretation**. In the model above, the value **e^beta** is the **odds ratio ** for a team’s chances of winning. In the regularized setting, this is no longer the case.

Another way to correct is to play around with features. Try to use something other than schedule to build a Bradley-Terry model. Many folks over the years have attempted this. However, when going down this path, keep in mind the signal-to-noise problem that reared its ugly head above.

Finally, we leave you with some basic Python code to reproduce the results above.

First, we process the data. Assume files where each line is date, winner, score, loser, score. Then we simply open the file, read the lines and hold them in memory. Similarly, we create an NBA team dictionary to use to indexing and identify the numebr of games played.

Next, we populate the explanatory matrix and response matrix. This is just a simple sweep through the data file. We also perform some of the basic linear algebra functions that will be needed later; such as a matrix transpose, some multiplication, and rankings initialization.

Next, we perform Newton’s method to identify a set of coefficients that in turn give us our team rankings.

The resulting rankings vector yields our team rankings. We can simply return this as a function for a later block or display accordingly.

Also note as this is an 31-dimensional walk with a random initial start, each time we run Newton’s method, we will get different (but effectively the same) results. To tighten this component, we either have to find an optimal hot start location; or utilize a different convergence metric.

As of the morning of November 9th, 2017, there has been a total of 163 NBA games played. Applying Bradley-Terry to these games, we obtain the current rankings:

How would you build your own model?

]]>This requires construction of a **Team Defensive Rating**, a **Defensive Points Per Scoring Possession**, and a **Stop Percentage**. In this article, we take a look at the construction of defensive rating. But more importantly, as it is a box score calculation, we look to see how it compares to truth by using play-by-play data.

The first calculation, **stop percentage**, attempts to identify the percentage of possessions that result in no points: **blocks, steals, defensive rebounds**. Since blocks do not necessarily end possessions, there must be some form of estimation to identify the percentage of blocks that result in termination of a possession.

Stops, as defined by **Dean Oliver**, is a two-part process. The first part is the **individual part**. This portion attempts to identify stops generated explicitly from the player through their **blocks**, **steals**, and **defensive rebounds**. The second part is the **team part**. This portion attempts to identify stops generated by the team when the player is on the court.

Individual stops is calculated as

Note that this is a three part equation for **steals**, **blocks**, and **defensive rebounds**; in that order. Let’s break down this somewhat intimidating equation through each of these three parts.

The first part is **steals**. If a steal occurs, the possession ends. This is the primary reason the possession has ended.

The second part is **blocks**. In this case, blocks do not necessarily ends a possession. In fact, a block also may end a possession as a **defensive rebound** that may or may not be obtained by the player who obtained the block. So how do we break down a block?

The first parentheses deals with rebounding relative to shooting attempts. This can actually be written down in terms of a tree diagram of **conditional probabilities**.

This term looks for only two of the four instances: **defensive rebounds when field goals are made** and **offensive rebounds when field goal attempts are missed**. The latter condition identifies possessions that **continue** after missed field goal attempt. The former term of **defensive rebounds on made fields goals** should never happen. Right? **Wrong**. They do happen, but require free throws.

The last term for blocks takes an opponents rebounding percentage and increases it by **seven percent**. This percentage increase is to correct for team rebounds. Therefore, one minus this corrected offensive rebounding percentage yields a **defensive rebounding percentage**. Therefore, the blocks calculation identifies the **number of blocks that result in either defensive rebounds (second term) or continuation of play that result in made field goals (first term)**.

The **defensive rebound** portion identifies rebounds when field goal attempts are missed. We again see the continuation of play with made field goals percentage from the blocks calculation. This time, we find the opposite values, which are missed field goals. Multiplying the missed field goal percentage, relative to continuation and made FG, we obtain the **number of field goals terminate in defensive rebounds. **

Piecing these together, we have steals, field goals missed and defensively rebounded, blocks that are defensively rebounded, and blocks that eventually lead to baskets.

Next, we focus on the team contribution of stops when a player is in the game. It is given by the formula

Again there are three terms. The first term focuses on field goal attempts that are not made nor blocked. Due to the inclusion of missed field goals and blocks, we have the same correction with eventually made field goals and defensive rebounds. As we look explicitly at missed field goals, the made field goals inclusion come from possessions that **terminate on defensive rebounds** despite having a made field goal.

The second term focuses on turnovers that are not generated by steals. These are bad passes out-of-bounds, shot clock violations, traveling, double-dribbling, et cetera. The first and second terms are **scaled by minutes played**.

Since these are box score calculations, there is an assumption that a **uniform distribution of field goal attempts per second** is upheld. The scaling by minutes played leverage this uniform distribution.

The third and final term is the percentage of free throws off of fouls that result in zero points. The squared term is **two consecutive misses**. There is an assumption of **two free throws** on average as one free-throw possessions are either **continuation of possession free throws on made baskets** or **empty-possession technical fouls**. The value of **0.4** is the 15-year old constant of **percentage of free throws that are possession ending**. This value has since been updated to 0.44 in some cases; or learned to a random value near 0.43 through the use of play-by-play data.

This time instead of scaling by minutes, we scale by fouls.

Adding Individual and Team Stops, we obtain **stops** for when a player is in the game. This is the most complicated portion of identifying defensive rating. We can then calculate **stop percentage** for a player. This is given by

This formula calculates the number of stops per possession and scales by the minutes played. Think of the formula as the following: Stops per minutes played for an individual divided by the possessions per minutes played by the team. This yields the estimated stops per possession when a player played.

Recall that possessions is a complex computation that is found in offensive ratings.

Defensive points per scoring possession is as it sounds. We compute the number of points scored and divided it by the number of terminating possessions with points scored. This is also known as **chances** by other folks (thanks Seth :P). In this case, we have the estimated scoring possessions given by **field goals made** plus **free throws that result in at least one point**. The defensive points per scoring possession is given by

I use the term **DPpSP** as defensive points per scoring possession because it saves space on the formula graphic.

Team defensive rating is simple to compute. In this case, it is merely

This identifies the points given up per 100 possessions.

We are finally able to calculate defensive rating. Recall that formula to start the article? If not, here it is…

If we substitute in our short-hand terms, we obtain the exact same equation:

So we can work in reverse and use the above formulas to compute defensive rating. In order to compute this explicitly, we can simply identify these box score elements:

- Steals
- Blocks
- Defensive Rebounds
- Opponent Offensive Rebounds
- Team Defensive Rebounds
- Opponent Field Goals Made
- Opponent Field Goals Attempted
- Team Blocks
- Team Minutes Played
- Opponent Turnovers
- Team Steals
- Minutes Played
- Personal Fouls
- Team Personal Fouls
- Opponent Free Throw Attempts
- Opponent Free Throws Made

Let’s consider the **October 30, 2017** game between the **Philadelphia 76ers **versus the **Houston Rockets**. This game resulted in a 115-107 victory for the Philadelphia 76ers. The box scores for the game are:

Let’s look at **Joel Embiid’s **defensive rating. Here, we will not count anything other than box score statistics. In this case, we have the following:

- 2 Steals
- 1 Block
- 7 Defensive Rebounds
- 10 Opponent Offensive Rebounds
- 41 Team Defensive Rebounds
- 33 Opponent Field Goals Made
- 83 Opponent Field Goals Attempted
- 4 Team Blocks
- 240 Team Minutes Played
- 15 Opponent Turnovers
- 10 Team Steals
- 24 Minutes Played
- 5 Personal Fouls
- 31 Team Personal Fouls
- 38 Opponent Free Throw Attempts
- 28 Opponent Free Throws Made

First, we will use the common possession estimator, **Possessions = FGA + 0.44FTA – OREB + TOV**. For the Houston Rockets, this is 83 + 0.44*38 – 10 + 15 = **104.72 possessions**. Of these 104.72 possessions, a total of 107 points were scored; resulting in **1.0218 points per possession**. This gives us a team defensive rating of **102.1772 points per 100 possessions**.

Computing the defensive points per scoring possession, we get **DPpSP = 107 / (33 + 0.4*(1 – (1 – 28/38)^2)*38**. This results in **2.269478 points per scoring possession**.

In the game, **Joel Embiid **recorded 2 steals, 1 block, and 7 defensive rebounds. This results in 4.465804 stops in the game. Computing the team stops portion of stops, we obtain 3.323866 stops. Combining these we obtain **7.789670 stops**.

This results in a stop percentage of

That’s a stop percentage of **74.3857 percent**. Note that this does not mean that Embiid stops 74% of possessions. This is contribution scaled by a factor of **five** and includes team effort. This is the rationale for the 80%-20% split in defensive rating.

We can now finally calculate the defensive rating for Embiid. We obtain **DRTG = 0.8*TDR + 0.2*100*(1-STOP%)*DPpSP = 0.8*102.1772 + 0.2*100*(1-0.743857)*2.269478 = 81.74176 + 11.626218 = 93.367978.**

This indicates that Joel Embiid obtained a **93.37 defensive rating**, which is significantly better than the **102.18 team defensive rating**. We interpret this as Embiid improves team defense by an estimated total of 9 points per 100 possessions. Explicitly to this game, this indicates that Embiid has **saved the** **76ers roughly 4.5 points** **against the Houston Rockets**.

Through the use of play-by-play data, we are able to walk through the entire game and see all actions occur. The first order of business is to look at the number of possessions. Through counting all possession termination actions, we obtain a total of 204 possessions. This results in **102 possessions for both the Rockets and 76ers**. Recall that the estimated possessions was 104.72 possessions.

Of the 204 total possessions, we find that Embiid participated in 105 total possessions. Of these 105 total possessions, Embiid played on 51 defensive possessions while playing in 54 offensive possessions.

Getting into foul trouble with five fouls, Embiid found himself playing in six spurts throughout the course of the game.

Embiid started the game, playing the first 18 possessions of the game. During these 18 possessions, Embiid played 9 defensive possessions that resulted in 7 points for Houston. The playing time resulted in 3 minutes and 43 seconds of playing time, an average of 12.39 seconds per possession.

Of the 9 defensive possessions, Embiid recorded one defensive rebound only as the Rockets scored on three of the nine possessions. **Robert Covington **was the star of this stint, recording a defensive rebound and two steals of these nine possessions.

At the end of this stint, **Embiid’s defensive rating is 77.78.**

Coming back in to close out the first quarter and start the second quarter, Embiid participated in 23 possessions over the 5 minutes and 27 seconds of playing time. This resulted in 11 defensive possessions that resulted in 14 Houston Rockets points. The games slowed pace a little, resulting in 14.21 seconds per possession.

Of the 11 defensive possessions, Embiid recorded 1 defensive rebound and 1 steal; terminating two of Houston’s 11 offensive possessions. Of the 11 possessions, Houston converted on seven of the possessions, two of which were 1-for-2 on free throws. One of the scoring possessions, Embiid committed a foul that resulted in an extra free throw made. This was a particularly weak stint for the Sixers defense as Houston left off a couple points due to free throws and still managed 1.27 points per possession.

At the end of this stint, **Embiid’s defensive rating is 105.00.
**

For one last stint during the first half, Embiid participated in 9 total possessions, 4 of which were defensive possessions. During these four possessions, the Houston Rockets picked up 3 total points on a single three point attempt by **James Harden**.

Despite the one scoring possession in four attempts, Embiid had little to directly attribute to the stops. Rebounds by **Jerryd Bayless**, **Dario Saric**, and **Ben Simmons **terminated the other three possessions.

The average possession was 13.78 seconds over Embiid’s 2 minutes and 4 seconds of actions. At the end of this stint, **Embiid’s defensive rating is 100.00. **

Embiid started the second half for his fourth stint, which lasted 3 minutes and 45 seconds. During this time, Embiid played in a total of 14 possessions; 7 of which were on defense. During these seven defensive possessions, Embiid recorded nothing on the defensive end. Up to this point Embiid managed **two defensive rebounds and one steal**.

Houston managed to score only six points over the seven possessions, converting only three possessions. A fourth possession, Philadelphia was bailed out by two consecutive misses from the foul line by **Clint Capela**. The other three possessions were terminated by Covington (steal, rebound) and Simmons (rebound).

The average possession lasted 16.07 seconds. At the end of this stint, **Embiid’s defensive rating is 96.77. **

Embiid entered the game late in the third quarter for a short one minute and 36 seconds for a total of six possessions. With an average possession of 16 seconds per possession, Embiid played in three defensive possessions.

Houston converted on one of the three possessions, again bailing out the 76ers by missing both free throws after an Embiid foul. With the three points coming on another Harden three point attempt, the Rockets only mustered one point per possession during this stint.

At the end of this stint, **Embiid’s defensive rating is 97.06.
**

Embiid closed out the game with a significant eight minute and eleven second stretch of time. This stretch witnessed 35 total possessions, which started to speed up, thanks to free throws late. The average possession was 14.03 seconds. Of the 35 possessions, Embiid participated in 17 defensive possessions.

It was during this time that Embiid collected many of his stats. During this stretch Embiid picked up 5 defensive rebounds and one steal. Embiid also picked up his only block of the game. Unfortunately, Houston retained possession as the block went out of bounds.

As Houston only converted on seven possessions, one of which Embiid sent Houston to the line for two free throws. Fortunately for Philadelphia, Houston failed to convert on two free throws, leaving two points on the line. Due to this, Embiid’s defensive rating improved and finished with 47 points over 51 defensive possessions for a **defensive rating of 92.16 points per 100 possessions**.

Compare this to the team defensive rating of **104.90**, we find that Embiid’s presence indicates an improvement in defense; however, his individual stats only seem to appear in his final stint. This implies that Embiid is not the sole reason for the defensive improvement. Seeing the actual statistics from when Embiid is on the court, **Robert Covington** turns out to be the premier defender. This in turn indicates that the combination of **Covington and Embiid** identifies a solid defensive tandem for the 76ers.

As we have seen that the 76ers’ defensive rating is actually **104.90**, we had an estimated team defensive rating of **102.17** from the Oliver equations above. That’s not a terrible estimate by any stretch. However, this comes from estimation of possessions.

In turn, Embiid’s defensive rating was estimated to be **93.37** points per 100 possessions. In reality, Embiid managed a **92.16 **points per 100 possessions.

This shows that the estimation process varies about the truth, and while it is a method for approximating points per 100 possessions, we are able to compute the actual defensive rating by performing play-by-play calculations.

The reason why the estimation process manages to miss by roughly 1-3 percent is due to many factors. First, possessions are estimated using coefficients that are not proper for the estimation process. Second, possession times are assumed to be **uniform**. This means that if 104 possessions are estimated for both teams, then every possession is estimated to be 13.85 seconds long. We see that this is not the case. Third, points per possession are assumed to be **uniformly distributed** over possessions. This again is not the case.

For the possessions issue, we have seen that possessions are grossly over-estimated in the past. For the uniform distribution assumption, if we are able to obtain **thousands of possessions per game**, then we may have a chance to argue uniformity assumptions. However, in small sample games… yes 100 defensive possessions is a small sample relative to the possible ways to terminate possessions… any deviation from uniformity will violate uniformity. And this explicitly happens here to induce these variations in points per 100 possessions.

Despite these flaws, if the user only manages to have box score data, we see that defensive rating is not egregious in estimation. Instead, it’s a carefully thought out process that leverages assumptions of uniformity to get close to the truth.

If we are to compare two players using defensive rating, we must perform a **test of hypotheses**. We cannot simply sort the players by defensive rating using Oliver’s defensive rating. This is because we are using an **estimation process** using the **uniform distribution**. Therefore, if a player has a defensive rating of 92.65 and another player has a defensive rating of 94.01; can we say that the first player is better? **Most likely not.**

Instead, we may say that the players are the same, as the uniformity assumptions may lead to large enough variances such that both scores are realistic for both players.

]]>In this article, we take a look at Colley’s methodology and attempt to understand the associated statistics with the procedure.

First, we start simple: let us only look at one team and their resulting wins and losses over the course of the season. Currently, the **Atlanta Hawks** have completed eight games into the season and have compiled a 1-7 record. The most basic model will consider each game as a **random draw** from a **Bernoulli Distribution**.

If you are unfamiliar, a Bernoulli distribution is a success/failure distribution for a given event. Here, an event for the Hawks is a game played. A win is considered a success, a loss is considered a failure. Then, we are interested in the **probability of success**, identified by the value **p**. The distribution for a Bernoulli random variable is given by:

The value, **x**, is merely and indicator of whether the Hawks won their game or lost their game. If we now consider every game as an **independent Bernoulli trial**, then we simply identify the number of wins as the sum of each x-value.

Now you may pause for a moment and ask, “**doesn’t the probability of winning change from game to game?**” While the answer to this is **“YES!” **we are interested in only the most basic model. We will expand on this in a bit.

Now, if we are to sum these 0 (loss) and 1 (win) events over **N **games, we obtain a **Binomial distribution**. A binomial distribution counts the number of successes (wins) over the course of N trials (games). There are a couple of key assumptions here:

- Games have the same probability of success.
- Each game is independent. Meaning back-to-backs don’t influence capability.

The distribution barely changes. In this case it becomes

In this case, we made a slight change. The value **x** is no longer 0 or 1; but rather is the **number of wins in N games**. For the Hawks, this would be a value between 0 and 8.

A simple indicator of how well a team plays is to identify their probability of success. In reality, each team has their **true value**. In this set up, we don’t care about the opponents. Instead, we care only about the Hawks wins and losses. If we perform the **maximum likelihood estimator for p**, we are able to identify an estimated value of success for a team.

Maximum Likelihood is the process of taking observed data and computing the probabilistic model for the data **evaluated at the observed data**. This process is called building a **likelihood function**. We then maximize this function with respect to the parameter of interest. This will identify the **highest probability of p from observing the data**. In the general case, the likelihood function is the Binomial distribution. We then write the likelihood function as

One way to maximize this function is to take the derivative and set it equal to zero. We have to be careful and ensure that the second derivative is negative; or else we obtain a **minimum**… which is the **lowest probability of p from observing the data**. That would be bad. Here, we make a slight one-to-one transformation to make our life easier before we apply the differentiation. The resulting differentiation gives us

If we set this to zero and solve, we get

This is actually **win percentage!!! **This means that if we consider this ultra-simple set-up, the easiest way to compare teams is to look at a teams win’s percentage and call it a day. However, there are a lot of issues with this.

One of the first issues we run into, **particularly for small samples**, is that the maximum likelihood operator has **large variance** and tends to be only a **sample path** of the actual state. This means the combination of both large variance and a single sample pass can give us wildly varying results in true probabilities of success. Instead, we might want to apply a **filtering technique** to control for spurious wins. To give example, during the **Detroit Pistons versus Los Angeles Lakers **game on October 31, 2017; with the Lakers handily ahead, the announcers made a comment that since the **Detroit Pistons defeated the Golden State Warriors two days prior**, that the **Los Angeles Lakers are better than the Golden State Warriors**. This flaw in transitive logic stems from the similar result above. If we isolate both games and apply straightforward win percentages, the Lakers are indeed better than the Warriors!

To remedy this, we apply a filtering technique. A common one is placement of a **prior distribution** on the parameter space. This is a **Bayesian procedure** that attempts to use prior information to control observed information, like we saw above. One of the requirements necessary for a prior is to have a **domain identical to the domain of the parameter space**. This means that if we put a prior on the probability of success, we obtain a value between zero and one.

The most common technique for the Binomial distribution is to place a **Beta distribution** prior on **p**. The formulation for a resulting **posterior distribution** is given as **(Likelihood x Prior) / marginal**. The marginal distribution is the **normalizing constant of the posterior distribution**. In this case, we have the posterior distribution to be:

Let’s break down this quantity and show that it’s not as scary. First, let’s factor out the terms that have nothing to do with the probability of success. These terms all cancel out! Next, let’s combine like terms. This leaves us with

You may recognize this. This is a **Beta distribution** but with the parameters updated by the observed data! The values of **alpha **and **beta** are used to **filter** the probability of success. The more important thing here is that if we compute the **expected value**, or **posterior mean**, we obtain an estimator for **p** given the data. In this case, the expected value is

This means the **win percentage** is stretched to a competing value by these parameters of alpha and beta. If we set **alpha = beta = 1**, we get the **Uniform distribution**! This means that every team is equal before the season begins, and as games are played, their resulting probability of winning changes from uniform. Let’s plus in one for both alpha and beta:

If you made it this far, **congratulations! **This is the starting point for **Colley’s Method**. Colley’s Method is the process of assuming that every team follows a **game** **independent, unrelated to other teams, unaffected by schedule, with uniform prior building** to estimated the probability of success for a team.

What does this mean for the Atlanta Hawks? Well, while their 1-7 record would give them a 0.125 win percentage, their filtered win percentage is 0.222. Meaning that they are simply on a sample path slightly worse than their actual probability of success.

From this point, Colley makes some subtle changes in attempts to incorporate scheduling and team-versus-team interactions. Let’s dive in finally!

One of the first steps that Colley performs of this posterior mean is to rewrite the number of wins for a team. This is a straightforward, **but misleading**, process! Instead of following Colley’s footsteps, let’s apply Colley’s ideas in the general sense.

First, we take the number of wins and write them as a **weighted sum of wins, losses, and games played**. This is given by

Let’s break this down. First, we split the number of wins into a **gamma** percentage of wins plus a **1-gamma **percentage of wins. We then add and subtract in a **1-gamma** percentage of losses. This is effectively adding zero and not changing the equation.

The third step collects like terms. The final term is **1-gamma** times the **number of games played!** Since the number of games played is constant, multiplied to a constant **1-gamma**, we can rewrite this as a sum. This means **each game played by that team is identified by 1-gamma**.

This means the final term is merely the weighted number of wins. The weighted number of losses. and the weighted **strength of schedule**. If we now return to Colley’s footsteps, Colley applies **gamma = 0.5**. In this case we obtain

This is effectively the **win percentage** **given half weight ** and the **strength of schedule being uniform across all teams, given half weight**.

We can apply a different weighting scheme. Suppose I want to give only **ten percent to strength of schedule**. Then, all we need to do is set **gamma = 0.9**. This would give us

The next step in Colley’s Method is to change the underlying assumptions of uniformity. Ignoring the requirement, Colley replaces the **1/2** term in the strength of schedule to be **r_i,** or the **ranking of team i**. This leads to the equation for a win to be

This is now a single equation with **q unknowns**. Here q denotes the the **number of unique teams played against** by the, in this case, Atlanta Hawks. In an effort to build a linear system of equations, allowing us to solve for these unknown ratings, we now consider this equation for **all teams**.

To understand how all teams work, let’s index every team by index j. This means **j=0 ** is the **Atlanta Hawks**, **j=1 **is the **Boston Celtics**, and so on. Then we obtain the system of equations identified as

There are a total of 30 of these equations, with 30 unknowns of r_i^j. The value, **r_i^j** just means the **rating for ****game i opponent of team j**. If we rewrite this in terms of a set of linear equations, we obtain the following:

Let’s walk through this. Line one is the original definition from the **Beta-Binomial posterior distribution**. Colley states that the posterior mean **(probability of success)** is indeed the **team ranking score**.

The second line substitutes in the rewritten line for wins; using the ignoring of the uniform assumption.

The third line rearranges the terms such that all the team rankings are on the left hand side, while the terms that do not contain the team rankings (constants) are on the right hand side.

Finally, the fourth line rearranges the sum to identify the number of times team j has played team i**. **Therefore we can walk the sum over all teams as opposed to all games!

What this now gives us in the following: **For team j, we associate to them N+2 games**. This is the filtered number of games from the Beta-Binomial distribution. **For opponent i of team j, we associated -n_ij**** games**. This is the number of games played between team i and team j**. For team j, we associate the response to be 1 + (number of wins – number of losses)/2**. Think of this as reflection of the win-loss ratio for the team.

Putting this all together we can obtain a **matrix representation** of this system of linear equations.

The matrix representation is given by

This is written in short form as **Cr = b**. Here, **C** is the **schedule matrix**. This portion identifies the strength of schedule for a given team. Each row is a team’s schedule with the diagonal being the number of games played (plus two with the Beta filtering), and the off-diagonal begin the number of games played against the opponent that the column represents.

The matrix (vector), **b**, is the number wins minus the number of losses **(win differential)** divided by 2. We add on the value of one from the Beta filtering. We think of each entry as effective win percentage for each team. To be clear, a .500 team results in 1 + 0/2 = 1, as the wins and losses cancel each other out.

Therefore the solution of rankings, **r**, is just the inverse of **C** times **b**. This is merely strength of schedule adjustment on the win-percentage. There is no closed form solution for the generalized matrix, **C**, and closed form solutions for the NBA is excessively messy. Due to this, we cannot identify explicit bounds on the rankings.

Instead, let’s look at a couple simple examples. This will help show some issues with the Colley rankings system.

Let’s start simple where we have four teams split evenly into two team conferences. For the league, a total of four games are played. Therefore each team plays their same opponent all four times. The resulting Colley matrix system is given by

There are a total of 25 possible outcomes for the season. Since the season is completely partitioned, we can focus on one conference. In this case, we identify the five following scenarios:

- Team A wins all four games:
**b_1 = 3, b_2 = -1** - Team A wins three games:
**b_1 = 2, b_2 = 0** - Team A wins two games:
**b_1 = 1, b_2 = 1** - Team A wins one game:
**b_1 = 0, b_2 = 2** - Team A wins zero games:
**b_1 = -1, b_2 = 3**

The other conference has the same breakdown.

Now, if we invert the schedule matrix, we see how much weight the schedule places on the win-loss records for each team. Here, the inverse of the schedule matrix is given by

What this shows is that there is a **clear conference bias** as only in-conference games are weighted. We interpret that the schedule places **sixty percent weight **on the team’s win-loss record, while placing **forty percent weight** on the team’s lone opponent’s win-loss record. Therefore, if the two best teams are in the same conference, we could very well see them split games and finish 2-2. While in the other conference, one team is a complete cupcake, resulting in a 4-0 versus 0-4 record. Let’s mark these teams as **A, B, C, **and **D**, respectively.

This results in a **b** vector of **[1, 1, 3, -1]**. Multiplying by the inverse of the schedule matrix we obtain the rankings vector, **r**, as **[0.5, 0.5, 0.7, 0.3]**. A good sign is that we obtain all rankings between zero and one. **This is not always the case in the general situation**.

In this case, we find that Team C is ranked first despite never playing a difficult opponent. This is due to **conference biasing** through the scheduling (sampling frame). Therefore, let’s mix things up.

If we change the schedule slightly and have two games in conference and two games out of conference, we obtain the following scheduling matrix:

In this case, it is impossible to obtain two undefeated teams. Instead we obtain more than 25 possible outcomes. While we still have less than 64 possible outcomes, there are 60 in total, we will not state all the use cases. Instead, we focus on the **strength of schedule impact**.

Inverting the schedule matrix, we obtain

If we consider the old problem of the two best teams in the same conference and they split their two games, but win out of conference we obtain the following **b** vector: **[1, -1, 2, 2]. **This corresponds to a respective record set of: 2-2, 0-4, 3-1, 3-1. In this case, the rankings, **r**, are given by **[0.46, 0.21, 0.67, 0.67]**.

This is the best scenario possible as we have a complete sampling frame of the entire league. We may have issues with ranking as Team A will be down weighted by Team B merely due to scheduling; however this is remedied by playing both Team C and Team D.

Finally, let’s look at another out of conference situation. In this case, we have three games in conference and one game out of conference. In this case, we can set the schedule matrix as

In this case, the inverse is given by

We see that there is now rating introduced to teams from teams they have never played. For instance, Team A obtains **11.25% of their ranking from a team they never played**. If this team is poor, then Team A is penalized **every time that team plays**.

In this scenario, assume Team A is the best team in the league and every other team is mediocre. Then suppose we obtain a **b** vector of **[****3, 0, 0, 1]. **This results in a record set of 4-0, 1-3, 1-3, 2-2. This results in a 128/160 = 0.8. The overall rankings are **[0.80, 0.45, 0.30, 0.45].**

What’s interesting here is that this is the first instance that the average of all rankings **is not 1/2**. This violates the expected requirement and is a result of the fractional sampling frame of the schedule.

Let’s verify this: Team A plays Team B three times. In the end, Team A is 3-0, Team B is 0-3. Team A play Team D while Team B plays Team C. This results in Team A 4-0, Team B 1-3, Team C 0-1, Team D 0-1. Team C plays Team D three times and this results in Team C 1-3 as Team D goes 2-2. This results in the **b **vector of **[3, 0, 0, 1]**.

This shows that despite the assumed proof finding an average ranking of 1/2 in Colley’s paper, here is a counter-example that proves otherwise true.

If we have a larger schedule and large population of teams, we run into other major issues. For instance, let’s consider Week 14 of the 2011 NCAA Football schedule. In this case, the 2012 sampling frame (schedule matrix) yielded such a weighting that we obtained probabilities larger than one!

**Note: **Ranking is determined to be the **probability of success** as defined in equation 1 in Colley’s paper. **Therefore it must be between 0 and 1**.

This happens due to the fact the uniform distribution is used as the **filter** and later, the mean of the filter is thrown away in favor of a **general mean**. Since this general mean does not match the filtering distribution used, **any scheduling bias will pull this filtered probability outside of possible ranges.**

What this indicates is that Colley’s Method is **schedule/conference biased **as witnessed by penalizing a second best team due to scheduling **(Examples 1 and 2),** that the assumed weighting will result in a global mean of 0.5 is also false **(****Example 3)**, and that the ignored assumptions violate the definition of probability of success **(Thanks LSU…). **

The reason this all occurs is due to the fact that the original assumption is that **every team’s sample path of games is independent of every other team!** This was the very start of this article. Recall we focused on one team. This assumption is purely false and the correction should focus more on a **graph-based/network approach**.

With the weighted win-loss versus strength of schedule partition of wins, we obtain 30 independent equations. To impose the correlation between the teams played, the general values of **r** are injected without any pure rationale other than “it makes sense.” We have viewed above, despite conference biasing (which is an experimental design issue, not a Colley issue), we still require a **carefully thought out sampling frame**. In NCAA college football, this is not the case. For the NBA, we are fine as teams play everyone in their conference 3-4 times, everyone in their division 4 times, and everyone outside of their conference twice. Despite the distributional assumptions being incorrect, we still obtain an interpretation ranking.

If you’d like to play around with your own Colley Matrix, here’s the associated Python code:

For the NBA, considering games through November 3rd, we have witnessed 130 total games played. The current records are given by:

The resulting Colley Rankings are given by

And Atlanta has since lost an 8th game since starting this article, falling to 1-8 and the bottom of the Colley Rankings.

]]>