Want to know how how well the **Washington Wizards **play when Ian Mahinmi is substituted in for Dwight Howard? We can take a cursory glance at the five-man data:

Just note that we aggregate the data; and that match-up data is effectively scrubbed. Therefore this is not a proxy for interaction in offense-defense match-ups.

Using the image above, the files are Excel dumps with five-man offensive lineup and their corresponding defensive lineup beneath them. In this case, we see that the lineup with Dwight Howard has played 7,627 seconds together for 264 offensive possessions, while the the Ian Mahinmi lineup has played 4,146 seconds together for 152 offensive possessions.

Similarly, we see that the Howard-led unit has an offensive rating of 111.36 with a defensive rating of 124.15; while the Mahinmi-led unit has an offensive rating of 104.61 with a defensive rating of 118.06. There’s not much difference here as the net ratings are -12.79 versus -13.45, respectively.

Also note how I aggregate possessions. There are some **zero-valued** defensive possessions with a defensive FGA. These occur on the backs of free-throws with substitutions; as I refuse to partition possessions and refuse to double count.

The biggest take-away is that most of the five-man lineups have only a handful of possessions. Across the entire league, less than **ten percent** of all five-man lineups has more than 27 offensive possessions played. In fact, the median is **six possessions**. So be careful using lineup data to make concrete decisions.

One interesting tidbit is also the number of unique lineups. Typically, teams with small numbers have roster stability. Teams broken up by trades or injuries will tend to have many unique lineups. Similarly, teams that are floundering will also have a high number as they are trying out new ideas in the midst of a seemingly lost-cause season. Apologies, Atlanta.

One of the things we can do with lineup data (not with the summaries below) is break out the **maximum likelihood rotation** for a team and identify if that team makes the most sense. For example, here is the Orlando Magic from this season; the team with the least amount of unique five-man lineups:

Note that the optimal Orlando Magic lineup performs exceptionally well around the half-time periods; closing the second quarter and starting the third quarter. We see that the standard rotation usually struggles to start games and put away games; with the final five minutes being a dramatic drop-off. **Note: **The darker the red, the worse the team is outscored; the darker the green, the better the team outscores their opponent.

Through December 12th, here are the teams ranked by least amount of unique rotations:

- Orlando Magic – 103
- Indiana Pacers – 112
- Minnesota Timberwolves – 123
- Denver Nuggets – 125
- Portland Trail Blazers – 132
- Washington Wizards – 143
- Los Angeles Clippers – 146
- Dallas Mavericks – 159
- Oklahoma City Thunder – 175
- Chicago Bulls – 176
- Brooklyn Nets – 184
- Houston Rockets – 189
- Boston Celtics – 198
- San Antonio Spurs – 200
- Toronto Raptors – 205
- Detroit Pistons – 208
- Los Angeles Lakers – 209
- Philadelphia 76ers – 213
- Sacramento Kings – 218
- Charlotte Hornets – 223
- New Orleans Pelicans – 230
- Milwaukee Bucks – 231
- Memphis Grizzlies – 234
- Miami Heat – 234
- Golden State Warriors – 236
- Cleveland Cavaliers – 239
- New York Knicks – 249
- Utah Jazz – 250
- Phoenix Suns – 255
- Atlanta Hawks – 329

Yes, that last one is not a typo. Here are all the teams’ five-man lineup summary statistics:

**Atlanta Hawks: **ATL12DEC2018

**Brooklyn Nets: **BKN12DEC2018

**Boston Celtics: **BOS12DEC2018

**Charlotte Hornets: **CHA12DEC2018

**Chicago Bulls: **CHI12DEC2018

**Cleveland Cavaliers: **CLE12DEC2018

**Dallas Mavericks: **DAL12DEC2018

**Denver Nuggets: **DEN12DEC2018

**Detroit Pistons: **DET12DEC2018

**Golden State Warriors: **GSW12DEC2018

**Houston Rockets: **HOU12DEC2018

**Indiana Pacers: **IND12DEC2018

**Los Angeles Clippers: **LAC12DEC2018

**Los Angeles Lakers: **LAL12DEC2018

**Memphis Grizzlies: **MEM12DEC2018

**Miami Heat: **MIA12DEC2018

**Milwaukee Bucks: **MIL12DEC2018

**Minnesota Timberwolves: **MIN12DEC2018

**New Orleans Pelicans: **NOP12DEC2018

**New York Knicks: **NYK12DEC2018

**Oklahoma City Thunder: **OKC12DEC2018

**Orlando Magic: **ORL12DEC2018

**Philadelphia 76ers: **PHI12DEC2018

**Phoenix Suns: **PHX12DEC2018

**Portland Trail Blazers: **POR12DEC2018

**Sacramento Kings: **SAC12DEC2018

**San Antonio Spurs: **SAS12DEC2018

**Toronto Raptors: **TOR12DEC2018

**Utah Jazz: **UTA12DEC2018

**Washington Wizards: **WAS12DEC2018

Enjoy!

]]>To me, as a former player/coach/scout/analyst/front office specialist, this is intuitive. However, as one Eastern Conference representative responded to me: **I’m not sure about that. Spacing is about creating space.** Another Eastern Conference representative gave a different response with a mocking tone: **I’m sure Steph Curry doesn’t think “better shoot this because they are 6.2 feet apart.” **In fact, teams that have pushed back on the notion are still in the belief that spacing can only be achieved by placing all shooters on the three point line and that the defense plays no role on spacing. Other teams have quickly identified that spacing is indeed a defensive-based analysis where offenses are measured based on the defensive reaction to an offensive scheme.

When we talk about scoring efficiency, we think of **effective field goal percentage** as the main descriptor. Remember this value provides an **unbiased sufficient statistic** for points scored from the field. What this quantity helps break down is the value of every type of shot, and serves as a basic premise of “layups and threes” over mid-range jumpers.

If we take this a step further, teams tend to shoot really well when the play results in a dunk. As most teams shoot 85% or better when the field goal attempt results in a dunk; we would expect a play ending in a dunk to have an **“expected point value”** of 1.7 points per attempt. That’s not too bad, right?

Compare this to a resulting three point attempt. Using the naive method for comparing two’s and three’s, we find that a shooter must hold a 56.7% conversion rate in order to keep pace with a dunker. And 85% is the low number for teams. Here’s the dunking distribution for **Atlanta, Boston, Brooklyn, ****Denver, Detroit, New York, Portland, **and **Sacramento:**

What we see from these teams is that teams effectively only attempt between 3 and 7 dunks per game. The reasons are simple: dunks have a higher propensity to be defended and are opportunistic. The latter part means that dunks are primarily taken when there is little resistance by the defense. In fact, whenever a dunk is on top of a defender it either turns into a highlight of a scorer destroying a defender or a defender picking up a gritty block.

Compare this to teams that hoist between 25 and 35 three point attempts per game, we see that threes appear with **less resistance** than dunks. If we take an even slightly more in-depth look at the rim attempts; layups are a **significant drop off** when it comes to conversion rates. As dunks are typically converted 85 – 95 percent of the time, layups are converted only between 54 and 64 percent of the time; with the **Milwaukee Bucks **and the **Toronto** **Raptors** leading the league in conversion.

Here’s six layup profiles of the eight above teams we highlighted for dunks:

We see between 11 and 16 layup attempts per game. At a clip of only 58%, we obtain an **“expected point value”** of only 1.16 points per attempt. This is indeed significant drop-off from the 1.7 for dunks; but still well above mid-range attempts. When compared to three point attempts, we naively only require a 38.7% three-point percentage to keep pace with layups; which according to NBA Stats a total of 18 teams currently shoot above that percentage.

This suggests that layups+dunks are a slightly better option than three point attempts. This reinforces the idea of layups and threes; but should really be viewed as **Dunks and Threes but Layups if you must**. Therefore, spacing in an offense should reflect **creating lanes for layups and dunks while opening pockets about the three point line for open looks at three.** Here’s what I visualize when I speak of spacing:

The above play is a **Hammer Action** run by the offense initiated by a Pick-and-Roll. The dashed yellow line is the movement of the defender after the weak-side pin is hit. Just think, if the weak-side screen defender decides to help, this leaves a **wide open** space for a slip for an uncontested layup or dunk. **Spacing is created by the screen**, not the three point attempt. In fact, if this is executed perfectly, we shouldn’t see a 3PA recorded. If this play is not executed perfectly, we will see the 3PA recorded. And, of course, if this play is well-defended; we will see neither immediate actions.

What this diagram shows is that spacing is indeed manipulating the defense into a high-quality field goal attempt. Spacing is created by screening, passing, and gravity; not solely three point attempts. Let’s take a look at this in-depth at the expense of DeAndre Ayton.

To illustrate spacing, we look at the notorious crossover move by Indiana’s Darren Collison on Phoenix’s DeAndre Ayton (22) during their November 27th match-up. The play was a result of two attempts at a Myles Turner (33) – Darren Collison (2) pick-and-roll action.

During the possession, Collison brings the ball up the court, and the Pacers run a double pull-through with Thaddeus Young (21) and Corey Joseph (6). Joseph goes to the strong side corner as Young goes into the strong side short-corner. The aim is to pull the defense to the strong side of the court, opening up a driving lane along the right side of the key. The goal is to hit a pick-and-roll at the three point line and attack the lane.

In the screen-cap, we see that Phoenix’s Devin Booker (1) ICE’s the screen, forcing the ball into the congested strong-side. Ayton picks up the ball, but knows he can sag back as Collison’s drive will be directly into traffic. Therefore, there’s no driving lane and Ayton is quick to recover on a slip-pass to a rolling Turner.

As there is no space to act, Turner either must force a difficult shot, or bring the ball back out of the paint. Smartly, he does so by kicking back out to Collison, who has now strayed back to the three point line. Turner then chases the pass and attempts the pick-and-roll a second time.

As we see Turner come back up, Booker holds his ground and does not ICE the pick-and-roll the second time. Ayton correctly reads the play and sags to contain Collison’s dribble; except this time Turner pops instead of rolls. This action forces Booker to go over the top of the screen and allowing the ball to come downhill into the lane.

If we compare the two frames immediately after the pick-and-roll; we see that Ayton is in nearly the same spot. He’s properly contained the ball going from right to left. However, his feet in the second pick-and-roll are tight. This means his range of motion will be poor; which is what Collison capitalizes on. Regardless of the feet, Collison’s driving lane is closed. This forces Collison to hit the brakes.

Ayton is unable to control his body due to his footwork, which Collison actually pauses to watch Ayton fall. As Booker recovers in the play, we still see he is not guarding anyone as he’s facing the wrong way to defend a catch-and-shoot to Turner; facing the wrong way and can get nailed with a DHO to Turner; and facing the wrong way to help on Collison.

Similarly, Phoenix’s T.J. Warren (12) is the next man up on the stop. He is facing the play with Thaddeus Young behind the backboard in the (now) weak-side short corner.

Instead of stepping up, Warren takes a moment to process the action. Collison drives. Believe it or not, the sequence between Ayton falling down and Collison leaping up for the layup is **1.92 seconds**. That’s a significant amount of time to watch a play unfold.

Here is a situation where spacing is created by overloading the strong side of the ball and using a screen to attack a driving lane. While Phoenix played the play properly (enough), unfortunate luck combined with two bad reactions from Booker and Warren allowed Collison to have **no one within 8 feet** of him as he went up for a layup. That’s spacing with only two guys outside of the three point line.

On way to think of creating (or losing) spacing is the ability to open (or close) passing and driving lanes. This is usually coined as **anticipation** and **reaction**. Anticipation is the result of understanding the actions of an opponent and **beating them to spots**. Reaction is the process of taking appropriate actions **after an opponent has initiated a move**. Good NBA players are athletic enough to **react**. Great NBA players are athletic enough to **anticipate and react**.

In the example above, Ayton was very strong at anticipating the action. However, in the situation with Collison, Ayton’s reaction was a bit slower. One way to measure this is through velocity and acceleration. When I described how Booker was facing the wrong way or Ayton was just “going in the wrong direction,” I was focused on their acceleration and velocity. Let’s see the above possession again, but with the **coverage map** and the** velocity vectors** attached.

If we take a quick screen-cap, we see that Booker is actually defending the roll that never comes. And since it never comes, he never corrects himself and heads away from the play; putting him out of position. This spacing is created by **Myles Turner**.

Mikal Bridges (25) needs to hold his ground as he’s the last line of defense for a kick-out to a wide open Corey Joseph. Plus, being at his height on the play, he has to come downhill to attack Collison; a likely fouling (or futile) attempt on the ball. The reaction time of Warren leaves him outside the opposite end of the restricted area when Collison lays the ball in; over six feet away.

As a side note, if you’re familiar with sketching, this year has been outrageously rough on the assumptions of the sketching model. If we take a look at the associated velocity plot of the offense, we find that many of the actions will not segment until improper times.

Here, we see that the two pick-and-rolls do not meet the sketching segmentation process’ requirements. Over the previous two years, pick-and-rolls have become increasingly **non-traditional** in the sense that a screener sets their feet and absorbs contact. In fact, many times, they will decelerate into an oncoming defender, and then glide off without stopping. We actually see that **both times in this possession**. The velocity plots concur. The closest we come it the first screen being initiated. However, Collison is already gone before Turner comes to a stop and initiates contact. This is due to the BLUE action.

Similarly, on the crossover, Collison’s speed is above the 0.1 foot per second thershold. It’s not until he hesitates to watch Ayton fall that he comes to a stop. What this indicates is that segmentation should also include **player interaction**. Which, by the way, if you managed to appear at a Baltimore convention I spoke at in 2015 (along with John Urschel, who spoke about Graph Laplacians) you would have seen my segmentation approach; applied to a Spurs-Heat NBA championship game from the year before.

Hopefully you see why I die on this hill and say that **spacing is a defensive driven metric**. Also, if you’re curious as to how I obtained the data. I spent way too much time between November 27th and December 1st plotting points. The process I use is here: https://www.youtube.com/watch?v=PNYgqWqS1B8&t=7s

I did manage to confirm through NBA stats that the generated summary stats during the segments matched their summary statistics (within .01%). So that’s a plus. But it’s also 9 hours of my life I won’t get back.

]]>The question is then, how do we extract information about the passing network? Obviously, we don’t have the passing network anymore (unless it’s hidden, then someone link me sooner than possible!); but suppose we were crafty enough to obtain tracking data? For instance, suppose we supported our favorite Division III NCAA team and strapped Go-Pro cameras to the roof and performed basic convolutional neural networks to find players. Well, then how do we extract out passing information?

In this article, we summarize a methodology for grabbing passing attempts and identify why simple filtering is a massive problem.

Before we can perform any analysis on passing, we must first model the pass in tracking data. A pass can be performed in one of many ways. The **four most common** passes are the **chest pass**, the **bounce pass**, the **lob pass**, and the **handoff**. Think of these four components as straight angle, angle up, angle down, and no distance. There are other variants that exist, but these are the four we maintain our focus on.

One easy method to track passes is to **mark passes** in a game and then use the passes as training data to build a **supervised learning **model. I find this methodology may be overkill for something simple as marking passes. Instead, let’s just apply a **notch filter**, which is a rule-based method for eliminating certain elements of the game. In this case, the passes themselves.

So let’s walk down what will seem like completely obvious rules. As we introduce rules, we will discuss how the rule translates into a filter and some of the issues that arise with such a filter.

Rule one is that all completed passes must be between two teammates and within the same possession. This means that the ball has to be with one player for a period of time and then transfer to another player on the same team for a separate period of time. This indicates a **transaction** has occurred between the two players.

To apply this rule, we first **identify possessions** for each team. Within this period, we know that the offense has the ball. Therefore, second, we can merely look at the **touches** between players. A touch is just defined as the period of time a player maintains control of the ball.

Therefore, all we need to do in the code is maintain the **state** of the ball being touched while the ball is moving within possession. Building a class variable may be helpful in this capacity; where whenever a pass is impossible, the state of the class is **flushed**.

**Example: **Consider a point guard that brings the ball up court. They will be listed as the touching player until the make a pass to the wing. When the ball leaves the player, possibly by a certain threshold, the player state may be blank. When the receiver catches the ball, the player state of the ball has changed to a new player. **This is a pass**.

**Example: **Now when the receiver catches the ball, suppose they shoot. The ball leaves their possession and changes the player state of the ball to blank. This looks like a pass as well. However, the ball travels to the rim and misses; landing back in a teammate’s hands. We see the state change once again; but the **action** of **field goal attempt **has occurred. Therefore, we flush the previous state and update it with the rebounder. **This is not a pass**.

The issue with touch-based passing is that while we do indeed grab every **completed **single pass; we gain a lot of **false positives**. These situations are deflections; which are “passes” that are unintended to be made to a player but occurs because the ball was deflected. Think of a tipped entry pass into the post that gets batted to another player in the short-corner.

Similarly, loose balls cause problems as suppose a point guard loses their handle and a teammate grabs the ball. **This is not a pass** as passes are **intended**. This is difficult to gauge, especially in the following example:

**Example: **Suppose a wing throws an alley-oop to a cutting post player. However, the pass is misguided and ricochets off the rim to a different teammate. Is this a pass? **Ideally**, this is an incomplete pass that results in a loose ball; recovered by the teammate.

Therefore, we count these as **incomplete passes**. We will encounter these in the next rule.

Also, dead balls pose problems as the clocks does not always stop when there is a dead ball. For instance, a made field goal does not always stop the clock in the NBA. Hence the simple act of flipping the ball out of bounds to a teammate to be thrown back in may register as a pass. We will treat this issue later.

Finally, an artifact of the camera system, box outs look like passes in some occasions. If a field goal attempt is taken, the ball may travel over a teammate in a low-enough fashion to look like a lob pass. Therefore, we really need to track the field goal file to notch these out.

Another simple rule is that passes that are incomplete must either be turnovers, dead balls, or loose balls. For the turnover cases, these are passes that are intercepted by opposing defensive players or passes that exit the court. Either way, the result of the pass leads to **termination of possession. **Therefore, we can use the possession ending frame to signal an incomplete pass. **These are still passes**.

For incomplete passes that do not result in turnovers, we know that the offense has retained possession. Therefore, the ball must have been deflected out of bounds or the ball was loose and a “scramble” ensued. In this case, we look for specific information such as **change of direction with a non-teammate touch**.

We identify the last part of **non-teammate touch**. This is vital. A deflection cannot occur unless an opponent is there to deflect the ball. Similarly, a ball cannot ricochet off the rim without a rim. Neither an opposing player, nor the rim, are teammates of the passer.

These are two extremely difficult areas to filter out. A strip or poke requires a defensive player to rip the ball from an offensive player. In this case, we have a **loose ball**. Therefore the state of the ball should be **loose**. However, using tracking data like we have for Second Spectrum or SportVU, we no longer have directionality of player. Therefore, we don’t know if an offensive player is facing, is sideways, or has their back to the defensive player. We can estimate it by where the ball is measured, but it will not be known exactly.

In these situations, strips may occur and they may look like passes. Pokes are worse as the offensive player may have a back to the defender and the ball is poked directly to an offensive teammate. It will look exactly like a pass and end up in the teammate’s hands. However, **it is a loose ball**.

As a side note, Second Spectrum has tried cleaning these up as some pokes have been registered in the past as passes in the past. One such way to clean up deflections is to leverage the **loose ball** and **deflections** counting stats. While the time-stamps may not line up perfectly, they give extra information into **game state**; allowing us to sever touch-to-touch state of the ball. Leveraging this data will also reduce **Magic Bullets.**

Magic Bullets are passes that seemingly defy physics. For instance, a player may put so much backspin on the ball, it takes a hard sharp turn to reach a teammate. In the same vein, as tracking technology cannot measure the spin of the basketball, deflected passes may look the same in x,y-space.

In total, using the touch-to-touch attack with notching based on possessions, loose balls, and deflections, we are able to account for over 99% of passes in the league with a false positive rate at about 2-3% . The bulk of the false positives are **transition of possession** cases. Remember the issue raised in rule one? These are those instances.

Transition of possession cases are situations where a possession changes hands between teams. Turnovers are obvious. Dead ball turnovers are obvious. Made field goals are not obvious. In some cases, three seconds have run off the clock as a team takes a ball after a made field goal and tosses it to a teammate underneath the hoop. This gets marked as a pass as the touch-to-touch is satisfied, the new possession has technically started despite the shot clock not, and the game clock is still running.

To deal with this is actually relatively simple. We see the ball get passed from in-bounds to out-of-bounds. Since no turnover is noted in the play-by-play, we are obviously **taking the ball out of bounds**.

As other intricate nuances of passing arise, we can include other simple filters. However, these two rules and a couple bashes against play-by-play captures almost every pass with excessively few false positives.

So let’s see this in action.

Let’s apply the above rules in terms of tracking data from the Pacers-Cavaliers opening round game from two seasons ago. In this case, Myles Turner knocks down an elbow jumper to transfer possession to the Cavaliers. Kevin Love takes out the ball and **passes** it back in to Kyrie Irving, who dribbles up the court.

Love gives a pin down to LeBron James, who received the **pass** at the high elbow. J.R. Smith comes across the lane for a pin-down from Love, mimicking a lift action to the wing. C.J. Miles goes under the screen, allowing Smith to curl as a reaction.

Thaddeus Young, defending Love, steps up to deny the curl, allowing Love to react and slip for the easy lay-up. James makes the **pass** to the interior. Unfortunately, the layup was not easy as Myles Turner rotates from the opposite elbow to help introduce the field goal attempt to the front row .

As we can see from the video, the exact play described breaks out. As we identify where a player obtains the basketball, a blue line appears. This line merely indicates the split for recording **which direction** a player will move. This also signifies when a **touch** starts. As the passes are made, we see a faint **PASS** blurt at the top of the screen. That’s the filter in action!

Now that we know we can extract passes, we can start looking at the distribution of passes. For instance, we can look at the locations of passes being made and the locations of passes being received. We can also build out the passing network when we introduce the labels of the players.

Now, unlike field goal attempts, passes can be made all over the court, in different directions, during the same possession. For field goal attempts, there is one fixed location to shoot to; and players do not (intentionally) shoot at the wrong hoop. For passes however, passes can go forwards and backwards. They often do happen in the back court. Therefore, we should always specify direction of play when dissecting passes. The simplest solution? **Make direction of player go left**. Therefore all back-court passes are on the right hand side and all front-court passes occur on the left-hand side.

When performing this simple transformation, we obtain the following plot:

Here, the black dots represent the passer’s location when they released the basketball and the blue dots represent the receiver’s location when they received the basketball. Using the left-to-right transformation, we see the flow of the game from back-court to front-court. We also see a significant set of rebounds at the opponent’s hoop!

So let’s apply some analysis to see what sorts of things this mess of a plot is telling us.

Let’s piggy-back off of a recent post about shooting trends in the NBA and perform the same analysis on passes. By performing a non-negative matrix factorization on the pass origination locations, this gives insight for the trends in where passes originate.

Again we break this down in 10 types of passes, and will quickly jump through them.

The most common half-court pass is the initialization pass. These are passes that come from the Pistol sequence to initiate an offense. These are also passes where the point guard tends to dribble to one wing or the other before initializing the offense. This is the primary reason for the band about the three point line.

These passes tend to be the “second pass” part of an offense. While it is not always the second pass, these are passes that are typically initiated by a wing player.

The players with coefficients in this location for passing actually are **post players** that have rotated out to the top of the key. We see several point guards with these coefficients, but they are rarely as large as say **Al Horford’s **coefficient.

The fourth most apparent pass is the interior pass. These are the drive-and-kick guards and the “extra pass” passers that find themselves tangled up in traffic.

Passers in these locations are actually right-dominat offensive players. If this were last year, **Evan Fournier** would be found as the king of this pass type as over 70% of Fournier plays last year resulted in him catching a pass on the right wing (extended), only to make another pass, typically into the post or to the top of the key.

Another top of the key pass-type, this is the roaming guard. As opposed to the Al Horford-type passers at the top of the key, these are the guards that handle the ball typically on **odd-set** offenses such as **all-out** or **3-2. **

We are seeing repetitive passing structures; and as you can guess, passes are primarily performed by guards and wings that play above the free-throw line. These are the off-ball guards and wings for the most part.

These passes are typically generated by wings and posts that have rotated out to the high elbow. **Nikola Vucevic** is actually a big name in this category. So is **Kelly Olynyk**. The streaks are wing players (that also play 4/5 positions) that tend to drive. These players are **Giannis Antetokounmpo, ** and strangely not a post… **Klay Thompson**.

Here, we obtain mid-range passing. Remember that these are passes and not shots. These passes are all over the place. They are kick-outs from drives (to the top of the key). They are skip passes. There’s even **Hammer passes** sneaking in there.

Finally we have the pure high post player passes. These are for post players that rarely move beyond the three point line. Here we find players such as **Clint Capela**.

One thing we learn about the different trends in passes are the areas where passes are not made. For instance, the left three-point wing does not have a significant spike in any plot. This indicates that most players **drive **or **shoot** from this location.

A semi-satisfying result is that **directly under the basket** is relatively small compared to every location around the restricted area (including behind the basket!), indicating the same phenomenon.

Regardless, by looking specifically at the players’ coefficients, we are able to start to understand the players’ **decision making processes** of how likely they are to make passes.

After 278 games played through November 24th, we have had the chance to witness every team play at least 16 games; as the Detroit Pistons have been lagging behind most teams that have played 18-19 games by this point. This gives us the ability to gte above the noise floor and start comparing trends between this season and last season. To do this, we simply apply the non-negative matrix factorization process to the entire season and compare the spatial basis functions. And here’s what we found out…

We find that the primary action so far this year has been rim attacks; which is a major change from last season. Last year, the league was dominated by three and rim players. These players ranged from your driving players such as James Harden of Houston and Steph Curry of Golden State to your catch-and-shoot and slip players such as Klay Thompson of Golden State and, yes… Al Horford of Boston.

Using the Milwaukee Bucks’ court as the backdrop for this season, we see the huge discrepancy between this trend. Notice the right-handedness of last year. This will become an important factor very soon…

Here, we see the mid-range start to fall off in favor of a right-hand dominant game. We see the three-and-rim attack as in the prior season; even with a flair of the right-hand dominance from the previous season (see the Rockets court above). However, this season is more pronounced; which may be due to only 278 games played as opposed to 1230 games.

Note that the mid-range game is not in the top two spatial basis; which indicates a current slight decline in preference for the mid-range. In particular, the post spot-up type shots and the baseline drive-and-pull-up.

By the third spatial basis, we begin to see three point attempts jump back into the mix. We also see that spatial basis three of last year (rim attempts) are currently this year’s favorite style of field goal attempt.

Here, we are seeing effectively dunks as the fourth most prominent field goal attempt, whereas last year, the right-handed “Kosta Koufos” style of play was the fourth most dominant style of field goal attempts.

As we have seen this fourth spatial basis drop from last year, this is again indication of the growing “no-man’s land” that is the space between the restricted area and the three point line.

We see near identical favoritism to the three point line in spatial basis five, indicating that this type of field goal attempt is still a favorite among NBA teams. This makes sense given the three-point revolution over the past few years.

In the current trend, we do see some left-hand style of play and a slight decrease at the top of the arc. The top of the arc fluctuations are most likely due to noise, while the left-handed attack at the rim is most likely compensation for the right-handed dominance from above combined with the tendency for players to attack the rim and shoot the three. That is, this is separating old Spatial Basis 1 (Houston plot) and Spatial Basis 5 (Houston plot) into two re-factored types with Spatial Basis 1 (Milwaukee plot) and Spatial Basis 5 (Milwaukee plot)

This is an odd result that pops up. Here, we see the well-known “Corner Three,” which is considered the “easiest three point shot in the game.” It has yet to be seen over the course of this season as a preferred style of shot attempt. Instead for this season, we see an artifact-laden plot where the mid-range game is ever-so-slightly introduced with free throw and elbow jumpers. We see some faint three point trend, with an ever-so-faint corner three attempt.

This indicates the artifacts are finding player with a high propensity to attack the rim while taking shots from the elblows, free throw line, and the occasional corner three. These style of players are your Kyrie Irving’s of Boston, but also players such as Mo Bamba from Orlando.

Here, we once again see the mid-range fall on this list. And finally, we obtain the corner three assault from teams for this season. This shows that the corner three is still indeed a prevalent part of NBA team’s systems.

Here we see a match-up of left-handed post play. The key difference here is that the action is tighter around the rim, indicating that players are more willing to either take the extra step, or position themselves, to be within five feet of the rim as opposed to be between 5 and 15 feet from the rim.

As we have seen over the years, field goal percentages drop significantly after players wander beyond six feet from the rim; in particular at ten feet do percentages really drop off. This plot indicates that both players and coaches are understanding that key point and making it an established style of field goal attempt.

Here’s another basis-to-basis match with the top-of-the-arc three point attempts. The core difference between these two plots is that players are taking the shots closer to the top of the arc as opposed to having a 15 foot window along the arc.

Some of this may be due to noise while some of this may be due to spacing effects of the team. Regardless, the style of field goal attempt is still in vogue for this season, at the same prevalence.

Finally, the mid-range game makes an appearance in the league. It should be noted that in a ten-factor decomposition, the tenth factor contains the “remainder” of the spatial noise. This indicates that the mid-range game is effectively the least favored shot style.

That said, its prevalence identifies that it is still a key part of the game. And while the mid-range is considered a low-yielding type of field goal attempt, the fact that players take them induces a “game theory” attack on their defenders. Read that as hitting an occasional mid-range jumper forces defenses to address the mid-range game; allowing spacing to occur on offense. The mid-range game is still indeed important to spacing.

What we ultimately find out here is that the mid-range game is indeed disappearing from the game. It has been written a lot about the transformation the Milwaukee Bucks have made over this year with Mike Budenholzer at the helm. You can find some of those articles here, here, and here. However, these articles just look at the basics of “More Threes!” and “Less Mid-Range!” But what are the Bucks actually taking for shots? We can quantify this. And within a couple of those articles, a comparison is made to the Houston Rockets. We can quantify that too.

Last year, under Jason Kidd and company, the Bucks favored an all-over-the-court style of game. It was called “position-less” basketball. What it really did was force the team to shoot all over the map in an undisciplined style of play. We see that with the shot-chart from last season:

We see that shots are indeed taken everywhere on this court. A strange occurrence is that right-direction play for the Bucks actually sees some consideration of not taking a “foot-over-the-line” two-point attempt. However, in the left-hand direction… it’s out the window. This is just a strange coincidence.

Let’s compare that chart to this year:

Just slightly under a quarter of the way through the season, we see the Bucks taking a few mid-range shots; but nowhere close to the volume as before. We still see a significant amount of field goal attempts lingering in the high post/paint; but that’s an artifact of players such as Khris Middleton and Eric Bledsoe attacking from the three point line. Most important to note, there are only a couple handfuls of mid-range jumpers to the wings/baseline.

To understand this in terms of style of play, we look a the coefficients of the Bucks in terms of the spatial components. Remember, we need the breakdown above to be able to understand what is really going on.

Let’s quickly recall what those ten components were…

- Threes: 1 (right-handed), 5 (spread/wing), 6 (corner) , 9 (top)
- Mid-Range: 2 (baseline), 4 (right post), 7 (pullup), 8 (right post)
- Rim: 3, 10

OK, so what did we see last year for the Bucks? We saw a **significant Khris Middleton mid-range (baseline) game**. This was far and away the most preferred conditional attack from the Bucks. As Middleton attempted 15.5 field goal attempts per game (second to Antetokounmpo at 18.7), **8.37**** attempts came from the mid-range**. That’s 54 percent of Middleton’s shots came from low-eFG% regions. With an eFG% of .524 last season, a field goal attempt from Middleton was worth approximately 1.048 points per chance; a little below the desired 1.10 rate. This placed Middleton at 43rd in the NBA, despite the low eFG% rates; which indicates that Middleton is most likely a well-above average mid-range shooter.

If we take a closer look at the Bucks’ numbers, we do indeed see a high number of zeros on the board (indicating roles of shooters), but those numbers are associated with low-minutes players. This means most zeros on the table are **structural zeros** and have no influence on the style of play. Instead, we focus on players who play significant amounts of time.

If we look at Eric Bledsoe, Giannis Antetokounmpo, Malcolm, Brogdon, Khris Middleton, Jabari Parker, Brandon Jennings, Jason Terry, Thon Maker, and Matthew Dellavedova; we see **every single one of these players have numbers all over the** **board**. This indicates that the Bucks are playing the older style of basketball, allowing shooters to shoot all over the map; for better/for worse.

Taking a closer look, basis functions 2 (midrange baseline) is the Bucks’ most prominent field goal attempt type. This is followed by basis function 5, the spread wing three point attempt; and spatial basis 3, the primary rim action. This shows Milwaukee as playing a very late-90’s to mid-2000’s style of play. Don’t believe us? Here’s the 2004-05 season:

Click to view slideshow.Let’s compare that to this season:

Here we begin to see the separation of shot types. To help ourselves again, we focus on what each spatial component represents:

- Threes: 3 (wing), 5 (spread/wing), 7 (corner) , 9 (top)
- Mid-Range: 8 (post action), 10 (all mid-range)
- Rim: 1, 4, 6
- Right-Handed Action: 2

Immediately we see a lot of zeros for high usages players and **they are all in components 8 and 10**. Look above what those components mean. That’s right. The mid-range game is indeed gone. And Khris Middleton? **Almost nothing from full mid-range**. Hands down the biggest change of all. And his shooting has so far improved .045 for eFG%. **That’s an extra .09 points per chance** for Middleton. Remember that desired 1.10 points per chance that only about 30 players in the league can attain? Middleton is at **1.13 points per chance** when it comes to field goal attempts.

We also see other trends pop out: Giannis **is not a corner three shooter**. Donte DiVincenzo and Tony Snell are effectively the **only wing-three shooters on the team** and they play a limited role. In fact, if the Bucks are taking threes, they are spread about the high wing from Brook Lopez, Khris Middleton, and Eric Bledsoe. And if Giannis gets into the half-court offense, be prepared for a right-handed game. That’s the primary action for the Bucks; **and no one else on the team really does it**.

As we compare the current Bucks to the previous year Bucks, we see that the roster has very little turnover. Effectively the addition of Brook Lopez over Jabari Parker and some deeper rotation shuffles is all that exists. Therefore, it is safe to imply that the style of play change is due to the change of coaching.

Recall that last season the Rockets were by and far the most disciplined team in the league. We saw this with large blocks of zeros in the spatial components.

However, the Rockets roster turned over quite heavily from last season as we saw the departures of Trevor Ariza to Phoenix and Luc Mbah a Moute back to Los Angeles. We saw the introduction of Carmelo Anthony for a short period of time and have welcomed new players such as Michael Carter-Williams and Gary Clark into the mix. The question was whether the players could adapt to new roles or fit into the system; or if the system would change to adapt to the players. So far this season, after 17 games, the Rockets have a distribution as follows:

We see some similar trends as the previous year, but we also see a lack of discipline compared to last year. While the midrange game has vanished as a whole from across the league, we see Houston pick up the post / high paint mid-range game (spatial component 8) and it is being adopted by Eric Gordon and Michael Carter-Williams. The latter can be due to learning the system; the former is definitely not.

We also see the impact of Carmelo Anthony with the high mid-range game, rivaling that of Chris Paul; who was effectively the only Rocket (other than Harden) given the green light in this area last season. Whereas Paul was effective from the mid-range (much like Khris Middleton), Carmelo was not.

For top three’s we still see Harden, Paul, and PJ Tucker maintain their roles. However, with the departure of Trevor Ariza, we see that fourth player disappear from the ranks. This is a roster construct as neither Anthony nor Carter-Williams fit that style of shooting capability. Ariza’s shooting role has since fallen onto Gerald Green. More curious, however, we see that Eric Gordon disappear from the ninth component.

Breaking down Gordon’s role in the offense, we see that he has yet **to attempt a field goal between 16 feet and the three point line**. Which is the least amount he has taken ever, but we also see that Gordon is down **seven percent** when it comes to three point frequency. And this may be a direct result of his unexpected dip in three point percentage; **.243**!!!

Last year, Gordon was primarily a spread/wing three point shooter and then a corner spot-up shooter; moreso on the left side than right as he has **no spatial component for right-dominant threes**. Compare that to this year, Gordon’s threes have been exclusively a wing and spread/wing shooter as he has **no spatial component for top threes nor corner threes**. This indicates a change in the style of play for the Rockets between last year and this year; a given with quite the overhaul of the roster.

But as D’Antoni mentioned earlier this season, the Rockets may have lost their swagger for the time being. And we see it in the shooting numbers as the team feels out the right style of play.

But for now, it looks as if Milwaukee has taken over as the current king of disciplined shooting. As the transformation even converted mid-range king Khris Middleton into a better overall shooter.

]]>

My basketball world was rocked when I moved to Wisconsin and played out the remainder of my high school career in Madison, Wisconsin. In Wisconsin, the Badgers were the team of interest to the casual fan; but the team of the state was University Wisconsin – Platteville, headed by then head-coach Bo Ryan; the originator of the most famous Wisconsin high school offense: the Platteville Swing. There in Wisconsin, most teams did not rely on shooting or athleticism. Instead, they focused on ball control and grinding games to a halt. Possessions were no longer 10-15 seconds as in California; but rather 45-70 seconds as no shot clock was in existence. Scoring 60 was viewed as a “high octane offense.” For example, my senior year, we averaged 64 points a game; which was considered blasphemously high. Most teams in our conference averaged 44 points a game. We even had one conference game result in a 9-6 final.

With downtrodden possessions, came the need for defensive discipline. In practices, we played the **30-point game**, which focused on scoring **a point** for every pass of “reasonable” distance. Shots, worth **five points**, could only be taken as lay-ups or dunks. Teams turnover possessions whenever a turnover is made or a team dribbles the ball; a turnover being worth **-5 points**. First team to 30 wins, loser gets a “down-back” for every point they lost by. Losing by 15 was always terrible. More importantly, a defensive foul was worth **three points** to the offense.

What this taught a defense was to reinforce ball denial, teach how to use hands on defense, and how to force turnovers. The ultimate discriminator of team points came down to fouls. Due to this, we had to learn how to reach without reaching, and block without bodying, and take charges without blocking. Try learning how to slide into a charge during a break and no dribbles are being taken. After about forty collisions, you start to become an expert.

Need to reach for a steal? You learn at the swipe is almost always called for the foul. Absorbing contact and using the upward-and-inward, two-hand motion results in jump-balls or outright steals.

The ultimate goal was to eliminate an opponents possession through the use of disciplined defense; specifically to disrupt opponent’s offensive flow at the passing, cutting, and driving level.

Need specific examples? Feel free to watch the 2014-15 Wisconsin Men’s basketball team play almost any game. Of read any of these articles here, here, or here.

When we got to game time scenarios, we would keep tracking of possession ending events and measure them against the fouls a player had. It’s great if a player had three steals in a game, but if it came at the price of four fouls; the defender put the offense in a better position later in the game due to the bonus and removed them from being as aggressive as necessary in higher stress situations (aka Crunch Time). A prime example has been Dwane Casey trying to handle Andre Drummond in foul trouble.

On-ball defensive fouls are commonly obtained through one of three actions: reaching/hacking for steals, bodying/hacking for attempting to block, and blocking for attempting to draw a charge. There is a fourth foul, the hand-check, but this is a pre-cursor to the above three, as its aim is to re-position an offensive player. The three main defensive statistics used to measure the impact of a player directly on action are then the **steal**, the **block**, and the **charge**. These measures do not directly measure the impact of all players on the court, but they do encapsulate the position of the player relative to the ball and is performing the potential possession killing action.

For a steal and a charge, the offense is credited with a turnover; an immediate possession killing event. For a block, however, a possession is not necessarily terminated. We have seen extensive discussion about this in the past. Therefore, we must look at blocks that lead to change of possession; known as **kills** in volleyball.

Therefore, we aggregate these statistics to form what we call the **Wisconsin Stat**.

The Wisconsin stat is a measurement of the number of possessions terminated directly by a player divided by the number of fouls that that defender has achieved. What this value gives us is an identification of players that are “disciplined” when they get into the middle of the action. This means the defender is either a low-fouling possession-killing block specialist, or a thief who can make clean getaways, or they are an expert at drawing the charges.

In all three categories, we can think of current NBA leaders. For blocks, we usually think of **Hassan Whiteside**, **Rudy Gobert**, **DeAndre Jordan**, or **Andre Drummond**. For steals, we think of **Stephen Curry**, **Paul George, Robert Covington**, or **John Wall**. For charged drawn, we think of **Kyle Lowry, Ersan Illyasova, DeMarcus Cousins, **and **Kemba Walker**. These guys always tend to hover around the tops of each of these lists. However, are they all as efficient at terminating an offensive possession?

To measure this, we simply add the three key defensive stat categories and divide by the total number of fouls drawn. Using the introduction above, we can see why we would call this the “Wisconsin Stat.” Ideally, it would be the **“Ryan Score****“** if we were Nylon Calculus or Liberty Ballers; but you know my despise for naming any stat after someone’s last name.

If a player scores a one, this means they are **equally likely to cause a foul as they are in terminating a possession**. Think about that for a moment. These are players that can eliminate **five possessions** **a game without fouling** out. These players inherently grant you **more than five points of** **defense** without having to rely on a synthesized measurement such as defensive RAPM, RPM, or PIPM.

Granted, we would not want to use this measurement on its own. Instead, we’d rather focus on using the value a transformed variable in a bigger model. For instance, what does the impact of such a statistic really play on **defensive efficiency**. Sure, we gain five points over the course of the game; but that requires the player to **play at a level of fouling fives time a game**; which is almost unheard of.

Therefore, we use this statistic with care; and look into what the statistic is telling us. Which, we will do here.

If we construct the Wisconsin Stat for all players in the league and then limit the list of players to those that have created at least **ten** defensive actions: Kills + Steals + Charges Drawn; we obtain a list of **165 players** across the league for the 2018-19 NBA season.

Who tops that list? **Tyus Jones **of the **Minnesota Timberwolves**. Let’s call this a coincidence: Tyus Jones was on the 2015 NCAA National Championship Duke Blue Devils that vanquished the Wisconsin Badgers. Jones is a bench player with curiously high up-side; but has been relatively moderate at the start of this season. In fact, Minnesota actually projects well on this statistic; something we will look deeper into in a moment. Therefore, we look into the “true number one” of this list: former teammate…

Butler has only garnered 13 blocks and zero charges over the course of the season; but his 33 steals plants him fourth in the league. Combine this with his 30 fouls drawn, and Butler is sitting at **1.3333 **possessions terminated per foul. Over the span of 15 games, that’s 2 fouls per game. Meaning, Butler salvages an average of 2.66 points per game for his teams. Second on this list is no slouch either…

Leonard has been known for his defensive prowess for a few years now. This stat backs that up. Leonard is tied for 29th in the league for steals with 23 and has similar statistics as Butler. However, Leonard has only played in 13 games this season and has registered a mere 19 fouls. This indicates that Leonard only produces approximately 1.92 points per game for his teams.

We immediately see that Butler actually improves defenses from an individual point of view. However, teammates matter and a team may not be able to afford the extra 0.54 fouls per game. Note that if we had a team of eight Jimmy Butlers playing a standard eight-man rotation; we’d have roughly 16 fouls expected over the course of the game. If we were so lucky to have pure uniformity, we’d have **no bonus attempts**. Otherwise, every fifth foul results in two free throws; effectively 1.5 points yielded. This means that Butler’s 2.66 points per game isn’t necessarily so. It could potentially be a mere 1.16; worth less than Kawhi’s 1.92 points!

Therefor we’d have to look at the impact of fouling as well.

Before we do that, we unveil the entire list of Wisconsin Stat guys for the 2018-19 NBA season (through Thanksgiving):

We will even let you scroll through all the teams:

Click to view slideshow.To understand the impact of fouling, let’s quickly trot through the scoring rules associated with fouling. First, if a shooter is fouled, then they are awarded the number of free throws equal to the value of the field goal attempt (if missed) or one free throw (if the field goal is converted). Second, two free throws are awarded for every team foul, starting with the fifth team foul in a period of play. A team foul occurs only during a defensive possession or a loose ball event. Tertiary, the NBA implemented a “two-minutes” fouling system that awards free throws, **regardless of count of team fouls**, on the second team foul within the final two minutes of a period.

Within the NBA, a player cannot secure more than six personal fouls. Therefore, they are limited by the number of fouls they can tally before the Wisconsin stat “times out.” Therefore, deeper analysis into this statistic and its associated impact requires an advanced study into **survival analysis. **That is, the probabilistic time it takes for a player to foul out of a game.

What this means, at a high level, is that the better the Wisconsin stat, the better defender. There is a **diminishing return** against the number of fouls a player obtains. Alternatively, there is a **initialization** **cost** for teams that are less-aggressive and create almost no fouls; limiting towards a zero-divided-by-zero phenomenon. This cost shows that teams are tentative, non-aggressive as a whole, but assertive when aggressive. These teams **must be good at corralling shooters **if they are to win ball-games.

Therefore, we need to find low-fouling, but not too low, and high Wisconsin stat players and teams to really identify the power of what the statistic is describing.

Let’s look at this from a team level:

We see that the Minnesota Timberwolves and Indiana Pacers top the league in the Wisconsin Stat. But if we take a look at the Defensive Ratings for both teams, we see that the Indiana Pacers are currently fifth in the league with 105.3 while the Minnesota Timberwolves are resting at 112.8 for 26th in the league. We can’t definitely suggest this is a pacing issue as the Timberwolves only run 100 possessions a game to the Pacers’ 97 possessions a game.

The difference comes between the effectiveness in **corralling** opposing teams. In fact, both teams have played 18 games and are perfect candidates for analysis. With 368 fouls, the Timberwolves are 25th in the league in fouls. Similarly, with 362 fouls, the Pacers are tied for the third lowest number of fouls in the league. Read that as both teams are **conservative** (or **assertive**) when it comes to their aggression. Compare this to the Toronto Raptors (435 fouls over 19 games converts to 413 fouls over 18 games) and we see that the Raptors are far more aggressive.

When we take a look at the corralling numbers, we find the following shooting distributions of the Timberwolves and Pacers opponents:

And we immediately see that the Pacers are “better” at limiting field goals between 3-16 feet to bad looks. These are typically the hooks and floater attempts. Whereas the Timberwolves are a victim of high field goal percentages from 10-16 feet. Combine this with the three percent differential at the rim, and Indiana shows to have a better interior defensive posture. Combine this with Indiana’s abnormally high defensive three-point percentage (could be noise), the near 40% clip at which teams take those attempts (third highest in the league) indicates that either the Pacers are such an intimidating force inside (Myles Turner by chance) that teams take longer range attempts; or that Indiana will yield the three, knowing they won’t put you on the line as much and bait opponents into their **assertive defensive scheme**.

Given the above, we can measure the assertiveness of a team on defense; but realize that the story is not completely told through this statistic. What it does is identify the **assertiveness **and **aggressiveness **of a particular player and team. Combining this statistic with others, such as the discussion of comparing the Pacers and Timberwolves, we begin to stat seeing the actual values of assertiveness and aggressiveness each player/team exhibits.

More importantly, we can start breaking apart players at the top of the counting stats. For instance, Anthony Davis (5th) appears to be the top post defender as he barely fouls compared to his contemporaries, such as Andre Drummond (154th). And it’s why we see Dwane Casey fretting about convincing Drummond to play through the foul trouble. Ideally, Drummond would identify more optimal methods for picking up his blocks without fouling.

Similarly, it’s part of the notion of why we picked on Steph Curry as a defensive player (during my days in the Western Conference) as his Wisconsin Stat rating is actually fairly low in specific situations. And this is what separates players such as John Wall from Curry as a defensive stopper. It actually gave edge in those games.

At the same time, as a third reminder: Buyer beware. Use this statistic as a talking point to something bigger; and perform rigorous analysis of the entire situation before understand the actual impact of the statistic on the game.

For fun, feel free to search all players through all games played up until Thanksgiving here: https://docs.google.com/spreadsheets/d/1EH-j63pokndc8riM0yzhG-YfRnLh3UadWOAB6PdQIug/edit#gid=0

]]>- Effective Field Goal Percentage: 0.527 (8th in the league)
- Turnover Percentage: 14.4 (28th in the league)
- Offensive Rebound Percentage: 22.6 (18th in the league)
- Free Throw Rate: 0.22 (7th in the league)
- Opponent eFG%: 0.527 (21st in the league)
- Opponent TOV%: 13.6 (8th in the league)
- Defensive Rebound Percentage: 79.6 (4th in the league)
- Opponent FTr: .203 (15th in the league)

Overall, the Mavericks appear to be a wildly varying team that turns over the ball and has difficulty obtaining offensive rebounds, but can score and get to the rim. Similarly, they are strong in inducing turnovers and can rebound on defense; but have difficulty in stopping opponents from scoring. When reading through the four factors, it would appear that teams play at the level (and mirror that) of Dallas; and is potentially a reason why the team is near .500.

So if we are to report how the team is doing to coaches or management, how do we go about presenting and, more importantly, discussing these numbers?

Over the previous eight seasons in the NBA, I’ve witnessed effectively every team present these numbers as **rankings** or **percentiles**. Most commonly, the phrase “We are **xx**-th in the league…” is the one that I have heard. And effectively every coaching staff or management chain responds in almost identical fashion: “How do we move up…?” It’s a curious tale that once got me into a philosophical debate with a head coach about the difference between moving from 11th to 10th in a statistical category versus moving 11th to 10th in another category. The problem was that the report didn’t indicate the value of the difference between 10th and 11th. Both had been wistfully whisked away to Gaussian-land, making them look identical; when in reality the stats were telling different stories.

The biggest challenge is that many analysts enjoy standardizing data and treating the data as Gaussian. Standardization is indeed helpful when attempting to remove scaling effects in an effort to treat distributions on **the same scale**. However, mean zero – standard deviation one distributions are not the same. Take for instance, this fun example: One distribution is a Gaussian sample of scores that are standardized (orange group); while another distribution is an Exponential sample of scores that are standardized (blue group). Each have the same mean and standard deviation. But what do their distributions look like…

Now, if we are to compute the rankings, we see that movement among the rankings for the exponential group should be much easier to commit to than the movement within the Gaussian group. This is due to the fact that many of the teams are located in a tight concentration about a small set of values. In this example, between -1 and -2. The Gaussian group has a more difficult time moving as the bulk of the teams are spread among -1.5 to 1.5. If we impose a Gaussian percentile, we would seriously undercut the tightness of the Exponential distribution and would therefore be misleading team officials in the value of a statistic.

So let’s take a look at the distribution of the Four Factors for the 2018-19 NBA season. To do this, we merely copy the Four Factor stats from Basketball Reference to create an importable csv file that we read into a pandas data frame. We also build off the **RANK** functionality of Excel to help produce sample ranks for each team at each Four Factor category. This builds us ranks columns that we can also import into Python.

Then, given the Four Factors table, we trim down to a condensed data set of offense-only factors. We leverage the seaborn package to produce a nice bi-variate distribution plot, which results in:

Unfortunately, on the diagonal, the estimated densities are flattened; so let’s recover Offensive Rating, Turnover Percentage, and Offensive Rebound Percentage:

We might be able to sneak under the radar for OREB% through the use of a statistical test and rely on small sample sizes. However, it’s near blasphemous to treat Free-Throw Rate, TOV%, and Effective FG% as Gaussian due to their heavy right tails. If we p-hack, we can claim Gaussianity; but that’s being deceptive (and lazy) in reporting.

Remember that example we had above? Compare those exponential plots to the the kde plots above… There’s a reason for that example.

The way we build a percentile is simple. The basic definition of a **sample percentile** is the location of which a **certain percentage of points** fall beneath. The most common example is the **median**. For the median, we are looking for the explicit value of a statistic such that **fifty percent **of the data falls below the value. For a given data set, the sample median is either a data point (if there is an odd sample size) or **any value in-between two data points** (if there is an even sample size). Let’s see an explicit example…

Consider the set of effective Field Goal percentages heading into today:

If we sorted these values (right panel above), we immediately see that the median is located between the 15th and 16th data point. This would be any value in between .517 and .518. Our undergraduate text book may tell us to just average these values to obtain a median of **0.5175**.

If we continued this example, we would find that the 25th percentile is 0.502. There is no give or take on that. It’s the exact data point. Similarly, the 75th percentile is 0.527. However, if we keep going in opposite directions, we see that the distribution of effective Field Goal percentage is skewed to the right. The way we view this is through a **Probability Plot**.

For the probability plot, we see that there is a semblance of a Gaussian distribution, but the tails are definitely skewed, both in the high value versus quantile value. **Side Note: **A quantile is the value of the statistic at a given percentile. Some text books may use them interchangeably.

So how does eFG% compare to the exponential and Gaussian distributions? Well, let’s look at the Probability Plots for both. Referring to the example above, we have:

We see that the eFG% is actually closer to the Gaussian set-up as the tails aren’t as nearly skewed. However, if we take a nuanced look at the plots, we see the same shapes as the exponential for the eFG% plot: both tails are above the red line and a pronounced bend below the red line. So we see semblance of both the Gaussian and the Exponential distribution. So which is it?

If we apply an **Anderson-Darling goodness-of-fit **test, we find that the distribution of effective Field Goal percentages are **neither**. In fact, the p-values (and effect sizes for those who knee-jerk against p-values) strongly disagree with the distributions being from that of a Exponential (10E-9) or a Gaussian (10E-14). At least, it notes that the sample is more akin to an exponential than to a Gaussian as we indicated above!

To bring this example home, the same tests identify the exponential sample as Exponential (0.97) and the Gaussian sample as Gaussian (0.922). So this means if we do indeed apply a **parametric distribution**, we’d be directly lying by giving false information. Remember our core problem we are trying to answer? **How do we report “percentiles” effectively across all statistics**.

So what do we do to correct this issue? The simplest way is to rely on **nonparametric statistics**. We are already kind of doing this by deferring to the rank and percentile; but the traditional analyst tends to force Gaussianity. Which we just dispelled using eFG%. Instead of focusing on using **z-Scores, **we should really zero in on the **empirical distribution function (EDF)**.

The empirical distribution function is an estimator for the cumulative distribution function. The EDF is a stepwise function that counts the percentage of data points at or below the given value. If we take a quick glance back at the sorted eFG% values, we would see **zero** until the value **0.487**, where we would see a jump by 1/30. This plot remains constant until we see the value **.490**, in which a jump of 1/30 to 2/30th occurs. This continues on until we run across all data points in the sample. The resulting R plot is then:

**Side Note: **This is one of those times where R completely outperforms Python. Can you believe that Python does not have an EDF function built in?!?!

Again we see that aggressive right tail in the EDF. This is due to the Golden State Warriors (2nd) and Milwaukee Bucks (1st). Now, if we simply report the position/rank, we wouldn’t be necessarily lying, but we would be mis-representing the data. Let’s compare eFG% to OREB%. Notice in the **sunny-side up plots** above, there is actually a seemingly **negative trend** between eFG% and OREB%. We tell that from the two-dimensional distribution pulling along the line Y = -X. Regardless, let’s plot the EDF for OREB%:

Notice that the tail is left-heavy for OREB%. This is due to the Chicago Bulls, Memphis Grizzlies, and Phoenix Suns.

Now using the EDF’s we can apply **Bernoulli Distributions** to understand the statistical properties of the statistics of interest. To see this, we just have to take a moment to recall what the EDF is doing. Recall that the EDF asks whether a **data value** is below a particular value of the domain **x**. For example, **is team, i, below the value of x = .175 for** **OREB%? **If the team is the **Chicago Bulls**, then the answer is yes. Otherwise the answer is no. By removing the label and treating all teams as a random sample, we get a value of **1 (yes) **or **0 (no) **for that **one data point**. Therefore, the probability of falling below, say .175, is just the **TRUE CDF** at that particular value. This means that we can start making an inference on the true distribution without having to guess a distribution!

And since this is a Bernoulli random variable for each team, we obtain a variance estimate for free! So let’s understand what this is telling us…

We will start simple: let’s compare two rankings across two categories. Suppose we are a team in such a position and need to focus on personnel changes or coaching strategies to help nudge the values in a positive direction. Further suppose that, while we’d love to improve both categories, we only have enough budget to isolate one category. Which do we choose?

28th in OREB% puts us at **18.1%** (Phoenix Suns) and 28th in eFG% puts us at **49.7% **(Minnesota Timberwolves). Now, the movement for a team to jump a spot requires an improvement of **1.8% **for offensive rebounding percentage, and **0.1%** for effective Field Goal Percentage. Ideally, we would use a prior distribution on the counting stats that construct the statistic of interest; however, reporting rarely does that. If we did perform the prior calculations, we could further identify how close a team really is to capturing the next spot. Instead we focus on EDF calculations alone.

Applying the Dvoretzky-Kiefer-Wolfowitz (DKW) bound, we obtain the following confidence regions:

And what we can do is measure the deviations from north-to-south to help understand where a team really is sitting, despite their hard-coded number of say… **18.1****%**. In this instance, 28th is as good as 25th in offensive rebounding, while 28th is as good as 23rd in effective field goal percentage. This would indicate that improving shooting would be key to emphasize over rebounding; if one had to choose.

**Side Note: **By placing a prior distribution on the counting stats, we would be able to better control the widths of these intervals.

Now that we have seen the univariate attack on these Four Factors, let’s do two things. First, revisit the **Dallas Mavericks** and second, identify next steps.

For the Mavericks, we saw that their shooting numbers are up and their offensive rebounding numbers and turnover percentages are towards the bottom of the league. To recall, the offensive Four Factors are given by:

- Effective Field Goal Percentage: 0.527 (8th in the league)
- Turnover Percentage: 14.4 (28th in the league)
- Offensive Rebound Percentage: 22.6 (18th in the league)
- Free Throw Rate: 0.22 (7th in the league)

We saw above that the Mavericks are currently 8th in the league in eFG%. This places them within a tier of 5th through 11th, 11th being a steep drop off. For completeness, we include the TOV% and FTr plots:

We see that 28th in TOV% is a tough spot to climb out of to get to 26th and is a difference of over 3 percent to change dramatically. They are effectively between 30th and 26th when it comes to turning the ball over on offense. Being 18th in OREB% places them well between 22nd through 15th, indicating they can climb up, but have a better chance of slipping. Similarly, being 7th in FT rate is the entry-point of the funnel in the EDF plot. This places them between 11th and 4th in the league in getting to the foul line. It’s still fairly impressive, but teetering towards middle-of-the-pack.

So what’s the diagnosis?

**The Dallas Mavericks are performing well in scoring categories, maintaining a second-tier status among the league at roughly 8th in eFG% and Ftr. Their non-shooting change of possession capabilities are near bottom of the league, but are being masked by a couple good performances. Despite rating 18th in OREB%, they are really rebounding at a 20th rate team with noise. Their turnovers are effectively at the bottom of the league. Emphasis on protecting the basketball on offense and positioning for offensive rebounds will improve the team’s numbers. **

This sounds like “Well, duh.” But we’ve given a top-level quantification of the potential slip in Dallas at some point in the near future. To really dig into the specifics, this top-level breakdown of understanding what percentiles and rankings are really telling us allows us to dive into the right areas of analysis.

So… what next?

In truth, the above analysis is only good at an introductory level. Realistically, a team cannot simply isolate rebounding and ignore shooting. Often times, the offensive flow requires players to be out of “optimal” rebounding positions. To counter act this, we can look at the **interactions** of the statistics. And to do this, we need **Empirical Distribution Functions on Steroids**. Or as well call them… **copulas. **

This is an area of research I’ve focused on for a few years and one thing very subtle about copulas is that they treat discrete distributions as continuous. One of my fellow colleagues created a phenomenal attack by introducing right-censoring to overcome the discrete-to-continuous problem. He started writing a paper on this a while back, and if you notice, there may be a Justin influence on his work. Unfortunately, I departed for Orlando and I was dropped down to a “thank you” despite all the hours of work put in to crafting variance bounds and analyzing the NBA data set on the XPCA solutions; but none-the-less, it’s a great paper to follow on the next steps!

]]>So how do we go about extracting these seemingly odd stats? Most of it comes from **player tracking data. **Unfortunately, the user agreements with tracking data typically state: No dissemination of data, no dissemination of code showing processing of said data, and no dissemination of analytics/visualizations containing summaries derived from said data; unless, of course, you have prior written consent. That said, we can still develop analytics based off this style of data. It will still work when you finally gain a chance to operate on tracking data!

Tracking data, whether supplied by a company such as Second Spectrum or SportVU or KINEXON or Catapult, the results are nearly identical: We are given a **time of play**, an** absolute time of reference**, the **player identification key**, and a **position estimate**. Depending on the type of company, we will have other tags associated with a datagram; but this is the meat and potatoes of the tracking data. If we take a look at the defunct SportVU datagram, we have:

[1,1451772637436,705.59,10.59,null,[[-1,-1,68.68569,4.5478,4.0184],[1610612758,200765,85.52082,40.35368,0.0],[1610612758,200752,82.67104,18.33905,0.0],[1610612758,201956,65.34798,20.40074,0.0],[1610612758,202326,76.02045,34.92187,0.0],[1610612758,203463,71.96106,6.66964,0.0],[1610612756,2199,74.39282,37.39965,0.0],[1610612756,200782,82.85718,14.61204,0.0],[1610612756,202688,88.31124,48.3688,0.0],[1610612756,203933,61.43261,19.63005,0.0],[1610612756,1626164,69.16161,4.94893,0.0]]]

Breaking this down, we have the following components:

**Period: **1

**Absolute Time: **1451772637436 (Unix time… 2:10:37.436 PM January 2nd, 2016 in Sacramento)

**Time Remaining in Period: **705.59 (11:44.41 remaining in period)

**Time Remaining in Shot Clock:** 10.59 seconds

**Empty Slot: **null, for future use.

**Position Vectors: **11 arrays of length five.

**Array Slot One: **Team ID (-1 for basketball)

**Array Slot Two: **Player ID (-1 for basketball)

**Array Slot Three: **Sideline Location (0 to 94)

**Array Slot Four: **Baseline Location (0 to 50)

**Array Slot Five: **Height in feet (ball only)

So how do we even get these positions? Simple… Machine Learning!

The aforementioned tracking systems vary considerably when it comes to finding “dots on a map.” Second Spectrum and SportVU rely heavily on camera based technology, whereas KINEXON leverages a radio-frequency (RF) based system. The mathematics are relatively similar, with the key difference being an **invasive** or **passive system**. The KINEXON system is invasive, as it requires the player to interact with the system by wearing an RF emitter. The Second Spectrum system is passive as it requires no extra action from the player. The challenge with the latter system then becomes identification of the player.

While we do not have access to the raw camera data from Second Spectrum, we can start to hammer out details using similar set-ups. We should note that this is not Second Spectrum’s methodology; but rather a machine-learning based approach when using cameras (that’s actually been around for 15+ years!). To start, let’s consider a recent game between the Sacramento Kings and the Atlanta Hawks. Here, we have video feed from a well-known fixed point camera system. The first thing we must do is identify where the players are.

Here, we see all the players, the referees, the coach, the basketball, and the crowd. Using this well known camera angle (and having millions of frames over the years; **except you Omari Spellman…**), we can apply a **convolutional neural network** to extract out the player entities of interest. This is the most difficult part of the process (especially for side-view camera systems) but the convolutional neural network can be applied in many ways. They way we do it is rather straightforward…

First, we construct a **court waveform**. This is taking the fixed-point locations of the court and turning them into a **waveform**. This is merely taking a Fourier transform of the court boundaries. This requires labeling. Therefore, if we take the Fourier transform of this image, we get a smeared representation of the court due to the players **obfuscating the court**. However, applying a **convolution, **we find there is enough evidence of court to treat this as a **notch filter** and we can filter out the court. This leaves us with fuzzy players.

Second, we construct a **player transform**. This is taking a player’s movement and constructing a Fourier transform of the player. This helps in identification. To help illustrate this, let’s play a game…

NBA players are quite discriminating when you break down their body mechanics. Consider this example from the 1992 Olympic Dream Team game versus Angola:

It’s grainy, but we should **easily** identify **Magic Johnson** running the break. And it’s not because of his acrobatic backwards hook pass to Bird for three. It’s his gait. Magic is quite distinguishing in his run with his shoulders set high, his head set low and forward; almost forming a hunchback. Whereas **Karl Malone** is commonly a knees-forward runner with a straightened back and shoulders pinned back. Compare it to **Michael Jordan**, who slinks into a compact form when running. All these players have specific traits.

Given these filters, if we bash the filters against a game log, we also can build a **mulitnomial distribution** of likely players on the court. This helps hedge our bets on properly selecting the player on the court. But we will get to that in a moment.

Once we have the filters in place, we will have “boxes” identifying objects on the court. Next, we filter and classify. The first step is filtering. This process attempts to remove erroneous boxes and aid in classification. Commonly, this part works in conjunction with the classifier. A common filter is the **Kalman filter**. The role of this filter is to ensure the ball stays the ball and the player stays a player. More importantly, if two players screen, this filtering process helps track players correctly as the filters may not convolve enough frames to ensure the players are correct.

The classification step is the labeling process. This process identifies who a player is and can either be as simple as a **multinomial distribution** (this actually performs rather poorly), a **neural network** (this actually performs rather poorly too…), or a **support vector machine** (this is commonly used). The result is then given by:

This example turned out really well. I suppressed the **referee indicator** because **Dave Joerger **makes me look bad (he classifies as a referee). Also, the basketball is obfuscated in this sequence of frames. It’s correctly found using back-filtering (reversing the Kalman filter when its located a few frames further), but that would be misleading if found here.

Regardless, we are able to identify where players are **within a frame of a camera view**. This isn’t their coordinate on the court. In fact, using this camera frame, we **can** extract positions, but it’s tedious and excessively painful. Much of it deals with effectively single source collection, multi-source splicing, and resulting obfuscation. This means the primary method for finding coordinates is from the sideline camera which we need more than just a sideline camera to identify good points. It also means that the sideline camera isn’t the only feed we get. We also get baseline close-ups, overheads, and side-line closeups. These immediately kill the CNN requirements above. Finally, this also means even with our best option, we lack the ability to properly identify depth… the camera doesn’t tell us its zoom level and unless we have enough of the **notch filter** from above; we are unable to back out the zoom effect.

Instead, the camera systems for SportVU and Second Spectrum are mounted above the court. This allows a proper fixed-point analysis with little to no obfuscation. Therefore the CNN-filter-classify method runs exceptionally smoothly and the remainder of the positioning problem is a classical **angle-of-arrival **problem.

Now that we can obtain boxes of players, referees, and basketballs, and we know the location of the cameras, we can start to build a positioning model for identifying the location of a player. So let’s start simple with the basketball.

Since the cameras systems are fixed, we know which **pixels** the court is located.

Given that knowledge, if we classify the basketball, we obtain six proposed locations; one obtained from each camera. We can think of this pixel as a projecting line from that portion of the camera to the ball: **Remember that depth is not solved using one camera**.

The resulting position of the ball is then the best intersecting point of these six straight lines. Since the cameras are fixed and the court is fixed, all we really need to know is the **angle-of-arrival **of the line from the ball into the camera. The width of the resulting pixel at the location of the ball is simply the associated **sampling error** of the position estimate.

For angle-of-arrival analysis, we set the origin of the **reference frame** at the center of the six cameras. Therefore, if we average the (x,y,z)-coordinates of the cameras, we obtain (0,0,0). We then treat the location of the basketball as unknown **b = ****(x,y,z)** whereas each camera is listed as **c_i = ****(x_i, y_i, z_i)**, all known locations. Using the pixels which the ball falls into, we obtain two particular angles: **alpha_i **and **beta_i** for **camera i****. **Alpha measures the left-right directionality while beta measures the up-down directionality. This is what we measure through the CNN-filter-classify pixel matching method.

We can leverage trigonometry to help identify necessary parts of the model before writing the statistical equation for locating the basketball. First, we can use the measured elevation angles (beta) and the distances between the cameras to deduce the estimated distances from the cameras to the basketball.

We can apply the law of cosines to extract out estimated distances between the camera and the basketball. Note, due to error sources within the cameras, we may get differing distances for the same distance between the camera and the basketball. But that’s alright, we propagate those errors into the statistical system. I typically use the mean of the extracted distances. For a basketball, this commonly jitters the center of the ball within 1-2 feet.

Once we obtain the estimated distances, **d****_i**, from each camera **c_i** to the ball, **b**, we can solve the system of equations:

This system does not include biases with cameras, and the error terms are suppressed. However, the left hand side are the measured pseudo-ranges while the right hand side is a collection of squared-distances between the six cameras and the basketball. We can easily solve this using **Newton-Raphson** root-finding. By doing this, we obtain the three-dimensional position of the basketball!

Now that the hard work is out of the way, we are able to start looking at characteristics of the basketball. Let’s take for instance a snapshot of that game mentioned above with a sportVU snippet from a **2016 Phoenix Suns at Sacramento Kings** game. We start crudely using a possession frame. This is the time that the **Phoenix Suns** have the basketball between 5:28 and 5:15 block in the first period.

Now, we don’t have video. Instead, we can compute distances and leverage the possession frame to identify who has the ball and what is happening. To compute who has the ball, we can start with computing the distance between the ball and the player. **We know the Suns are on offense**. Therefore, we know **Brandon Knight** has the ball.

We can even see when a basic pass occurs. What kind of pass is it and where is it going?

While Brandon Knight brought the ball up-court, he makes a pass to **Mirza Teletovic**. This occurs roughly 3.5 seconds into the possession. To identify the type of pass, we can look at the z-profile of the basketball.

At that 3.5 second mark, we see the ball get picked up and passed to Teletovic. Notice the ball has a sharp point and comes upward higher than normal. That’s Teletovic receiving a **bounce pass** and pulling the ball up. Notice that Teletovic **never dribbles the ball**. Teletovic is approximately 19 feet from the basket on the wing. In fact, a sequence of screens occurs attempting to free open a player Teletovic will eventually pass to, who is…

…not **Devin Booke****r**. The ball appears to be closing in on Booker, however the ball is being skipped back over to **Brandon Knight **for a fifteen foot pass. Knight catches the ball off the skip pass and does something subtle in the z-coordinate of the basketball. Knight **pump fakes** before taking the shot on a **Catch-and-Shoot** field goal attempt from 23 feet out.

Knight misses the three point attempt as we see the **secondary bounce above the rim **(orange line). **DeMarcus Cousins** secures the rebound and the possession ends. In three dimensions, the play unfolded as such:

At this point, we can start developing a rule based system to tease out some basic analytics. For instance, we can use distance of ball to the player to help understand **touches**. Given touches, we can define what a **dribble** looks like in the data. Similarly, we can use point-to-point relationships to help understand **passes** and **types of passes**. In the example above, we saw **four dribbles**, **one bounce pass**, **one skip pass**, **one pump fake**, and **one catch-and-shoot **for the Suns’ offensive possession.

Armed with this knowledge of trilateration and the statistical/machine learning process of extracting position estimates, how would you start developing new techniques to measure some tracking quantities such as dribbling, passes, or even screening? But proceed with caution… someone might just call you a nerd.

]]>In this article, we will break down how to perform such a test. The nuts and bolts of the test are given in the link above, so I won’t inundate you with repetitive material.

This question is almost seemingly answered **YES!** across the board. It may be one of those years where it seems obvious that scoring is indeed up. But let’s take a look at the trend over the years.

To start, we look at scores posted from the 2010-11 season through this season. That’s a total of 8 complete seasons and this current partial season. To capture scores, I just go to one of my favorite score repositories: Ken Massey’s Data page. Scores are outlined in a very precise way, and if you copy and paste his scores into a text file, you’ll be able to run code found on this page quite easily.

The first thing we do is read the text files in and create a scoring dictionary. The scoring dictionary just associates the collection of scores for each season. Later on, we will be able to call the season and have all the final scores at our disposal.

years = ['2011','2012', '2013','2014','2015','2016','2017','2018','2019'] scores = {} for dirName, subdirList, fileList in os.walk(inDir): print('Found directory: %s' % dirName) for fname in fileList: if fname[0:4] in years: if fname[4:] == '.txt': print('\t%s' % fname) # We have a year's worth of data! f = open(inDir+'/'+fname,'r') lines = f.readlines() f.close scores[fname[0:4]] = [] for line in lines: scores[fname[0:4]].append(float(line[36:39])) scores[fname[0:4]].append(float(line[65:68]))

Ideally, we would plot the histograms for each season layered on top of one-another. This would give us a decent illustration of how scoring is changing.

plotRange = np.linspace(70,150,1000) for year in range(2012,2020): plt.hist(scores[str(year-1)]) plt.title('NBA Scoring From 2011 through 2019') plt.xlabel('Points') plt.ylabel('Frequency') plt.show()

However…

It appears there’s a slight shift forward, indicating scoring is increasing… but the way the histograms are presented obfuscates the year-to-year interaction. Even though the years are stacked on top of each other, can we tell the difference (at least visually) between what happened in 2011 and 2012? To help aid with this, we can look at the **kernel density estimator**.

x_grid = np.linspace(60, 150, 1000) bandwidth = 0.2 alphs = np.linspace(.5,1.0,10) i=0 for year in range(2011,2020): kde = gaussian_kde(np.array(scores[str(year)])) est = kde.evaluate(x_grid) plt.plot(x_grid, est, alpha=alphs[i], lw=3,label=year) plt.legend() i += 1 plt.title('NBA Scoring From 2011 through 2019') plt.xlabel('Points') plt.ylabel('Frequency') plt.show()

The kernel density estimator is a **weighting** technique that places a **weight **at each data point. Usually, the weight is a **Gaussian** amount; but it can be other things like a **top-hat** or an **Epanechnikov**. If we apply the Gaussian weighting and then color the years with a decaying transparency to offset the colors; we obtain a much more readable graph:

And it’s here that we are able to finally see the pattern of scoring over the years. In fact, it appears that the 2011-12 and 2012-13 NBA seasons witnessed a **decrease** in scoring. Similarly, the 2016-17 and the 2017-18 NBA seasons appear to be almost identical when it comes to scoring. Similarly, there seems to be a **foundational shift** in scoring between the 2015-16 and 2016-17 NBA seasons; as well as the 2017-18 and current NBA seasons. The latter has been hypothesized to be due to the freedom of movement and shot clock adjustment rules established during the off-season; as well as a dramatic increase in pace of play.

So let’s test this out using the nonparametric test.

for year in range(2012,2020): [stat, pval] = mannwhitneyu(scores[str(year-1)],scores[str(year)],alternative='less') print(year-1, year, np.mean(scores[str(year-1)]),np.std(scores[str(year-1)]), np.mean(scores[str(year)]),np.std(scores[str(year)]),stat, pval)

Applying a Wilcoxon-Mann-Whitney test is easy thanks to the built in packages of Python. All we need to do as a statistician is to confirm whether we have hit all the requirements of the test. In this case, a Wilcoxon-Mann-Whitney requests that the two samples we are interested in are independent. We can argue that they are as one season’s scores does not give you information about the next season’s scores. However, there is an underlying dependent structure as we can use tools such as RAPM to predict ratings given lineups, and the lineups don’t change much from year to year.

So we cheat a little and, for the sake of argument, suppose they are indeed independent. It’s actually fairly low noise, and as you will see, that barely impacts the results of the test.

Notice I made a decision on the 2017-18 NBA season to suggest there is no increase in scoring. This is one of those situations were we need to understand the **significance **of a p-value. In graduate and undergraduate school, we learn of the five-percent rule. However, in industry, the five-percent rule almost never applies. Instead, we attempt to understand what makes sense for decisions relative to the process. In some engineering studies, p-values of .3 are “good enough” to suggest a significant effect. Whereas, in our study above, we really should be looking at something close to .00001. At that point, .023 just isn’t going to cut it.

Another note to make is that despite scoring being up this year, it’s nothing in comparison to the jumps in the 2013-14, the 2015-16, and the 2017-18 NBA seasons. We should be a little cautious, as this may be due to **sample size**. And if the season pans out in a uniform fashion to its first month, we may see the biggest change yet!

But the question is to identify if pace is the root cause of this. The simple answer is to look at the **offensive ratings**.

We pose this problem as an efficiency problem. If a team’s scoring is up and it is due to pace, then we would expect the offensive ratings to either stay the same or decrease while points per game goes up. The simple ratio of

**Points Scored = Offensive Rating x Total Possessions / 100**

computes the points scored given an offensive rating. Under this, if pace increases, then the number of possessions increase. If ratings stay the same (or decrease), then pace dictates the number of points scored.

So we can first test if the offensive ratings are increasing. To test this, we can simply draw each team’s ratings over the course of each season. To do this, I simply compiled a **csv file** of offensive ratings from Basketball Reference. In this case, we have a table of teams by seasonal offensive ratings. To break this file up, we call on the **pandas** package in Python.

data = pd.read_csv(inFile) print data['2011 ORTG'].values x_grid = np.linspace(80, 130, 1000) bandwidth = 0.2 alphs = np.linspace(.5,1.0,10) i=0 for year in range(2011,2020): key = str(year) + ' ORTG' kde = gaussian_kde(np.array(data[key].values)) est = kde.evaluate(x_grid) plt.plot(x_grid, est, alpha=alphs[i], lw=3,label=year) plt.legend() i += 1 plt.title('NBA ORtg From 2011 through 2019') plt.xlabel('Rating') plt.ylabel('Frequency') plt.show()

The results are promising…

From this plot, we actually see (visually) that the efficiency of teams decrease from the 2010-11 season into the following two seasons. However, the 2013-14 season witnesses an odd bump, primarily due to the **Dallas Mavericks** and the **Portland Trail Blazers ** that season. Afterwards, the 2014-15 NBA season bounces back to par with the lower efficient seasons before we see the 2015-16 NBA season return to as efficient levels as the 2010-11 season. This relationship indicates that pace increased in the 2012-13 season and is a primary culprit in the increase of scoring.

We once again witness the jump in efficiency for the 2016-17 NBA season, which is nearly replicated in the 2017-18 NBA season. This jump indicates that its not necessarily pace, but more likely a combination of pace and efficiency; as this was the real introduction to the three-point mentality for teams.

Performing the Wilcoxon-Mann-Whitney tests, we see there is a clear drop between the 2011 and 2012 NBA seasons. **NOTE: **The test above does not test this; I concluded this by re-running the test with **‘greater’** selected as the alternative option.

for year in range(2012,2020): key = str(year) + ' ORTG' key1 = str(year-1) + ' ORTG' [stat, pval] = mannwhitneyu(np.array(data[key1].values),np.array(data[key].values),alternative='less') print(year-1, year, np.mean(np.array(data[key1].values)),np.std(np.array(data[key1].values)), np.mean(np.array(data[key].values)),np.std(np.array(data[key].values)),stat, pval)

But more importantly, we see there is effectively no progressing change in efficiency **until the 2016-17 NBA season**. Thank you, space-and-pace revolution. With only approximately 50 games into the current season, the small p-value is debatable on whether we have increased efficiency this season. It’s a gathering storm of yes, but the significance just isn’t quite there.

The takeaway here is that if we say **not enough evidence to suggest efficiency has increased**, then **pacing is indeed the prime culprit for increased scoring**.

That said, the next question in order is… how long can these paces keep up over the season?

]]>Effective field goal percentage (eFG%) is a correction made on traditional field goal percentage in an attempt to adjust for three point attempts. The motivation is fairly straightforward from a traditional probabilistic argument:

**Suppose a player shoots m-for-n from two point range and p-for-q from three point range. The player then has a (m+p)/(n+q) FG% and a (m+1.5*p)/(n+q) eFG%**.

If we multiply FG% by the value of 2, we obtain 2(m+p) points; which is not the correct number of points. However, if we multiply eFG% by the value of 2, we obtain 2(m + 1.5p); which is indeed the correct number of points.

To completely illustrate, consider the **131-117 Memphis Grizzlies victory over the Atlanta Hawks on Friday, October 19, 2018**. In this game, Jaren Jackson Jr. shot 8-12, 2-4 for a total of 18 points. Using the eFG% formula above, Jackson shot (8 + .5*2) / 12 for 9/12; or .75. Multiplying this by 2*12, we obtain 18 points.

The reason we make adjustments such as eFG% is to better encapsulate the game. It is indeed true that a three point field goal is worth 50% more than a two point field goal; therefore it makes sense to weight it as such. However, this leads to a rather nuanced question in reverse order: **If a player shoots 40% from the three point line, how many points do we expect that player to score after N attempts? **

The result leads us to what is called an **expected point value; or EPV**. EPV is a common term that models points scored using a probability distribution. It can either be complex such as the models shown at the Sloan Spots Analytics Conference every so often. Or it can be something primitive such as looking at end-state results; as example: shooting distributions / shot charts. Regardless, for a particular possession, we ask what is the expected number of points scored. For the shooter who shoots 40% from the three point line? Well, we expect him to score **1.2 points per possession**.

The faux pas that occurs here is that analysts tend to get into **moment matching** and immediately suggest that if the shooter instead takes a two point attempt, then they must **shoot 60% **to match the productivity of the three point shooter. This makes sense as the **expected point value** for this field goal attempt is now **1.2 points per possession**. Hence we have a tie.

The only problem is, **a shooter can never score 1.2 points per shot**. This is a major misconception. We say a misconception; as the remainder of this post will even prove to you that by hedging bets to enforce **120 points per 100 possessions**; the two distributions identifies a that one shooter is indeed better than the other… despite the exact same expected point values.

To start, let’s consider a team that only shoots three pointers, where each possession is independent of the last, and they are stable shooters. Similarly, we have the identical set up for a two point shooting team. We then pit these teams against one another, impose that neither team turns over the ball, gains no offensive rebounds, and no fouls occur. We also impose that the teams have the exact same number of possessions. And yes, we impose that the team’s EPV’s are identical.

Due to equivalency of EPV’s we have that the two-point team shoots **P** **percent**, while the three point team shoots **2*P/3** percent. For example, if a two-point team shoots **60%**, the three-point team shoots **40%**.

From the statistical set-up above, we note that each team follows their own **Binomial ** **distribution with N trials** and respective probabilities of success being their field goal percentages. For simplicity, let’s set **N = 100 **possessions. The distributions for the number of field goals made by each team are given by

Now if these two teams play, we can keep track of the score. Let **X** be the number of two point field goals made. Let **Y** be the number of three point field goals made. Then the final score after 100 possessions is then **2X **– **3Y**.

Under this transformation, we are now able to start writing the probabilities of every score possible in the game. Let’s illustrate this with the three simplest scenarios.

Under a one possession game, we change **N **from 100 to 1. In this case, each team gets one shot at scoring. This results in only four possible outcomes for the final score **X – Y: **

**0 – 0: Tie****0 – 3: Three Point Team Wins****2 – 0: Two Point Team Wins****2 – 3: Three Point Team Wins**

In the first case, both teams miss their attempts. This is a 40% chance for Team X and a 60% chance for team Y. This means we have a **24% chance of resulting in a tie**. Similarly, Team X has only one chance to win the possession, they make theirs and stop their opponent from scoring. In this case, we have a 60% chance of a made two point field goal and a 60% chance of a missed three point field goal, resulting in a **36% chance of Team X winning the possession.**

Unfortunately for Team X, Team Y has two options of winning the possession. All they need to do is score on their one possession. In this case, they have a **40% **chance of scoring. If we’d prefer to carry out the full Binomial structure, we have **.4*.4 + .4*.6**, which is .16 + .24 = .40. Any which way we do the math, **Team Y (Three Point Team) is favored to win the possession.** This is despite the the expected point values being identical!

What this tells us, at a cursory level, is that three’s are better than two’s in the single possession. But what about multiple possessions?

In the two possession game, we end up with **NINE** possible outcomes in the game:

**0 – 0: Tie****0 – 3: Team Y wins****0 – 6: Team Y wins****2 – 0:****Team X wins****2 – 3: Team Y wins****2 – 6: Team Y wins****4 – 0: Team X wins****4 – 3: Team X wins****4 – 6: Team Y wins**

In this situation, we have one possibility for a tie, five possibilities for the three point team to win, and three possibilities for the two point team to win. Here we see once again that the three point team has more options to win. Despite this, the probabilities tell a different story.

**Ties: **Once again we only have one way to end up tied. The probability is smaller than in the one possession scenario, we now the teams must miss more attempts. In this case, we have **.6^2** for the three point team to miss both attempts and **.4^2** for the two point team to come away empty handed. In this case, we end up with a **.36 x .16 = .0576** chance of a tie. That’s only a 5.76% chance; drastically reduced from the 24% chance in the one possession scenario.

**Twos: **For the two point team wins, we have only three scenarios to work with. The first case is **2 – 0**, which requires the two point team to make one basket and the three point team to miss all theirs. In this case, we have two different ways for the two-point team to make their basket: make the first or make the second; but not both. Therefore the probability of a **2 – 0 **victory is **2*.4*.6*.6^2 = .1728**. That’s a 17.28% chance of the score being 2 – 0 after two possessions each. Similarly for **4 – 0**, we have a 12.96% chance; and for **4 – 3**, we have a 17.28% chance. In total, the two point team leads with a probability of **.4752**.

Doing the math, this means that the three point team (Team Y) only has a **.4672 chance of winning**. This indicates that taking the two point attempts are more beneficial than taking two three point attempts; therefore making two point attempts **more valuable**.

If we breakdown one more small possession, we can consider a three possession game, which results in **16 different outcomes. **To save space, we will leave it as a homework exercise for you to prove the probabilities for each

**0 – 0: Tie (.013824 chance)****0 – 3: Team Y wins (.027648 chance)****0 – 6: Team Y wins (.018432 chance)****0 – 9: Team Y wins (.004096 chance)****2 – 0: Team X wins (.062208 chance)****2 – 3: Team Y wins (.124416 chance)****2 – 6: Team Y wins (.082944 chance)****2 – 9: Team Y wins (.018432 chance)****4 – 0: Team X wins (.093312 chance)****4 – 3: Team X wins (.186624 chance)****4 – 6: Team Y wins (.124416 chance)****4 – 9: Team Y wins (.027648 chance)****6 – 0: Team X wins (.046656 chance)****6 – 3: Team X wins (.093312 chance)****6 – 6: Tie (.062208 chance)****6 – 9: Team Y wins (.013824 chance)**

Again we see Team Y with the upper hand in number of outcomes with nine winning outcomes; but once again they are on the short end of the stick with a probability of winning being only **.441856**; a meager 44.1856%. Team X’s chances? **.482112**, or 48.2112%. Four full percentage points more likely to win over a three possession game.

Now the argument goes that the more possessions we have, the closer the probabilities of winning after **N** possessions become. And that’s correct to a point. In fact, the age-old **Central Limit Theorem** tells us that if we have approximately **30 observations, **then we can throw away the Binomial distribution and start using the trusty Gaussian distribution for estimating probabilities of winning. Let’s just jump straight into 100 possessions. To start, let’s **simulate 10,000 games**:

import numpy as np import matplotlib.pyplot as plt import seaborn as sn import math import random sn.set(color_codes=True) def nCr(n,r): f = math.factorial return f(n) / f(r) / f(n-r) twowins = 0 threewins = 0 twoscores = [] threescores = [] numPoss = 100 for i in range(10000): twoscore = 0 threescore = 0 for j in range(numPoss): r1 = random.random() r2 = random.random() if r1 < .60: twoscore += 2. if r2 threescore: twowins += 1. if twoscore < threescore: threewins += 1. twoscores.append(twoscore) threescores.append(threescore) sn.kdeplot(np.array(twoscores)) sn.kdeplot(np.array(threescores)) plt.title("Comparison of 100 possessions of 2's and 3's") plt.xlabel("Points Scored") plt.ylabel("Frequency") plt.show()

**Note: There’s an error being caused by WordPress. For some reason, line 27-30 deletes themselves. Here’s a screenshot of the actual code:**

The result gives us the following plot:

We indeed see that the Central Limit Theorem is indeed trying to take hold as the expected point value is hovering right about 120 for both teams. Despite 10,000 simulations, we still see that the three point distribution is considerably shaky while the two point distribution, while apparently tighter and smoother, is slightly biased beyond 120 points; more specifically to the right of the three point distribution. While we may claim this is merely a sampling issue (IE: run another application, get a shaky graph that be biased in the other direction), we can simply just compare the distributions directly using the binomial distributions.

To do this, we can write a script to compute the probabilities. All we need to do is properly write the mathematical equation for a two point team defeating a three point team in 100 possessions.

And we can turn this math into a couple lines of code:

prob = 0. probtie = 0. for k in range(numPoss+1): for l in range(int(np.ceil(2.*k/3.)),numPoss+1): if l > 2.*k/3. : part = nCr(numPoss,l)*nCr(numPoss,k)*.4**(numPoss-k+l)*.6**(numPoss-l+k) prob += part else: part = nCr(numPoss,l)*nCr(numPoss,k)*.4**(numPoss-k+l)*.6**(numPoss-l+k) probtie += part

Running this code, we find that over 100 possessions, the two point team is favored to win with a probability of **.4907** compared to the three point team’s chances of **.4867**. In fact, we wont see these numbers converge within .0001 until we get to upwards of hundreds of possessions; a near impossible feat in the NBA. Central Limit Theorem be damned.

Running this over every possession, we are able to see how convergence works.

Performing the math above, we find explicitly that taking **equivalent** expected point value field goals, we end up favoring the team that takes two point field goals. However, there are situations were taking two point field goals are **excessively worse** than taking three point attempts. As example, down ten points with four possessions remaining.

In this situation, the two point team is **guaranteed to lose**. Whereas the three point team musters a **.00065536** chance of winning the game. It’s not great, but it’s better than Team X’s chances!

The elephant in the room during this argument that has not been discussed as of yet is the practicality of the statistical model. While we started with the argument that 60% from two is equivalent to 40% from three and then proceeded to be proven that this is not true thanks to Central Limit Theorem assumptions failing and discreteness creeping in; the question remains: **Are teams really likely to shoot 50% better from two than from three in** **games****? **The short answer is, not really.

In fact, taking glances at any shot chart, we find that teams tend to shoot sixty percent from within **three feet of the hoop**. Therefore, simply forcing teams to shoot more than three feet out while maintaining a near-40% clip from three will almost guarantee victory for a team. And there’s one team that does that: **Golden State Warriors**.

It actually becomes quite interesting comparing the **frequency** and **efficiency** of each team. We can start mangling up the Binomial distribution to impose percentage of shots from certain locations; better representing their probabilities of winning. Instead of going down this rabbit-hole, we instead take the note that there’s effectively only a 14 square foot region where teams shoot 60%; while there is approximately a 250 square foot region where teams shoot upwards of 40%.

So while it is essentially easier to get three point attempts, it must be known that similar efficiency, when limited to similar frequency actually leads to deficits. Quite the interesting concept when considering scoring strategy within the NBA; as equal EPV does not imply equal probability of winning.

]]>**Question: **Who is an “elite” paint defender?

**Heuristic: **Measure three variables: Field goal percentage in paint, Field goal frequency in paint, Rate of passes out of paint after player drives/cuts into paint.

This isn’t a terrible set-up. The thought process is rather direct as it focuses primarily on **end-state** results. It’s an analysis that helped idealize Roy Hibbert several years ago. And while Hibbert was a great interior presence, his numbers are rather confounded: his Defensive Rating was only at “elite” levels when effectively both Paul George and David West were on court. More importantly, the pressure Hibbert yielded on the interior, combined with the pressure of the guards and other Bigs led to teams taking less efficient field goal attempts.

**(Side Note: **When I was working with an Eastern Conference team, the lead analyst mentioned he would have rather taken Hibbert over Blake Griffin if given the opportunity. The following year, Hibbert went to LAL and I made the bet that Hibbert’s defensive numbers would decrease dramatically over the next couple years because he doesn’t have D. West next to him anymore. The comment was tongue-in-cheek and West was emphasized as George had missed the year due to the leg injury, but the numbers did drop and has widely been blamed on the shifting tides of the NBA.)

Instead of making this an article about Roy Hibbert, let’s focus on the shifting tides that were becoming clear: Teams were finally adopting long range approaches and more complex offensive schemes to force rim protectors away from the rim. This led to sequences such as Hammer attacks or “Screen-the-Screener” action that would intentionally tangle rim protectors.

From a data science perspective, this becomes a nightmare in confounding players versus the system, IE: the Jae Crowder problem.** The drives that result in a pass away from the basket?** Are those designed passes to find open players within a complex offensive scheme or because the rim protector is a legitimate rim protector? **The low frequency of rim two’s?** Is this a fact that the team is finding a dynamic motion to to open more three point attempts? Here, we begin to understand the need for a **random effects model** that begins to separate out the **system**, which is widely considered a lurking variable for measures such as BPM and RAPM, from the **player**. In this article, we start to look at **identifying the system**.

The challenge with constructing a lurking variable into a feature is due to the inability to measure the lurking variable accurately. For measures such as Player Efficiency Rating, Box Plus-Minus, or Regularized Adjusted Plus-Minus, the values are derived from play-by-play data or synthesized from box score data; without any ability to adjust for play type. This play type (system) becomes effectively the reason a player such as Jae Crowder performs well for a 50+ win team in Boston that cannot escape the Eastern Conference Playoffs but then struggles mightily with a 50+ win team in Cleveland that eventually makes the NBA Finals. To this end, these types of models are effectively **Y = g(f(X)), **where **Y** is the resulting measure, **X** is the player activity, **f **is the function that measures the player, and **g** is the noise model. For RAPM, this model is explicitly given here. In a primitive form, the model is given by

**Rating = Player on Court + Error**

This model is effectively a **first order random effects model**. The first order means that the model only looks at the variables measured and **no interactions**. If RAPM truly cared about pairs of players, we’d see the explosion of variables where the 1’s and 0’s are multiplied. In this case we don’t, and we are left with a first order model. The randomness comes from not being able to define what players are one the court. Instead, we just sample them as they come. This is different than a **fixed effects** model, where we can simply identify who is playing when before the entire games begins.

Our goal at this point is to impose a new feature: the play type. We can then look at the model

**Rating = Play Type + Player on Court + (Play Type x Player on Court) + Error**

and we have ourselves a second order random effects model. We can visualize the play type as a **grouping** or **treatment** in terms of designs of experiments; but this now requires us to **cluster play types; **and hence measure the lurking variable. And there are two ways to do this: **Mechanical Turk **or **Tracking Data**.

The Mechanical Turk is a method of developing labels by hand instead of machine. It’s name comes from the famous chess playing “machine” from 200+ years ago. It is also the primary method for **Synergy** and **Krossover ** data collection. The process is tedious and often flawed. If you haven’t heard the question “Is that really a PnR?” 1000 times when working with Synergy, you haven’t worked with Synergy data enough. However, the process is good enough to produce actionable labels.

The PnR question from above started to make a lot of problems for analytics departments when the question if whether a hip tap by a passing by “screener” or an unused “screen” of a “screener” 5+ feet away from the play were considered as pick-and-rolls (even shallow cuts were being labeled as PnR’s).

While the simple understanding of identifying a screen becomes a challenge, more complex schemes such as breaking out plays becomes a near impossible task for these methodologies. For starters, to break down a play in the NBA we typically need to at least see the play **three times**. And that’s for experts who can break plays down.

Tracking data instead allows the scientist to apply machine learning techniques to help tease out actions. In this case, we are able to **template** plays based on their spatio-temporal patterns and then cluster the actions. And while this may seem sexy and time saving; this process if too often flawed. If you’ve ever heard the question “Is that really a post touch?” enough times when working with **Second Spectrum** markings data, then you haven’t worked with enough Second Spectrum data. Notice a trend here?

Regardless, attempting to identify complex plays using tracking data is also a very difficult task. There have been some public attempts, such as sketching from Andrew Miller; which performs a segmentation of track paths made by players, a functional clustering of segments (treating components as words), and modeling possessions (treating possessions as a topic modeling problem).

It’s a fairly strong attempt, and is fairly on par with my work since interacting with SpotVu data with teams from many seasons ago. However, this methodology suffers from the dreaded **time-warping problem. **That is, players run at different speeds along fuzzy paths in the same direction due to either competency or design. Taking a look at Miller’s paper above, time-warping rears its ugly head when cuts or perimeter motion takes effectively between 2 and 8 seconds.

A key benefit to the procedure, and why it becomes such a strong attempt, is that this is an unsupervised technique; allowing for construction of plays without encoding plays. Along with this unsupervised formulation, interpretability is easily available as tracks are identified as the vocabulary.

The methodology I’ve been using for the better part of five years comes from development on SportVU data with that same aforementioned Eastern Conference team. When it was originally presented to the staff, I probably received the largest glassy-eyed response I’ve ever received in my life. But at the end, it was able to separate out the effect of the defensive system on a player such as **Roy Hibbert** and identify that he was a product of the system; which maximized his talents exceptionally well. And, unfortunately, it’s not as a visually cool application as Andrew’s work above.

The methodology is rather tedious: We start with a collection of unmarked plays and break out their locations at each time step as a binned structure; not much unlike the shot locations in the Nonnegative Matrix Factorization procedure for field goal attempts. From there, we have to identify that we are now victims of two types of alignment: play start alignment and time-warping.

For play start alignment, we employ a **Fast-Fourier Transform**, treating the position of the players and the basketball as a a **signal** over time. The resulting **power spectrum ** can be used to cross-correlate plays to identify differences in start times between two similar plays in the 2D Fourier spectrum. Consider this equivalent to comparing two arbitrary signals over time that end up being same frequency, same information content with noise, disrupted by a time-delay. If the cross correlation’s peak is at time zero, the signals are at the same time. If the peak is offset from zero, the offset is the play alignment. The width of the cross-correlation is two-fold: “flat” schemes are differing plays and “fat” schemes are similar plays with time-warping or players doing slightly different actions. Unfortunately, identifying peaked cross-correlations doesn’t help us much; unless we mechanical turk plays in advance and use them as templates. And even then, this is a **global property**. Any slight changes will flatten and fatten the cross-correlation and leave us with no immediate reason as to why.

For time-warping, we tackle this problem later.

So let’s start understanding this system through the use of a particularly well known strategy: the **Horns** structure. Under this set-up, we will start to break down different Horns plays and apply data science techniques to uncover features that break the plays down.

The Horns offense is a well-known offense that is initialized with a dual screen action towards the ball handler. The “Twist” action is when the ball-handler is screened twice. Once by each screener, leading to a zig-zag pattern.

The action is straightforward and commonly used to tangle interior defenders at the free throw line. This will either open up a driving mismatch for the point guard 12-15 feet from the rim, open up a pullup three point attempt from the top of the key, or open up the initial screener underneath the rim.

To illustrate out process, we take a sequence of five Horns Twist plays to the right and plot them on the court. These plays have been subjected to the FFT mentioned above and the play tends to look very predictable.

If we up this towards twenty five samples, it begins to take a life of its own.

And now we start to see the jumbled mess we expected to see. Don’t ask for 500 of them, it colors almost a third of the court. However, we are able to start mapping out the **tensor** over time:

Applying a tensor decomposition, we start to identify characteristics, or **signatures, **of different styles of play. Here, we apply a **nonnegative CANDECOMP-PARAFAC decomposition**. This allows us to start breaking down the plays into a number of **components**.

For instance, if we settle on **one** component to represent a **Horns Twist**, we end up with a **Sideline component, **a **Baseline component**, and a **Temporal component:**

The sideline action will capture components of motion that occur along the sideline. Similarly, baseline action captures motion along the baseline. The temporal component identifies the **segments** of when activity occur. To reconstruct the play, we focus on the **outer-product** of these components. By taking the outer product, we see a significant amount of activity happening at the right of right around the perimeter. This component captures the motion of the screens and the ball-handler. In this case, the non-moving shooters have relatively insignificant roles; despite the small blips at 1 and 50 in the Baseline action; and the small blip at 1 in the sideline action.

More importantly, in the temporal component, we identify the screen actions. The first bump is the first screen, the second bump is the screener chasing the ball-handler and the second screen being set. The slight dip is the second screener approaching the ball-handler.

If we wish to expand on the decomposition and focus on breaking out particular components, we can. However, it should be noted that more components does not necessarily indicate better fit. In fact, if we break down the Horns Twist play with five components, we get seemingly more actions:

Here, we immediately see our rank-one action as the first component in the tensor decomposition. But now we see other activities. What is component 2 capturing? This happens early on in the possession, and again late. It’s location is primarily focused near the center of the court. Similarly, there’s two primary actions in accordance with the sidelines: **This is effectively the initial screener’s role**. In fact, this is his screen action towards the ball handler and subsequent roll to the basket. The temporal component at the end is the motion of the ball-handler entering into this region after the second screen.

Component three trims on this exact same action, allowing us flexibility in modeling the pick and roll type action that occurs on the first screen. What ultimately happens is that the collection of these components **captures players’ roles and motion** within an offensive scheme. For a common possession, I tend to use **fifty** components.

**Side Note:** There’s no distinct tried and true way to select the optimal number of components. This is an actual open research problem. Fifty is just a feel-good, warm, fuzzy number.

Alas, that time warping problem is back. Here, we mitigate it by using a dual attack on the temporal component. Using the tensor decomposition, the motion of the players will elicit similar signatures. However, we will see changes in the temporal components associated with the motion. A slow player will have distorted temporal changes. A delayed player, will have a shifted temporal component.

At this point, we again appease to the Fourier transform gods and cross-correlate these signals to find speed/reaction of player (fattening) and strategic delay (offset).

Let’s take a look at a subtle wrinkle.

The Horns 4-5 play is a near identical action as the Horns Twist play. The exception is that the secondary screener screens the primary screener.

This secondary screen action frees the original screener and typically sets up a three point attempt, or forces the interior defender to step up, freeing the secondary screener to slip into the lane. In this case, we selected 25 FFT’d pops:

And the associated **wormhole **plot:

As we start to break down the components, we immediately see a different structure. For a single rank-1 decomposition, we obtain a seemingly significantly different result:

We see the two screens like before, but this time they are located in different spots. We see the **identical screen** set at roughly 12 feet along the baseline and about 20 feet out from the basket. However, we see the second screen action that short of the first. This is the staggered screen the screener action. We also pick up the resulting flare of the primary screener as the rolling drop of the baseline action.

Similarly, the temporal aspect shows that the Horns 4-5 acts as a “smoother” play as two screens can interact simultaneously; as opposed to the Horns Twist that **requires a staggered time-delay screen on the ball-handler.**

Expanding out to five components, we have a similar decomposition of the play:

Comparing these components to the Horns Twist components, we start to see the massive differences between the two plays in the decomposition space.

At this point, we have to make a decision: do we store templates or use an autonomous structure. The former action resorts back to a **mechanical Turk** type activity. Here, we use subject matter expertise to design out plays and then collect the plays to **diagnose** a signature. Instead of keying thousands of plays; we merely have to key approximately 200 plays and use the template going forward. More importantly, we can diagnose specific **actions** within a play; and diagnose those. Typically, we piecemeal actions together.

The latter is to aggregate all actions and perform a decomposition with a large rank; or large number of components. This allows us to not have to template, but requires a significant amount of tender, love, and care to tease out actions and label them. This is more in the flavor of Miller’s paper above, but potentially runs the risk of developing many false-positives: ghost actions that don’t really occur but reduces the noise in the observed tensor.

Despite this, we can then take a team’s actions and merely fit the components to a team. This will yield a collection of coefficients that in turn. These coefficients acts as weights for the types of plays that teams run. For instance, in the 2015-16 NBA season, the highest weight for the **Los Angeles Clippers** was the **Horns 4-5**. This play, coincidentally, was a Bread-and-Butter play for the Clippers when ran with Chris Paul, Blake Griffin, and DeAndre Jordan. Furthermore (not so coincidentally), when Austin Rivers replaced Paul, the timing was significantly flattened out; indicating that the play took much longer to develop.

It’s indeed a heavy-lifting methodology; but it’s a way for data science to interact with NBA modeling by leveraging tracking data without having to impose heuristically developed features that can be tainted by lurking variables.

]]>