Suppose a game completes and three players post the following stat lines:
- Player A: 31 points, 13 rebounds, 3 assists
- Player B: 20 points, 11 rebounds, 9 assists
- Player C: 20 points, 21 rebounds, 0 assists
Frequently, we ask who was the better player or which player contributed the most to the game. Unfortunately, most questions asked within a front office and coaching staff don’t revolve around the notion of who the best is. While it is indeed an important question, it’s usually obvious that Russell Westbrook and Paul George are the top scorers, while Steven Adams is dominant on the boards for the Oklahoma City Thunder. And that LeBron James is most dominant player on the Los Angeles Lakers. There are already some great tools in evaluating the contribution of a player. By using Dean Oliver‘s points produced model, we can even break down some of the components of points scored, assists, and rebounds into respective amounts of contribution to a win or a loss.
Many of the questions that are asked revolve around, what style of play does this player have or if we take away / limit this component of a player, how are they going to respond? And it’s these questions that start to pose difficulty.
In the example above, it’s not necessarily the points the player generates, but rather: Which player is the rebounder? Which player generates their own attempts? If we can only play two of these players at a time, which players best compliment each other? And it’s this latter question we will focus a bit on through a real coaching problem I performed for a team. (Note: The subject of that analysis is considerably different than what I am about to present.)
Big Boarder: Drummond in Detroit
For instance, if we were to limit Andre Drummond‘s ability to gather offensive rebounds, we know we will frustrate and severely limit the Detroit Pistons offense. Through January 14th, Drummond had posted 218 offensive rebounds, which results in at least 218 extra chances to score. Over the course of 42 games, that a potential expected 6 points a game extra just on his rebounding alone. In fact, the distribution of second chance points are distributed as:
As we see the total of 230 points scored over the course of a season given a Drummond offensive board; keeping Drummond off the offensive glass gives our team an added 5.47 points (current estimate) in score differential per game.
This means if Drummond is off the glass, who is more likely to increase their rebounding? Ideally, it’s our team. Realistically, someone will step up for the Pistons. Now here lies our quest for understanding.
Our first guess is Blake Griffin. He gets about one offensive rebound a game and plays a perimeter style of play; however, he is Detroit’s second leading rebounder within the starting unit. And despite having 49 offensive rebounds, the Pistons average less than one point per offensive rebound and that includes third-, fourth, etc. chance points on top of second chances.
We can live with checking Blake Griffin and body if necessary. So what about the second best overall rebounder, Zaza Pachulia? Well if we place him side-by-side with Drummond, we expect to feast with small guard as neither Drummond nor Pachulia play the perimeter. Alternatively, if Pachulia subs in for Drummond, our work is done as the Pistons are now keeping Drummond off the offensive glass by putting him on the bench.
Ditto for Jon Leuer.
Sticking with Blake: Now What?
Now if we stick with Blake Griffin as the primary alternative crashing player, we need to begin to understand the style of Blake Griffin’s play. We can look at Griffin’s basic stats, his on/off numbers, and some of his advanced stats; but we still lack a method of comparing the player. We have seen several situations when Griffin has been on- or off- the court with Drummond. We can casually count the differentials in rebounding, points scored, etc. However, with Drummond being tethered by our defense; how many examples have we seen of that? How do we quantify these instances? Do we dare impose a set of “business rules” to define a player and vehemently argue their merits despite never statistically testing those rules? Or do we apply some form of metric-based learning to help understand how a player adjusts to the various roles they must play over the course of the season? I tend to select the latter.
So how do we begin to understand the different facets to Blake Griffin’s game? A simple methodology would suggest we simply apply Euclidean distance metrics and measure the difference between stat categories of Griffin and every other player in the league. This will provide some reasonable results. If we condition on all games where Drummond is limited on rebounds and compare Blake Griffin’s stat-lines across the league, we will indeed find some interesting results.
For instance, if Drummonds rebound total is lower than say 14 rebounds a game, we find that not only Blake Griffin’s points go up (he averages upwards 8 3PA per game in recent games) but he begins to match to players like Paul George, Stephen Curry, and James Harden. While Griffin is considered to be one of the best players in the league, he is not on that echelon this season. Unfortunately, this what what the Euclidean metrics will say. And if you bring this to a coaching staff, you’ll most likely get one of three reactions:
- You’re kidding me, right?
- What a pile of [expletive].
- Hey, thanks for this information, this is really insightful. (Not invited to the next team prep.)
So why did this weird player comparison happen? It’s primarily due in part to the stat-line not sitting on Euclidean space, but rather a manifold space.
Got ‘Em: Lecture on Manifold Learning!
Warning: This is a condensed, liberal, walk-through of manifold learning. If intimidated, skip at your own risk!
The primary problem with the above Euclidean solution is that the Euclidean-based metrics assume that rebounds, points scored, and other stats are independent… Which we know is definitely not true. But, if you need an example…
Consider a single possession with a single field goal attempted and zero fouls. This is one of the most common possessions in the league. If this occurs, we can effectively fall in one of three categories: made field goal, missed field goal with defensive rebound, or other. In this case, other may be an offensive rebound with turnover or end of period.
Regardless, we either score points with no rebounds or we rebound with no points scored. Unfortunately, those Euclidean metrics assume that you can both score and rebound on the same field goal. Aside: which is another reason why you should never use linear regression on possessions to help make decisions. Ahem… but that’s another story.
This possession actually lies on a manifold! A manifold is a space of points where, given a point on the manifold, the space looks locally Euclidean. The best example I can give is the circle. In this case, consider two points (p and q) on the circle:
If we are interested in the average point between p and q and we compute the Euclidean distance average, we end up with a point not on the circle at all! That’s a huge problem. In basketball terms that suggests a player is expected to score 1.2 points on zero field goal attempts. Yes, it gets that bad.
Therefore, we need to learn the manifold for which the data sits on and develop the metric. Note: Metric effectively means “distance measuring” function.
For the circle, this is simple arc length. For a small neighborhood of points about the point p, we see the circle is roughly flat. If you don’t agree, try this sub-example: you’re on the Earth with no hills or valleys. It looks flat to you. And no matter where you walk, you rotate about the Earth’s center despite the world always looking flat. That’s a manifold. The collection of your path walking between two far-away cities is no longer flat, but rather and arc. That’s your metric.
Intuitively, we all know this. But for basketball data, what is the manifold?!?! And this is where manifold learning comes into play.
Manifold learning techniques thrive on a very similar concept: Assume all the points are locally Euclidean and mathematically impose rules such that comes points are viewed as “too far away” to be in the “locally flat” space. Many of these methods exist, such as ISOMAP, Local Linear Embedding, and Self-Organizing Maps. A recently popular one is t-distributed Stochastic Neighbor Embedding; or t-SNE.
Despite the implementation being difficult to master, the idea is relatively straight-forward. We take a sequence of n points. This may be a set of stat-lines from every player in every game this season. Suppose 8 players play for every team in every game over a total of 650 games this season and we should have roughly 10,400 samples to help us understand the manifold that represents player interactions. Label these points x_1, x_2, …, x_10400. Each point is p-dimensional, where p is the number of statistics we consider in the model.
Note that we might use per 100 possessions to help us normalize for playing time. And we may adjacently attach on average point differential over the possessions played to help encapsulate garbage time (instead of throwing data out).
Given these roughly 10,400 samples we form the 10,400-by-10,400 matrix of Euclidean distances between each pair of points. From this, we compute the probability that two points are considered to be “local” to one another:
In t-SNE, we propose a new set of points, y_1, y_2, …, y_n that are in ridiculously small dimension, typically 2, to “project” our data points x_1, x_2, …, x_n onto . We compute the distances between each of these projected points and assume they follow a Cauchy distribution (t-Distribution with infinite variance):
And we look at the Kullback-Leibler distance between the projected points’ distribution and the “Euclidean distribution” we built using the matrix above. We then apply gradient descent to minimize the Kullback-Leibler distance. In doing this, new “projected” points are proposed and the process continues until Kullback-Leibler converges to some small error.
The resulting points are then local neighborhoods that represent the original data set (p-many possessional stats) in a 2-dimensional context. And the cool part is… distances are effectively preserved. This gives us a clustering approach to players…
Therefore giving us insight into the style of play for every player.
Simple Example: Points, Rebounds, and Assists
So let’s perform a simple example: one where we can envision the data. In this case, we consider the traditional triple-double stat-line of points, rebounds, and assists. If we sample every single one of these throughout the course of the season, we obtain over 11,000 samples and a horrific plot.
This is the simplest plot we could give. If we start to include more statistics, we can no longer plot the distribution, but rather have to start looking at conditionals. No thanks. This plot, unfortunately, gives us absolutely no indication of player styles.
We may be able to label some points and attempt to perform clustering, but the Euclidean distance problem from above will bite us yet again.
However, if we apply a t-SNE algorithm, we obtain the updated plot:
Immediately we see major clusters form with many minor clusters within the major areas. Looking at this plot, we could make the argument that there are between 4 and 8 major clusters. If we start to append names to the clusters, we start to find specific groupings.
For instance, the upper right grouping are high scoring (talking 30-50 points per 100 possessions with high usage and high team usage region) with approximately 10 rebounds and 10 assists per 100 possessions. This is predominantly James Harden and Anthony Davis territory, with a few visits from Stephen Curry and one appearance from Kyle Kuzma on the fringe of this group. The latter thanks to his highly efficient 37 point, 8 rebound, 3 assist game. The last of these causing him to sift to the fringe.
Curiously, the antipodal player isn’t the 0-0-0 player, but rather the rebounders / shot blockers like Andre Drummond. Antipodal here means, “on the opposite side of the manifold” or “as far away as possible.” These players are valuable, but their style is significantly different. And we know that is true. These players are our De’andre Jordan‘s and Tristan Thompson‘s of the league.
What this plot breaks down for us is the style of play. Therefore, players that fall near other players share the same style for that particular game Therefore, the next steps are to start identifying the styles of play. And seeing that I personally don’t want to spend more than 2 hours on this post tonight, I marked two of the minor groups.
Application to Our Problem
So now we return to our problem. Given the t-SNE plot, we are able to mark every game of every player. In this case, we can mark Andre Drummond’s low rebounding games and identify where the other Detroit players fall.
Before we continue, we note that we left Stanley Johnson off this list only because he is nowhere close to any of the other players and is scatter-shod all over the map. His four major appearances in Drummond’s “low rebounding” games are near Reggie Jackson for one, near the center for another, up towards the mallet looking portion, and slightly above Drummond during the same game.
What this informs us of is that Blake Griffin is most likely to be the rebounder if we begin to eliminate Drummond from the offensive glass. However, we have to keep track of Luke Kennard, as he is effectively the third option for rebounds. As such the strategy would be to rotate off Kennard to pick up Griffin, if possible, and be ready for Kennard to crash. This may lead to potential low gravity events when guarding Kennard to make him comfortable enough to stray out onto the three point line, allowing our team to better rebound a potential deep ball coming from Bullock or Jackson.
The good news is, we have insight of the styles. And with one final point to make: we performed all this analysis in an effort to determine the styles in players. Ultimately, this is a qualitative quantification. Meaning that we leveraged analytics to get to a point where we needed to summarize the style. While we obtain “closeness” of styles, if we run this algorithm again, we may see styles change ever-so-slightly depending on the blending of styles. Therefore, we say proceed with caution as this is a data science tool for exploration and uncovering new features and relationships within the data, through imposing some form of qualitative markings.