Consider, for a moment, being a General Manager for an NBA team that is faced with determining the number of years for a player contract. The problem seems simple: a team requires a certain skill set that a player possesses and they would like to know for how long a player would be able to contribute those particular skill sets for the team. This type of problem is known as a **player career arc **problem. The most common phrasing of this statement is: **“Are we able to forecast the contribution of a player over the next three years?” **

There are several ways to attack this problem. We can apply regression type models, time series methods, or even deep learning algorithms. Each method carries both strengths and weaknesses. Some of these strengths and weaknesses are **data driven** and/or are **model specification**. For example, suppose we apply a regression method such as RAPM. Here, we suppose that our stint data is Gaussian; which it is not. This means the resulting coefficients are not quite Gaussian **when the RAPM theory suggests the coefficients are** and we are unable to apply a nice transition model to estimate the next year’s coefficients. While we still do this and can obtain reasonable estimates, they are only **reasonable to an extent** and may not be indicative of the truth.

In this post, we focus on a nonparametric attack and develop a Random Forest model to predict player career arcs. Once we present the methodology, we randomly grab a player and identify their career arc with respect to the NBA players that this player closely identifies with. To do this, we use a nice property of random forests: the proximity matrix.

The player selected for this exercise? **Eric Snow**.

## Random Forests: Decision Trees with Bagging and Randomization

### Decision Trees

In order to understand what a random forest is, we first take a look at a decision tree. A **decision tree** is a multi-level partitioning algorithm that chops up our data such that players are clustered according to their traits. For example, suppose that we list players by their positions and use attributes such as **height, weight, number of rebounds per 36 minutes****, number of three point field goals attempted per 36 minutes, **** and jersey number**. It this basic exercise, if we have a player that is 6’11, weighs 250 pounds, obtains 14 rebounds and takes 0.3 three’s per 36 minutes, and wear jersey number 52; chances are, **we have a center**.

The decision tree may operate in this following manner. Suppose we select an attribute at random. Say it is **rebounds per 36 minutes**. We then look for an optimal **splitting point** that separates the rebounders apart. This may well separate centers and power forwards from the wings and guards, however, we may not get strong separation between the post players. Similarly, **Russell Westbrook** may slip through and hang out with the post players for this past season. If we have labels, such as positions, to train off of, we can use a special method of separation such as **Gini Index** or **Maximum Entropy**. If not, we look for an optimal split point that just gains separation between a determined set of players.

At the next level we select another attribute at random. This time, suppose **jersey number** is selected. Now there are two sets of players to look at from the previous level: low rebounders and high rebounders. We again split up each group of players into a pair of subgroups by some measure and continue this process.

At each level, a collection of players is called a **root**. The separation of the group of players at each root is called a **split**, with the players being distributed into **leaves**. Trees do not have to have only two leaves at each level; it’s just in our example we decided that for easy illustration.

For a decision tree, the **root** is the collection of all players. Each level is a **depth** of the tree, consisting of **leaves **previous level. Each of the leaves at the end of a tree is a **terminal leaf**. A nice illustration given by Dr. Mohammad Noor Abdul Hamid identifies splitting on Gender and Height to partition people.

Particularly notice that for the split on height, the splits **do not have to be the same** for each leaf node. That is, female heights are split differently than male heights.

Typically trees have low bias, but tend to have high variance associated with prediction and therefore have terrible predictive power.

### Bagging

To move from a decision tree to a random forest, we introduce (quickly) the idea of **bagging**. **Bootstrap Aggregation**, or bagging, is a technique to help understand the accuracy of a prediction for a given learning process. The process is relatively simple: First, we take a bootstrap sample from our training set. This is a **sample with replacement**. Next, we fit our model of interest to the bootstrap sample. Call this **model number 1**.

We repeat this process of taking a bootstrap sample and fitting our model to obtain **model number 2**. In this case, model number 2 will be similar to model number 1; however the prediction for a particular input, **x**, may be different for both models. This is what helps us understand the variability associated with a model **without having to rely heavily on distributional assumptions**.

We continue this process until we have obtained **B** many models. Taking the average predictions of these B models yields our **bagging estimate** for that particular input. Not only do we assess variance, we are able to reduce variance by using the bagging estimate as the mean. If our modeling process is the decision tree model above, this will help us reduce the high variability associated with trees. Combining this idea with decision trees above, we obtain what we call **random forests**.

### Random Forests

Trees are one of the best models when it comes to capturing complex interactions within data. Unfortunately, noise is usually too high to make a good prediction and the tree ultimately becomes an **explanatory tool**. Instead, by introducing the concept of bagging to decision trees, we are able to help reduce the noise associated with fitting (also called **growing**) a tree to data.

A **random forest** is then a collection of decision trees obtained through bootstrap sampling with each node being a **randomly selected attribute**. This randomization is key to ensure that each of the bootstrapped trees have as minimal amount of correlation as possible. The process is relatively straightforward:

- For Each Tree:
- Draw a bootstrap sample of the data.
- Build a Decision Tree to the bootstrap sample.
- Select a subset of attributes at random.
- Find the best attribute to split on, using Gini Index or Entropy
- Split the Node into two Leaf-Nodes.

- Repeat B times to obtain B trees.

The resulting tree will then identify how to split up the data. If there are labels to each observation, then we take the **maximum number of labels** in the leaf to identify the value of that leaf.

In our player example above, our labels are positions: **PG, SG, SF, PF, C**. Suppose we use five attribute selections each time. For the first tree it may be **Rebounds, Jersey, Three’s, Rebounds, **and **Height**. Notice that rebounds was selected twice! That’s alright. There are then **32 terminal leaves** that contain all the positions. Suppose the first terminal leaf contains **16 players** that are labeled as **PG, PG, PG, PG, PG, PG, PG, PG, SG, SG, SG, SG, SG, SF, SF, PF**. Then the terminal leaf is marked as **PG**.

We can, actually, continue the splitting process until all leaves contain a single label. Either way, for a new player, if we take their attributes and drop them into each of the B trees, we obtain B labels associated to that player. **For classification**, taking the maximum label is the predicted label of the player**. For regression**, we simply average the outputs. So how do we compare players?

### Proximity Matrices

A **proximity matrix** is a player-by-player comparison that counts **how similar** (or proximal) **two players are.** In this case, we define proximity as **the number of terminal leaves shared between two players**. If we consider every NBA player that started after 1980, we leverage all 2630 players to obtain a 2630×2630 matrix. The **(i,j) entry** of this matrix identifies how **close two players are according to the model**. In this case, how many terminal leaves are shared.

If we perform a redundant exercise and train on all players, the diagonal of this matrix will be **exactly the number of trees**. If two players are as opposite as humanly possible, then the value of their row and column intersection is **zero**. Meaning they are not close at all.

## Applying Random Forests to Forecast Players

For our simple exercise, we collected all summary statistics for every NBA player that started their career **after 1979**. Sorry, Kareem, you’re not included. However, **Magic Johnson **and **Larry Bird** are included! The attributes we used were:

**Age, GP, GS, MP, FG, FGA, FG%, 3P, 3PA, 3P%, 2P, 2PA, 2P%, EFG%, FT, FTA, FT%, OREB, DRED, REB, AST, STL, BLK, TO, PF, PTS**

We then took each season and split off players into **years played in the league**. This helped us redefine the 38 season worth of data into 21 years worth of data. Each year represented a year played by a player. For instance, **Y****ear 1** is the collection of **Rookie Seasons** between 1980 and 2017. The final year? That’s **Year 21**, which only includes **Kevin Willis’** and **Keving Garnett’s **final seasons in the league; 2007 and 2016, respectively.

Once we obtain the 21 files, we apply a sequence of random forests to each files, growing 1000 trees at each file. This means what we have **21,000 trees **over the 21 years!

Next, we consider a player of interest. Suppose they have played three years in the league. We take their first year and drop the attributes for their rookie season into the **Year 1** random forest. This yields a proximity matrix for that player.

Repeat this for years two and three, and we obtain two more proximity matrices. We then add the three proximities together to get an idea of closeness between the player f interest and the other players in the league.

This way, if a player experiences and uptick in their career, they may match weaker players early on, but match stronger players in their third year. The proximities will capture this. Using the proximities as weights, we then take **Year 4 **values for players and compute the predicted stat line for the player of interest!

Similarly, if a player that the player of interest matches to is no longer in the league, we mark an indicator to identify a **probability that the player of interest will be out of the league**.

Note that half of the players who started after 1979 were out of the league by their fifth year. It would be of interest to identify this probability as we trudge along trying to predict future years.

## Case Study: Eric Snow

One the 2630 players in the league, we performed a random selection and obtained **Eric Snow** as our case study. Eric Snow had a curious career after coming out of Michigan State in 1995. Snow was picked up by Seattle, used sparingly in his first three seasons; at roughly an 11 minute-per-game rate. After a trade to Philadelphia in 1998, Snow became more of a presence on the court alongside **Allen Iverson**, dramatically improving his scoring from 3 points a game to 12 points a game despite only tripling his minutes.

The question is, **can we predict his 2004 NBA season using this random forest methodology? **

If we apply traditional time series techniques, we would expect his numbers to increase, as they have over the final five year period. In this case, we apply the random forest methodology in hopes of finding players **similar to Eric Snow** and using their **future years** to **predict Snow’s progression in the league**.

Who are some players that Snow matches to? Here are some **proximity scores:**

**Milt Palacio**(83 matches)**Spud Webb**(46 matches)**Antonio Daniels**(31 matches)**Doc Rivers**(29 matches)**Randy Brown**(23 matches)**Scott Skiles**(19 matches)**Kevin Johnson**(12 matches)**Malik Sealy**(12 matches)**Bill Hanzlik**(11 matches)

That’s quite a cast of characters! Note that these are not all the top matching players. There is a total of **3****80 players that matches to Eric Snow!** Of those players who managed to play a ninth year? **Only 186**. That’s slightly under half. Therefore, we say that Eric Snow has roughly a **49% chance of being in the league for a ninth year.** Pressing on, that percentage drives down to **42% **and **37%** for a 10th and 11th year, respectively. Snow managed to play 13 seasons in the NBA.

### Predicting the 9th season…

Now Eric Snow moves on to his ninth season. Here, we use the proximity weights obtained from players like Spud Webb and Doc Rivers. This will give us a **free-flow** **estimate of stats**. Since coaches control the actual games played, we adjust accordingly.

In this case, the **Philadelphia 76ers** coaching tandem of** Randy Ayers** and** Chris Ford **oversaw Snow playing in all 82 games. In this case, here are the **true stats** compared to the **predicted **stats. **NOTE: **we deleted all of Eric Snow’s games up to season 8. This means we could not train using Snow’s season 9 through 13 to predict season 9. We do use statistical integrity!

Here we see that we miss quite a bit in starts and assists; however we manage to nail down items such as field goals, three pointers attempts, free throw percentage, rebounds, and particularly **points scored**.

What this helps show is that players can be approximated relatively well through their proximity to other players in the league.

By the way…** Snow was also predicted to be 30.37 years old**. He was indeed 30 years old for this season.

## We Did Really Well! But Wait…

However note that we can only compare players **given the data used**! This is a very important note.

For instance, if a player is injured, they may have poor stats for that given season. Want an example? Look at **Marc Gasol** from a couple years ago with his broken foot. In this case, we may want to impose a new attribute such as **days out with injury**.

Similarly, we used totals. Totals aren’t the greatest statistics to use. Instead, we may wish to manufacture new attributes such as **coaching type, number of possessions played, **or **strength of schedule of opponent**. We may even want to change the entire variable set-up and use **per possession** type stats.

We just have to remember that the quality of output is indicative of the quality of attributes used. How does this old adage go…? **Garbage in, garbage out?**

Also note that we cannot perform this procedure on **rookies**. This is because we **don’t use any pre-NBA data**. In this case, we must obtain features that represent all players coming into the league, as well as have come into the league, **or even attempted to come into the league** to identify a proximity for eligible players.

## Let’s See What Happens…

To test, let’s take a look at another random player. This time, **Dwight Powell (Dallas Mavericks)**. Powell is heading into his fourth year this season and we are interested in his proposed stats. In this case, we have predicted the following for Powell:

Here, we expect Powell to get roughly the same amount of minutes, however distributed over more games. Due this this, we expect his shooting to decrease; as well as his rebounding and steals. However, we expect his passing to improve and his ability to get to the line to improve.

Let’s see how this plays out!

Pingback: Using Random Forests to Forecast NBA Careers by Justin Jacobs of Squared Statistics | Advance Pro Basketball