Recently, Dikembe Mutombo, Jo Jo White, and Spencer Haywood were inducted as players into the NBA Hall of Fame. Mutombo, an eight-time NBA All-Star and four-time Defensive Player of the Year, is the second all-time in blocks; playing in the NBA from 1991 through 2009. White, a seven-time NBA All-Star and two-time NBA champion, set an all-time Boston Celtics franchise record 488 consecutive games played; playing from 1969 through 1991. Haywood, a four-time NBA All-Star and the 1970 ABA Most Valuable Player, is best known for breaking down the NBA player eligibility rules barrier in 1970. Each of these players deservedly earned their spots in the NBA Hall of Fame, being inducted as part of the Class of 2015.

Now the question turns to who will be eligible for selection into the Class of 2016. One seemingly obvious **Answer** is Allen Iverson. Fresh off being determined to be eligible, despite a brief ten game stint in Turkey in 2010, Iverson seems to be the sure lock for the Hall of Fame; given his four scoring titles (’99, ’01, ’02, ’05), eleven-time NBA All-Star appearances, and 2001 NBA MVP award. Compared to the Class of 2015, Iverson is indeed a sure bet to make the Hall of Fame.

So how do we quantify this certainty? Further more, if we had to pick four other deserving players, who would they be? To do this, we turn to the practice of **machine learning**.

Here, we will take basic assumptions and put them into a **decision tree** format that analyzes each variable we include about a player and produces a yes or no response to whether they satisfy some unknown Hall of Fame criteria. Sounds easy, right? It actually is.

First, we take a look at all players that qualify for the Hall of Fame. This set includes all players that have been retired for more than five years from the NBA. Thanks to NBA stats, we were able to obtain regular season, playoff, and all-star statistics for over 2,000 historical NBA players.

Using this list of players, we clean data by eliminating values that do not cross generation gaps, such as three-point shooting, blocks, and steals. We instead focus on items such as games played, minutes, points, field goal percentage, and rebounds. We include other values such as number of first team awards, number of playoff games and All-Star games played in, number of points scored in playoffs, number of NBA First Team selections, and number of Most Valuable Player awards.

Once the data is cleaned, we **randomly select 75% of the data**, each from the previous Hall of Fame inductees and from the pool of eligible players not in the Hall. This set of folks are called the **training set**, which will help us identify the best way to train our machine learning model.

Using the remaining 25% of the data, called the **test set**, we test how well our model has been trained. We measure how well our machine learning process works by counting the number of Hall of Fame players not predicted, and counting the number of eligible retirees predicted to be Hall of Famers.

The process we use to classify inductees is a **Random Forest**. This is a classification scheme developed at the University of California Berkeley by Leo Breiman. The idea is simple. We take input values **(variables)** and randomly partition them.

**FIRST: One Tree In The Forest**

For instance, we randomly select the variable **points**. We then find the value of points that best separates the players into **HIGH **and **LOW**. This value might be 21,284 career points. The way we find the best value is to find a number that best separates the Hall of Famers into one class and the not Hall of Famers into another class. We should not see this as a clean split. In actuality, if points were the first variable selected, the value 17,700 is a splitting value as the **HIGH** category contains 46 Hall of Famers and 9 Non-Hall of Famers. The **LOW **category contains over 50 Hall of Famers but hundreds of Non-Hall of Famers.

Now that we split on points, we select a variable at random. In this case, we can indeed get **points**! If this is the case, we split the **HIGH **category into a **HIGH-HIGH **and a **HIGH-LOW**. This splitting would be set at 19,700 points were we find 33 Hall of Famers versus 1 non-Hall member in the **HIGH-HIGH** category. The **HIGH-LOW** category will consist of players with point totals between 17,700 and 19,700 points; a category containing 13 Hall members versus 8 Non-Hall members.

We continue this random splitting process until we either have every possible split returning nothing but Hall of Famers or Non-Hall of Famers; or we hit a maximum number of variables selected in the process. The result is then a tree structure where we can take a player’s variable set and ask if each total is higher or lower than the threshold.

For our example, suppose we had these splits: Points -> Points -> Number of MVP Awards -> Rebounds -> Assists -> Rebounds -> Points -> Number of All-Star Games.We then take a player of interest; say Allen Iverson and drop his values into this **tree**. In this case, Iverson was high on points; high on the second split of points; low on the splits of rebounds for all high-high scoring players; high on assists for low rebounding, high-high scoring players; etc. You start to see the point. Where Iverson finally Plinko’s down to, we ask from the training set: **Are there more HOF players here than non HOF players?** If yes, mark this tree as predicting Iverson is a HOF player.

**NEXT: Forests from the Trees**

A Random Forest is then the following: a collection of trees where the variables are selected at random. If we use 500 trees, we then count how many trees say the given player will be a true HOF candidate. Dividing by the number of trees, we obtain a predicted probability of the player’s chances of making the Hall of Fame.

**APPLICATION: Identifying Candidates for the Hall of Fame**

Let’s apply this to our collected NBA data. We took in roughly 25 variables of interest: Games Played, Points, Rebounds, Minutes, Assists, Field Goals Attempted, Field Goals Made, Free Throws Attempted, Free Throws Made, Field Goal Percentage, Free Throw Percentage, same previous categories for playoffs, number of MVP awards, number of NBA First All-Team awards, number of NBA First All-Defensive awards, number of All-Star Appearances, and number of NBA championships.

From this relatively sparse list, the training set was able to predict all players in the test set correctly. This is not expected all the time; as some players had 60% values (means they are in the Hall since they are above 50). Some results were expected; some were intriguing.

**A Look at the Top:**

First we see that Allen Iverson is indeed the first player on the list. After that, we are met with Lou Hudson, Rolando Blackman, Marques Johnson, and Tim Hardaway to round out the top five players.

“Sweet Lou” Hudson is a former Atlanta Hawk, playing with the team from 1966 (then St. Louis Hawks) until 1977, when he moved on to the Los Angeles Lakers from ’77 through ’79. In his eleven seasons, Hudson collected six consecutive All-Star appearances from 1969 through 1974. Despite averaging 22.0 points a game, Hudson’s teams continually fell to the Lakers in the Division Finals and the Boston Celtics after their move to the Eastern Division in the early 70’s.

Rolando Blackman is a former Dallas Maverick, playing with the team from 1981 until 1992, when he moved to the New York Knicks. Blackman’s Mavericks’s best appearance was the 1988 Western Conference Finals, where his squad fell to the Los Angeles Lakers in seven games. A four time All-Star (’85, ’86, ’87, ’90), Blackman returned to the conference finals with the New York Knicks in 1993 and eventually to the NBA finals in 1994; falling to the Houston Rockets in seven games. Unfortunately, Blackman only appeared in 6 playoff games in the 1994 campaign; his final of his 13 year career.

Fourth on our list is Marques Johnson, a high impact player who had a relatively short career due to injury. Playing seven seasons with the Milwaukee Bucks (1977 – 1984) and two more seasons with the Los Angeles Clippers (1984 – 1986), Johnson is listed as playing in eleven NBA seasons. However in his tenth season (1986-87), Johnson suffered a neck injury by colliding with Benoit Benjamin in a November game against the Dallas Mavericks. Attempting a comeback in 1989, Johnson ended his career after merely ten more games. In his short, effectively ten year career, Johnson made five All-Star teams and helped lead the Bucks to five division titles, continually getting knocked out by the Philadelphia 76ers in the conference semi-finals and conference finals. Johnson posted a 21.0 PPG and 7.5 RPG average for his 7 Milwaukee years.

The final member of our five is the third component of Run TMC: Tim Hardaway. With his TMC partners in the Hall of Fame (Mitch Richmond and Chris Mullin), Hardaway played six seasons in Golden State before moving to Miami. Hardaway was a game-changer with his UTEP 2-Step, made five All-Star appearances, All-NBA First Team player, and was the 7th NBA player to average 20 points and 10 assists in a season (1991-92). However, a significant knee injury at the end of the 1992-93 season robbed a year of his prime and found him without his explosiveness that made him in three All-Star appearances in his first four years in the NBA. The lack of a quick moves pushed Hardaway to the bench and eventually to the Miami Heat, where he regained some quality minutes and stats; as well as two ferocious series against the New York Knicks in the 1997 and 1998 playoffs. Hardaway’s 17.7 points per game, 8.2 assists per game, and 1.6 steals per game forces his rankings to be high.

**Comparison to Current Hall of Famers**

Remember that not all Hall of Fame members make the Hall solely because of their statistics. For example, Arvydas Sabonis is in the Hall of Fame. His numbers are paltry compared to most players: 5,629 points (12.0 PPG), 3,436 rebounds (7.3 RPG), and 964 assists (2.1 APG). Sabonis played seven seasons with the Portland Trailblazers and never made an All-Star appearance.

Similarly, Drazen Petrovic of the Trailblazers and New Jersey Nets had a mere handful of seasons. Unlike Sabonis’ reasons for a small number of seasons spent fifteen years dominating the European circuit, Petrovic’s life was tragically taken in a car accident in Bavaria; slightly after his fourth NBA season ended.

For both of these players, they are enshrined in the NBA Hall of Fame based on their merits as not only being top players for the small number of years they had in the league, but because of the impact they had on European basketball.

Comparing their numbers to players above solely on statistical merit would indicate that the above players qualify for making the Hall of Fame. There are currently 109 members of the Hall of Fame inducted as players. From our models, we should expect Allen Iverson to become member 110.

However, there are 29 other names that qualify as matching criteria based on season stats, playoff stats, and individual awards. What other quantifiable variables would you include?

Hopefully you find the flexibility and interpretation of Random Forests nice to use for learning. We can also use these for identifying similarity between players through the use of **proximities**. But that’s for another day.

Pingback: Testing the Quality of a Binary Classifier: ROC Curves | Squared Statistics: Understanding Basketball Analytics