When we learn about computational mathematics, software engineering, or data science in general, one of the most important questions we have to answer is how many operations are required. The answer isn’t to show the prowess of our algorithm, but rather identify how quickly the algorithm can perform. We need to know if the answer is obtainable in a reasonable amount of time.
In this post, we take a cursory look at streaming computations and perform a simple application to scheduling in the NBA in an effort to identify teams who have the most compact schedules; meaning they have the least amount of rest between Game 1 and Game 82.
A Real Data Science Problem
Note: This is discussion about real data science problems. If you’re only here for the basketball analysis skip to the next section.
A common (and sometimes forgotten!) problem in data science is the ability to compute statistics. Recall that the definition of a statistic is merely a function of the data. What we mean to compute statistics is the ability perform computations that the computer is able to store in memory. Take for instance a 100 gigabit per second stream of data with the charge of calculating a mean.
If we can compute only 1 gigabit per second stream, we lose 99% of our data. If we’d like to have a response in real time, we’d be forced to perform sublinear computations.
There is no way we can calculate the mean of the data. Suppose we get the mean in one calculation through some fairy tale manner, we are still requiring 100 seconds to compute one second of data. If we need to compute an entire hour’s worth of data, we are stuck waiting for an extra 356,400 seconds for the computation to be complete. That’s and extra four days and three hours. Sorry, MapReduce won’t help you here either (it’s slower).
A sublinear algorithm is akin to random sampling. In this case, we sample data using some random number generator, selecting an adequate amount of data and computing the mean. The idea is straightforward, however several checks need to be performed to ensure the sample is appropriate enough to represent the global mean. We can do that through the process of data forking and working on multiple streams at the expense of requiring less data to perform computations.
Real Time Computations
Now if our data stream is only 1 gigabit per second, we are able to compute the true mean at real time, but only for order 1 calculations. This means we can only perform “one” action to the data as each data point comes in. Recall that the mean is computed by
From the formula above, the time to compute the mean is actually order n. This means we require all n data points to compute the mean. From this viewpoint, we are in trouble. However, computation of the mean is probably the most well-known streaming computation out there. So there’s hope!
In the streaming paradigm, we change the method of computation to be a forked stream. What this means is that we can construct a window of data by building n streams of data. Here, all we need to do is perform a series of three calculations, which has nothing to do with the size of the data stream. The only worry here is memory allocation; which is already assumed to be OK from our first sentence.
Let’s rewrite the mean calculation.
To perform this computation, we simply hold the sum and an number of data points in memory. We can start with an initial fill of all zeros. This means the first n-1 means are biased. But this is alright as we only cared about the mean of n points at a given time.
As we fork the stream for n data points, for the n+1 data point, we subtract the first stream, containing data point x_1, replace the stream element with x_(n+1), and reassign the stream orders. By doing this, we perform three (four) calculations on the data: subtract the oldest data point, add the newest data point, and divide by the sample size. The (four) comment is performed when the initial fill washes out. The update equation is n = n + 1 until the initial fill is removed.
So let’s apply this methodology to a simple NBA problem: scheduling.
NBA Schedules: Schedule Distribution
Opening day for the 2018 NBA season is on October 17th and proceeds for the following 177 days, concluding on April 11th. Over the course of these 177 days, a series of 1230 games are needed to be scheduled. The games are distributed to ensure that all 30 NBA teams have 82 total games with 41 games on the road and 41 games at home.
Of the 82 games, each team must play their own division four times for 16 games. They also must play every team in their opposing conference two times for 30 games. The remaining 36 games are distributed across the remaining 10 teams; resulting in six teams being played four times and the other four teams three times.
Furthermore, the NBA has a required set of nine days off:
- November 23rd: Thanksgiving
- December 24th: Christmas Eve
- February 16th – 21st: All-Star Break
- April 2nd: NCAA National Championship
As all teams need to be mixed, with no other days off for the league, and in such a way that all teams get an acceptable amount of rest. The NBA does a phenomenal job in scheduling, despite the acceptable amount of rest being well-defined across the years.
For the 2018 season, we can walk through the entire 1230 game schedule and construct a 30×177 binary matrix representing the entire schedule. Using the imshow command, we can view the matrix where yellow represents the game played and black represents the day off.
We immediately see the days off in the league as they are black columns. By running across the rows, we can start to measure the compactness of the schedule. A simple, naive, measure would be to simply count the number of days off. However, every team plays 82 games in 177 days. This means every team gets 95 days “off.” Therefore counting days off just doesn’t cut it.
Instead, we become interested in compactness of the schedule. Compactness means that the schedule has accumulation points. In this sense, an accumulation point is finding days where there are many games within a neighborhood of that day. For instance, a team might have two games before a day and two games after a day; seen as a 4-in-5 set of games. That day is an accumulation point.
We can then start to map accumulation points to high volume sections of the schedule. That is, we hunt for short burst games where 66-75% (or more) of the days are played. We will also focus the course of two weeks intervals. This allows us to nest bursts in the schedule.
Nesting is a situation where certain scenarios in the schedule cannot happen unless other scenarios occur. For instance, a back-to-back series of games is a high-volume set of games. However, a 3-in-4 burst of games must require a back-to-back to occur.
Similarly, a 4-in-6 burst of games must require a back-to-back and a 3-in-4 to occur. Let’s illustrate these using binary vectors. Suppose that we have a three game set. Using a one to indicate a game is played, and zero to indicate a game is not played, then all possible 2-in-3 bursts are 011, 101, and 110. This means two of three possibilities can result in back-to-back sets.
But for 3-in-4, there’s always a back-to-back:
- 0111 – Two Back-to-Back’s and a Three-In-A-Row!
- 1011 – Back-to-Back
- 1101 – Back-to-Back
- 1110 – Two Back-to-Back’s and a Three-In-A-Row!
Similarly, for 4-in-6, there’s almost always a 3-in-4:
- 001111 – Three Back-to-Back’s, Two 3-in-3, One 3-in-4, one 4-in-4, and one 4-in-5
- 010111 – Two Back-to-Back’s, one 3-in-3, two 3-in-4, and one 4-in-5
- 011011 – Two Back-to-Back’s, two 3-in-4, and one 4-in-5
- 011101 – Two Back-to-Back’s, a 3-in-3, three 3-in-4, and one 4-in-5
- 011110 – Three Back-to-Back’s, two 3-in-3, two 3-in-4, one 4-in-4, and two 4-in-5
- 100111 – Two Back-to-Back’s, one 3-in-3, one 3-in-4
- 101011 – One Back-to-Back, one 3-in-4
- 101101 – One Back-to-Back, one 3-in-4
- 101110 – Two Back-to-Back, one 3-in-3, two 3-in-4, and one 4-in-5
- 110011 – Two Back-to-Back’s… no 3-in-4
- 110101 – One Back-to-Back, one 3-in-4
- 110110 – Two Back-to-Back’s, two 3-in-4, and a 4-in-5
- 111001 – Two Back-to-Back’s, one 3-in-3, and a 3-in-4
- 111010 – Two Back-to-Back’s, one 3-in-3, two 3-in-4, and a 4-in-5
- 111100 – Three Back-to-Back’s, two 3-in-3, one 3-in-4, one 4-in-4, and a 4-in-5
That’s 15 total options. There is one exception to the 3-in-4 option. That’s the 110011 burst. In this case there must be LOTS of rest days to avoid a 3-in-4. In fact, in order for a 3-in-4 to not exist is to require there to be a 0011001100; which is four games in 10 days. Ridiculous. There’s effectively a 3-in-4.
NBA Cutting Back on High Volume Bursts
Starting in the 2018 season, the NBA announced that 4-in-5 games would be eliminated from the schedule after a series of studies had been produced on the effect of travel, lack of sleep, and game density (compactness). This resulted in a red alert system for games; with a Charlotte Hornets game being cited as one of the worst offenders: a 4-in-5 with a back-to-back to finish the 4-in-5 being a Memphis-Charlotte transfer; resulting in “three hours of sleep.” The Hornets were dispatched by the Detroit Pistons with ease: 112-89.
So let’s analyze the 2018 NBA season!
Processing the Schedule: Streaming Computations
To perform this analysis, we take the entire schedule and walk through every scenario of compactness. We look for all back-to-back games, all 3-in-4, 4-in-6, 5-in-7, and so on. We can do this by rolling through the schedule, or we can perform streaming computations!
To perform the streaming computation, we first construct a window of games. Here, we are interested only in 12 day bursts of windows. Therefore, we build an initial fill of all zeros and walk through the process of the first 12 days of the season. This is October 17th through October 28th.
We then compute a fixed series of computations: add the first two elements of the binary representation. If this is two, we have a back-to-back. Similarly, we add the first four elements and if this is more than three, we have a 3-in-4. We can do this, as there are no 3-in-3’s and 4-in-4’s. We perform this computation similarly for all other high volume bursts through the 12 games.
Once we perform this computation, we add in the new day. Remember in the streaming computation, we just pop off the oldest day and add in the new day at the end. Since the data is small scale, we can hold the windows in a 30×12 matrix. This can be performed in Python using the numpy.roll function and a wipe.
As we walk through the schedule, the window slides across the binary matrix of games and pushes the oldest games to the front of the matrix, where the fixed burst computations are performed. Therefore, we don’t have to hunt any games!
The only catch is the end of the season, where we can either place a terminal fill of zeros, or sweep across the final games. I chose the latter.
By taking in a schedule data file, we can walk through each line, strip the teams, and populate the games window matrix (games) and the season matrix (gameMat). We only populate gameMat in order to display the matrix image above.
# Walk through the scores file. Each line will consist of: # DATE: Winning Team Score Losing Team Score for line in lines: # Strip to ensure no white space messes with the key structure of the team dictionary. winner = line[12:30].strip() loser = line[41:60].strip() day = line[8:10].strip() if day == currentDate: if gameNumber < 12: games[int(teams[winner])][int(gameNumber)] = 1. games[int(teams[loser])][int(gameNumber)] = 1. gameMat[int(teams[winner])][int(gameNumber)] = 1. gameMat[int(teams[loser])][int(gameNumber)] = 1. else: games[int(teams[winner])] = 1. games[int(teams[loser])] = 1. gameMat[int(teams[winner])][int(gameNumber)] = 1. gameMat[int(teams[loser])][int(gameNumber)] = 1. else: if line[11:15] == 'NONE': print 'NO GAMES TODAY!', line[0:10] # New day shifts games. Let's get our counts before we run away! for i in range(30): specials = CountGames(games[i],i,specials) # Work on shifting games; as it is a new day! games = np.roll(games,-1,axis=1) # Must wipe out the last element of every row for i in range(30): games[int(i)] = 0. gameNumber += 1. continue # New Day! gameNumber += 1. currentDate = day if gameNumber < 12: games[int(teams[winner])][int(gameNumber)] = 1. games[int(teams[loser])][int(gameNumber)] = 1. gameMat[int(teams[winner])][int(gameNumber)] = 1. gameMat[int(teams[loser])][int(gameNumber)] = 1. else: # New day shifts games. Let's get our counts before we run away! for i in range(30): specials = CountGames(games[i],i,specials) # Work on shifting games; as it is a new day! games = np.roll(games,-1,axis=1) # Must wipe out the last element of every row for i in range(30): games[int(i)] = 0. # Now add the new guy! games[int(teams[winner])] = 1. games[int(teams[loser])] = 1. gameMat[int(teams[winner])][int(gameNumber)] = 1. gameMat[int(teams[loser])][int(gameNumber)] = 1.
What we see in this code block is that we apply the roll function and then wipe the new column of zeros in order to populate new data. This is a cheat as the data is small. If we performed a purely streaming computation, we hold the row sums for each burst type; search the new teams playing on the new day and add/subtract accordingly. Despite this, you can see how the streaming window interacts.
Performing the above computations, we can then look at a compactness measure. In this case, we perform a simple calculation of counting high volume days, weighted by their volume. In this case, we add all back-to-back games to 75% of 3-in-4 games, 66.6% of 4-in-6 games, to 71.4% of 5-in-7 games, and so on. The higher the number, the more dense games are.
In total there were 425 instances of Back-to-Back games, with each team averaging a total of 14 back-to-back games. The Memphis Grizzlies and Utah Jazz are the worst offenders with 16 Back-To-Back games. Compare this to the Charlotte Hornets, Cleveland Cavaliers, Detroit Pistons, Golden State Warriors, Los Angeles Lakers, Miami Heat, and New Orleans Pelicans, who all have 13 Back-To-Back games; we find that the discrepancy isn’t too high. This is arguable dependent on travel time/distance between games.
Three-In-Four games are fairly common as well. In this case, we find that each NBA team averages 18.9 three-in-four games during the season. As there were a total of 567 of these instances, we found that there was a much larger discrepancy among teams when compared to Back-To-Back games. For the 2018 season, the Memphis Grizzlies have 23 3-in-4 bursts, while the Cleveland Cavaliers have 10. The next lowest total? The Denver Nuggets have 15.
Finally, the NBA season witness zero 4-in-5 bursts!
For 4-in-6 games, we see 611 instances throughout the season. However, recall from the breakdown above, a sequence of 101101 results in a pair of 3-in-4 bursts. This is similar for 4-in-6 situations when we expand the window. So don’t read this as 611 unique scenarios.
Despite this, teams witness 20.4 4-in-6 bursts per season. The worst offender is the Charlotte Hornets with 27. The team with the least? You guessed is… the Cleveland Cavaliers with 14. The Los Angeles Lakers come in second lowest with 15.
The NBA attempted to reduce the number of 5-in-7 games this season, but were unable to eliminate them. Under this scenario, we find there are 35 total instances where the Minnesota Timberwolves and the San Antonio Spurs take the brunt with three of these situations each.
There are eight teams that observe this burst two times during the season: Atlanta, Boston, Chicago, Orlando, Philadelphia, Phoenix, Sacramento, and Toronto. There are thirteen teams with one observance. However, there are seven teams that escape the wrath of 5-in-7 bursts. These teams are Cleveland, Golden State, Indiana, Los Angeles Lakers, New Orleans, New York, and Memphis.
Fortunately, there are no 6-in-8 games this season.
Six-In-Nine games start to become fleeting, as there are 87 instances of these bursts; each team averaging 2.9 6-in-9 bursts. Again, we find a large discrepancy across teams. The Boston Celtics appear in 7 games that are the tail end of a 6-in-9 burst. Phialdelphia (6), Orlando (5), and Minnesota (5) take the brunt with a high number of these scenarios.
There are, however, two teams that avoid the 6-in-9 run. What this means is they ALWAYS get FOUR days rest every 9 days. These two fortunate teams? Cleveland and New York.
The final scenario we look at are 8-in-12 games. This is because there are zero instances of 7-in-9, 8-in-11, and 9-in-12 bursts. In fact, there are only three instances of 8-in-12 bursts. The poor teams who suffer these bursts are the Boston Celtics, Phoenix Suns, and Chicago Bulls.
By computing the compactness score over this sliding window, we find that the Philadelphia 76ers (51.929), Memphis Grizzlies (51.917), and Orlando Magic (51.595) suffer the most from the schedule. In fact, teams with compactness scores over 50 have dense schedules; with the least amount of rest. This includes the top three above as well as the Boston Celtics (50.512), Phoenix Suns (50.179), Charlotte Hornets (50.131), and the Minnesota Timberwolves (50.060).
There is a significant drop off after 50 (to 47). However, there is an exceptional outlier: Cleveland (29.833). This means that Cleveland’s schedule is roughly twice as spread out compared to Philadelphia, Memphis, and Orlando.
The entire table can be viewed here:
Looking at the distribution of scores, we find that Cleveland does indeed have a significant advantage in schedule. Denver finds itself in a similar situation, and this added rest along with their home court advantage may potentially be giving Denver a slight boost in the NBA standings.
2017 NBA Season…
Comparing to the previous season, we find that there are only 170 total days between opening day and the final day of the season. The 2018 season not only witnessed less 5-in-7 bursts and elimination of 4-in-5 bursts; but also obtained one extra week of play. Computing the binary matrix, we find that the is also no NCAA National Game off.
We find that the distribution of compactness is much higher in 2017 than in 2018.
So much higher that the number of Back-To-Back games totaled 486 (425 in 2018), 3-in-4 bursts totaled 714 (567 in 2018) and 4-in-6 bursts totaled 855 (611 in 2018). Similarly, there were 23 total 4-in-5 games in the 2017 season. Also, there were dreaded 7-in-10 bursts for teams in the 2017 NBA season; another non-existent burst in the 2018 season. The Los Angeles Clippers, Philadelphia 76ers, and Orlando Magic all experienced these during the season. And as a note, each of these teams suffered significant injuries during the season.
Continuing this path, the number of 8-in-12 games dropped from 22 (2017) to 3 (2018) while the number of 5-in-7 games dropped from 93 (2017) to 35 (2018).
For the 2017 season, we see almost every team is above the 50 score. The exceptions are Dallas and Brooklyn. Every team, however, dropped in compactness; indicating that every team technically obtains more rest this season than in the previous season. The lone real exception is the Minnesota Timberwolves, who dropped from 50.679 to 50.060. This may become a problem as the Timberwolves are notorious for not resting their players.
As we have seen the NBA help increase rest for the NBA season, we still find kinks in the scheduling format as some teams are generously given days off while others still find themselves in 5-in-7 ruts. So how would you improve scheduling the season? Note that we didn’t even factor in traveling and time-zone changes. This increases the problem with scheduling and rest; making the job of an NBA scheduler an unenviable one.