ESPN recently updated their Hunt For October page and introduced probabilities for teams making the playoffs. While ESPN does not give a detailed analysis on how they obtain these simulations, we can instead create our own method.
To do this, let’s take a simple approach: use past results of games and a little statistical machinery to build a simulation model. First, we take every team and identify their remaining schedule. Not including today, there is a total of 239 remaining Major League Baseball games. We use historical data to estimate the probability of a win for each team in each of the 239 games. We then treat each game as independent and compute the probability of every possible scenario of how those 239 games play out. Each scenario will give a final 162 game record for every team, and its associated playoff tree.
However, if we simply exhaust every possible scenario, there are over 9 billion times a vigintillion possible outcomes of games. That’s a 9 with 74 zeros after it. Too many to compute. So let’s do something that’s simple. Here we simulate a large amount of scenarios and average the possible outcomes. This process is called performing a Markov Chain Monte Carlo. Here we’ll go step by step and build our own model of estimated probabilities for every team to make the playoffs.
1. Define the Rules for Making the Playoffs
This rule is simple. In order to make the playoffs, a MLB team must either win their five team division (division champion) or have one of the two best non-division winning records (wild cards). If there is a tie, then the teams play in one game playoffs to determine the division champion and the wild card spots.
2. Identify the Schedule for Every Team
Each team has between 15 and 18 games remaining on their schedule. This breaks down to five 3-4 game series, with a couple teams having an extra one game against an opponent.
3. Build a Sampling Probability for Each Remaining MLB Game
Here we use historical data to build a probability to predict which team will win each of the remaining 239 MLB games. Simplest models treat each game with a fifty percent probability for each game. That means if the St. Louis Cardinals are playing the Philadelphia Phillies, there is a fifty percent chance that the Phillies will win the game.
Instead of doing this, the historical data will adjust the probability of the Phillies beating the Cardinals. The sample probability for the Phillies to beat the Cardinals is the number of times the Phillies have defeated the Cardinals during the season divided by the number of times the teams have played against each other. However, this leads to some inadmissible probabilities.
For instance, the Washington Nationals’ remaining games are against the Marlins, Orioles, Phillies, Braves, Mets and Reds. The combined record for opponents, scaled to the number of remaining games, is 1051 – 1284 (.450). This remaining Nationals schedule gives fans hope; until you see the Mets’ remaining schedule has a scaled 1048 – 1197 (.467), with a 7.5 game lead as of this afternoon (17 September). Despite the Naitonals having the easiest schedule for the remaining three weeks of the season, the Nationals are 0-5 against the 61-84 Reds. Similarly, the Nationals have a losing record against the Marlins and the Mets. Over half of their remaining NL games are against opponents the Nationals have losing records. So we use the records against each team to indicate the probability of the of the Nationals defeating each respective team.
However, the sample probability for the Nationals defeating the Reds is 0/5 = .000. This would indicate that the Nationals have no chance of beating the Reds in a game. Instead, we note we have a small sample and therefore use a continuity correction to account for the small samples and drive the probabilities closer towards 0.500. This method is to account for small sample probabilities by pushing away from the unrealistic boundaries of zero and one.
The common method for continuity correction is to take the number of wins and add 2; while taking the number of games played and adding 4. For the Nationals versus Reds scenario, the adjusted probability moves from 0 for 5 to 2 for 9. Hence the adjusted probability for the Nationals defeating the Reds is .222 instead of .000.
4. Simulate Using a Markov Process
To simulate the possible outcomes of the 239 games, we take each game and identify the probability of each team winning. For a game between the Nationals and Reds, the Nationals have a 22.22% chance of winning while the Reds have a 77.78% chance of winning. We then draw a uniform random number between 0 and 1 and compare this number to .222. If the number is smaller than .222, the Nationals win the game. Otherwise the Reds win the game.
We perform this for all 239 games and obtain a sampled final standings. Now, this is only one simulation from the Googol many possible outcomes. So let’s take 10,000,000 repetitions. Each of the ten million results gives us 6 division winners and 4 wild cards each time. We count the number of division wins for each team and divide by ten million to obtain an estimated probability of winning the division. We do this similarly for estimating the probability of being a wild card. This process is quite straightforward and is simple to implement. This process takes approximately an hour to compute on a commodity piece of hardware.
5. Compute the Average Final Standings
We compute the simulated final standing to be the average number of wins for each team, rounded to the closest number, added to the current standings. In this case, we obtain an expected final standings with an associated expected playoff seeding.
For our expected results, we have a playoff scenario of Houston-New York and Pittsburgh-Chicago wild cards; with the winners playing Kansas City and St. Louis, respectively. The other expected playoff match-ups are Toronto-Texas and New York – Los Angeles (Mets-Dodgers). These results are the eventual predicted outcomes, but they do not give probabilities for each team of making the playoffs.
6. Identify the Simulated Probabilities of Making the Playoffs
To estimate the probability for each team making the playoffs, we look at the simulated final standings for each of the ten million simulations. For instance, the Toronto Blue Jays had 8,240,821 instances of winning the American League East. The New York Yankees had 1,759,179 instances of winning. Despite none of the teams being eliminated from the playoffs mathematically, there were no instances of the Baltimore Orioles, Tampa Bay Rays, and Boston Red Sox winning the division. This leads to Toronto having a 82.41% chance of winning the division.
There’s a simple six step method for developing a simple algorithm to estimate probabilities for each team making the playoffs. Let’s compare our results with ESPN’s listing as of tonight.
As we can see, our probabilities are fairly close, with an exception between the Houston Astros and Minnesota Twins. Our discrepancy is relatively obvious: The Angels play six of their remaining 16 games against the Athletics and Mariners, both teams that are below .500. In our system, the Angels have a .550 probability of winning those games. The Angels also have a losing record, most likely pushing us to expect an extra loss or two, against the Astros.
Get ready for an exciting October as the playoff teams are almost set with less than three weeks to go. Maybe we will see the Minnesota Twins or the Los Angeles Angels channel their inner-2007 Rockies and storm into the playoffs.