Sports Betting Manifesto & Elo Repo

Lately I’ve been drawn into sports betting as a staging ground for developing quantitative strategies in capital markets, as an easily accessible and barebones market. This comes from the principle that in order to get an algorithm off the ground you need data, usually in the form of a time-dependent signal, and a way of assessing predictions - in ML terms, these are simply your inputs and testing data. Simple though this may seem, in the case of the financial markets, the domain-space for our input vector is overwhelming, with sophisticated models utilizing a wide-range of signals as predictors, posing the problem of choosing features before we even begin developing the model.  

Moreover, short of buying a Bloomberg terminal, access to these signals is nontrivial, while managing and manipulating any number of high-resolution timestamped signals can be cumbersome in its own right. 

Sports betting, on the other hand, produces a highly publicized source of data in the form of sports coverage and results, while feature selection - although could be made arbitrarily complicated - in general does not require in-depth knowledge of the field (as might be the case with some economic indicators). Additionally, although exotic options exist (e.g. bets on the number of corners or cards in a soccer match), the most basic target is simple and concise: predict the winner. 

In this sense we could view bets on winners as binomial options, yielding two possible pay-outs depending on the outcome of an event. Unlike financial contracts, however, the underlying asset has no market value, and thus cannot be priced through replication in the same way a forward or call option can. This means the value of a bet is purely probabilistic, and determined precisely as the expected payout of the agreement. 

For the time being, to simplify things, we will ignore the “vig” from bookmakers and assume odds are balanced exclusively by the weight of money, and thus priced by the market. To make money then, we simply need to beat the market in assessing player strength.

So let’s take a look at this market. 

To continue painting in broad strokes, aside from the technical benefits, it makes sense that sports betting might be an appealing field to apply quantitative methods as a fairly inefficient but highly liquid market. Indeed, sports bets are largely made recreationally without much formal quantitative analysis, exposing bettors to a number of biases that may ultimately affect their predictions. In turn this means options may not be fairly priced, a fact which can be seen in the case of different odds offered across sites for the same outcome. 

None of this is said to disparage the average sports bettor: the motivations and objectives of these bets are not comparable to the return-maximizing approach of a quantitative betting strategy. Instead, I mean to illustrate the potential for even a basic model to yield a disproportionate edge in the market. 

This is because the underlying dynamics of the sports betting industry are governed by the same principle as the financial markets: uncertainty stemming from incomplete information. Sports matches are incredibly complex interactions predicated on an (uncountable?) number of minute parameters and initial conditions, making it virtually impossible to deterministically model future events more than a few seconds in advance. Instead, we model this complexity with statistics, assigning probabilities to outcomes as a way of aggregating events and casting this computational irreducibility as “random” probabilistic events. 

In this way, the relevance of quantitative modeling in sports betting becomes obvious: as in financial markets, the more precisely and accurately you are able to transform information into predictions, the more readily you are able to capitalize off of market inefficiencies by exploiting priced-in uncertainty. 

So then what do these models look like?

</br >

The following is a repository for an Elo model I implemented to predict tennis matches and kick things off to get a sense of performance. Tennis is a convenient candidate for quantitative prediction for two reasons: 1) singles matches are interactions between two opponents, eliminating the need for modeling potentially complicated group dynamics in team sports, and 2) tennis matches have two strict outcomes - win and lose - (ignoring retirement), removing the potential for ties. Moreover, of the sports that satisfy these two conditions, tennis is particularly attractive for its large volume of professional matches and active player-base, with 66 top-level tournaments being scheduled each year by the ATP and ITF.  The data was courtesy of Jeff Sackmann.

The one downside to choosing men’s tennis is that it is a sport I know nothing about.

Luckily for me there exists an abundance of mathematical background on modeling competitive zero-sum games - most of the literature concerns chess to be specific, but for the reasons listed above the leap from tennis to chess is a reasonable one. This is the case for the Elo rating system, developed by Arpad Elo and published in 1960 as a suggestion for the US Chess Federation. At its core, the system models interactions between opponents in a zero-sum game by estimating expected wins for each player based on their respective Elo rating (a number usually between 100 and 2400). In the case of a single game this is equivalently viewed as estimating the win-probability of each player - following as the expectation of a Bernoulli variable. 

The magic then lies in calculating player ratings based on their past performance to accurately model their strength. The Elo system handles this by assigning new players a base rating and awarding points for a win and subtracting points for a loss; in this way the system iteratively builds ratings off of past data to calibrate the model over time. This has the added benefit of capturing movement in players’ skill level over time (most notably due to improvement or aging).

On a technical note, this was part of the motivation for choosing Elo over the Bradley-Terry model or another pairwise-comparison method that models strength as a static intrinsic property of the players. This leads to computational considerations as well, with the BT model relying on maximum likelihood estimation to compute player rating which can be precarious for high-dimensional datasets (large numbers of players) rather than the case of a robustly prescribed iterative procedure as seen in the Elo system.

A full explanation of the mathematics involved in calculating these expectations and points is provided in the notebook in the above GitHub repository. I’ll only focus on the results in this post.

As a benchmark, I looked at the model’s performance across three metrics and for the time-being compared these to simple random guessing. These three metrics measured:

  1. Cross-entropy (CE): At its heart, this is a classification question, looking to predict the winner, and so it makes sense to score the model using CE. Cross-entropy is between [0, 1], with lower scores representing better performance.

  2. Accuracy: This is the obvious one, and is simply the fraction of correct guesses made (expressed as a percentage).

  3. Brier Score (BS - in this case identical to the MSE): This metric looks at the calibration of the model, since we are not interested in the accuracy of a single outcome per se, but rather the correctness of the model’s estimated expected outcome. That is to say, a true 60% favorite is still expected to lose 4 in 10 matches and so accuracy alone does not tell the full story. BS is between [0, 1], again with lower scores representing better performance.

Overall, the model performed acceptably, if not slightly underwhelmingly, notching a 65.99% accuracy on the (unseen) test tournament data, considerably above the expected accuracy of 50% for random guessing. Gains in BS and CE were much more marginal, however, with benchmarks for random guessing standing at 0.25 and 0.6913 respectively. 

So where might have things gone wrong?

Oftentimes, “inaccuracy” in Elo ratings can be caused by inadequate training data, since all players start with an equal base rating, a lack of data means the model is unable to converge onto true skill value. 

To test whether this was the case, I added a method to “jump-start” the convergence by adjusting each player’s base rating to reflect their performance in the data set. I want to note that as a standalone method, this approach of “using” the training data twice is dubious and makes the model prone to amplifying biases in the training data. In this case however my main goal was to check convergence, with the idea being if the system produced the same ratings from a different start-point that it would imply the rating did in fact approximate a steady-state solution. 

Sure enough, this is what was found, with the change in initialization resulting in only minor discrepancies in the final ratings, in both positioning and magnitude. 

This is mixed news, because 1) it demonstrates an efficiency of the model to achieve convergence on reasonable datasets, but 2) this seemingly implies a ceiling on the efficacy of our model. 

To this extent, there are a few final adjustments that could be made to boost performance, including tuning hyper-parameters such as the points multiplier that is used to award players, or introducing additional covariates in our win-expectation estimation (i.e. surface type, hot-streak multiplier, etc.). These modifications are immensely time-consuming, however, and there’s no guarantee of any meaningful improvement. 

More useful is to analyze where our Elo approach may be deficient in an attempt to develop an altogether new approach. 

One area which could be improved is the lack of a measure of uncertainty in our estimates. Like all statistical estimates, even assuming our model is a perfect fit, the rating is only as good as its uncertainty. Uncertainty here might be caused by insufficient data on an individual player, or a long hiatus since their last tournament. 

Next, looking at our test data, we find that the underdog in ATP points (the player with less tournament wins) actually beats the favorite in 31% of matches, implying a reasonably high level of volatility in the game. To account for this, we may also wish to add a volatility parameter to the model, measuring how prone a player is to swings in skill. 

Both of these changes are actually established metrics that give rise to the Glicko-2 rating system, a successor to Elo often used in online competitive games like Dota 2 and CS:GO. 

Last, this brings in one final consideration we should keep in mind going forward, which is the question of how we want to gear our model to extract edge. In a perfect world, our model would be able to tell us the winner of every match, the margin of the wins, and the color of the winner’s underwear, however this level of accuracy is unrealistic and unhelpful. Instead, the issue of volatility presents two paths we could pursue with our model, being 1) focus on stability and shave uncertainty in the matches with clear favorites, prioritizing performance on the mode, or 2) identify likely upsets, betting less often on far out-of-the-money options. 

Depending on the true frequency of these events, as well as the feasibility of each to be modeled, the path we choose will need to be determined with further work, however if we wish to consistently out-perform the vig, it is likely that a type 2 model is our best bet.