Friday, March 16, 2012

Sports Betting and Data Mining

Several years ago, I did a little experiment with sports betting. I know virtually nothing about sports -- I can tell the difference between horse racing and golf, but that's about it.  But I had a little conjecture about sports betting, and it was easy enough to test.

Certainly, nobody would ever pay attention to any sports-related predictions I might offer, because I don't know anything about sports. This year, I didn't even realize when the Superbowl was on until I noticed people in the grocery store buying unusual amounts of meat and cheap beer. And the conventional wisdom is that some real expertise -- which I don't have -- is necessary if you're going to predict (e.g.) football games.

So I thought, "what if the only thing that mattered for predicting football outcomes is each team's win-loss record?".  How accurate could you be if the only thing you knew was which teams had beaten which other teams that season? Forget a "home field advantage", news about whether a player has an injury, the weather, the enthusiasm of the crowd, the point spread, and all that stuff.

Here's what I did.  I downloaded the win/loss record for the NFL prior to the playoffs. I made a list of the teams in a random order. Then I programmed my computer to rearrange the teams on the list so that when team A appeared higher in the list than team B, it was more likely that team A had beaten team B earlier in the season (wonkishly: I used a stochastic local search algorithm). In this way, the teams with the best record tended to be higher up in the list.

Then I looked at the playoff games -- keeping in mind that I had not used the playoffs at all up to this point. I imagined that I was using the list of teams to predict the outcomes of the playoff games and the Superbowl. So if team A was going to play team B, I'd predict that team A would win if it appeared higher up in the list (and vice-versa for team B). My computer program successfully predicted the winner of every playoff game (and the Superbowl) except one!

What do I conclude from that little experiment? Maybe nothing. But what I strongly suspect is that the problem with sports predictions is that there's too much data available. It's too easy to find patterns mixed in with the mountain of data, and most of those patterns are coincidence -- they look like they'll predict the outcomes, but they won't hold up over time.

In brief, the problem with prediction isn't that there's not enough data; the problem is that there's way too much data. Here's an analogy. Suppose I give a million ordinary coins to a bunch of monkeys, who each flip their coins ten times. Some of those coins are going to land heads-up ten times in a row (about ten of them will, most likely). Now suppose I get those few coins together that landed heads all those times, and someone asks me to make a bet that they'll keep landing heads-up. The fact that they happened to land heads-up those ten times doesn't mean anything! Of course some of them kept landing heads-up -- there were so many coins that it would be strange if they hadn't. But that's not because those coins are special.

Sports predictions are like that -- there's so much data out there, that of course some of the data will correspond to the win/loss record of some teams. But it's not because of those teams, or something about those statistics; it's because there are so many data points that it would be strange if there weren't any correlations. So the next time someone notices that such-and-such team has always won the third game in a row before a playoff when they're playing at home and the weather is cold, be skeptical.

Believe it or not, this is actually important. As we all know, our society is collecting exponentially increasing amounts of data on everything including digitized medical records, buying habits, internet browsing patterns, social network structures, genetic data, and on and on. This is the so-called "big data" phenomenon. Big data is a treasure trove of information that's bound to contain lots of valuable information, and there's a growing industry of people whose job it is to "mine" that data for useful patterns.

One of the very first companies to take advantage of this was Wal-Mart. They have collected enormous, mind-boggling amounts of data on their customers, and have gone so far as to reconstruct people's most likely routes through the store; with that information, they can strategically place certain items in the path of people who are most likely to buy them. Wal-Mart knows how the weather affects people's buying habits, for example, and they have dedicated vast amounts of computational power to analyzing this mountain of data.

Now with genetic information getting cheaper to collect, medical records going digital, an overabundance of financial data, not to mention all the obvious places where data is being collected (Google, Facebook, etc.), it's become increasingly important to find ways to mine that data for practical insights. But the danger of discovering spurious correlations that have no predictive power is correspondingly greater, too.

The moral of the story is that the mixture of big data, powerful computers, and statistics is potentially dangerous. Increasingly, as I read about the economy and financial industry, or advertising, or social network analysis, it looks like the big data and the statistics are taking the place of common sense. The output of the computers is too often taken as conclusive evidence of a trend. But the statistics and correlations themselves won't tell you whether a pattern is spurious or potentially valuable -- after all, an unsophisticated statistical test would suggest that those coins are very likely to be biased toward heads. But you wouldn't want to have to bet your financial future or your health on that prediction.