Last year, Google Flu Trends
blew it. Even Google
admitted this. It over-predicted the occurrence of flu by a factor of almost
2:1. Which is a good thing for the health care system, because if Google’s predictions had been right, we would have had the worst flu season in 10 years.
Here’s how Google Flu
Trends works. It monitors a set of approximately 50 million flu-related terms for query volume, then compares this against data collected from health care providers where influenza-like illnesses
(ILI) are mentioned during a doctor’s visit.
Since the tracking service was first introduced, there has been a remarkably close correlation between the two, with Google’s
predictions typically coming within 1% to 2% of the number of doctor’s visits where the flu bug is actually mentioned. The advantage of Google Flu Trends is that it is available about two weeks
prior to the ILI data, giving a much-needed head start for responsiveness during the height of flu season.
advertisement
advertisement
But last year, Google’s estimates overshot actual ILI data by a substantial
margin, effectively doubling the size of the predicted flu season.
Correlation is not Causation
This highlights a typical trap with big data: We tend to start following the
numbers without remembering what is generating the numbers. Google measures what’s on people’s minds. ILI data measures what people are actually going to the doctor about. The two are
highly correlated, but one doesn’t not necessarily cause the other. In 2013, for instance, Google speculated that increased media coverage might be the cause for the overinflated predictions.
More news coverage would have spiked interest, but not actual occurrences of the flu.
Allowing for the Human Variable
In the case of Google Flu Trends, because it’s using
a human behavior as a signal -- in this case, online searching for information -- it’s particularly susceptible to network effects and information cascades. The problem with this is that these
social signals are difficult to rope into an algorithm. Once they reach a tipping point, they can break out on their own with no sign of a rational foundation. Because Google tracks the
human-generated network effect data and not the underlying foundational data, it is vulnerable to these weird variables in human behavior.
Predicting the Unexpected
A recent article in Scientific American pointed out another issue with
an over-reliance on data models. Google Flu Trends completely missed the non-seasonal H1N1 pandemic in 2009. Why? Algorithmically, Google wasn’t expecting it. In trying to eliminate noise
from the model, Google actually eliminated signals coming during an unexpected time. Models don’t do very well at predicting the unexpected.
Big Data Hubris
The author of
the Scientific American piece, associate editor Larry Greenemeier, nailed another common symptom of our emerging crush on data analytics: big data hubris. We somehow think the quantitative
black box will eliminate the need for more mundane data collection, such as actually tracking doctors' visits for the flu. As I mentioned before, the biggest problem with this is that the more we rely
on data, which often takes the form of arm’s-length correlated data, the further we get from exploring causality. We start focusing on “what” and forget to ask “why.”
We should absolutely use all the data we have available. The fact is, Google Flu Trends is a very valuable tool for health-care management. It provides a lot of answers to very pertinent questions.
We just have to remember that it’s not the only answer.