Google Flu Trends revisited: Improving influenza modelling from search query logs

Wednesday 5th August 2015

                                              Influenza virus. Credit: C. S. Goldsmith and A. Balish, CDC

In July 2014, i-sense joined up with Google to contribute to the earlier global detection of influenza outbreaks.  

Since then, researchers from i-sense, UCL, Google and Harvard University have been developing influenza modeling techniques from search query data, in order to support a more accurate picture of influenza-like illness (ILI) in the UK. They were able to significantly improve on the original Google models and make more accurate flu estimates, during peak flu seasons in the US (from 2008 to 2013).1

You have probably heard of Google Flu Trends (GFT). It was a platform that displayed weekly estimates of influenza-like illness (ILI) rates all around the world. Those estimates were products of a statistical model that maps the frequency of Google search queries to official health surveillance reports.

The original GFT model was definitely a good start, but it did overestimate ILI rates during its application. In a new research paper, researchers make an effort to understand the challenges and performance limitations of the original GFT algorithm as well as propose improvements in its accuracy.

They adapted the original GFT query selection and modelling methods, first applying a technique known as the Elastic Net to perform more targeted query selections. Different search queries carry different “weights”- these “weights” indicate the relevance of a query in relation to what we want to find out- in this case the prevalence of flu. So a query like “flu remedies” is expected to have a more significant weight than a non-related search query.

Once relevant queries were automatically identified, researchers applied a nonlinear regression model on top of groups of them. The main idea was that different groups of related queries might identify different user behaviours, related to flu. For instance, the model provided a way to separate awareness or concern about flu from actually having the disease. When using their proposed model to evaluate US data, the researchers found ILI estimates compared well to flu rates from the CDC (Centers for Disease Prevention and Control).

So what went wrong with the previous Google Flu Trends model? The researchers discovered that the previous GFT model was using queries with an ambiguous or non-existent relation to flu. The most common queries, which were, incorrectly, being used to estimate ILI rates, were “flu symptoms”, “benzonatate”, “symptoms of pneumonia” and “upper respiratory infection”. Aggregating queries about different health conditions led to an over-prediction of ILI rates on many occasions.

Researchers have made a positive step forward in improving estimates of influenza-like illness using Google search queries. In the future, they hope to improve their models even further and bring different types of data sources together. For example, embedding various other user-generated data sources (e.g. social media, search query logs, mobile phone logs) together.

Each year, seasonal influenza affects many thousands of people in the UK and millions worldwide, producing mild to severe symptoms. In contrast, pandemic influenza is unpredictable and has the potential to cause severe illness and death in large numbers. Web data platforms like GFT provide researchers with the opportunity to assess a different, much broader part of the population missed by traditional health surveillance or who may not use the healthcare system. It could be crucial in rapidly identifying and controlling the next flu pandemic.

1 Lampos, V., Miller, A.C., Crossan, S., Stefansen, C. 'Advances in nowcasting influenza-like influenza rates using search query logs', Scientific Reports, 5, 12760 (2015); DOI: 10.1038/srep12760

Related links