Monday, September 1, 2014

Widgets

Correlation is not Causation!


These days are exciting for people interested in statistics. Never before have scientists been able to use as many data as today. In fact, it seems that scientists are flooded with data; there is not enough time and capabilities to analyze data in time. That’s how Big Data was born: data seems to be too big, too fast, or too hard for current technologies to process.

A prime example of how Big Data can be tamed is Google Trends. As many of you know, using the incredible amount of information that is been gathered every second by their search engine, Google created a tool named Google Flu Trends to predict this disease. The promise was startling: to predict a disease outbreak in real time. Without any doubt, to predict any event (and disease outbreaks!) in real time would be a great accomplishment. Who would not want something like this? NO ONE.

But there was a little problem: Google was using that information without a clear idea of why people were searching for things related to flu. Furthermore, there is no theory (serviceable) to understand how information is propagated in today’s data-flooded days; could it be that social media overreacts to news creating the notion that –for example- a flu outbreak is higher than it is? No one can know for sure. What we do know is that Google Flu trends failed. According to research done by Harvard University and Northwestern University, Google Flu Trends prediction system overestimated the number of influenza cases in the US in one hundred out of 108 weeks during the 2011 - 2012 flu season (starting on August 2011).

Why I mention this? Because it is important to emphasize the importance of having a causal model (or a theory of change) before you analyze any kind of data. Without a true model describing how and why information is searched and propagated, it is easy to wrongly estimate everything.

Consequently, before analyzing the nutrition status of children and women in our study, it was important to first understand the direct causes of malnutrition. To illustrate this, look at the following picture (taken from Lancet):

Conceptual Model of Pathways to Death and Disability


In this conceptual model, you can see that breastfeeding practices are not linked to malnutrition: it impacts weight by affecting the number of infections. But when we started analyzing our data, we found that breastfeeding was highly correlated with malnutrition. Specifically, we found that -all else being equal- a child that is breastfed is 16.6% more likely to be underweight, 28.3% more likely to be severely underweight, and 12.9% more likely to be stunted. If we stopped here our analysis, we would conclude –wrongly- that breastfeeding was causing malnutrition.
But we knew better thanks to our causal model. We knew that we would have to look for another variable that was varying with breastfeeding. That is, the relationship that we found between breastfeeding and malnutrition must be disguising a true relationship between a third unknown variable with both breastfeeding AND malnutrition.
That’s precisely what we found: for children older than six months old, breastfed children also ate solid foods fewer times a day. Hence, our missing variable was the amount of solid food eaten by children! Specifically, breastfed children ate solid food only twice a day (and 8.5% of them were not given any solid food). In contrast, those who were no longer breastfed ate solid food three times per day.

In sum, before you analyze anything try to have a theory of change. Otherwise, you will be concluding wrongly because correlation is not causation!


No comments:

Post a Comment