These
days are exciting for people interested in statistics. Never before have
scientists been able to use as many data as today. In fact, it seems that
scientists are flooded with data; there is not enough time and capabilities to
analyze data in time. That’s how Big Data was born: data seems to be too big,
too fast, or too hard for current technologies to process.
A
prime example of how Big Data can be tamed is Google Trends. As many of you
know, using the incredible amount of information that is been gathered every
second by their search engine, Google created a tool named Google Flu Trends to
predict this disease. The promise was startling: to predict a disease outbreak
in real time. Without any doubt, to predict any event (and disease
outbreaks!) in real time would be a great accomplishment. Who would not want
something like this? NO ONE.
But
there was a little problem: Google was using that information without a clear
idea of why people were searching for things related to flu. Furthermore, there
is no theory (serviceable) to understand how information is propagated in
today’s data-flooded days; could it be that social media overreacts to news
creating the notion that –for example- a flu outbreak is higher than it
is? No one can know for sure. What we do know is that Google Flu
trends failed. According to research done by Harvard University and
Northwestern University, Google Flu Trends prediction system overestimated the
number of influenza cases in the US in one hundred out of 108 weeks during the 2011 -
2012 flu season (starting on August 2011).
Why
I mention this? Because it is important to emphasize the importance of having a
causal model (or a theory of change) before you analyze any kind of data. Without
a true model describing how and why information is searched and propagated, it
is easy to wrongly estimate everything.
Consequently,
before analyzing the nutrition status of children and women in our study, it was
important to first understand the direct causes of malnutrition. To illustrate
this, look at the following picture (taken from Lancet):
Conceptual Model of Pathways to Death and Disability
In this conceptual model, you can see that breastfeeding practices are not linked to malnutrition: it impacts weight by affecting the number of
infections. But when we started analyzing our data, we found that breastfeeding
was highly correlated with malnutrition. Specifically, we found that -all else
being equal- a child that is breastfed is 16.6% more likely to be underweight,
28.3% more likely to be severely underweight, and 12.9% more likely to be
stunted. If we stopped here our analysis, we would conclude –wrongly- that breastfeeding was causing malnutrition.
But we knew better thanks to our causal model. We knew that we would have
to look for another variable that was varying with breastfeeding. That is, the
relationship that we found between breastfeeding and malnutrition must be
disguising a true relationship between a third unknown variable with both
breastfeeding AND malnutrition.
That’s precisely what we found: for children older than six months old, breastfed
children also ate solid foods fewer
times a day. Hence, our missing variable was the amount of solid food eaten
by children! Specifically, breastfed children ate solid food only twice a day
(and 8.5% of them were not given any solid food). In contrast, those who were
no longer breastfed ate solid food three times per day.
In sum, before you analyze anything try to have a theory of change.
Otherwise, you will be concluding wrongly because correlation is not causation!
No comments:
Post a Comment