Sunday, December 19, 2010

Of Correlations, Causations and the Divide Therein

An important maxim used in science, or more precisely, in the scientific study of relationships between/amongst variables, is that 'Correlation does not imply Causation'. Indeed, until and unless such causality has been verifiably established through independent means, any attempt to indicate that it does falls under the logical fallacy of questionable cause, cum hoc, ergo propter hoc (Latin for "with this, therefore because of this").

It is important for all to understand this concept - those who are engaged in scientific studies, as well as those who read about and interpret such studies.

Correlation is a statistical relationship between two or more random variables; for simplicity's sake, let's consider two, say, A and B, such that if changes in the values of variable A statistically correspond to changes in the values of variable B, a correlation is said to exist between A and B. This reflects a statistical dependence of A on B, and vice versa, and therefore, statistically-computed correlations can be used in a predictive manner. To pick a completely random example, the epidermal growth factor receptor (EGFR) is expressed on neoplastic cells in colorectal carcinoma. Number of cells expressing EGFR was found to be correlated with the size of the tumor (adenoma), i.e., cells from a larger tumor express more EGFR. Therefore, EGFR expression may be useful as a prognostic biomarker for adenoma progression.

Those who have already identified the problem in this assertion, congratulations! As the paper cautions, although EGFR pathway is important to colorectal carcinogenesis, it is unknown at this point whether the observed increase in EGFR expression is because neoplastic cells make more EGFR per se for some reason, or because a larger tumor would house numerically more of the cells that are capable of making EGFR. This, as you can understand, is an important distinction, and therefore, the authors conclude correctly that "Further larger studies are needed to explore EGFR expression as a biomarker for adenoma progression."

Such examples abound, all illustrating how correlations can be useful in suggesting possible causal or mechanistic relationships between variables, but more importantly, such statistical interdependence between the said variables is not sufficient for logical implication of a causal relationship. In other words, while empirically A may be observed to vary in conjunction with B, that observation is not enough to assume A causes B.

But what happens when one makes such an erroneous assumption? For starters, one is then disregarding four other possibilities, any or each of which may be true and account for the correlation.

  1. A may cause B.

  2. B may cause A.

  3. An unknown or uncharacterized third variable C may cause both A and B.

  4. A and B may influence each other in presence or absence of C in a feed-back loop, self-reinforcing type of system.

  5. The two variables, A and B, changing at the same time in absence of any direct logical or actual relationship to each other, besides the fact that the changes are occurring at the same time - a situation also known as coincidence. A coincidence may allude to multiple, complex or indirect factors that are unknown or too nebulous to ascribe causality to, or may reflect pure, random chance.

Each of these five hypotheses is testable and there are statistical methods available to reduce the occurrence of coincidences. Therefore, the mere observation that A and B are statistically correlated doesn't lend itself to any definitive conclusion as to the existence and/or directionality of a causal relationship between them.

Determination of causality is an entirely different ball of wax, and that discussion is beyond the scope of this post. Suffice it to say that in the sciences, causality is not assumed or given. The scientific method requires that the scientists set up empirical experiments to determine causality in a relationship under investigation.

The scientific method works in logical progression.

  1. Initial observations (of a putative relationship between variables) are made.

  2. an explanation is proposed in form of one-or-several hypotheses about possible causal relationships, including one of no relationship (the Null hypothesis).

  3. Certain predictions or models may be generated on the basis of each of the hypotheses, which in turn guide the experimental design.

  4. Experiments are designed to demonstrate the falsifiability of the hypotheses, i.e., to test the logical possibility that the hypotheses could be proven false by a particular empirical observation. Indeed, testing for falsifiability or refutability is a key part of the scientific process.

  5. Once designed, the experiments are used to test the hypotheses rigorously, and the data, analyzed critically to reach a conclusion, accepting or rejecting the hypotheses.

  6. But the method doesn't cease there. All empirical observations are potentially under continued scrutiny, which involves reconsideration of the derived results, as well as and re-examination of the methodology, especially in the light of newer techniques that are capable of taking deeper and more accurate measurements. Such is the dynamic nature of the scientific method.

Establishment of causality, therefore, has to pass through the same rigorous filters before it can be accepted. But if it does, the conclusions may be considered unimpeachably valid, within the given set of circumstances.

So... Correlation doesn't inherently imply causation.

Now, having provided a glimpse of the logical framework of this maxim, and discussed how the scientific method is utilized to establish causality in observed relationships between/amongst variables, let's look at some modern examples.

So... what happens when scientists, study authors, investigators ignore this prime maxim?

Well, as a casual search would show, the internet is replete with examples of situations of spurious relationships, where either it was mistakenly assumed that evidence of correlation implied a causal relationship (where none existed), or the effect of some other variable(s) were ignored/disregarded, thereby leading to often implausible, ridiculous, and frankly humorous conclusions. Here is a small sampling of such conclusions gleaned from the internet (Caveat lector: some of these may be apocryphal; however, they do illustrate the point).

  1. Sleeping with one's shoes on is strongly correlated with waking up with a headache. Therefore, sleeping with one's shoes on causes headache.

  2. As ice-cream sales increase, the rate of drowning deaths increases sharply. Therefore, ice-cream causes drowning.

  3. Examination of the records of the Netherlands for many years revealed a strong positive correlation between (i) the annual number of storks, and (ii) the annual number of human babies born. Therefore, storks bring babies, since the opposite choice, neonates gathering a flock of storks, is unlikely.

  4. Over the same period of time in history, the number of pirates have decreased, and there has been an increase in global warming. Therefore, global warming is caused by a lack of pirates - a central tenet of Pastafarianism, the religion (N.B. those who are not familiar with this great religion, quickly check out the link!).

These examples, found in Wikipedia (see here and here), illustrate the erroneous conclusions reached by ignoring the plausible effects of a third variable, or the possibility of a coincidence:

  • In 1, a state of extreme inebriation may be responsible for the inability to remove one's shoes, and the hangover the next morning, leading to the headache.

  • In 2, the warm months of summer may be instrumental in goading people to cool off either by eating ice-cream or by taking a swim (correspondingly increasing the chance of drowning) or both.

  • In 3, the relationship may be quite complex; the storks arrived at the onset of winter and established nests in chimneys and farm outbuildings - therefore, gathering more in rural areas than in cities. This may have been because of the presence of structures conducive to nesting, a cleaner, less-polluted rural environment, and/or the availability of more food and water in winter. For reasons completely unrelated to storks or any other bird, rural families tend to have more children than urban families; in addition, many babies are born in the spring in rural areas because of human (ahem!) behavior during the cold short winter days and long winter nights. Or, these two events may have been completely unrelated, a coincidence. (Note: For a humorous take on the Stork-Baby theory, check out this faux article (PDF) from Germany.)

  • In 4, another example (albeit made-up) of coincidence.

Not to belabor the point, these, precisely, are considerations that should give a pause to the investigators studying relationships of variables. However, oftentimes, the zeal of the investigator(s) in trying to find the postulated relationship becomes a hindrance to the objective assessments thereof. The following few examples illustrate how.

I have already talked about the poorly-analyzed study that purportedly showed an association between religiosity and prolonged survival following liver transplant. Other examples include the following studies:

  1. An older (1999) study published in Nature (no less!) falls squarely in this group. From the associations, the study opined that myopia in young people was caused by exposure to ambient night lights that were left on in their room. This assertion was later soundly refuted by two groups from Ohio and Boston, who found (a) no such association between pediatric myopia and ambient night lighting, and more importantly, (b) an association between parental and filial myopia, lending credence to the idea that myopic parents are more likely to leave the lights on in their children's rooms.

  2. In an interesting report in Guardian Science, Matt Parker, a mathematician with the Maths Department at Queen Mary College of the University of London, looked up publicly available data on the number of mobile phone masts (Cell phone towers) in each county across the UK, and matched it against the live birth data for the same counties, finding an extremely strong and statistically significant correlation between the two numbers. As an intellectual exercise, Matt has released his findings publicly, stating that although it is a correlation-only finding, with no evidence of causality, he is curious to find out if others interpret a causal connection from it, given the mobile-phone health scare hysteria that arises from time to time despite a resounding lack of evidence.

  3. Living near Freeways is associated with autism, concluded the recent CHARGE study, that aims to uncover genetic and environmental links to autism.

The CHARGE study observed that even after adjusting for co-variates, such as maternal smoking, socio-economic and demographic factors, maternal residence during the 3rd trimester, as well as at the time of delivery, was more likely be near a Freeway (but not any other major road) for mothers giving birth to autistic babies. The authors speculate that proximity to the Freeway may be a surrogate for exposure to traffic related air-pollution which is known to have adverse prenatal effects, and call for a systematic examination of the possible association of the air pollutants to autism. However, unless that latter link is established, the assertion of the link between Freeways and autism is at best premature, and bound to increase confusion (and panic) in the interim, without answering several pertinent questions, such as:

  1. Why were Freeways 'bad' and not other major roads?

  2. What about autism incidence and prevalence data from other countries where people live close to busy major roadways equivalent to Freeways?

  3. Are there autism clusters around US cities that have/had high levels of urban air-pollution?

  4. What about autism incidence and prevalence data in rural US communities?

Let's all take a deep breath and sing in chorus now: Correlation does not imply causation. If there is a true causal relationship between variables, rigorous application of the scientific method remains by far the best way to uncover it.

NOTE: One quick disclaimer. When I have quoted one or more Wikipedia articles in the text, it is because I have found them well-written, informative, and adequately illustrative; however, I shall make no claim as to their veracity and/or authenticity because I have not been able to access and verify all the background references therein. If you find an error, please feel free to chide me in the comments.


  1. Excellent post! What a clear, concise and logical treatment of a common problem.

    Spot on.