Whenever we see a relationship between two variables X and Y, it’s wise to be conservative and assume that the relationship is correlational rather than causal. In this article, we cover 4 common reasons why correlation does not equal causation.
(1) Omitted variable: There may be some variables Z that affect both X and Y.
(2) Reverse causality: Y may be causing X rather than the other way round.
(3) Sample selection: We may be looking at X and Y among a non-representative group of individuals.
(4) Measurement error: X and Y may be difficult to measure.
[This article was originally published on Medium by Toward Data Science]
The phrase “correlation does not imply causation” has become a cliche of sorts. This seems to be the phrase that impassioned readers type into the comments section when they read articles claiming incredulous links between two variables.
What does “correlation does not imply causation” mean? When should we use this phrase? How can we tell the difference between correlation and causation? What are the reasons why correlation does not equal causation? These are the questions we tackle in this article.
Correlation vs. Causation
We say that X and Y are correlated when they have a tendency to change and move together, either in a positive or negative direction. In the diagrams below, X and Y have a positive correlation (left), a negative correlation (middle), and no correlation (right).
We say that X causes Y when a change in X leads to a change in Y. Another way of saying this is that the change in Y would not have happened without the preceding change in X.
The maxim “correlation does not imply causation” serves as a useful reminder of how to think about the relationship between two variables X and Y. If X and Y seem to be linked, it’s possible but not certain that X caused Y. It’s also possible that Y caused X or that some third variable (Z) caused both X and Y.
However, this phrase can sometimes be a knee-jerk reaction when one hears dubious causal links between two variables. Often, this phrase is uttered with a tone of disdain, seeming to imply that “correlations” are inferior to “causal” links. This is unfortunate because correlations can be interesting even if they don’t present causal relationships. Correlations can tell us interesting things and can help us understand possible causal links. But we need to be careful and nuanced when understanding and interpreting such correlations.
More Sex, Higher Income?
To illustrate how we can distinguish between correlation and causation, let’s look at an article that claims that more sex causes higher income. In the summer of 2013, an article was published on Gawker.com with the headline “More Buck For Your Bang: People Who Have More Sex Make The Most Money”. The author Max Rivlin-Nadler writes: “Scientists… found that people who have sex more than four times a week receive a 3.2 percent higher paycheck than those who have sex only once a week. God forbid you don’t have sex at all.”
The Gawker article was based on a study by Nick Drydakis, an Economics professor at Anglia Ruskin University, called “The Effect of Sexual Activity on Wages”. In his study, Drydakis examines the relationship between the frequency of sexual intercourse and income among 7,500 Greek households (not German households, as the Gawker article states).
To be clear, Dryadkis does not claim that his study shows that more sex causes higher income. He writes that, like existing studies, disentangling correlation from causation is difficult: “it is unclear whether this correlation represents a causal relationship… since the current findings are strictly applicable only to the time, place, individual characteristics from which the sample was drawn, we should highlight that the reported results are simply an indication of the relationship between sexual activity and wages but are by no means the final word.”
The study and the corresponding (mis)interpretation of its results in the Gawker article are good examples of the “correlation does not imply causation” maxim at work. First, the study primarily focuses on correlations, but the relationship was interpreted as a causal relationship by the press. Second, this led to some backlash from readers who warned the author of the Gawker article to be careful about conflating correlation and causation. Third, despite its non-causal nature, the results of the study are arguably still interesting. It certainly got the attention of the internet at the time!
Given this, let’s look at reasons why correlation does not imply causation.
4 Reasons Why Correlation ≠ Causation
(1) We’re missing an important factor (Omitted variable)
The first reason why correlation may not equal causation is that there is some third variable (Z) that affects both X and Y at the same time, making X and Y move together. The technical term for this missing (often unobserved) variable Z is “omitted variable”. In the study on the sex-income relationship, what third factor (Z) could make people have more sex (X) and more money (Y)?
Drydakis suggests that physical health is an important third variable: “In this study, we hypothesised that because the medical and psychological literature suggest that sexual activity is associated with good health, endurance, mental well-being, mental capacities and dietary habits, it could be perceived as a health indicator, which might influence returns to labor market activity… The patterns found in this study strengthen this reasoning.” In other words, it’s likely that physical and mental health influences both sexual activity levels and income. All else equal, healthier people are both more likely to have more sex and have higher income, even if the former didn’t cause the latter.
There are probably many other omitted variables that affect how often people have sex and how much income they make. For example, Drydakis also finds that certain personality traits (Z) affect both sexual activity and wages. In particular, extraversion is found to affect both these outcomes. This seems plausible: extraverted people probably find it easier to chat up people they’re romantically interested in and to present themselves in a manner that might command a higher wage (e.g. bargain/negotiate wages with confidence).
What other omitted variables can you think of?
(2) We got things the other way round (Reverse Causality)
The second reason why X and Y moving together may not imply that X causes Y is that Y might be causing X instead. The technical term for this is “reverse causality”.
In the comments section of the Gawker article, many readers picked this up as the main reason to doubt a causal effect of sexual activity on wages. One commenter writes, “People who have the most sex make the most money? Or is it that the people who make the most money have the most sex?”. Another writes: “Correction: Men who make the most money get [sic] the most sex.” Yet another writes: “I think you have that backwards. Ever tried to get laid when you’re broke?”
You get the gist.
(3) We’re looking at unusual people (Sample Selection)
The third reason why correlation does not imply causation is that the sample we’re looking at is not representative of the population of interest. The technical term for this is that we have “sample selection”.
The classic example of sample selection is based on Nobel winner James Heckman’s work. Suppose you want to know why someone is paid the wage that they get. You go out to collect some data on some outcomes Y (e.g. wages) and some variables of interest X that you think might affect how much you get paid (educational degree, occupation). Heckman noted that in such an investigation, you can only collect information on the wage of people who actually work. You can’t get information on the outcomes (Y) of people who don’t work. Furthermore, people who work are selected in some non-random way from the population (e.g. you’re unlikely to find new mothers in this group). Thus, estimating the determinants of wages from this selected group may lead us to draw inaccurate conclusions.
What does this have to do with the study on the sex-income relationship? It could be the case that a huge chunk of people who are unemployed (and hence earn 0) are having lots of sex but they never appear in the data because their wages are zero, If these people did appear in the data, then the relationship might look very different!
(4) It’s difficult to measure things (Measurement error)
The fourth reason why correlation does not imply causation is that the outcomes that we’re interested in are difficult to measure and hence can only be imperfectly observed. The technical term for this is “measurement error”.
This is a particular concern when you ask people to report things about controversial or sensitive topics such as, er, how frequently they have sex. Similar issues hold when you ask people about who they plan to vote for or their attitudes towards various contentious issues.
When certain types of people with certain traits are more likely to misreport the variable that we’re interested in, then this can lead us to infer incorrect relationships. As one reader comments: “It could also be so that people who lie about their income, tend to also lie about how many times they have sex.”
Consider a similar scenario where certain types of men who feel the need to overreport the number of sexual encounters are also the type of men who pursue high-salary jobs because they believe that both these things bring them prestige. In this case, it’ll look as though people who have higher reported sexual encounters also have higher wages when in fact this relationship is due to them misreporting their sexual activity.
Indeed, correlation does not imply causation. Whenever we see a relationship between two variables, it’s wise to be conservative and assume that the relationship is correlational rather than causal.
However, this doesn’t mean that correlations are useless. Rather than dismissing correlations entirely, we need to be careful how we interpret these correlations. Being aware that “correlation does not imply causation” is a starting point, but throwing this phrase around without considering precisely why correlations might not equal causation adds little to the discussion.
In this article, we covered 4 common reasons why correlation does not equal causation. Whenever we see two variables moving together and feel inclined to draw a causal relationship between the two, we should stop and ask ourselves the following questions:
(1) Are we forgetting about any variables Z that affect both X and Y? (Omitted variables)
(2) Is Y causing X rather than X causing Y? (Reverse causality)
(3) Who is missing? (Sample selection)
(4) How easy is it to measure X and Y? (Measurement error)
Getting creative with our answers to these questions is one way to avoid conflating correlation and causation. Hopefully, it might even lead us to consider other relationships that we previously did not think about. After all, looking at correlations might be a starting point to uncovering something more interesting.