Skip to content

Tag: omitted variables

Correlation does not imply causation. Then what does it imply?

‘Correlation does not imply causation’ is an adage students from all social sciences are made to recite from a very early age. What is less often systematically discussed is what could be actually going on so that two phenomena are correlated but not causally related. Let’s try to make a list: 1) The correlation might be due to chance. T-tests and p-values are generally used to guard against this possibility. 1a) The correlation might be due to coincidence. This is essentially a variant of the previous point but with focus on time series. It is especially easy to mistake pure noise (randomness) for patterns (relationships) when one looks at two variables over time. If you look at the numerous ‘correlation is not causation’ jokes and cartoons on the internet, you will note that most concern the spurious correlation between two variables over time (e.g. number of pirates and global warming): it is just easier to find such examples in time series than in cross-sectional data. 1b) Another reason to distrust correlations is the so-called ‘ecological inference‘ problem. The problem arises when data is available at several levels of observation (e.g. people nested in municipalities nested in states). Correlation of two variables aggregated at a higher level (e.g. states) cannot be used to imply correlation of these variables at the lower (e.g. people). Hence, the higher-level correlation is a statistical artifact, although not necessarily due to mistaking ‘noise’ for ‘signal’. 2) The correlation might be due to a third variable being causally related to the two correlated variables we observe. This is the well-known omitted…