In Part I I argue that the search and discovery of statistically significant relationships does not amount to explanation and is often misplaced in the social sciences because the variables which are purported to have effects on the outcome cannot be manipulated.
Just to make sure that my message is not misinterpreted – I am not arguing for a fixation on maximizing R-squared and other measures of model fit in statistical work, instead of the current focus on the size and significance of individual coefficients. R-squared has been rightly criticized as a standard of how good a model is** (see for example here). But I am not aware of any other measure or standard that can convincingly compare the explanatory potential of different models in different contexts. Predictive success might be one way to go, but prediction is altogether something else than explanation.
I don’t expect much to change in the future with regard to the problem I outlined. In practice, all one could hope for is some clarity on the part of the researchers whether their objective is to explain (account for) or find significant effects. The standards for evaluating progress towards the former objective (model fit, predictive success, ‘coverage’ in the QCA sense) should be different than the standards for the latter (statistical & practical significance and the practical possibility to manipulate the exogenous variables).
Take the so-called garbage-can regressions, for example. These are models with tens of variables all of which are interpreted causally if they reach the magic 5% significance level. The futility of this approach is matched only by its popularity in political science and public administration research. If the research objective is to explore a causal relationship, one better focus on that variable and include covariates only if it is suspected that they are correlated with the outcome and with the main independent variable of interest. Including everything else that happens to be within easy reach not only leads to inefficiency in the estimation. One should refrain from interpreting causally the significance of these covariates altogether. On the other hand, if the objective is to comprehensively explain (account for) a certain phenomenon, then including as many variables as possible might be warranted but then the significance of individual variables is of little interest.
The goal of research is important when choosing the research design and the analytic approach. Different standards apply to explanation, the discovery of causal effects, and prediction.
**Just one small example from my current work – a model with one dependent and one exogenous time-series variables in levels with a lagged dependent variable included on the right-hand side of the equation produces an R-squared of 0.93. The same model in first differences has an R-squared of 0.03 while the regression coefficient of the exogenous variable remains significant in both models. So we can ‘explain’ 90% of the variation in the first case by reference to the past values of the outcome. Does this amount to an explanation in any meaningful sense? I guess that depends on the context. Does it provide any leverage to the researcher to manipulate the outcome? Not at all.
[…] All in all, that’s it: a correlation does not imply causation, but unless the correlation is due to noise, statistical artifact, or an confounder (omitted variable), correlation is pretty suggestive of causation. Of course, causation here means that a variable is a contributing factor to variation in the outcome, rather than that the variable can account for all the changes in the outcome. See my posts on the difference here and here. […]