Explanation and the quest for ‘significant’ relationships. Part I

The ultimate goal of social science is causal explanation*. The actual goal of most academic research is to discover significant relationships between variables. The two goals are supposed to be strongly related – by discovering (the) significant effects of exogenous (independent) variables, one accounts for the outcome of interest. In fact, the working assumption of the empiricist paradigm of social science research is that the two goals are essentially the same – explanation is the sum of the significant effects that we have discovered. Just look at what all the academic articles with ‘explanation’, ‘determinants’, and ’causes’ in their titles do – they report significant effects, or associations, between variables.

The problem is that explanation and collecting significant associations are not the same. Of course they are not. The point is obvious to all uninitiated into the quantitative empiricist tradition of doing research, but seems to be lost to many of its practitioners. We could have discovered a significant determinant of X, and still be miles (or even light-years) away from a convincing explanation of why and when X occurs. This is not because of the difficulties of causal identification – we could have satisfied all conditions for causal inference from observational data, but the problem still stays. And it would not go away after we pay attention (as we should) to the fact that statistical significance is not the same as practical significance. Even the discovery of convincingly-identified causal effects, large enough to be of practical rather than only statistical significance, does not amount to explanation. A successful explanation needs to account for the variation in X, and causal associations need not to – they might be significant but not even make a visible dent in the unexplained variation in X. The difference I am talking about is partly akin to the difference between looking at the significance of individual regression coefficients and looking at the model fit as a whole (more on that will follow in Part II). The current standards of social science research tend to emphasize the former rather than the later which allows for significant relationships to be sold as explanations.

The objection can be made that the discovery of causal effects is all we should aim for, and all we could hope for. Even if a causal relationship doesn’t account for large amounts of variation in the outcome of interest, it still makes a difference.  After all, this is the approach taken in epidemiology, agricultural sciences and other fields (like beer production) where the statistical research paradigm has its origins. A pill might not treat all headaches but if it has a positive and statistically-significant effect, it will still help millions. But here is the trick – the quest for statistically significant relationships in epidemiology, agriculture, etc. is valuable because all these effects can be considered as interventions – the researchers have control over the formula of the pill, or the amount of pesticide, or the type of hops. In contrast, social science researchers too often seek and discover significant relationships between an outcome and variables that couldn’t even remotely be considered as interventions. So we end up with a pile of significant relationships which do not account for enough variation to count as a proper explanation and they have no value as interventions as their manipulation is beyond our reach. To sum up, observational social science has borrowed an approach to causality which makes sense for experimental research, and applied its standards (namely, statistical significance) to a context where the discovery of significant relationships is less valuable because the ‘treatments’ cannot be manipulated. Meanwhile, what should really count – explaining when, how and why a phenomenon happens, is relegated to the background in the false belief that somehow the quest for significant relationships is a substitute. It is like trying to discover the fundamental function of the lungs with epidemiological methods, and claiming success when you prove that cold air reduces significantly lung capacity. While the inference might still be valuable, it is no substitue for the original goal.

In Part II, I will discuss what needs to be changed, and what can be changed in the current practice of empirical social science research to address the problem outlined above.

*In my understanding, all explanation is causal. Hence, ‘causal explanation’ is tautology. Hence, I am gonna drop the ‘causal’ part for the rest of the text.

Writing with the rear-view mirror

Social science research is supposed to work like this:
1) You want to explain a certain case or a class of phenomena;
2) You develop a theory and derive a set of hypotheses;
3) You test the hypotheses with data;
4) You conclude about the plausibility of the theory;
5) You write a paper with a structure (research question, theory, empirical analysis, conclusions) that mirrors the steps above.

But in practice, social science research often works like this:
1) You want to explain a certain case or a class of phenomena;
2) You test a number hypotheses with data;
3) You pick the hypotheses that matched the data best and combine them in a theory;
4) You conclude that this theory is plausible and relevant;
5) You write a paper with a structure (research question, theory, empirical analysis, conclusions) that does not reflect the steps above.

In short, an inductive quest for a plausible explanation is masked and reported as deductive theory-testing. This fallacy is both well-known and rather common (at least in the fields of political science and public administration). And, in my experience, it turns out to be tacitly supported by the policies of some journals and reviewers.

For one of my previous research projects, I studied the relationship between public support and policy output in the EU. Since the state of the economy can influence both, I included levels of unemployment as a potential omitted variable in the empirical analysis. It turned out that lagged unemployment is positively related to the volume of policy output. In the paper, I mentioned this result in passing but didn’t really discuss it at length because 1) the original relationship between public support and policy output was not affected, and 2) although highly statistically significant, the result was quite puzzling.

When I submitted the paper at a leading political science journal, a large part of the reviewers’ critiques focused on the fact that I do not have an explanation for the link between unemployment and policy output in the paper. But why should I? I did not have a good explanation why these variables should be related (with a precisely 4-year lag) when I did the empirical analysis, so why pretend? Of course, I suspected unemployment as a confounding variable for the original relationship I wanted to study, so I took the pains of collecting the data and doing the tests, still that certainly doesn’t count as an explanation for the observed statistical relationship between unemployment and policy output. But the point is, it would have been entirely possible to write the paper as if I had strong ex ante theoretical reasons to expect that rising unemployment increases the policy output of the EU, and that the empirical test supports (or more precisely, does not reject) this hypothesis. That would certainly have greased the review process, and it only takes moving a few paragraphs from the concluding section to the theory part of the paper. So, if your data has a surprising story to tell, make sure it looks like you anticipated it all along – you even had a theory that predicted it! This is what I call ‘writing with the rear-view mirror’.

Why is it a problem? After all, an empirical association is an empirical association no matter whether you theorized about it beforehand or not. So where is the harm? As I see it, by pretending to have theoretically anticipated an empirical association, you grant it undue credence. Not only is data consistent with a link between two variables, but there are strong theoretical grounds to believe the link should be there. A surprising statistical association, however robust, is just what it is – a surprising statistical association that possibly deserves speculation, exploration and further research. On the other hand, a robust statistical association ‘predicted’ by a previously-developed theory is way more – it is a claim that we understand how the world works.

Until journals and reviewers act as if proper science never deviates from the hypothetico-deductive canon, writers will pretend that they follow it. While openly descriptive and exploratory research is frowned upon, sham theory-testing will prevail.

Eventually, my paper on the links between public support, unemployment and policy output in the EU got accepted (in a different journal). Surprisingly given the bumpy review process, it has just been selected as the best article published in that journal during 2011. Needless to say, an explanation why unemployment might be related to EU policy output is still wanting.

Unit of analysis vs. Unit of observation

Having graded another batch of 40 student research proposals, the distinction between ‘unit of analysis’ and ‘unit of observation’ proves to be, yet again, one of the trickiest for the students to master.

After several years of experience, I think I have a good grasp of the difference between the two, but it obviously remains a challenge to explain it to students. King, Keohane and Verba (1994) [KKV] introduce the difference in the context of descriptive inference where it serves the argument that what often goes under the heading of a ‘case study’ often actually has many observations (p.52, see also 116-117). But, admittedly the book is somewhat unclear about the distinction and unambiguous definitions are not provided.

In my understanding, the unit of analysis (a case) is at the level at which you pitch the conclusions. The unit of observation is at the level at which you collect the data. So, the unit of observation and the unit of analysis can be the same but they need not be. In the context of quantitative research, units of observation could be students and units of analysis classes, if classes are compared. Or students can be both the units of observation and analysis if students are compared. Or students can be the units of analyses and grades the unit of observations if several observations (grades) are available per student. So it all depends on the design. Simply put, the unit of observation is the row in the data table but the unit of analysis can be at a higher level of aggregation.

In the context of qualitative research, it is more difficult to draw the difference between the two, also because the difference between analysis and observation is in general less clear-cut. In some sense, the same unit (case) traced over time provides distinct observations but I am not sure to what extent these snap-shots would be regarded as distinct ‘observations’ by qualitative researchers. 

But more importantly, I start to feel that the distinction between units of analysis and units of observation creates more confusion rather than more clarity. For the purposes of research design instruction, we would be better off if the term ‘case’ did not exist at all so we could simply speak about observations (single observation vs. single case study, observation selection vs. case selection, etc.) Of course, language policing never works so we seem to be stuck in an unfortunate but unavoidable ambiguity.

Slavery, ethnic diversity and economic development

What is the impact of the slave trades on economic progress in Africa? Are the modern African states which ‘exported’ a higher number of slaves more likely to be underdeveloped several centuries afterwards?

Harvard economist Nathan Nunn addresses these questions in his chapter for the “Natural experiments of history” collection. The edited volume is supposed to showcase a number of innovative methods for doing empirical research to a broader audience, and historians in particular. But what Nunn’s study actually illustrates is the difficulty of making causal inferences based on observational data. He claims that slave exports contributed to economic underdevelopment, partly through impeding ethnic consolidation. But his data is entirely consistent with a very different interpretation: ethnic diversity in a region led to a higher volume of slave exports and is contributing to economic underdevelopment today. If this interpretation is correct, it could render the correlation between slave exports and the lack of economic progress in different African states spurious – a possibility that is not addressed in the chapter.

The major argument of Nunn’s piece is summarized in the following scatterplot. Modern African states from which more slaves were captured and exported (correcting for the size of the country) between the XVth and the XIXth centuries are associated with lower incomes per capita in 2000 (see Figure 5.1 on p.162, the plot reproduced below is actually from an article in the Quarterly Journal of Economics which looks essentially the same):

The link grows only stronger after we take into account potential ‘omitted variables’ like geographical location, natural openness, climate, natural resources, history of colonial rule, religion and the legal system. Hence, the relationship seems more than a correlation and Nunn boldly endorses a causal interpretation: “the slave trades are partly responsible for Africa’s current underdevelopment” (p.165).

Not being a specialist in the history of slavery, my initial reaction was one of disbelief – the relationship seems almost too good to be true. Especially when we consider the rather noisy slave exports data which attributes imperfect estimates of slave exports to modern states which didn’t exist at the time when the slaves were captured and traded. While it is entirely plausible that slave exports and economic underdevelopment are related, such a strong association several centuries apart between the purported cause and its effect invites skepticism.

It seemed perfectly possible to me that the ethnic heterogeneity of a territory can account for both the volume of slave exports, and current economic underdevelopment. In my layman’s worldview, people are more likely to hunt and enslave people from another tribe or ethnicity than their own. At the same time, African countries in which different ethnicities coexist might face greater difficulties in providing public goods and establishing the political institutions conductive to economic prosperity. So I was a bit surprised that the analysis doesn’t control for ethnic diversity, in addition to size, climate, openness, etc.

But then towards the end of the essay, the relationship between slave exports and ethnic diversity is actually presented and the correlation at the country level turns out to be very high. But Nunn decides to interpret the relationship in the opposite direction: for him, slave exports caused ethnic diversity by impeding ethnic consolidation (which in turn contributes to economic underdevelopment today). He doesn’t even consider the possibility of reverse causality in this case, although the volume of slave exports could easily be a consequence rather than a cause of ethnic diversity in a region.

Of course, data alone cannot give an answer which interpretation is more likely to be correct. And this is exactly the point. When the assignment of countries into different levels of slave exports is not controlled by the researcher or randomized by nature, it is imperative that all possible interpretations consistent with the data are discussed and evaluated; especially in a volume which aims to bring research methodology lessons to the masses.

And finally, if my suggestion that ethnic diversity is more likely to be a cause rather than an effect of slave exports is correct, can ethnic diversity explain away the correlation between slave exports and economic performance? While Nunn doesn’t test this conjecture, he has the data available on his website, so why don’t we go ahead and check: while I can’t be entirely sure I replicate exactly what the original article is doing [there is no do-file online], a regression of income on slave exports with ethnic diversity included as a covariate takes the bulk of the significance of slave exports away.

Is unit homogeneity a sufficient assumption for causal inference?

Is unit homogeneity a sufficient condition (assumption) for causal inference from observational data?

Re-reading King, Keohane and Verba’s bible on research design [lovingly known to all exposed as KKV] I think they regard unit homogeneity and conditional independence as alternative assumptions for causal inference. For example: “we provide an overview here of what is required in terms of the two possible assumptions that enable us to get around the fundamental problem [of causal inference]” (p.91, emphasis mine). However, I don’t see how unit homogeneity on its own can rule out endogeneity (establish the direction of causality). In my understanding, endogeneity is automatically ruled out with conditional independence, but not with unit homogeneity (“Two units are homogeneous when the expected values of the dependent variables from each unit are the same when our explanatory variables takes on a particular value” [p.91]).

Going back to Holland’s seminal article which provides the basis of KKV’s approach, we can confirm that unit homogeneity is listed as a sufficient condition for inference (p.948). But Holland divides variables into pre-exposure and post-exposure before he even gets to discuss any of the additional assumptions, so reverse causality is ruled out altogether. Hence, in Holland’s context unit homogeneity can indeed be regarded as sufficient, but in my opinion in KKV’s context unit homogeneity needs to be coupled with some condition (temporal precedence for example) to ascertain the causal direction when making inferences from data.

The point is minor but can create confusion when presenting unit homogeneity and conditional independence side by side as alternative assumptions for inference.

Social science in the courtroom

Everyone who is interested in the sociology of science, causal inferences from observational data, employment gender discrimination, judicial sagas, or academic spats should read the latest issue of Sociological Methods & Research. The whole issue is devoted to the Wal-Mart Stores,Inc. v. Dukes et al. case – “the largest class-action employment discrimination suit in history”, with a focus on the uses of social science evidence in the courtroom. 

The focal point of contestation is the report of Dr. Bielby – an expert for the plaintiff. In a nutshell, the report says that the gender bias in promotion decisions at Wal-Mart can be attributed to the lack of efforts to create a strong corporate culture and limit the discretion managers have in promotion decisions, which in turn allows for biased decisions. The evidence is mostly 1) a literature review that supports the causal links between corporate policies and corporate culture, corporate culture and individual behavior, discretion and biased individual behavior, and corporate policies and outcomes, and 2) description of the corporate policies and culture at Wal-Mart which points to a relatively weak policy towards gender discrimination and considerable discretion for managers in promotion decisions. Dr. Bielby describes the method as follows: “…look at distinctive features of the firm’s policies and practices and … evaluate them against what social scientific research shows to be factors that create and sustain bias and those that minimize bias” [the method is designated as “social framework analysis”].

What gives the case broader significance (apart from the fact that it directly concerns between half a million and a million and a half current and former employees at Wal-Mart), is the letter [amicus brief] the American Sociological Association (ASA) decided to send in support of Dr. Bielby’s report. In the letter, ASA states that “the methods Dr. Bielby used are those social scientists rely on in scientific research that is published in top-quality peer-reviewed journals” and that “well done case studies are methodologically valid”. However, the Supreme Court apparently begs to differ, and rejected the plaintiffs’ claim.

The current issue of Sociological Research & Methods has two articles which attack the decision of ASA to endorse Dr. Bielby’s methodology and two articles that support it. In my opinion, the former are right. Mitchell, Monahan, and Walker characterize Dr. Bielby’s approach as “subjective judgments about litigation materials collected and provided to the expert by the attorneys”, but even if that goes too far, Sørensen and Sharkey definitely have a point in writing that what Dr. Bielby does is engage in abductive reasoning – “generate a hypothesized explanation from an observed empirical phenomenon”. Hence, hardly a reliable way to make a valid inference about causes and effects. Employment discrimination might be consistent with high managerial discretion but is not necessarily caused by it.

What makes this academic exchange particularly juicy is the fact that most contributors (the editor of the journal included) have been opponents in the courtroom as well – well, not directly but as experts for the two sides in numerous employment discrimination suites. Which probably raises the stakes, I guess. Here is the editor describing the process of putting the special issue together:

“Managing” these interchanges has been far more difficult than I had thought. Even around very technical issues, scholars can get very heated. Part of the problem, I believe, is that the academy and, certainly, the social sciences, and most specifically sociology, do not have a well-articulated set of norms about how to engage in constructive scientific discourse. Too often I have seen the following:
1. Claims that a person holds a position or has said something when he or she did not, that is, “putting words in a person’s mouth.”
2. Misconstrual, intentionally or not, of the meaning of what a person has written.
3. Questioning the expertise, intelligence, motives, or morals of an author.
4. Obfuscation by bringing in irrelevant or tangential points or material.” (p.552-3)

Academic discourse at its best.