The ‘Global South’ is a terrible term. Don’t use it!

The Rise of the ‘Global South’

The ‘Global South‘ and ‘Global North‘ are increasingly popular terms used to categorize the countries of the world. According to Wikipedia, the term ‘Global South’ originated in postcolonial studies, and was first used in 1969. The Google N-gram chart below shows the rise of the ‘Global South’ term from 1980 till 2008, but the rise is even more impressive afterwards.

Nowadays, the Global South is used as a shortcut to anything from poor and less-developed to oppressed and powerless. Despite this vagueness, the term is prominent in serious academic publications, and it even features in the names of otherwise reputable institutions. But, its popularity notwithstanding, the ‘Global South’ is a terrible term. Here is why.

 

There is no Global South

The Global South/Global North terms are inaccurate and misleading. First, they are descriptively inaccurate, even when they refer to general notions such as (economic) development. Second, they are homogenizing, obscuring important differences between countries supposedly part of the Global South and North groups. In this respect, these terms are no better than alternatives that they are trying to replace, such as ‘the West‘ or the ‘Third World‘. Third, the Global South/Global North terms imply a geographic determinism that is wrong and demotivational. Poor countries are not doomed to be poor, because they happen to be in the South, and their geographic position is not a verdict on their developmental prospects.

 

The Global South/Global North terms are inaccurate and misleading

Let me show you just how bad these terms are. I focus on human development, broadly defined and measured by the United Nations’ Human Development Index (HDI). The HDI tracks life expectancy, education, and standard of living, so it captures more than purely economic aspects of development.

The chart below plots the geographic latitude of a country’ capital against the country’s HDI score for 2017. (Click on the image for a larger size or download a higher resolution pdf). It is quite clear that a straight line from South to North is a poor description of the relationship between geographic latitude and human development. The correlation between the two is 0.48. A linear regression of HDI on latitude returns a positive coefficient, and the R-squared as 0.23. But, as is obvious from the plot, the relationship is not linear. In fact, some of the southern-most countries on the planet, such as Australia and New Zealand, but also Chile and Argentina, are in the top ranks of human development. The best summary of the relationship between HDI and latitude is curvilinear, as indicated by the Loess (nonparametric local regression) fit.

 

 

 

You can say that we always knew that and the Global South was meant to refer to ‘distance from the equator’ rather than to absolute latitude. But, first, this is rather offensive to people in New Zealand, Australia, South Africa and the southern part of South America. And, second, there is still far from a deterministic relationship between human development and geographic position, as measured by distance from the equator. The next plot (click on the image for a larger size, download a pdf version here) shows exactly that. Now, overall, the relationship is stronger: the correlation is 0.64. And after around the 10th degree, it is also rather linear, as indicated by the match between the linear regression line and the Loess fit. Still, there is important heterogeneity within the South/close to equator and North/far from equator countries. Singapore’ HDI is almost as high as that of Sweden, despite the two being on the opposite ends of the geographic scale. Ecuador’s HDI is just above Ukraine’s, although the former is more than 50 degree closer to the equator than then latter. Gabon’s HDI is higher than Moldova’s, despite Gabon being 46 degrees further south than Moldova.

 

 

This is not to deny that there is a link between geographic position and human development. By the standards of social science, this is a rather strong correlation and fairly smooth relationship. It is remarkable that no country more the 35 degrees from the equator has an HDI lower than 0.65 (but this excludes North Korea, for which there is no HDI data provided by the UN).  But there is still important diversity in human development at different geographic zones. Moreover, the correlation between geographic position and development need to be causal, let alone deterministic.

There are good arguments to be made that geography shapes and constraints the economic and social development of nations. My personal favorite is Jared Diamond’s idea that Eurasia’s continental spread along an East-West axis made it easier for food innovations and agricultural technology to diffuse, compared to America’s continental spread along a North-South axis. But geography is not a verdict for development, as plenty of nations have demonstrated. Yet, the Global South/Global North categories suggest otherwise.

 

What to use instead?

OK, so the Global South/Global North are bad words, but what to use instead? There is no obvious substitute that is more descriptively accurate, less homogenizing and less suggestive of (geographic) determinism. But then don’t use any categorization that is so general and coarse. There is a good reason why there is no appropriate alternative term: the countries of the world are too diverse to fit into two boxes: one for South and one for North, one for developed and one for non-developed, one for powerful, and one for oppressed.

Be specific about what the term is referring to, and be concrete about the set of countries that is covered. If you mean the 20 poorest countries in the world, say the 20 poor countries in the world, not countries of the Global South. If you mean technologically underdeveloped countries, say that and not countries of the Third World. If you mean rich, former colonial powers from Western Europe, say that and not the Global North.  It takes a few more words, but it is more accurate and less misleading.

It is a bit ironic that the Global South/Global North terms are most popular among scholars and activists who are extremely sensitive about the power of words to shape public discourses, homogenize diverse populations, and support narratives that take a life of their own, influencing politics and public policy. If that’s the case, it makes it even more imperative to avoid terms that are inaccurate, homogenizing and misleading on a global scale.

If you want to look at the data yourself, the R script for the figures is here and the datafile is here.

What’s a demockracy?

– What’s a democracy?

– Democracy means that people rule and the government respects the opinions of the citizens.

– So the government should do what the people want?

– In principle, yes, but…

– Can a majority of the people decide to abolish the parliament?

– No, the basic institutions of the state are usually set in the Constitution and constitutional rules are not to be changed like that. Everything that is in the constitution is off limits.

– OK, I can see why. Can the people decide different groups deserve different pay for the same job?

– No, even if this is not outlawed by the Constitution, there is the Universal Declaration of Human Rights, and fundamental human rights are not be changed by democratic majorities.

– Makes sense. Can the people decide on gay marriage? That’s not in the Declaration.

– Well, there are certain human rights that are not yet in constitutions and universal declaration, but we now recognize them as essential so they are also not subject to majorities.

– OK, so in democracies the government does what the people want, but not when it comes to constitutional issues, recognized fundamental human rights, and other very important norms.

– Yes.

– So can the people decide to change the interest rate?

– Oh, no! Not even politicians can do that. Monetary policy is delegated to independent central banks.

– But people can decide on regulating tel…

– Nope, regulation is basically all delegated to independent agencies, so that’s out.

– Hm, ok, so can the people decide to change the terms of foreign trade?

– Not really, these are set in international treaties so people cannot change anything that is in international treaties just like that.

– Got it. But people surely can decide if their country goes to war or not?

– Well, foreign policy is tricky, there is a lot of secret information involved, complex strategies to be made and it needs rapid responses, so, no.

– OK, can people decide on pensions, then?

– Pensions affect the future lives of those who can’t vote yet, so current majorities can’t really decide.

– OK, so in democracies the government does what the people want, but not when it comes to constitutional issues, recognized fundamental human rights and other very important norms, and not on anything that is in international treaties, and not on monetary policy or any regulatory issues, and not on foreign policy, and not on pensions. But for the rest the government should do what the majority of people want?

– Well, not really. It might not be clear what people want: there could be cyclical majorities among policy alternatives. And it might not be clear how to respond: respecting majorities on particular issues might lead to disrespecting a majority of the people overall.

– That sounds complicated. But if there are not cyclical majorities and one can satisfy a majority of people on a majority of the issues, then one should do what the people want?

– Nope. People might not want what’s good for them. People don’t understand policy and don’t follow political developments close enough. And people are duped by politicians and the media.

– Hard to disagree. I think I got it now: Democracy is a political system in which the government does what the people want, but not when it comes to constitutional issues, recognized fundamental human rights and other very important norms, and not on anything that is in international treaties, and not on monetary policy or any regulatory issues, and not on foreign policy, and not on pensions, and not on anything where it is unclear what the majority wants or how to satisfy a majority of people on majority of issues, and then only if the people want what’s right for them, to be decided by some experts in government or outside. Now that’s what I call a real demockracy!

Books on data visualization

Here is a compilation of new and classic books on data visualization:

 

Scott Murray (2017) Interactive Data Visualization for the Web 

Elijah Meeks (2017) D3.Js in Action: Data Visualization with JavaScript 

Alberto Cairo (2016) The Truthful Art: Data, Charts, and Maps for Communication 

Andy Kirk (2016) Data Visualization 

David McCandless (2014) Knowledge is Beautiful 

 

Edward Tufte (2006) Beautiful Evidence  

Edward Tufte (2001) The Visual Display of Quantitative Information 

Edward Tufte (1997) Visual Explanations: Images and Quantities, Evidence and Narrative 

Edward Tufte (1990) Envisioning Information 

The Discursive Dilemma and Research Project Evaluation

tl; dr When we collectively evaluate research proposals, we can reach the opposite verdict depending on how we aggregate the individual evaluations, and that’s a problem, and nobody seems to care or provide guidance how to proceed.

Imagine that three judges need to reach a verdict together using majority rule. To do that, the judges have to decide independently if each of two factual propositions related to the suspected crime is true. (And they all agree that if and only if both propositions are true, the defendant is guilty).

The distribution of the judges’ beliefs is given in the table below. Judge 1 believes that both propositions are true, and as a result, considers the conclusion (defendant is guilty) true as well. Judges 2 and 3 consider that only one of the propositions is true and, as a result, reach a conclusion of ‘not guilty’. When the judges vote in accordance with their conclusions, a majority finds the defendant ‘not guilty’.

 

Proposition 1 Proposition 2 Conclusion
Judge 1 true true true (guilty)
Judge 2 false true false (not guilty)
Judge 3 true false false (not guilty)
Majority decision TRUE TRUE FALSE (not guilty)

However, there is a majority that finds each of the two propositions true (see the last line in the table)! Therefore, if the judges vote on each proposition separately rather than directly on the conclusion, they will have to find the defendant ‘guilty’. That is, the judges will reach the opposite conclusion, even though nothing changes about their beliefs, they still agree that both propositions need to be true for a verdict of ‘guilty’, and the decision-making rule (majority) remains the same. The only thing that differs is the method through which the individual beliefs are combined: either by aggregating the conclusions or by aggregating the premises.

This fascinating result, in which the outcome of a collective decision-making process changes depending on whether the decision-making procedure is premise-based or conclusion-based, is known as the ‘discursive dilemma‘ or ‘doctrinal paradox‘. The paradox is but one manifestation of a more general impossibility result:

There exists no aggregation procedure (generating complete, consistent and deductively closed collective sets of judgments) which satisfies universal domain, anonymity and systematicity.” (List and Pettit, 2002).

Christian List has published a survey of the topic in 2006 and keeps an annotated bibliography. The paradox is related but separate from Arrow’s impossibility theorem, which deals with the aggregation of preferences.

After this short introduction, let’s get to the point. My point is that the collective evaluation of scientific research proposals often falls victim to the discursive dilemma. Let me explain how.

Imagine three scientific experts evaluating an application for research funding that has three components. (These components can be about three different aspects of the research proposal itself or about three different parts of the application, such as CV, proposal, and implementation plan). For now, imagine that the experts only evaluate each component as of excellent quality or not (binary choice). Each expert uses majority rule to aggregate the scores on each section, and the three experts reach a final conclusion  using majority rule as well.

The distribution of the evaluations of the three experts on each of the three components of the application are given in the table below. Reviewer 1 finds Parts A and C excellent but Part B poor. Reviewer 2 finds Parts B and C excellent but part A poor. And Reviewer 3 finds Parts A and B poor and part C excellent. Overall, Reviewers 1 and 2 reach a conclusion of ‘excellent’ for the total application, while Reviewer 3 reaches a conclusion of ‘poor’. By aggregating the conclusions by majority rule, the application should be evaluated as ‘excellent’. However, looking at each part individually, there is a majority that finds both Parts A and B ‘poor’, therefore the total evaluation should be ‘poor’ as well.

 

Part A Part B Part C Conclusion
Reviewer 1 excellent poor excellent EXCELLENT
Reviewer 2 poor excellent excellent EXCELLENT
Reviewer 3 poor poor excellent POOR
Majority decision POOR POOR EXCELLENT ?

So which one is it? Is this an excellent proposal or not, according to our experts?

I do not know.

But I find it quite important to recognize that we can get completely different results from the evaluation process depending on how we aggregate the individual scores, even with exactly the same distribution of the scores and even when every expert is entirely consistent in his/her evaluation.

But before we discuss the normative appeal of the two different aggregation options, is this a realistic problem or a convoluted scenario made up to illustrate a theoretical point but of no relevance to the practice of research evaluation?

Well, I have been involved in a fair share of research evaluations for journals, publishing houses, different national science foundations as well as for the European Research Council (ERC). Based on my personal experience, I think that quite often there is a tension between aggregating expert evaluations by conclusion and by premises.  Moreover, I have not seen clear guidelines how to proceed when the different types of aggregation lead to different conclusions. As a result, the aggregation method is selected by the implicit personal preferences of the one doing the aggregation.

Let’s go through a scenario that I am sure anyone who has been involved in some of the big ERC evaluations of individual research applications will recognize.

Two of the three reviewers find two of the three parts of the application ‘poor’, and the third reviewer finds one of the three parts poor and the other two parts ‘good’ (see the table below).

Part A Part B Part C Conclusion
Reviewer 1 poor poor good POOR
Reviewer 2 good poor poor POOR
Reviewer 3 good poor good GOOD
Majority decision GOOD POOR GOOD ?

Thus a majority of the final scores (the conclusions) indicate a ‘poor’ application. However, when the reviewers need to indicate the parts of the application that are ‘poor’, they cannot find many! There is a majority for two  out of the three parts that finds them ‘good’. Accordingly, by majority rule these cannot be listed as ‘weaknesses’ or given a poor score. Yet the total proposal is evaluated as ‘poor’ (i.e. unfundable).

There are three ways things go from here, based on my experience. One response is, after having seen that there is no majority evaluating many parts of the application as ‘poor’ (or as a ‘weakness’), to adjust upwards the overall scores of the application. In other words, the conclusion is brought in line with the result of the premise-based aggregation. A second response is to ask the individual reviewers to reflect back on their evaluations and reconsider whether their scores on the individual parts need be adjusted downwards (so the premises are brought in line with the result of the conclusion-based aggregation). A third response is to keep both a negative overall conclusion and a very, very short list of ‘weaknesses’ or arguments about which parts of the proposal are actually weak.

Now you know why you sometimes get evaluations saying that your project application is unfundable, but failing to point out what its problems are.

Again, I am not arguing that one of these responses or ways to solve the dilemma is always the correct one (although, I do have a preference, see below). But I think (a) the problem should be recognized, and (b) there should be explicit guidelines how to conduct the aggregation, so that there is less discretion left to those doing it.

If I had to choose, I would go for conclusion-base aggregation. Typically, my evaluation of a project is not a direct sum of the evaluations of the individual parts, and it is based on more than can be expressed with the scores on the application’s components. Also typically, having formed a conclusion about the overall merits of the proposal, I will search for good arguments to make why the proposal is poor, but also add some nice things to say to balance the wording of the evaluation. But it is the overall conclusion that matters, and the rest is discursive post hoc justification that is framed to fit to requirements of the specific context of the evaluation process.

Another argument to be made in favor of conclusion-based aggregation is the idea that reviewers represent particular ‘world-views’ or perspectives, for example, stemming from their scientific (sub)discipline. Therefore, evaluations of individual parts of a research application should not be aggregated by majority, since the evaluations are not directly comparable. If I consider that a literature review presented in a project proposal is incomplete based on my knowledge of a specific literature, this assessment should not be overruled by two assessments that the literature review is complete coming from reviewers who are experts in different literatures than I am: we could all be right in light of what we know.

In fact, the only scenario in which premise-based aggregation (with subsequent adjustment of the conclusions) makes sense to me is one where all reviewers know, on average, the same things and they provide, on average, scores without bias but with some random noise. In this case, majority aggregation of the premises filters the noise.

But I am sure that there more and different arguments to be made, once we realize that the discursive dilemma is a problem for research evaluations and that currently different aggregation practices are allowed to proliferate unchecked.

I suspect that many readers, even if they got this far in the text, would be unconvinced about the relevance of the problem I describe, because they think that (a) research evaluation is rarely binary but involves continuous scores, and (b) because aggregation is rarely based on majority rule.

The first objection is easier to deal with: First, sometimes evaluation is binary, for example, when the evaluation committee needs to list ‘strengths’ and ‘weaknesses’. Second, even when evaluation is formally on a categorical or continuous scale, it is in practice binary because anything below the top end of the scale is ‘unfundable’. Third, the discursive dilemma is also relevant for continuous judgements.

The second objection is pertinent. It is not that majority rule is not used when aggregating individual scores: it is, sometimes formally, more often informally. But in the practice of research evaluation these days, having anything less than a perfect score means that a project is not going to be funded. So whatever the method of aggregation, any objection (low score) by any reviewer is typically sufficient to derail an application. This is likely a much bigger normative problem for research evaluation, but one that requires a separate discussion.

And since we have to spend a lot of time preparing comprehensive evaluation reports, also of  projects that are not going to be funded, the discursive dilemma needs to be addressed so that the final evaluations are consistent and clear to the researchers.

Intuitions about case selection are often wrong

Imagine the following simple setup: there are two switches (X and Z) and a lamp (Y). Both switches and the lamp are ‘On’. You want to know what switch X does, but you have only one try to manipulate the switches. Which one would you choose to switch off: X, Z or it doesn’t matter?

These are the results of the quick Twitter poll I did on the question:

Clearly, almost half of the respondents think it doesn’t matter, switching X is the second choice, and only 2 out of 15 would switch Z to learn what X does. Yet, it is by pressing Z that we have the best chance of learning something about the effect of X. This seems quite counter-intuitive, so let me explain.

First, let’s clarify the assumptions embedded in the setup: (A1) both switches and the lamp can be either ‘On’ [1 ] or ‘Off’ [0]; (A2) the lamp is controlled only by these switches; there is nothing outside the system that controls its output; (A3) X and Z can work individually or in combination (so that the lamp is ‘On’ only if both switches are ‘On’ simultaneously).

Now let’s represent the information we have in a table:

Switch X Switch Z Lamp Y
1 1 1
0 0 0

We are allowed to make one experiment in the setup (press only one switch). In other words, we can add an observation for one more row of the table. Which one should it be?

Well, let’s see what happens if we switch off X (let’s call this strategy S1). There are two possible outcomes: either the lamp goes off (S1a) or it stays on (S1b).

In the first case (represented as the second line in the table below) we can conclude that X is not necessary for the lamp to be ‘On’, but we do not know whether X can switch on the lamp on its own (whether it is sufficient to do so).

Switch X Switch Z Lamp Y
1 1 1
0 1 1
0 0 0

If the lamp goes off when we press X, we know that X is necessary for the outcome but we do not know whether X can turn on the lamp on its own or only in combination with Z.

Switch X Switch Z Lamp Y
1 1 1
0 1 0
0 0 0

To sum up, by pressing X we learn either that (S1a) X is not necessary or that (S1b) X matters but we do not know whether on its own or only in combination with Z.

 

Now, let’s see what happens if we press Z (strategy S2). Again either the lamp stays on (S2a) or it goes off (S2b).

Under the first scenario, we learn that X is sufficient to turn on the lamp.

Switch X Switch Z Lamp Y
1 1 1
1 0 1
0 0 0

Under the second scenario, we learn that X is not sufficient to turn on the light. It is still possible that it is necessary for turning on the lamp in combination with Z.

Switch X Switch Z Lamp Y
1 1 1
1 0 0
0 0 0

To sum up, by pressing Z we learn either that (S2a) X can turn on the lamp or (S2b) that it cannot turn on the lamp on its own but is possibly necessary in combination with Z. 

Comparing the two sets of inferences, I think it is clear that the second one is much more informative. By pressing Z we learn either that we can turn on the lamp by pressing X or that we cannot unless Z is ‘On’. By pressing X we learn next to nothing: we are either still in the dark whether X works on its own to turn on the lamp (sorry for the pun) or that X matters but we still do now know whether we also need Z to be ‘On’.

If you are still unconvinced, the following table summarizes all inferences under all strategies and contingencies about each of the possible effects (X, Z, and the interaction XZ):

X works on its own Z works on its own Only XZ works Strategy
? True False S1a
? False ? S1b
True ? False S2a
False ? ? S2b

It should be obvious now that we are better off by pressing Z to learn about the effect of X.

Good, but what’s the relevance of this little game? Well, the game resembles a research design situation in which we have one observation (case), we have the resources to add only one more, and we have to select which observation to make. In other words, the game is about case selection.

For example, we observe a case with a rare outcome – say, successful regional integration. We suspect that two factors are at play, both of which are present in the case – say, high trade volume within the integrating block and democratic form of government for all units. And we wanna probe the effect of trade volume in particular. In that case, the analysis above suggests that we should choose a case that has the same volume of trade but a non-democratic form of government, rather than a case which has low volume of trade and democratic form of government.

This result is counter-intuitive, so let’s spell out why. First, note that we are interested in the effect of X (the effect of the switch and of trade volume) and not in explaining Y (how to turn on the lamp or how does regional integration come about). This is a subtle difference in interpretation, but one that is crucial for the analysis. Second, note that we are more interested in the effect of X than in the effect of Z, although both are potential causes of Y. If both X and Z are of equal interest, then obviously it doesn’t matter which one observation we make. Third, the result hinges on the assumption that there is nothing other than X or Z (or their interaction) that matters for Y. Once we admit other possible causal variables in the set-up, then we are no longer better off switching Z to learn the effect of X.

Sooooo, don’t take this little game as general advice on case selection. But it definitely shows that when it comes to research design our intuitions cannot always be trusted.

P.S. One assumption on which the analysis does not depend is binary effects and outcomes: it works equally well with probabilistic effects that are additive or multiplicative (involving an interaction). 

Learn more about research design.

More on QCA solution types and causal analysis

Following up my post on QCA solution types and their appropriateness for causal analysis, Eva Thomann was kind enough to provide a reply. I am posting it here in its entirety :

Why I still don’t prefer parsimonious solutions (Eva Thomann)

Thank you very much, Dimiter, for issuing this blog debate and inviting me to reply. In your blog post, you outline why, absent counterevidence, you find it justified to reject applied Qualitative Comparative Analysis (QCA) paper submission that do not use the parsimonious solution. I think I agree with some but not all of your points. Let me start by clarifying a few things.

Point of clarification 1: COMPASSS statement is about bad reviewer practice

It´s good to see that we all seem to agree that “no single criterion in isolation should be used to reject manuscripts during anonymous peer review”. The reviewer practice addressed in the COMPASSS statement is a bad practice. Highlighting this bad reviewer practice is the sole purpose of this statement. Conversely, the COMPASSS statement does not take sides when it comes to preferring specific solution types over others. The statement also does not imply anything about the frequency of this reviewer practice – this part of your post is pure speculation.  Personally I have heard people complaining about getting papers rejected for promoting or using conservative (QCA-CS), intermediate (QCA-IS) and parsimonious solutions (QCA-PS) with about the same frequency. But it is of course impossible for COMPASSS to get a representative picture of this phenomenon.

The term “empirically valid” refers to the, to my best knowledge entirely undisputed fact that all solution types are (at least) based on the information contained in the empirical data. The question that´s disputed is how we can or should go “beyond the facts” in causally valid ways when deriving QCA solutions.

Having said this, I will take off my “hat” as a member of the COMPASSS steering committee and contribute a few points to this debate. These points represent my own personal view and not that of COMPASSS or any of its bodies. I write as someone who uses QCA sometimes in her research and teaches it, too. Since I am not a methodologist, I won´t talk about fundamental issues of ontology and causality. I hope others will jump in on that.

Point of clarification 2: There is no point in personalizing this debate

In your comment you frequently refer to “the COMPASSS people”. But I find that pointless: COMPASSS hosts a broad variety of methodologists, users, practitioners, developers and teachers with different viewpoints and of different “colours and shapes”, some persons closer to “case-based” research, other closer to statistical/analytical research. Amongst others, Michael Baumgartner whom you mention is himself a members of the advisory board and he has had methodological debates with his co-authors as well.  Just because we can procedurally agree on a bad reviewer practice, it neither means we substantively agree on everything, nor does it imply that we disagree. History has amply shown how unproductive it can be for scientific progress when debates like these become personalized. Thus, if I could make a wish to you and everyone else engaging in this debate, it would be to talk about arguments rather than specific people. In what follows I will therefore refer to different approaches instead unless when referring to specific scholarly publications.

Point of clarification 3: There is more than one perspective on the validity of different solutions

As to your earlier point which you essentially repeat here, that “but if two solutions produce different causal recipes,  e.g. (1) AB-> E and (2) ABC-> E it cannot be that both (1) and (2) are valid”, my answer is: it depends on what you mean with “valid”.

It is common to look at QCA results as subset relations, here: statements of sufficiency. In a paper that is forthcoming in Sociological Methods & Research, Martino Maggetti and I call this the” approach emphasizing substantive interpretability”. From this perspective, the forward arrow “->2 reads “is sufficient for” and 1) in fact implies 2). Sufficiency means that X (here: AB) is a subset of Y (here: E). ABC is a subset of AB and hence it is also a subset of E, if AB is a subset of E. Logic dictates that any subset of a sufficient condition is also sufficient. Both are valid – they describe the sufficiency patterns in the data (and sometimes, some remainders) with different degrees of complexity.

Scholars promoting an “approach emphasizing redundancy-free models” agree with that, if we speak of mere (monotonic) subset relations. Yet they require QCA solutions to be minimal statements of causal relevance. From this perspective, the arrow (it then is <->, see below) reads “is causally relevant for” and if 1) then 2) cannot be true: 2) additionally grants causal relevance to C, but in in 1) we said only AB are causally relevant. As a causal statement, we can think of 2) claiming more than 1).

To proponents of the approach emphasizing substantive interpretability (and I am one of them), it all boils down to the question:

“Can something be incorrect that follows logically and inevitably from a correct statement?

Their brains shout:

“No, of course it can’t!

I am making an informed guess here: this fact is so blatantly obvious to most people well-versed in set theory that it does not require a formal reply.

For everyone else, it is important to understand that in order to follow the reasoning you are proposing in your comment, you have to buy into a whole set of assumptions that underlie the method promoted in the publication you are referring to (Baumgartner 2015), called Coincidence Analysis or CNA. Let me illustrate this.

Point of clarification 4: QCA is not CNA

In fact, one cannot accept 2) if 1) is true in the special case when the condition “AB” is  both minimally sufficient and contained in a minimally necessary condition for an outcome – which is also the situation you refer to (in your point 3). We have to replace the forward arrow “->” with “<->”.In such a situation, the X set and the Y set are equivalent. Of course, if AB and E are equivalent, then ABC and E are not equivalent at the same time. In reality, this – simultaneous necessity and sufficiency– is a rare scenario that requires a solution to be maximally parsimonious and having both a high consistency (indicating sufficiency) AND a very high coverage (indicating necessity).

But QCA – as opposed to CNA – is designed to assess necessary conditions and / or sufficient conditions. They don´t have to be both. As soon as we are speaking of a condition that is sufficient but not necessary (or not part of a necessary condition), then, if 1) is correct, 2) also has to be correct. You are acknowledging this when saying that “if A is sufficient for E, AB is also sufficient, for any arbitrary B”.

I will leave it to the methodologists to clarify whether it is ontologically desirable to empirically analyse sufficient but not necessary (or necessary but not sufficient) conditions. As a political scientist, I find it theoretically and empirically interesting. I believe this is in the tradition of much comparative political research. It is clear, and you seem to agree, that what we find to be correct entirely depends on how we define “correct” – there´s a danger of circularity here. At this point in time, it has to be pointed out that CNA is not QCA. Both are innovative, elegant and intriguing methods with their own pro’s and con’s. I am personally quite fascinated by CNA and would like to see more applications of it, but I am not convinced that we can or need to transfer its assumptions to QCA.

What I like about the recent publications advocating an approach emphasizing redundancy-free models is that they highlight that not all conditions contained in QCA solutions may be causally interpretable, if only we knew the true data-generating process (DGP). That points to the general question of causal arguments made with QCA if there is limited diversity, which has received ample scholarly attention for already quite a while.

Point of agreement 1: We need a cumulative process of rational critique

You argue that “the point about non-parsimonious solutions deriving faulty causal inferences seems settled, at least until there is a published response that rebukes it”. But QCA scholars have long highlighted issues of implausible and untenable counterfactuals entailed in parsimonious solutions (e.g. here, here, here, here, here, here and here). None of the published articles advocating redundancy-free models has so far made concrete attempts to rebuke these arguments. Following your line of reasoning, the points made by previous scholarship about parsimonious solutions deriving faulty causal inferences equally seems settled, at least until there is a published response that rebukes these points.

Indeed, advocates of redundancy-free models seem to either dismiss the relevance of counterfactuals altogether because CNA, so it is argued, does not rely on counterfactuals to derive solutions; OR they argue, that in the presence of limited diversity all solutions rely on counterfactuals. (Wouldn´t it be contradictory to argue both?). I personally would agree with the latter point. There can be no doubt that QCA (as opposed, perhaps, to CNA) is a set-theoretic, truth table based methods that, in the presence of limited diversity, involves counterfactuals. Newer algorithms (such as eQMC, used in the QCA package for R) no longer actively “rely on” remainders for minimization, and they exclude difficult and untenable counterfactuals rather than including tenable and “easy” counterfactuals. But the reason why QCA involves counterfactuals keeps being that intermediate and parsimonious QCA solutions involve configurations of conditions some of which are empirically observed, while others (the counterfactuals) are not. There can be only one conclusion: that the question of whether these counterfactuals are valid requires our keen attention.

Where does that leave us? To me, all that does certainly not mean that “the reliance on counterfactuals cannot be used to arbitrate this debate”. It means that different scholars have highlighted different issues relating to the validity of all solution types. None of these points have been conclusively rebuked so far. That, of course, leaves users in an intricate situation. They should not be punished for consistently and correctly following protocols proposed by methodologists of one or another approach.

Point of agreement 2: In the presence of limited diversity, QCA solutions can err in different directions

Parsimonious solutions are by no means unaffected by the problem that limited empirical diversity challenges our confidence in inferences. Indeed we should be careful not to omit that they err, too. As Ingo Rohlfing has pointed out, the question in which direction we want to err is a different question than the one which solution is correct. The answer to this former question probably depends.

Let us return to the above example and assume that we have a truth table characterized by limited diversity. We get a conservative solution

(CS) ABC -> E,

and a parsimonious solution

(PS) A -> E.

Let us further assume that we know (which in reality we never do) that the true DGP is

(DGP) AB -> E.

Neither CS nor PS give us the true DGP. To recap: To scholars emphasizing redundancy-free models, PS is “correct” because they define as “correct” a solution that does not contain causally irrelevant conditions. But note that PS here is also incomplete: the true result in this example is that, in order to observe the outcome E, A alone is not enough, it has to combine with B. Claiming that A alone is enough involves a counterfactual that could well be untenable. But the evidence alone does not allow us to conclude that B is irrelevant for E. It is usually only by making this type of oversimplifications that parsimonious solutions reach the high coverage values required to be “causally interpretable” under an approach emphasizing redundancy-free models.

To anyone with some basic training in QCA, this should raise some serious questions: But isn´t one of the core assumptions of QCA that we cannot interpret the single conditions in its results in isolation because they unfold their effect only in combination with other conditions? How, then, does QCA-PS fare when assessed against this assumption? I have not read a conclusive answer to this question yet.

Baumgartner and Thiem (2017) point out that with imperfect data, no method can be expected to deliver complete results. That may well be, but in QCA we deal with two types of completeness: complete AND-configurations, or including all substitutable paths or “causal recipes” combined with the logical OR. In order to interpret a QCA solution as a sufficient condition, I want to be reasonably sure that the respective AND-configuration in fact reliably implies the outcome (even if it omits other configurations that may not have been observed in my data).  Using this criterion, QCA-PS arguably fares worst (it most often misses out on causally relevant factors) and QCA-CS fares best (though it most often also still includes causally irrelevant factors).

To be sure, QCA-PS is sufficient for the outcome in the dataset under question. But I am unsure how I have to read it: “either X implies Y, or I did not observe X”? Or “X is causally relevant for Y in the data under question, but I don´t know if it suffices on its own”? There may well be specific situations in which all we want to know if some conditions are causally relevant subsets of sufficient conditions or not. But I find it misleading to claim that this is the only legitimate or even the main research interest of studies using QCA. I can think of many situations, such as public health crises or enforcing EU law, in which reliably achieving or preventing an outcome would have priority.

Let me be clear. The problem we are talking about is really neither QCA nor some solution type. The elephant in the room is essentially that observational data are rarely perfect and do not obey to the laws of logic. But is QCA-PS really the best, or the only, or at all, a way out of this problem?

Point of agreement 3: There are promising and less promising strategies for causal assessment

The technical moment of QCA shares with statistical techniques that it is simply a cross-case comparison of data-set observations. As such, of course it also shares with other methods the limited possibility for directly deriving causal inferences from observational data. Most QCA scholars would therefore be very cautious to interpret QCA results causally when using observational data and in the presence of limited diversity. Obviously, set relation does not equal causation. How then, could a specific minimization algorithm alone plausibly facilitate causal interpretability?

QCA (as opposed to CNA) was always designed to be a multi-method approach. This means that the inferences of the cross-case comparison are not just interpreted as such, but strengthened and complemented with additional insights, usually theoretical, conceptual and case knowledge. Or, as Ragin (2008: 173) puts it:

“Social research (…) is built upon a foundation of substantive and theoretical knowledge, not just methodological technique”.

This way, we can combine the advantages offered by different methods and sources. Used in a formalized way, the combination of QCA with process tracing can even help to disentangle causally relevant from causally irrelevant conditions. This, of course, does not preclude the possibility that some solution types may lend themselves more to causal interpretation than others. It does suggest, though, that focusing on specific solution types alone is an ill-suited strategy for making valid causal assessments.

Point of disagreement: Nobody assumes that “everything matters”

Allow me to disagree that an approach emphasizing substantive interpretability assumes “everything is relevant”. Of course that is nonsense. Like with any other social science method I know, the researcher first screens the literature and field in order to identify potentially relevant explanatory factors. The logic of truth table analysis (as opposed to CNA?) is then to start out with the subset of these previously identified conditions that themselves consistently are a subset of the outcome set, and then it searches for evidence that they are irrelevant. This is not even an assumption, and it is very far from being “everything”.

Ways ahead

In my view it makes sense to have a division of labour: Users follow protocols, methodologists foster methodological innovation and progress. I hope the above has made it clear that we are in the midst of, in my view, welcome and needed debate about what “correctness” and validity” means in the QCA context. I find it useful to think of this as a diversity of approaches to QCA. It is important that researchers reflect about the ontology that underlies their work, but we should avoid making premature conclusions as well.

Currently (but I may be proven wrong) I am thinking that each solution type has its merits and limitations. We can’t eliminate limited diversity, but we can use different solution typos for different purposes. For example, if policymakers seek to avoid investing public money in potentially irrelevant measures, the PS could be best. If they are interested in creating situations that are 100% sure to ensure an outcome (e.g. disease prevention), then the conservative solution is best and the parsimonious solution very risky. If we have strong theoretical knowledge or prior evidence available for counterfactual reasoning, intermediate solutions are best. And so on. From this perspective, it is good that we can refer to different solution types with QCA. It forces researchers to think consciously about what the goal of their analysis is, and how it can be adequately reached. It prevents them from just mechanically running some algorithm on their data.

All of the above is why I agree with the COMPASSS statement that …

“the current state of the art is characterized by discussions between leading methodologists about these questions, rather than by definitive and conclusive answers. It is therefore premature to conclude that one solution type can generally be accepted or rejected as “correct”, as opposed to other solution types”.