What’s a demockracy?

– What’s a democracy?

– Democracy means that people rule and the government respects the opinions of the citizens.

– So the government should do what the people want?

– In principle, yes, but…

– Can a majority of the people decide to abolish the parliament?

– No, the basic institutions of the state are usually set in the Constitution and constitutional rules are not to be changed like that. Everything that is in the constitution is off limits.

– OK, I can see why. Can the people decide different groups deserve different pay for the same job?

– No, even if this is not outlawed by the Constitution, there is the Universal Declaration of Human Rights, and fundamental human rights are not be changed by democratic majorities.

– Makes sense. Can the people decide on gay marriage? That’s not in the Declaration.

– Well, there are certain human rights that are not yet in constitutions and universal declaration, but we now recognize them as essential so they are also not subject to majorities.

– OK, so in democracies the government does what the people want, but not when it comes to constitutional issues, recognized fundamental human rights, and other very important norms.

– Yes.

– So can the people decide to change the interest rate?

– Oh, no! Not even politicians can do that. Monetary policy is delegated to independent central banks.

– But people can decide on regulating tel…

– Nope, regulation is basically all delegated to independent agencies, so that’s out.

– Hm, ok, so can the people decide to change the terms of foreign trade?

– Not really, these are set in international treaties so people cannot change anything that is in international treaties just like that.

– Got it. But people surely can decide if their country goes to war or not?

– Well, foreign policy is tricky, there is a lot of secret information involved, complex strategies to be made and it needs rapid responses, so, no.

– OK, can people decide on pensions, then?

– Pensions affect the future lives of those who can’t vote yet, so current majorities can’t really decide.

– OK, so in democracies the government does what the people want, but not when it comes to constitutional issues, recognized fundamental human rights and other very important norms, and not on anything that is in international treaties, and not on monetary policy or any regulatory issues, and not on foreign policy, and not on pensions. But for the rest the government should do what the majority of people want?

– Well, not really. It might not be clear what people want: there could be cyclical majorities among policy alternatives. And it might not be clear how to respond: respecting majorities on particular issues might lead to disrespecting a majority of the people overall.

– That sounds complicated. But if there are not cyclical majorities and one can satisfy a majority of people on a majority of the issues, then one should do what the people want?

– Nope. People might not want what’s good for them. People don’t understand policy and don’t follow political developments close enough. And people are duped by politicians and the media.

– Hard to disagree. I think I got it now: Democracy is a political system in which the government does what the people want, but not when it comes to constitutional issues, recognized fundamental human rights and other very important norms, and not on anything that is in international treaties, and not on monetary policy or any regulatory issues, and not on foreign policy, and not on pensions, and not on anything where it is unclear what the majority wants or how to satisfy a majority of people on majority of issues, and then only if the people want what’s right for them, to be decided by some experts in government or outside. Now that’s what I call a real demockracy!

The Discursive Dilemma and Research Project Evaluation

tl; dr When we collectively evaluate research proposals, we can reach the opposite verdict depending on how we aggregate the individual evaluations, and that’s a problem, and nobody seems to care or provide guidance how to proceed.

Imagine that three judges need to reach a verdict together using majority rule. To do that, the judges have to decide independently if each of two factual propositions related to the suspected crime is true. (And they all agree that if and only if both propositions are true, the defendant is guilty).

The distribution of the judges’ beliefs is given in the table below. Judge 1 believes that both propositions are true, and as a result, considers the conclusion (defendant is guilty) true as well. Judges 2 and 3 consider that only one of the propositions is true and, as a result, reach a conclusion of ‘not guilty’. When the judges vote in accordance with their conclusions, a majority finds the defendant ‘not guilty’.

 

Proposition 1 Proposition 2 Conclusion
Judge 1 true true true (guilty)
Judge 2 false true false (not guilty)
Judge 3 true false false (not guilty)
Majority decision TRUE TRUE FALSE (not guilty)

However, there is a majority that finds each of the two propositions true (see the last line in the table)! Therefore, if the judges vote on each proposition separately rather than directly on the conclusion, they will have to find the defendant ‘guilty’. That is, the judges will reach the opposite conclusion, even though nothing changes about their beliefs, they still agree that both propositions need to be true for a verdict of ‘guilty’, and the decision-making rule (majority) remains the same. The only thing that differs is the method through which the individual beliefs are combined: either by aggregating the conclusions or by aggregating the premises.

This fascinating result, in which the outcome of a collective decision-making process changes depending on whether the decision-making procedure is premise-based or conclusion-based, is known as the ‘discursive dilemma‘ or ‘doctrinal paradox‘. The paradox is but one manifestation of a more general impossibility result:

There exists no aggregation procedure (generating complete, consistent and deductively closed collective sets of judgments) which satisfies universal domain, anonymity and systematicity.” (List and Pettit, 2002).

Christian List has published a survey of the topic in 2006 and keeps an annotated bibliography. The paradox is related but separate from Arrow’s impossibility theorem, which deals with the aggregation of preferences.

After this short introduction, let’s get to the point. My point is that the collective evaluation of scientific research proposals often falls victim to the discursive dilemma. Let me explain how.

Imagine three scientific experts evaluating an application for research funding that has three components. (These components can be about three different aspects of the research proposal itself or about three different parts of the application, such as CV, proposal, and implementation plan). For now, imagine that the experts only evaluate each component as of excellent quality or not (binary choice). Each expert uses majority rule to aggregate the scores on each section, and the three experts reach a final conclusion  using majority rule as well.

The distribution of the evaluations of the three experts on each of the three components of the application are given in the table below. Reviewer 1 finds Parts A and C excellent but Part B poor. Reviewer 2 finds Parts B and C excellent but part A poor. And Reviewer 3 finds Parts A and B poor and part C excellent. Overall, Reviewers 1 and 2 reach a conclusion of ‘excellent’ for the total application, while Reviewer 3 reaches a conclusion of ‘poor’. By aggregating the conclusions by majority rule, the application should be evaluated as ‘excellent’. However, looking at each part individually, there is a majority that finds both Parts A and B ‘poor’, therefore the total evaluation should be ‘poor’ as well.

 

Part A Part B Part C Conclusion
Reviewer 1 excellent poor excellent EXCELLENT
Reviewer 2 poor excellent excellent EXCELLENT
Reviewer 3 poor poor excellent POOR
Majority decision POOR POOR EXCELLENT ?

So which one is it? Is this an excellent proposal or not, according to our experts?

I do not know.

But I find it quite important to recognize that we can get completely different results from the evaluation process depending on how we aggregate the individual scores, even with exactly the same distribution of the scores and even when every expert is entirely consistent in his/her evaluation.

But before we discuss the normative appeal of the two different aggregation options, is this a realistic problem or a convoluted scenario made up to illustrate a theoretical point but of no relevance to the practice of research evaluation?

Well, I have been involved in a fair share of research evaluations for journals, publishing houses, different national science foundations as well as for the European Research Council (ERC). Based on my personal experience, I think that quite often there is a tension between aggregating expert evaluations by conclusion and by premises.  Moreover, I have not seen clear guidelines how to proceed when the different types of aggregation lead to different conclusions. As a result, the aggregation method is selected by the implicit personal preferences of the one doing the aggregation.

Let’s go through a scenario that I am sure anyone who has been involved in some of the big ERC evaluations of individual research applications will recognize.

Two of the three reviewers find two of the three parts of the application ‘poor’, and the third reviewer finds one of the three parts poor and the other two parts ‘good’ (see the table below).

Part A Part B Part C Conclusion
Reviewer 1 poor poor good POOR
Reviewer 2 good poor poor POOR
Reviewer 3 good poor good GOOD
Majority decision GOOD POOR GOOD ?

Thus a majority of the final scores (the conclusions) indicate a ‘poor’ application. However, when the reviewers need to indicate the parts of the application that are ‘poor’, they cannot find many! There is a majority for two  out of the three parts that finds them ‘good’. Accordingly, by majority rule these cannot be listed as ‘weaknesses’ or given a poor score. Yet the total proposal is evaluated as ‘poor’ (i.e. unfundable).

There are three ways things go from here, based on my experience. One response is, after having seen that there is no majority evaluating many parts of the application as ‘poor’ (or as a ‘weakness’), to adjust upwards the overall scores of the application. In other words, the conclusion is brought in line with the result of the premise-based aggregation. A second response is to ask the individual reviewers to reflect back on their evaluations and reconsider whether their scores on the individual parts need be adjusted downwards (so the premises are brought in line with the result of the conclusion-based aggregation). A third response is to keep both a negative overall conclusion and a very, very short list of ‘weaknesses’ or arguments about which parts of the proposal are actually weak.

Now you know why you sometimes get evaluations saying that your project application is unfundable, but failing to point out what its problems are.

Again, I am not arguing that one of these responses or ways to solve the dilemma is always the correct one (although, I do have a preference, see below). But I think (a) the problem should be recognized, and (b) there should be explicit guidelines how to conduct the aggregation, so that there is less discretion left to those doing it.

If I had to choose, I would go for conclusion-base aggregation. Typically, my evaluation of a project is not a direct sum of the evaluations of the individual parts, and it is based on more than can be expressed with the scores on the application’s components. Also typically, having formed a conclusion about the overall merits of the proposal, I will search for good arguments to make why the proposal is poor, but also add some nice things to say to balance the wording of the evaluation. But it is the overall conclusion that matters, and the rest is discursive post hoc justification that is framed to fit to requirements of the specific context of the evaluation process.

Another argument to be made in favor of conclusion-based aggregation is the idea that reviewers represent particular ‘world-views’ or perspectives, for example, stemming from their scientific (sub)discipline. Therefore, evaluations of individual parts of a research application should not be aggregated by majority, since the evaluations are not directly comparable. If I consider that a literature review presented in a project proposal is incomplete based on my knowledge of a specific literature, this assessment should not be overruled by two assessments that the literature review is complete coming from reviewers who are experts in different literatures than I am: we could all be right in light of what we know.

In fact, the only scenario in which premise-based aggregation (with subsequent adjustment of the conclusions) makes sense to me is one where all reviewers know, on average, the same things and they provide, on average, scores without bias but with some random noise. In this case, majority aggregation of the premises filters the noise.

But I am sure that there more and different arguments to be made, once we realize that the discursive dilemma is a problem for research evaluations and that currently different aggregation practices are allowed to proliferate unchecked.

I suspect that many readers, even if they got this far in the text, would be unconvinced about the relevance of the problem I describe, because they think that (a) research evaluation is rarely binary but involves continuous scores, and (b) because aggregation is rarely based on majority rule.

The first objection is easier to deal with: First, sometimes evaluation is binary, for example, when the evaluation committee needs to list ‘strengths’ and ‘weaknesses’. Second, even when evaluation is formally on a categorical or continuous scale, it is in practice binary because anything below the top end of the scale is ‘unfundable’. Third, the discursive dilemma is also relevant for continuous judgements.

The second objection is pertinent. It is not that majority rule is not used when aggregating individual scores: it is, sometimes formally, more often informally. But in the practice of research evaluation these days, having anything less than a perfect score means that a project is not going to be funded. So whatever the method of aggregation, any objection (low score) by any reviewer is typically sufficient to derail an application. This is likely a much bigger normative problem for research evaluation, but one that requires a separate discussion.

And since we have to spend a lot of time preparing comprehensive evaluation reports, also of  projects that are not going to be funded, the discursive dilemma needs to be addressed so that the final evaluations are consistent and clear to the researchers.

Torture and game theory

The latest issue of Political Research Quarterly has an interesting and important exchange about the use of game theory to understand the effectiveness of torture for eliciting truthful information. In this post I summarize the discussion, which is quite instructive for illustrating the prejudices and misunderstandings people have about the role and utility of game theory as a tool to gain insights into the social world.

In the original article, Schiemann builds a strategic incomplete-information game between a detainee (who can either posses valuable information or not, and be either ‘strong’ or ‘weak’) and a state which can be either ‘pragmatic’ (using torture only for valuable information) or ‘sadistic’ (torturing in all circumstances). There are two additional parameters capturing uncertainty about the value and completeness of the information provided by the detainee, and two styles of interrogation (providing leading evidence or not). The article then proceeds to identify the equilibria of the game, which turn out to be quite a few (six), and quite different – in some, truthful information is provided while in others, not; in some, torture is applied while in others, not; etc…. At this point you will be excused for wondering what’s the point of the formal modeling if it only shows that, depending on the parameters, different things are possible.

Schiemann, however, makes a brilliant move by comparing each of these equilibria to some minimal normative standards that proponents of torture claim to uphold – namely, that torture should not be used on detainees who have provided all their information, that transmitted information should be generally reliable, and that in all cases only the minimum effective amount of torture should be applied. It turns out that none of these minimal normative standards are sustained by any of the equlibria of the game. If interrogational torture is to ‘generate valuable information, innocent detainees must be tortured for telling the truth’. The intuition is that unless the threat of torture is present, even ‘weak’ detainees would not confess, but for the threat of the torture is to be credible, it needs to be applied to innocent detainees as well (which, of course, from the point of view of the state are observationally equivalent to strong and knowledgeable detainees). Things get even uglier. ‘Proposition 4. Once torture is admitted as an interrogation technique, the strategic incentives facing the interrogator result in increasingly harsh forms of torture.’ Overall, the conclusion is that, ‘An outcome resulting in valuable information…is possible, but the conditions supporting it are empirically unlikely.’

Let’s recap what Schiemann’s formal analysis has demonstrated: the use of torture can never extract valuable information unless innocents are tortured and the frequency and intensity is rather high, and even then it would be very difficult to separate valid information from all the other ‘confessions’ made during the interrogations. For me, this is a devastating critique on the use of torture – the analysis not only shows that the effectiveness of torture is likely to be very low (empirical evidence has already pointed in that direction), but it shows why torture doesn’t work (unless one violates minimal normative standards that even proponents of the practice espouse).

Dustin Ells Howes, however, begs to differ. In a response to this analysis, entitled ‘Torture Is Not a Game: On the Limitations and Dangers of Rational Choice Methods‘,  he questions the fundamental premise of the analysis that torture can be modeled as a strategic interaction between agents who possess information, preferences and control over their actions. His main point is that under torture humans cannot be considered to have any agency at all. Fair enough, but then he proceeds to discuss how some individual can withstand torture after all by the force of ‘free will’. So, ultimately there are distinct states of the world that follow the exercise of torture – ‘confession’ (false or real) and ‘no confession’. So what’s the quibble with the game-theoretic analysis? Granted, it sound a bit perverse to talk about confessing under circumstances that destroy your entire sense of being a person, in addition to overwhelming physical pain that they bring, as a choice, but it matters little for the analysis whether you label it ‘choice’ or something else (‘expression of a strong free will'[?]). The fact remains that the state cannot sort out in advance which detainees possess information, which will confess, and which have already told everything they know. So torturing often and harshly and punishing innocents and those who actually reveal everything they know is unavoidable once one accepts the use of torture as a legitimate tool.

But in the mind of Howes, one should not even try to reason about the effectiveness of torture. It is dangerous to attempt to model torture, because, even if the current model shows that torture is ineffective and unjustifiable, once the principle of reasoning on the basis of formal models is accepted, others will build models that might show that torture works.

‘..[B]y placing his model within the framework of social science, he invites others to challenge him on that basis. If creating a formal model of interrogational torture is a legitimate way to argue against it, then social scientists could legitimately use the same methods to argue for it.’

At one level, I agree. Decades of game-theoretic modeling in economics have shown that by choosing the right assumptions and setup of the game, one can derive any result one wishes. But at the same time, there is something characteristically medieval about the argument – torture should be beyond the realm of reason, the only arguments we should have about the practice should be emotional and moral, not rational and theoretical. Was it Anselm of Canterbury who lamented himself for being able to prove the existence of God, anticipating that reason would be ultimately used to deny God’s presence?

What’s more annoying about Howes’ critique is that instead of discussing the original analysis, he prefers to attack rational choice in general as a research paradigm: ‘The most strident critics of rational choice theory argue that it distorts reality in a way that is corrosive to democracy.’…‘The close relationship between the rise of rational choice theory in the social sciences and U.S.government and military initiatives is well established.‘ This makes as much sense as rejecting the physics of nuclear fusion because its study has its origins in Nazi Germany. From these blanket statements about rational choice, Howe’s jups to the conclusion that ‘Schiemann’s formal model is conducive to bureaucratic violence.’ Not sure what that means but it sounds nasty.

Predictably, Schiemann’s response easily demolishes these ‘critiques’ and reaffirms the utility of game theory to shed light on normative political questions. But I find it a bit disturbing that crutiques of the use of reason (and models) to shed light on social and political phenomena can still find a place on the pages of scientific journals at all.

P.S. Exchanges on the pages of academic journals are a great way to learn. Here is another post which reviews an exchange related to gender discrimination at work.