tl; dr When we collectively evaluate research proposals, we can reach the opposite verdict depending on how we aggregate the individual evaluations, and that’s a problem, and nobody seems to care or provide guidance how to proceed.
Imagine that three judges need to reach a verdict together using majority rule. To do that, the judges have to decide independently if each of two factual propositions related to the suspected crime is true. (And they all agree that if and only if both propositions are true, the defendant is guilty).
The distribution of the judges’ beliefs is given in the table below. Judge 1 believes that both propositions are true, and as a result, considers the conclusion (defendant is guilty) true as well. Judges 2 and 3 consider that only one of the propositions is true and, as a result, reach a conclusion of ‘not guilty’. When the judges vote in accordance with their conclusions, a majority finds the defendant ‘not guilty’.
|Proposition 1||Proposition 2||Conclusion|
|Judge 1||true||true||true (guilty)|
|Judge 2||false||true||false (not guilty)|
|Judge 3||true||false||false (not guilty)|
|Majority decision||TRUE||TRUE||FALSE (not guilty)|
However, there is a majority that finds each of the two propositions true (see the last line in the table)! Therefore, if the judges vote on each proposition separately rather than directly on the conclusion, they will have to find the defendant ‘guilty’. That is, the judges will reach the opposite conclusion, even though nothing changes about their beliefs, they still agree that both propositions need to be true for a verdict of ‘guilty’, and the decision-making rule (majority) remains the same. The only thing that differs is the method through which the individual beliefs are combined: either by aggregating the conclusions or by aggregating the premises.
This fascinating result, in which the outcome of a collective decision-making process changes depending on whether the decision-making procedure is premise-based or conclusion-based, is known as the ‘discursive dilemma‘ or ‘doctrinal paradox‘. The paradox is but one manifestation of a more general impossibility result:
“There exists no aggregation procedure (generating complete, consistent and deductively closed collective sets of judgments) which satisfies universal domain, anonymity and systematicity.” (List and Pettit, 2002).
Christian List has published a survey of the topic in 2006 and keeps an annotated bibliography. The paradox is related but separate from Arrow’s impossibility theorem, which deals with the aggregation of preferences.
After this short introduction, let’s get to the point. My point is that the collective evaluation of scientific research proposals often falls victim to the discursive dilemma. Let me explain how.
Imagine three scientific experts evaluating an application for research funding that has three components. (These components can be about three different aspects of the research proposal itself or about three different parts of the application, such as CV, proposal, and implementation plan). For now, imagine that the experts only evaluate each component as of excellent quality or not (binary choice). Each expert uses majority rule to aggregate the scores on each section, and the three experts reach a final conclusion using majority rule as well.
The distribution of the evaluations of the three experts on each of the three components of the application are given in the table below. Reviewer 1 finds Parts A and C excellent but Part B poor. Reviewer 2 finds Parts B and C excellent but part A poor. And Reviewer 3 finds Parts A and B poor and part C excellent. Overall, Reviewers 1 and 2 reach a conclusion of ‘excellent’ for the total application, while Reviewer 3 reaches a conclusion of ‘poor’. By aggregating the conclusions by majority rule, the application should be evaluated as ‘excellent’. However, looking at each part individually, there is a majority that finds both Parts A and B ‘poor’, therefore the total evaluation should be ‘poor’ as well.
|Part A||Part B||Part C||Conclusion|
So which one is it? Is this an excellent proposal or not, according to our experts?
I do not know.
But I find it quite important to recognize that we can get completely different results from the evaluation process depending on how we aggregate the individual scores, even with exactly the same distribution of the scores and even when every expert is entirely consistent in his/her evaluation.
But before we discuss the normative appeal of the two different aggregation options, is this a realistic problem or a convoluted scenario made up to illustrate a theoretical point but of no relevance to the practice of research evaluation?
Well, I have been involved in a fair share of research evaluations for journals, publishing houses, different national science foundations as well as for the European Research Council (ERC). Based on my personal experience, I think that quite often there is a tension between aggregating expert evaluations by conclusion and by premises. Moreover, I have not seen clear guidelines how to proceed when the different types of aggregation lead to different conclusions. As a result, the aggregation method is selected by the implicit personal preferences of the one doing the aggregation.
Let’s go through a scenario that I am sure anyone who has been involved in some of the big ERC evaluations of individual research applications will recognize.
Two of the three reviewers find two of the three parts of the application ‘poor’, and the third reviewer finds one of the three parts poor and the other two parts ‘good’ (see the table below).
|Part A||Part B||Part C||Conclusion|
Thus a majority of the final scores (the conclusions) indicate a ‘poor’ application. However, when the reviewers need to indicate the parts of the application that are ‘poor’, they cannot find many! There is a majority for two out of the three parts that finds them ‘good’. Accordingly, by majority rule these cannot be listed as ‘weaknesses’ or given a poor score. Yet the total proposal is evaluated as ‘poor’ (i.e. unfundable).
There are three ways things go from here, based on my experience. One response is, after having seen that there is no majority evaluating many parts of the application as ‘poor’ (or as a ‘weakness’), to adjust upwards the overall scores of the application. In other words, the conclusion is brought in line with the result of the premise-based aggregation. A second response is to ask the individual reviewers to reflect back on their evaluations and reconsider whether their scores on the individual parts need be adjusted downwards (so the premises are brought in line with the result of the conclusion-based aggregation). A third response is to keep both a negative overall conclusion and a very, very short list of ‘weaknesses’ or arguments about which parts of the proposal are actually weak.
Now you know why you sometimes get evaluations saying that your project application is unfundable, but failing to point out what its problems are.
Again, I am not arguing that one of these responses or ways to solve the dilemma is always the correct one (although, I do have a preference, see below). But I think (a) the problem should be recognized, and (b) there should be explicit guidelines how to conduct the aggregation, so that there is less discretion left to those doing it.
If I had to choose, I would go for conclusion-base aggregation. Typically, my evaluation of a project is not a direct sum of the evaluations of the individual parts, and it is based on more than can be expressed with the scores on the application’s components. Also typically, having formed a conclusion about the overall merits of the proposal, I will search for good arguments to make why the proposal is poor, but also add some nice things to say to balance the wording of the evaluation. But it is the overall conclusion that matters, and the rest is discursive post hoc justification that is framed to fit to requirements of the specific context of the evaluation process.
Another argument to be made in favor of conclusion-based aggregation is the idea that reviewers represent particular ‘world-views’ or perspectives, for example, stemming from their scientific (sub)discipline. Therefore, evaluations of individual parts of a research application should not be aggregated by majority, since the evaluations are not directly comparable. If I consider that a literature review presented in a project proposal is incomplete based on my knowledge of a specific literature, this assessment should not be overruled by two assessments that the literature review is complete coming from reviewers who are experts in different literatures than I am: we could all be right in light of what we know.
In fact, the only scenario in which premise-based aggregation (with subsequent adjustment of the conclusions) makes sense to me is one where all reviewers know, on average, the same things and they provide, on average, scores without bias but with some random noise. In this case, majority aggregation of the premises filters the noise.
But I am sure that there more and different arguments to be made, once we realize that the discursive dilemma is a problem for research evaluations and that currently different aggregation practices are allowed to proliferate unchecked.
I suspect that many readers, even if they got this far in the text, would be unconvinced about the relevance of the problem I describe, because they think that (a) research evaluation is rarely binary but involves continuous scores, and (b) because aggregation is rarely based on majority rule.
The first objection is easier to deal with: First, sometimes evaluation is binary, for example, when the evaluation committee needs to list ‘strengths’ and ‘weaknesses’. Second, even when evaluation is formally on a categorical or continuous scale, it is in practice binary because anything below the top end of the scale is ‘unfundable’. Third, the discursive dilemma is also relevant for continuous judgements.
The second objection is pertinent. It is not that majority rule is not used when aggregating individual scores: it is, sometimes formally, more often informally. But in the practice of research evaluation these days, having anything less than a perfect score means that a project is not going to be funded. So whatever the method of aggregation, any objection (low score) by any reviewer is typically sufficient to derail an application. This is likely a much bigger normative problem for research evaluation, but one that requires a separate discussion.
And since we have to spend a lot of time preparing comprehensive evaluation reports, also of projects that are not going to be funded, the discursive dilemma needs to be addressed so that the final evaluations are consistent and clear to the researchers.