The Discursive Dilemma and Research Project Evaluation

tl; dr When we collectively evaluate research proposals, we can reach the opposite verdict depending on how we aggregate the individual evaluations, and that’s a problem, and nobody seems to care or provide guidance how to proceed.

Imagine that three judges need to reach a verdict together using majority rule. To do that, the judges have to decide independently if each of two factual propositions related to the suspected crime is true. (And they all agree that if and only if both propositions are true, the defendant is guilty).

The distribution of the judges’ beliefs is given in the table below. Judge 1 believes that both propositions are true, and as a result, considers the conclusion (defendant is guilty) true as well. Judges 2 and 3 consider that only one of the propositions is true and, as a result, reach a conclusion of ‘not guilty’. When the judges vote in accordance with their conclusions, a majority finds the defendant ‘not guilty’.

 

Proposition 1 Proposition 2 Conclusion
Judge 1 true true true (guilty)
Judge 2 false true false (not guilty)
Judge 3 true false false (not guilty)
Majority decision TRUE TRUE FALSE (not guilty)

However, there is a majority that finds each of the two propositions true (see the last line in the table)! Therefore, if the judges vote on each proposition separately rather than directly on the conclusion, they will have to find the defendant ‘guilty’. That is, the judges will reach the opposite conclusion, even though nothing changes about their beliefs, they still agree that both propositions need to be true for a verdict of ‘guilty’, and the decision-making rule (majority) remains the same. The only thing that differs is the method through which the individual beliefs are combined: either by aggregating the conclusions or by aggregating the premises.

This fascinating result, in which the outcome of a collective decision-making process changes depending on whether the decision-making procedure is premise-based or conclusion-based, is known as the ‘discursive dilemma‘ or ‘doctrinal paradox‘. The paradox is but one manifestation of a more general impossibility result:

There exists no aggregation procedure (generating complete, consistent and deductively closed collective sets of judgments) which satisfies universal domain, anonymity and systematicity.” (List and Pettit, 2002).

Christian List has published a survey of the topic in 2006 and keeps an annotated bibliography. The paradox is related but separate from Arrow’s impossibility theorem, which deals with the aggregation of preferences.

After this short introduction, let’s get to the point. My point is that the collective evaluation of scientific research proposals often falls victim to the discursive dilemma. Let me explain how.

Imagine three scientific experts evaluating an application for research funding that has three components. (These components can be about three different aspects of the research proposal itself or about three different parts of the application, such as CV, proposal, and implementation plan). For now, imagine that the experts only evaluate each component as of excellent quality or not (binary choice). Each expert uses majority rule to aggregate the scores on each section, and the three experts reach a final conclusion  using majority rule as well.

The distribution of the evaluations of the three experts on each of the three components of the application are given in the table below. Reviewer 1 finds Parts A and C excellent but Part B poor. Reviewer 2 finds Parts B and C excellent but part A poor. And Reviewer 3 finds Parts A and B poor and part C excellent. Overall, Reviewers 1 and 2 reach a conclusion of ‘excellent’ for the total application, while Reviewer 3 reaches a conclusion of ‘poor’. By aggregating the conclusions by majority rule, the application should be evaluated as ‘excellent’. However, looking at each part individually, there is a majority that finds both Parts A and B ‘poor’, therefore the total evaluation should be ‘poor’ as well.

 

Part A Part B Part C Conclusion
Reviewer 1 excellent poor excellent EXCELLENT
Reviewer 2 poor excellent excellent EXCELLENT
Reviewer 3 poor poor excellent POOR
Majority decision POOR POOR EXCELLENT ?

So which one is it? Is this an excellent proposal or not, according to our experts?

I do not know.

But I find it quite important to recognize that we can get completely different results from the evaluation process depending on how we aggregate the individual scores, even with exactly the same distribution of the scores and even when every expert is entirely consistent in his/her evaluation.

But before we discuss the normative appeal of the two different aggregation options, is this a realistic problem or a convoluted scenario made up to illustrate a theoretical point but of no relevance to the practice of research evaluation?

Well, I have been involved in a fair share of research evaluations for journals, publishing houses, different national science foundations as well as for the European Research Council (ERC). Based on my personal experience, I think that quite often there is a tension between aggregating expert evaluations by conclusion and by premises.  Moreover, I have not seen clear guidelines how to proceed when the different types of aggregation lead to different conclusions. As a result, the aggregation method is selected by the implicit personal preferences of the one doing the aggregation.

Let’s go through a scenario that I am sure anyone who has been involved in some of the big ERC evaluations of individual research applications will recognize.

Two of the three reviewers find two of the three parts of the application ‘poor’, and the third reviewer finds one of the three parts poor and the other two parts ‘good’ (see the table below).

Part A Part B Part C Conclusion
Reviewer 1 poor poor good POOR
Reviewer 2 good poor poor POOR
Reviewer 3 good poor good GOOD
Majority decision GOOD POOR GOOD ?

Thus a majority of the final scores (the conclusions) indicate a ‘poor’ application. However, when the reviewers need to indicate the parts of the application that are ‘poor’, they cannot find many! There is a majority for two  out of the three parts that finds them ‘good’. Accordingly, by majority rule these cannot be listed as ‘weaknesses’ or given a poor score. Yet the total proposal is evaluated as ‘poor’ (i.e. unfundable).

There are three ways things go from here, based on my experience. One response is, after having seen that there is no majority evaluating many parts of the application as ‘poor’ (or as a ‘weakness’), to adjust upwards the overall scores of the application. In other words, the conclusion is brought in line with the result of the premise-based aggregation. A second response is to ask the individual reviewers to reflect back on their evaluations and reconsider whether their scores on the individual parts need be adjusted downwards (so the premises are brought in line with the result of the conclusion-based aggregation). A third response is to keep both a negative overall conclusion and a very, very short list of ‘weaknesses’ or arguments about which parts of the proposal are actually weak.

Now you know why you sometimes get evaluations saying that your project application is unfundable, but failing to point out what its problems are.

Again, I am not arguing that one of these responses or ways to solve the dilemma is always the correct one (although, I do have a preference, see below). But I think (a) the problem should be recognized, and (b) there should be explicit guidelines how to conduct the aggregation, so that there is less discretion left to those doing it.

If I had to choose, I would go for conclusion-base aggregation. Typically, my evaluation of a project is not a direct sum of the evaluations of the individual parts, and it is based on more than can be expressed with the scores on the application’s components. Also typically, having formed a conclusion about the overall merits of the proposal, I will search for good arguments to make why the proposal is poor, but also add some nice things to say to balance the wording of the evaluation. But it is the overall conclusion that matters, and the rest is discursive post hoc justification that is framed to fit to requirements of the specific context of the evaluation process.

Another argument to be made in favor of conclusion-based aggregation is the idea that reviewers represent particular ‘world-views’ or perspectives, for example, stemming from their scientific (sub)discipline. Therefore, evaluations of individual parts of a research application should not be aggregated by majority, since the evaluations are not directly comparable. If I consider that a literature review presented in a project proposal is incomplete based on my knowledge of a specific literature, this assessment should not be overruled by two assessments that the literature review is complete coming from reviewers who are experts in different literatures than I am: we could all be right in light of what we know.

In fact, the only scenario in which premise-based aggregation (with subsequent adjustment of the conclusions) makes sense to me is one where all reviewers know, on average, the same things and they provide, on average, scores without bias but with some random noise. In this case, majority aggregation of the premises filters the noise.

But I am sure that there more and different arguments to be made, once we realize that the discursive dilemma is a problem for research evaluations and that currently different aggregation practices are allowed to proliferate unchecked.

I suspect that many readers, even if they got this far in the text, would be unconvinced about the relevance of the problem I describe, because they think that (a) research evaluation is rarely binary but involves continuous scores, and (b) because aggregation is rarely based on majority rule.

The first objection is easier to deal with: First, sometimes evaluation is binary, for example, when the evaluation committee needs to list ‘strengths’ and ‘weaknesses’. Second, even when evaluation is formally on a categorical or continuous scale, it is in practice binary because anything below the top end of the scale is ‘unfundable’. Third, the discursive dilemma is also relevant for continuous judgements.

The second objection is pertinent. It is not that majority rule is not used when aggregating individual scores: it is, sometimes formally, more often informally. But in the practice of research evaluation these days, having anything less than a perfect score means that a project is not going to be funded. So whatever the method of aggregation, any objection (low score) by any reviewer is typically sufficient to derail an application. This is likely a much bigger normative problem for research evaluation, but one that requires a separate discussion.

And since we have to spend a lot of time preparing comprehensive evaluation reports, also of  projects that are not going to be funded, the discursive dilemma needs to be addressed so that the final evaluations are consistent and clear to the researchers.

Why political scientists should continue to (fail to) predict elections?

The results from the British elections last week already claimed the heads of three party leaders. But together with Labour, the Liberal Democrats and UKIP, there was another group that lost big time in the elections: pollsters and electoral prognosticators. Not only were polls and predictions way off the mark in terms of the actual vote shares and seats received by the different parties. Crucially, their major expectation of a hung parliament did not materialize as the Conservatives cruised into a small but comfortable majority of the seats. Even more remarkably, all polls and predictions were wrong, and they were all wrong pretty much in the same way. Not pretty.

This calls for reflection upon the exploding number of electoral forecasting models which sprung up during the build-up to the 2015 national elections in the UK. Many of these models were offered by political scientists and promoted by academic institutions (for example, here, here, and here). At some point, it became passé to be a major political science institution in the country and not have an electoral forecast. The field became so crowded that the elections were branded as ‘a nerd feast’ and the competition of predictions as ‘the battle of the nerds’. The feast is over and everyone lost. It is the time of the scavengers.

The massive failure of British polls and predictions has already led to a frenzy of often vicious attacks on the pollsters and prognosticators coming from politicians, journalists and pundits, in the UK and beyond. A formal inquiry has been launched. The unmistakable smell of schadenfreude is hanging in the air. Most disturbingly, some respected political scientists have voiced a hope that the failure puts a stop to the game of predicting voting results altogether and dismissed electoral predictions as unscientific.

mudde afonso

This is wrong. Political scientists should continue to build predictive models of elections. This work has scientific merit and it has public value. Moreover, political scientists have a mission to participate in the game of electoral forecasting. Their mission is to emphasize the large uncertainties surrounding all kinds of electoral predictions. They should not be in the game in order to win, but to correct on others’ too eager attempts to mislead the public with predictions offered with a false sense of precision and certainty.

The rising number of electoral forecasts done by political scientists has more than a little bit to do with a certain jealousy of Nate Silver – the American forecaster who gained international fame and recognition with his successful predictions of the US presidential elections. (By the way, this time round, Nate Silver got it just as wrong as the others). For once, there was something sexy about political science work, but the irony was, political scientists were not part of it. And if Nate, who is not a professional political scientist, can do it, so can we – academic experts with life-long experience in the study of voting and elections and hard-earned mastery of sophisticated statistical techniques. So the academia was drawn into this forecasting thing.

And that’s fine. Political scientists should be in the business of electoral forecasting because this business is important and because it is here to stay. News outlets have an insatiable appetite for election stories as voting day draws near, and the release of polls and forecasts provides a good excuse to indulge in punditry and sometimes even meaningful discussion. So predictions will continue to be offered and if political scientists move away somebody else will take their place. And the newcomers cannot be trusted to have the public interest at heart.

Election forecasts are important because they feed into the electoral campaign and into the strategic calculations of political parties and of individual voters. Voting is rarely an act of naïve expression of political preferences. Especially in an electoral system that is highly non-proportional, as the one in the UK, voters and parties have a strong incentive to behave strategically in view of the information that polls and forecasts provide. (By the way, ironically, the one prognosis that political scientists got relatively right – the exit poll – is the one that probably matters the least as it only serves to satisfy our impatience to wait a few more hours for the official electoral results.)

Hence, political scientists as servants of the public interest have a mission to offer impartial and professional electoral forecasts based on state of the art methodology and deep substantive knowledge. They must also discuss, correct and when appropriate trash the forecasts offered by others.

And they have one major point to make – all predictions have a much larger degree of uncertainty than what prognosticators want (us) to believe. It is a simple point that experience has been proven right times and again. But it is one that still needs to be pounded over and over as pollsters, forecasters and the media get easily carried away.

It is in this sense that commentators are right: predictions, if not properly bracketed by valid estimates of uncertainty, are unscientific and pure charlatanry.  And it is in this sense that most forecasts offered by political scientists at the latest British elections were a failure. They did not properly gauge the uncertainty of their estimates and as a result misled the public. That they didn’t predict the result is less damaging than the fact they pretended they could.

Since the bulk of the data doing the heavy-lifting in most electoral predictive models is poll data, the failure of prediction can be traced to a failure of polling. But pollsters cannot be blamed for the fact that prognosticators did not adjust the uncertainty estimates of their predictions. The tight sampling margins of error reported by pollsters might be appropriate to characterize the uncertainty of polling estimates (under certain assumptions) of public preferences at a point in time, but they are invariably too low when it comes to making predictions from these estimates. Predictions have other important sources of uncertainty in addition to sampling error and by not taking these into account prognosticators are fooling themselves and others. Another point forecasters should have known: combining different polls reduces sampling margins of error, but if all polls are biased (as they proved to be in the British case), the predictions could still be seriously off the mark.

Offering predictions with wide margins of uncertainty is not sexy. Correcting others for the illusory precision of their forecasts is tedious and risks being viewed as pedantic. But this is the role political scientists need to play in the game of electoral forecasting, and being tedious, pedantic and decidedly unsexy is the price they have to pay.

Constructivism in the world of Dragons

Here is an analysis of Game of Thrones from a realist international relations perspective. Inevitably, here is the response from a constructivist angle. These are supposed to be fun so I approached them with a light heart and popcorn. But halfway through the second article I actually felt sick to my stomach. I am not exaggerating, and it wasn’t the popcorn – seeing the same ‘arguments’ between realists and constructivists rehearsed in this new setting, the same lame responses to the same lame points, the same ‘debate’ where nobody ever changes their mind, the same dreaded confluence of normative, theoretical, and empirical notions that plagues this never-ending exchange in the real (sorry, socially constructed) world, all that really gave me a physical pain. I felt entrapped – even in this fantasy world there was no escape from the Realist and the Constructivist. The Seven Kingdoms were infected by the triviality of our IR theories. The magic of their world was desecrated. Forever….

Nothing wrong with the particular analyses. But precisely because they manage to be good examples of the genres they imitate the bad taste in my mouth felt so real. So is it about interests or norms? Oh no. Is it real politik or the slow construction of a common moral order? Do leader disregard the common folk to their own peril? Oh, please stop. How do norms construct identities? Noooo moooore. Send the Dragons!!!

By the way, just one example of how George R.R. Martin can explain a difficult political idea better than an entire conference of realists and constructivists. Why do powerful people keep their promises? Is it ’cause their norms make them do it or because it is in their interests or whatever? Why do Lannisters always pay their debts even though they appear to be some the meanest self-centered characterless in the entire world of Game of Thrones?  We literally see the answer when Tyrion Lannister tries to escape from the sky cells, and the Lannister’s reputation for paying their debts is the only thing that saves him, the only thing he has left to pay Mord, but it is enough (see episode 1.6). Having a reputation for paying your debts is one of the greatest assets you can have in every world. And it is worth all the pennies you pay to preserve it even when you can actually get away with not honoring your commitments. It could not matter less if you call this interest-based or norm-based explanation: it just clicks, but it takes creativity and insight to convey the point, not impotent meta-theoretical disputes.

The failure of political science

Last week the American Senate supported with a clear bi-partisan majority a decision to stop funding for political science research from the National Science Foundation. Of all disciplines, only political science has been singled out for the cuts and the money will go for cancer research instead.

The decision is obviously wrong for so many reasons but my point is different. How could political scientists who are supposed to understand better than anyone else how politics works allow this to happen? What does it tell us about the state of the discipline that the academic experts in political analysis cannot prevent overt political action that hurts them directly and rather severely?

To me, this failure of American political scientists to protect their own turf in the political game is scandalous. It is as bad as Nobel-winning economists Robert Merton and Myron Scholes leading the hedge fund ‘Long Tern Capital Management‘ to bust and losing 4.6 billion dollars with the help of their Nobel-wining economic theories. As Myron & Scholes’ hedge fund story revels the true real-world value of (much) financial economics theories, so does the humiliation of political science by the Congress reveal the true real-world value of (much) political theories.

Think about it –  the world-leading academic specialists on collective action, interest representation and mobilization could not get themselves mobilized, organized and represented in Washington to protect their funding. The professors of the political process and legislative institutions could not find a way to work these same institutions to their own advantage. The experts on political preferences and incentives did not see the broad bi-partisan coalition against political science forming. That’s embarrassing

It is even more embarrassing because American political science is the most productive, innovative, and competitive in the world. There is no doubt that almost all of the best new ideas, methods, and theories in political science over the last 50 years have come from the US. (And a lot of these innovations have been made possible because of the funding received by the National Science Foundation). So it is not that individual American political scientists are not smart – of course they are, but for some reason as a collective body they have not been able to benefit from their own knowledge and insights. Or that knowledge and insights about US politics are deficient in important ways.The fact remains, political scientists were beaten in what should have been their own game. Hopefully some kind of lesson will emerge from all that…

P.S. No reason for public administration, sociology and other related disciplines to be smug about pol sci’s humiliation – they have been saved (for now) mostly by their own irrelevance. 

The education revolution at our doorstep

University education is at the brink of radical transformation. The revolution is already happening and the Khan Academy, Udacity, Coursera and the Marginal Revolution University are just the harbingers of a change that will soon sweep over universities throughout the world.

Alex Tabarrok has a must-read piece on the coming revolution in education here. The entire piece is highly recommended, so I am not gonna even try to summarize it here, but this part stands out:

Teaching today is like a stage play. A play can be seen by at most a few hundred people at a single sitting and it takes as much labor to produce the 100th viewing as it does to produce the first. As a result, plays are expensive. Online education makes teaching more like a movie. Movies can be seen by millions and the cost per viewer declines with more viewers. Now consider quality. The average movie actor is a better actor than the average stage actor.

As a result, Tabarrok predicts that the market for teachers will became a winner-take-all market with very big payments at the top: the best teachers would be followed by millions and paid accordingly.

My prediction is that the revolution in education will also lead to greater specialization – maybe you can’t be the best  Development Economics teacher, but you can be the best teacher on XIXth Century Agricultural Development in South-East Denmark: economies of scale brought by online education can make such uber-specialization of teaching portfolios profitable (or, indeed necessary).

Surprisingly or not, it is American entrepreneurs and institutions who lead this revolution. In Europe, online education is still relegated to pre-master programs and the like and is too often a thoughtless extrapolation of traditional education practices online. Sooner rather than later, the revolution will be at our doorstep. We better start preparing.

[P.S. the Guardian aslo run a recent piece on the topic as well]

Science is like sex…

‘Science is like sex – it might have practical consequences but that’s not why you do it!’

This seems to be a modified version of a quote by the physicist Richard Feynman that I heard last week at a meeting organized by the Dutch Organization for Scientific Research (the major research funding agency in the Netherlands). It kind of sums up the attitudes of natural scientists to the increasing pressures all researchers face to justify their grant applications in terms of the possible practical use (utilization, or valorization) of their research results. Which is totally fine by me. I perfectly understand that it is impossible to anticipate all the possible future practical consequences of fundmental research. On the other hand, I see no harm in forcing researchers to, at the very least, think about the possible real-world applications of their work. The current equilibrium  in which reflection on possible practical applications is required, but ‘utilization’ is neither necessary nor sufficient for getting a grant, seems like a good compromise.
Of course, I come from a field (public administration) where demonstrating the scientific contribution is usually more difficult than showing the practical applicability of the results: so my view might be biased. I am not even sure what fundamental research in the social sciences looks like. Even rather esoteric work on non-cooperative game theory has been directly spurred by practical concerns related to the Cold War (and sponsored by the RAND corporation) and has rather directly led to the design of real-world social instituions (like the networks for kidney exchange) which won Al Roth his recent Nobel prize.