{"id":934,"date":"2017-11-21T10:51:15","date_gmt":"2017-11-21T10:51:15","guid":{"rendered":"http:\/\/re-design.dimiter.eu\/?p=934"},"modified":"2017-11-21T10:51:15","modified_gmt":"2017-11-21T10:51:15","slug":"the-discursive-dilemma-and-research-project-evaluation","status":"publish","type":"post","link":"http:\/\/re-design.dimiter.eu\/?p=934","title":{"rendered":"The Discursive Dilemma and Research Project Evaluation"},"content":{"rendered":"<p><em><strong>tl; dr<\/strong> When we collectively evaluate research proposals,\u00a0we can reach the opposite verdict\u00a0depending on how we aggregate the individual evaluations, and that&#8217;s a problem, and nobody seems to care or provide guidance how to proceed.<\/em><\/p>\n<p>Imagine that three judges need to reach a verdict together using majority rule. To do that, the judges have to decide independently if each of two factual propositions related to the suspected crime is\u00a0true. (And they all agree that <em>if and only if<\/em> both propositions are true, the defendant is guilty).<\/p>\n<p>The distribution of the judges&#8217; beliefs is given in the table below. Judge 1 believes that both\u00a0propositions are true, and as a result, considers the conclusion (defendant is guilty) true as well. Judges 2 and 3 consider that only one of the propositions is true and, as a result, reach a conclusion\u00a0of &#8216;not guilty&#8217;. When the judges vote in accordance with their conclusions, a majority finds the defendant &#8216;not guilty&#8217;.<\/p>\n<p>&nbsp;<\/p>\n<table style=\"width: 100%;\">\n<tbody>\n<tr>\n<th><\/th>\n<th style=\"text-align: center;\">Proposition 1<\/th>\n<th style=\"text-align: center;\">Proposition 2<\/th>\n<th style=\"text-align: center;\">Conclusion<\/th>\n<\/tr>\n<tr>\n<td>Judge 1<\/td>\n<td style=\"text-align: center;\">true<\/td>\n<td style=\"text-align: center;\">true<\/td>\n<td style=\"text-align: center;\">true\u00a0(guilty)<\/td>\n<\/tr>\n<tr>\n<td>Judge 2<\/td>\n<td style=\"text-align: center;\">false<\/td>\n<td style=\"text-align: center;\">true<\/td>\n<td style=\"text-align: center;\">false (not guilty)<\/td>\n<\/tr>\n<tr>\n<td>Judge 3<\/td>\n<td style=\"text-align: center;\">true<\/td>\n<td style=\"text-align: center;\">false<\/td>\n<td style=\"text-align: center;\">false\u00a0(not guilty)<\/td>\n<\/tr>\n<tr>\n<td><strong>Majority decision<\/strong><\/td>\n<td style=\"text-align: center;\"><b>TRUE<\/b><\/td>\n<td style=\"text-align: center;\"><strong>TRUE<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>FALSE\u00a0(not guilty)<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>However, there is a majority that finds each of the two propositions true (see the last line in the table)! Therefore, if the judges vote on each proposition separately rather than directly on the conclusion, they will have to find the defendant &#8216;guilty&#8217;. That is, t<strong>he judges will reach the opposite conclusion, even though nothing changes about their beliefs, they still agree that both propositions need to be true for a verdict of &#8216;guilty&#8217;, and the decision-making rule (majority) remains the same<\/strong>. The only thing that differs is the method through which the individual beliefs are combined: either by aggregating the conclusions or by aggregating the premises.<\/p>\n<p>This fascinating result, in which the outcome of a collective decision-making\u00a0process changes depending on whether the decision-making procedure is\u00a0premise-based or conclusion-based,\u00a0is known as\u00a0the &#8216;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Discursive_dilemma\" target=\"_blank\"><em>discursive dilemma<\/em><\/a>&#8216; or &#8216;<em>doctrinal paradox<\/em>&#8216;. The paradox is but one manifestation\u00a0of a more general impossibility result:<\/p>\n<blockquote><p>&#8220;<em>There exists no aggregation procedure (generating complete, consistent and deductively closed collective sets of judgments) which satisfies universal domain, anonymity and systematicity<\/em>.&#8221; <a href=\"https:\/\/www.princeton.edu\/~ppettit\/papers\/Aggregating_EconomicsandPhilosophy_2002.pdf\" target=\"_blank\">(List and Pettit, 2002)<\/a>.<\/p><\/blockquote>\n<p>Christian List has published\u00a0<a href=\"http:\/\/eprints.lse.ac.uk\/5829\/1\/The_discursive_dilemma_and_public_reason_(publishers_pdf)_.pdf\" target=\"_blank\">a survey<\/a> of the topic in 2006 and keeps an <a href=\"http:\/\/personal.lse.ac.uk\/list\/doctrinalparadox.htm\" target=\"_blank\">annotated bibliography<\/a>. The paradox is related but separate from <a href=\"https:\/\/www.google.nl\/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=1&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwjG8M2Hp8_XAhXHLsAKHdjDBW8QFggoMAA&amp;url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FArrow%2527s_impossibility_theorem&amp;usg=AOvVaw2oovkKca3k5eVuTpYgn-FK\" target=\"_blank\">Arrow&#8217;s impossibility theorem<\/a>, which deals with the aggregation of <em>preferences<\/em>.<\/p>\n<p>After this short introduction, let&#8217;s get to the\u00a0point. My point is that t<strong>he collective evaluation of scientific research proposals often falls victim to the discursive dilemma<\/strong>. Let me explain how.<\/p>\n<p>Imagine three scientific experts evaluating an application for research\u00a0funding\u00a0that has three components. (These components can be about three different aspects of the research\u00a0proposal itself or about three different parts of the application, such as CV,\u00a0proposal, and implementation plan). For now, imagine that the experts only evaluate each\u00a0component as of excellent quality or not (binary choice). Each expert uses majority rule to aggregate the scores on each section, and the three experts reach a final conclusion \u00a0using majority rule as well.<\/p>\n<p>The distribution of the evaluations of the three experts on each of the three components of the application are given in the table below. Reviewer 1 finds Parts A and C excellent but Part B poor. Reviewer 2 finds Parts B and C excellent but part A poor. And Reviewer 3 finds Parts A and B poor and part C excellent. Overall, Reviewers 1 and 2 reach a conclusion of &#8216;excellent&#8217; for the total application, while Reviewer 3 reaches a conclusion of &#8216;poor&#8217;. By aggregating the conclusions by majority rule, the application should be evaluated as &#8216;excellent&#8217;. However, looking at each part individually, there is a majority that finds both Parts A and B &#8216;poor&#8217;, therefore the total evaluation should be &#8216;poor&#8217; as well.<\/p>\n<p>&nbsp;<\/p>\n<table style=\"width: 100%;\">\n<tbody>\n<tr>\n<th><\/th>\n<th style=\"text-align: center;\">Part A<\/th>\n<th style=\"text-align: center;\">Part B<\/th>\n<th style=\"text-align: center;\">Part C<\/th>\n<th style=\"text-align: center;\">Conclusion<\/th>\n<\/tr>\n<tr>\n<td>Reviewer 1<\/td>\n<td style=\"text-align: center;\">excellent<\/td>\n<td style=\"text-align: center;\">poor<\/td>\n<td style=\"text-align: center;\">excellent<\/td>\n<td style=\"text-align: center;\">EXCELLENT<\/td>\n<\/tr>\n<tr>\n<td>Reviewer 2<\/td>\n<td style=\"text-align: center;\">poor<\/td>\n<td style=\"text-align: center;\">excellent<\/td>\n<td style=\"text-align: center;\">excellent<\/td>\n<td style=\"text-align: center;\">EXCELLENT<\/td>\n<\/tr>\n<tr>\n<td>Reviewer 3<\/td>\n<td style=\"text-align: center;\">poor<\/td>\n<td style=\"text-align: center;\">poor<\/td>\n<td style=\"text-align: center;\">excellent<\/td>\n<td style=\"text-align: center;\">POOR<\/td>\n<\/tr>\n<tr>\n<td><strong>Majority decision<\/strong><\/td>\n<td style=\"text-align: center;\"><b>POOR<\/b><\/td>\n<td style=\"text-align: center;\"><strong>POOR<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>EXCELLENT<\/strong><\/td>\n<td style=\"text-align: center;\"><strong>?<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>So which one is it? Is this an excellent proposal or not, according to our experts?<\/p>\n<p><strong>I do not know.<\/strong><\/p>\n<p>But I find it quite important to recognize that we can get completely different results from the evaluation process depending on how we aggregate the individual scores, even with exactly the same distribution of the scores and even when every expert is entirely consistent in his\/her evaluation.<\/p>\n<p>But\u00a0before we discuss the normative appeal of the two different aggregation options, is this a realistic problem or a\u00a0convoluted scenario made up to illustrate a theoretical point but of no relevance to the practice of research evaluation?<\/p>\n<p>Well, I have been involved in a fair share of research evaluations for journals, publishing houses, different national science foundations as well as for the European Research Council (ERC). Based on\u00a0my personal experience, I think that quite often there is a tension between aggregating expert evaluations by conclusion and by premises. \u00a0Moreover, I have not seen clear guidelines how to proceed when the different types of aggregation lead to different\u00a0conclusions. As a result, the aggregation method is selected\u00a0by the implicit personal preferences of the one doing the aggregation.<\/p>\n<p>Let&#8217;s go through a scenario that I am sure anyone who has been involved in some of the big ERC evaluations of individual research applications will recognize.<\/p>\n<p>Two of the three reviewers find two of the three parts of the application &#8216;poor&#8217;, and the third reviewer\u00a0finds one of the three parts poor and the other two parts &#8216;good&#8217; (see the table below).<\/p>\n<table style=\"width: 100%;\">\n<tbody>\n<tr>\n<th><\/th>\n<th style=\"text-align: center;\">Part A<\/th>\n<th style=\"text-align: center;\">Part B<\/th>\n<th style=\"text-align: center;\">Part C<\/th>\n<th style=\"text-align: center;\">Conclusion<\/th>\n<\/tr>\n<tr>\n<td>Reviewer 1<\/td>\n<td style=\"text-align: center;\">poor<\/td>\n<td style=\"text-align: center;\">poor<\/td>\n<td style=\"text-align: center;\">good<\/td>\n<td style=\"text-align: center;\">POOR<\/td>\n<\/tr>\n<tr>\n<td>Reviewer 2<\/td>\n<td style=\"text-align: center;\">good<\/td>\n<td style=\"text-align: center;\">poor<\/td>\n<td style=\"text-align: center;\">poor<\/td>\n<td style=\"text-align: center;\">POOR<\/td>\n<\/tr>\n<tr>\n<td>Reviewer 3<\/td>\n<td style=\"text-align: center;\">good<\/td>\n<td style=\"text-align: center;\">poor<\/td>\n<td style=\"text-align: center;\">good<\/td>\n<td style=\"text-align: center;\">GOOD<\/td>\n<\/tr>\n<tr>\n<td><strong>Majority decision<\/strong><\/td>\n<td style=\"text-align: center;\"><b>GOOD<\/b><\/td>\n<td style=\"text-align: center;\"><strong>POOR<\/strong><\/td>\n<td style=\"text-align: center;\"><b>GOOD<\/b><\/td>\n<td style=\"text-align: center;\"><strong>?<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Thus a majority of the final scores (the conclusions) indicate a &#8216;poor&#8217; application. However, when the reviewers need to indicate the parts of the application that are &#8216;poor&#8217;, they cannot find many! There is a majority for two \u00a0out of the three parts that finds them &#8216;good&#8217;. Accordingly, by majority rule these cannot be listed as &#8216;weaknesses&#8217; or given a poor score. Yet the total proposal <em>is<\/em>\u00a0evaluated as &#8216;poor&#8217; (i.e. unfundable).<\/p>\n<p>There are three ways things go from here, based on my experience. One response is, after having seen that there is no majority\u00a0evaluating many parts of the application as &#8216;poor&#8217; (or as a &#8216;weakness&#8217;), to adjust upwards the overall scores\u00a0of the application. In other words, the conclusion is brought in line with the result of the premise-based aggregation. A second response is to ask the individual reviewers to reflect back on their evaluations and reconsider whether their scores on the individual parts need be adjusted downwards (so the premises are brought in line with the result of the conclusion-based aggregation). A third response is to keep both a negative overall conclusion <em>and<\/em>\u00a0a very, very short list of &#8216;weaknesses&#8217; or\u00a0arguments about which parts of the proposal are actually weak.<\/p>\n<p><strong>Now you know why you sometimes\u00a0get evaluations saying that your project application is\u00a0unfundable, but failing to point out what its\u00a0problems are.<\/strong><\/p>\n<p>Again, I am not arguing that one of these responses or ways to solve the dilemma is always the correct one (although, I do have a preference, see below). But I think (a) the problem should be recognized, and (b) there should be explicit guidelines how to conduct the aggregation, so that there is less discretion left to those doing it.<\/p>\n<p>If I had to choose, <strong>I would go for conclusion-base aggregation<\/strong>. Typically, my evaluation of a project is not a direct\u00a0sum of the evaluations of the individual parts, and it is based on more than can be expressed with the scores on the application&#8217;s\u00a0components. Also typically, having formed a conclusion about the overall merits of the proposal, I will search for good arguments to make why the proposal is poor, but also add some nice things to say to balance the wording of the evaluation. But it is the overall conclusion that matters, and the rest is\u00a0discursive post hoc justification that is framed to fit to requirements of the specific context of the evaluation process.<\/p>\n<p>Another argument to be made in favor of conclusion-based aggregation\u00a0is the idea that reviewers represent particular &#8216;world-views&#8217; or perspectives, for example, stemming from their scientific (sub)discipline. Therefore, evaluations of individual parts of a research application should not\u00a0be aggregated by\u00a0majority, since the evaluations are not directly comparable. If I consider that a literature review presented in a project proposal is incomplete based on my knowledge of a specific literature, this assessment should not be overruled by two assessments\u00a0that the literature review is complete coming from reviewers who are experts in different literatures than I am: we could all be right in light of what we know.<\/p>\n<p>In fact, the only scenario\u00a0in which premise-based aggregation (with subsequent adjustment of the conclusions) makes sense to me is one where all reviewers know, on average, the same things and they provide, on average, scores without bias but with some random noise. In this case, majority aggregation of the premises filters the noise.<\/p>\n<p>But I am sure that there more and different arguments to be made, once we realize that the discursive dilemma is a problem for research evaluations and that currently different aggregation practices are allowed to proliferate unchecked.<\/p>\n<p>I suspect that many readers, even if they got this far in the text, would be unconvinced about the relevance of the problem I describe, because they think that (a) research evaluation is rarely binary but involves continuous scores, and (b) because aggregation is rarely based on majority rule.<\/p>\n<p>The first objection is easier to deal with: First, sometimes evaluation <em>is<\/em> binary, for example, when the evaluation committee needs to list &#8216;strengths&#8217; and &#8216;weaknesses&#8217;. Second, even when evaluation is formally on a categorical or continuous scale, it is in practice binary because anything below the top end of the scale is &#8216;unfundable&#8217;. Third, the discursive dilemma\u00a0is also relevant for <a href=\"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00355-009-0428-y.pdf\" target=\"_blank\">continuous judgements<\/a>.<\/p>\n<p>The second objection is pertinent. It is not that majority rule is not used when aggregating individual scores: it is, sometimes formally, more often informally. But in the practice of research evaluation these days, having anything less than a perfect score means that a project is not going to be funded. So whatever the method of aggregation, any objection (low score) by any reviewer is typically sufficient to derail an application. This is\u00a0likely a much bigger normative problem for research evaluation, but one that requires a separate discussion.<\/p>\n<p>And since we have to spend a lot of time preparing comprehensive evaluation reports, also of \u00a0projects that are not going to be funded, t<strong>he discursive dilemma needs to be addressed so that the final evaluations are consistent and clear to the researchers.<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>tl; dr When we collectively evaluate research proposals,\u00a0we can reach the opposite verdict\u00a0depending on how we aggregate the individual evaluations, and that&#8217;s a problem, and nobody seems to care or provide guidance how to proceed. Imagine that three judges need to reach a verdict together using majority rule. To do that, the judges have to decide independently if each of two factual propositions related to the suspected crime is\u00a0true. (And they all agree that if and only if both propositions are true, the defendant is guilty). The distribution of the judges&#8217; beliefs is given in the table below. Judge 1 believes that both\u00a0propositions are true, and as a result, considers the conclusion (defendant is guilty) true as well. Judges 2 and 3 consider that only one of the propositions is true and, as a result, reach a conclusion\u00a0of &#8216;not guilty&#8217;. When the judges vote in accordance with their conclusions, a majority finds the defendant &#8216;not guilty&#8217;. &nbsp; Proposition 1 Proposition 2 Conclusion Judge 1 true true true\u00a0(guilty) Judge 2 false true false (not guilty) Judge 3 true false false\u00a0(not guilty) Majority decision TRUE TRUE FALSE\u00a0(not guilty) However, there is a majority that finds each of the two propositions true (see the last line in the table)! Therefore, if the judges vote on each proposition separately rather than directly on the conclusion, they will have to find the defendant &#8216;guilty&#8217;. That is, the judges will reach the opposite conclusion, even though nothing changes about their beliefs, they still agree that both&#8230;<\/p>\n<div class=\"more-link-wrapper\"><a class=\"more-link\" href=\"http:\/\/re-design.dimiter.eu\/?p=934\">Continue reading<span class=\"screen-reader-text\">The Discursive Dilemma and Research Project Evaluation<\/span><\/a><\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":"The Discursive Dilemma and Research Project Evaluation","jetpack_is_tweetstorm":false},"categories":[32,44],"tags":[728,729,732,731,730],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p7g3hj-f4","jetpack-related-posts":[{"id":180,"url":"http:\/\/re-design.dimiter.eu\/?p=180","url_meta":{"origin":934,"position":0},"title":"Predicting the votes of judges","date":"November 29, 2011","format":false,"excerpt":"Here is a (short) and interesting paper that uses an innovative approach to predict the votes of the US Supreme Court: Successful attempts to predict judges' votes shed light into how legal decisions are made and, ultimately, into the behavior and evolution of the judiciary. Here, we investigate to what\u2026","rel":"","context":"In &quot;Bayesian methods&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":436,"url":"http:\/\/re-design.dimiter.eu\/?p=436","url_meta":{"origin":934,"position":1},"title":"Models in Political Science","date":"April 9, 2012","format":false,"excerpt":"Inside Higher Ed has a good interview with David Primo and Kevin Clarke on their new book A Model Discipline: Political Science and the Logic of Representations.\u00a0 The book and the interview criticize the hypothetico-deductive tradition in social science: The actual research was prompted by a student who asked, \"Why\u2026","rel":"","context":"In &quot;Observational studies&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":489,"url":"http:\/\/re-design.dimiter.eu\/?p=489","url_meta":{"origin":934,"position":2},"title":"Tit-for-tat no more: new insights into the origin and evolution of cooperation","date":"June 26, 2012","format":false,"excerpt":"The Prisoner's Dilemma (PD)\u00a0is\u00a0the paradigmatic\u00a0scientific model to understand human cooperation. You would think that after\u00a0several decennia of analyzing this\u00a0deceivingly simple game, nothing new can be learned. Not quite. This new paper discovers a whole new class of strategies that provide a unilateral advantage to the players using them in playing\u2026","rel":"","context":"In &quot;Game theory&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":354,"url":"http:\/\/re-design.dimiter.eu\/?p=354","url_meta":{"origin":934,"position":3},"title":"Compiling government positions from the Manifesto Project data with R","date":"March 12, 2012","format":false,"excerpt":"****N.B. I have updated the functions in February 2019 to makes use of the latest Manifesto data. See for details here.*** The Manifesto Project (former Manifesto Research Group, Comparative Manifestos Project) has assembled a database of 'quantitative content analyses of parties\u2019 election programs from more than 50 countries covering all\u2026","rel":"","context":"In &quot;Policy making&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":324,"url":"http:\/\/re-design.dimiter.eu\/?p=324","url_meta":{"origin":934,"position":4},"title":"Google tries to find the funniest videos","date":"February 15, 2012","format":false,"excerpt":"Following my recent post on the project which tries to explain why some video clips go viral, here is a report on Google's efforts to find the funniest videos: You\u2019d think the reasons for something being funny were beyond the reach of science \u2013 but Google\u2019s brain-box researchers have managed\u2026","rel":"","context":"In &quot;Advertising research&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":724,"url":"http:\/\/re-design.dimiter.eu\/?p=724","url_meta":{"origin":934,"position":5},"title":"The failure of political science","date":"March 25, 2013","format":false,"excerpt":"Last week the American Senate\u00a0supported with a clear bi-partisan majority a decision to stop funding for political science research from the National Science Foundation. Of all disciplines, only political science has been singled out for the cuts and the money will go for cancer research instead. The decision is obviously\u2026","rel":"","context":"In &quot;Science politicisation&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=\/wp\/v2\/posts\/934"}],"collection":[{"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=934"}],"version-history":[{"count":11,"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=\/wp\/v2\/posts\/934\/revisions"}],"predecessor-version":[{"id":945,"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=\/wp\/v2\/posts\/934\/revisions\/945"}],"wp:attachment":[{"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=934"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=934"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=934"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}