What are the effects of COVID-19 on mortality? Individual-level causes of death and population-level estimates of casual impact


How many people have died from COVID-19? What is the impact of COVID-19 on mortality in a population? Can we use excess mortality to estimate the effects of COVID-19?

In this text I will explain why the answer to the first two questions need not be the same. That is, the sum of cases where COVID-19 has been determined to be the direct[1] cause of death need not be the same as the population-level estimate about the causal impact of COVID-19. When measurement of the individual-level causes of death is imperfect, using excess mortality (observed minus expected) to measure the impact of COVID-19 leads to an underestimate of the number of individual cases where COVID-19 has been the direct cause of death.


The major assumption on which the argument rests is that some of the people who have died from COVID-19 would have died from other causes, within a specified relatively short time-frame (say, within the month). It seems very reasonable to assume that at least some of the victims of COVID-19 would have succumbed to other causes of death. This is especially easy to imagine given that COVID-19 kills disproportionally the very old and that the ultimate causes of death that it provokes – respiratory problems, lungs failure, etc. – are shared with other common diseases with high mortality among the older population, such as the flu.

Defining individual and population-level causal effects

With this crucial assumption in mind, we can construct the following simple table. Cell A contains the people who would have survived if they had not caught the Coronavirus, but they caught it and died. Cell B contains the people that caught the Coronavirus and died, but would have died from other causes even if they did not catch the virus[2]. Cell C contains the people who caught the virus and survived and would have survived even if they did not catch the virus. Cell D contains the people who would have died if they did not catch virus, but they did and survived. Cell C is of no interest for the current argument, and for now we can assume that cases in Cell D are implausible (although this might change if we consider indirect effects of the pandemic and the policy measures it provoked. But for now, we ignore such indirect effects). Cell E is people that did not catch the virus and survived (also not interesting for the argument). Cell F is people who did not catch the virus and died from other causes. As a matter of definition, total mortality within a period is A + B + F.

 Caught Corona-virus and diedCaught Corona-virus and survivedDid not catch Coronavirus
Would have survived unless catches CoronaACE
Would have died if doesn’t catch CoronaBDF

The number of individual-level deaths directly caused by COVID-19 that can be observed is the sum of cells A + B. Without further assumptions and specialized knowledge, we cannot estimate the share of cases that would have died anyways from the total. For now, just assume that this is positive; that is, such cases exist. The population-level causal impact of COVID-19 is A, or, in words, those that have died from COVID-19 minus those that would have died from other causes within the same period. The population-level causal effect is defined counterfactually. Again, without further assumptions about the ratio of B to A, the population-level causal impact of COVID-19 is not identifiable. An important conclusion that we reach is that the population-level causal impact of COVID-19 on mortality does not necessarily sum up to the sum of the individual cases where COVID-19 was the cause of death.

Scenario I: perfect measures of individual-level causes of death

Assume for the moment that all individual cases where COVID-19 was the cause of death are observed and recorded. Under this assumption, what does excess mortality measure? Excess mortality is defined as the difference between the observed (O) and predicted (P) number of deaths within a period, with the prediction (expectation) coming from historical averages, statistical models or anything else[3]. Under our definitions, the observed mortality in O a period contains groups  A + B + F. So the difference between observed O and predicted P gives A, or the number of people that have died from COVID-19, but would have survived otherwise. Therefore, excess mortality identifies the population-level causal impact of the COVID-19 (see also the figure below).

One implication of this line of reasoning is that under perfect measurement of individual-level cause of deaths and a positive amount of people who would have died from other causes if they had not died from COVID-19 (cell B), the sum of the cases where COVID-19 was recorded as a cause of death should exceed the excess in observed mortality O – P.  (See the situation in France where this might be happening.)

Scenario II: imperfect measures of individual-level causes of death

Let’s consider now a more realistic scenario where determining and recording the individual causes of death is imperfect. Under this assumption, the observed number of deaths in a period still contains O = A + B + F. Excess mortality O – P still identifies the population level effect A. However, this is not the number of deaths directly caused by COVID-19, which includes those that would have died anyways (B): a category that is already included in the prediction about mortality during this period [4].

In other words, excess mortality underestimates the sum of individual cases where COVID-19 is the direct cause of death. The amount of underestimation depends on how large the share of people who would have died from other causes but died from COVID-19 is. The larger the share, the larger the underestimation. To put it bluntly, COVID-19 kills more people than excess mortality suggests. This is because the expected number of deaths, on which the calculation of excess mortality depends, contains a share of people that would have died from other causes, but were killed by the virus.


These are the main conclusions from the analysis:

  1. The sum of individual-level cases where COVID-19 was the direct cause of death is not the same as the population-level causal impact of the virus.
  2. Excess mortality provides a valid estimate of the population-level causal impact.
  3.  When measurement of the individual causes of death is imperfect, excess mortality provides an underestimate of the sum of individual cases where COVID-19 was the cause of death.
  4. With perfect measurement of the individual causes of death, excess in mortality should be lower than then the sum of the individual case where COVID-19 was the cause of death.


[1] I suspect some will object that the coronavirus and COVID-19 are never the direct causes of death but only provoke other diseases that ultimately kill people. This is irrelevant for the argument: I use ‘COVID-19 as a direct case of death’ as a shortcut for a death that was caused by COVID-19 provoking some other condition that ultimately kills.

[2] Formally, for people in cell B, COVID-19 is a sufficient but not necessary condition for dying within a certain period. For people in cell A, COVID-19 is both necessary and sufficient. Because of the counterfactual definition of the population-level effect, it only tracks cases where the cause was both necessary and sufficient.

[3] In reality, the models used to predict and estimate the expected mortality are imperfect and incorporate considerable uncertainties. These uncertainties compound the estimation problems discussed in the text, but the problems will exist even if the expected mortality was predicted perfectly.

[4] Extending the analysis to include indirect effects of COVID-19 and the policy responses it led to is interesting and important but very challenging. There are multiple plausible mechanisms for indirect effects, some of which would act to decrease mortality (e.g. less pollution, fewer traffic accidents, fewer crime-related murders, etc.) and some of which would act to increase mortality (e.g. due to stress, not seeking medical attention on time, postponed medical operations, increases in domestic violence, self-medication gone wrong, etc.). The time horizon of the estimation becomes even more important as some of these mechanisms need more time to exert their effects (e.g. reduced pollution).   Once we admit indirect effects, the calculation of the direct population-level effect of COVID-19 from excess mortality data becomes impossible without some assumptions about the share and net effect of the indirect mechanisms, and the estimation of the sum of individual-level effects becomes even more muddled.

Explanation and the quest for ‘significant’ relationships. Part I

The ultimate goal of social science is causal explanation*. The actual goal of most academic research is to discover significant relationships between variables. The two goals are supposed to be strongly related – by discovering (the) significant effects of exogenous (independent) variables, one accounts for the outcome of interest. In fact, the working assumption of the empiricist paradigm of social science research is that the two goals are essentially the same – explanation is the sum of the significant effects that we have discovered. Just look at what all the academic articles with ‘explanation’, ‘determinants’, and ’causes’ in their titles do – they report significant effects, or associations, between variables.

The problem is that explanation and collecting significant associations are not the same. Of course they are not. The point is obvious to all uninitiated into the quantitative empiricist tradition of doing research, but seems to be lost to many of its practitioners. We could have discovered a significant determinant of X, and still be miles (or even light-years) away from a convincing explanation of why and when X occurs. This is not because of the difficulties of causal identification – we could have satisfied all conditions for causal inference from observational data, but the problem still stays. And it would not go away after we pay attention (as we should) to the fact that statistical significance is not the same as practical significance. Even the discovery of convincingly-identified causal effects, large enough to be of practical rather than only statistical significance, does not amount to explanation. A successful explanation needs to account for the variation in X, and causal associations need not to – they might be significant but not even make a visible dent in the unexplained variation in X. The difference I am talking about is partly akin to the difference between looking at the significance of individual regression coefficients and looking at the model fit as a whole (more on that will follow in Part II). The current standards of social science research tend to emphasize the former rather than the later which allows for significant relationships to be sold as explanations.

The objection can be made that the discovery of causal effects is all we should aim for, and all we could hope for. Even if a causal relationship doesn’t account for large amounts of variation in the outcome of interest, it still makes a difference.  After all, this is the approach taken in epidemiology, agricultural sciences and other fields (like beer production) where the statistical research paradigm has its origins. A pill might not treat all headaches but if it has a positive and statistically-significant effect, it will still help millions. But here is the trick – the quest for statistically significant relationships in epidemiology, agriculture, etc. is valuable because all these effects can be considered as interventions – the researchers have control over the formula of the pill, or the amount of pesticide, or the type of hops. In contrast, social science researchers too often seek and discover significant relationships between an outcome and variables that couldn’t even remotely be considered as interventions. So we end up with a pile of significant relationships which do not account for enough variation to count as a proper explanation and they have no value as interventions as their manipulation is beyond our reach. To sum up, observational social science has borrowed an approach to causality which makes sense for experimental research, and applied its standards (namely, statistical significance) to a context where the discovery of significant relationships is less valuable because the ‘treatments’ cannot be manipulated. Meanwhile, what should really count – explaining when, how and why a phenomenon happens, is relegated to the background in the false belief that somehow the quest for significant relationships is a substitute. It is like trying to discover the fundamental function of the lungs with epidemiological methods, and claiming success when you prove that cold air reduces significantly lung capacity. While the inference might still be valuable, it is no substitue for the original goal.

In Part II, I will discuss what needs to be changed, and what can be changed in the current practice of empirical social science research to address the problem outlined above.

*In my understanding, all explanation is causal. Hence, ‘causal explanation’ is tautology. Hence, I am gonna drop the ‘causal’ part for the rest of the text.