Skip to content

Tag: prediction

Modeling mortality

To grasp the true impact of COVID-19 on our societies, we need to know the effect of the pandemic on mortality. In other words, we need to know how many deaths can be attributed to the virus, directly and indirectly. It is already popular to visualize mortality in order to gauge the impact of the pandemic in different countries. You might have seen at least some of these graphs and websites: FT, Economist, Our World in Data, CBS, EFTA, CDC, EUROSTAT, and EUROMOMO. But estimating the impact of COVID-19 on mortality is also controversial, with people either misunderstanding or distrusting the way in which the impact is measured and assessed. That’s why, I put together a step-by-step guide about how we can go about estimating the impact of COVID-19 on mortality. In the guide, I build a large number of statistical models that we can use to predict expected mortality in 2020. The complexity of the models ranges from the simplest, based only on weekly averages from past years, to what is currently the state of the art. But this is not all. What I also do is review the predictive performance of all of these models, so that we know which ones work best. I run the models on publicly available data from the Netherlands, I use only the open software R, and I share the code, so anyone can check, replicate and extend the exercise. The guide is available here: http://dimiter.eu/Visualizations_files/nlmortality/Modeling-Mortality.html I hope this guide will provide some transparency about how expected mortality is and can be estimated…

Predicting movie ratings with IMDb data and R

It’s Oscars season again so why not explore how predictable (my) movie tastes are. This has literally been a million dollar problem and obviously I am not gonna solve it here, but it’s fun and slightly educational to do some number crunching, so why not. Below, I will proceed from a simple linear regression to a generalized additive model to an ordered logistic regression analysis. And I will illustrate the results with nice plots along the way. Of course, all done in R (you can get the script here). Data The data for this little project comes from the IMDb website and, in particular, from my personal ratings of 442 titles recorded there. IMDb keeps the movies you have rated in a nice little table which includes information on the movie title, director, duration, year of release, genre, IMDb rating, and a few other less interesting variables. Conveniently, you can export the data directly as a csv file. Outcome variable The outcome variable that I want to predict is my personal movie rating. IMDb lets you score movies with one to ten stars. Half-points and other fractions are not allowed. It is a tricky variable to work with. It is obviously not a continuous one; at the same time ten ordered categories are a bit too many to treat as a regular categorical variable. Figure 1 plots the frequency distribution (black bars) and density (red area) of my ratings and the density of the IMDb scores (in blue) for the 442 observations…