Skip to content

Tag: R

Modeling mortality

To grasp the true impact of COVID-19 on our societies, we need to know the effect of the pandemic on mortality. In other words, we need to know how many deaths can be attributed to the virus, directly and indirectly. It is already popular to visualize mortality in order to gauge the impact of the pandemic in different countries. You might have seen at least some of these graphs and websites: FT, Economist, Our World in Data, CBS, EFTA, CDC, EUROSTAT, and EUROMOMO. But estimating the impact of COVID-19 on mortality is also controversial, with people either misunderstanding or distrusting the way in which the impact is measured and assessed. That’s why, I put together a step-by-step guide about how we can go about estimating the impact of COVID-19 on mortality. In the guide, I build a large number of statistical models that we can use to predict expected mortality in 2020. The complexity of the models ranges from the simplest, based only on weekly averages from past years, to what is currently the state of the art. But this is not all. What I also do is review the predictive performance of all of these models, so that we know which ones work best. I run the models on publicly available data from the Netherlands, I use only the open software R, and I share the code, so anyone can check, replicate and extend the exercise. The guide is available here: http://dimiter.eu/Visualizations_files/nlmortality/Modeling-Mortality.html I hope this guide will provide some transparency about how expected mortality is and can be estimated…

Government positions from party-level Manifesto data (with R)

In empirical research in political science and public policy, we often need estimates of the political positions of governments (cabinets) and the salience of different issues for different governments (cabinets). Data on policy positions and issue salience is available, but typically at the level of political parties. One prominent source of data for issue salience and positions is the Manifesto Corpus, a database of the electoral manifestos of political parties. To ease the aggregation of government positions and salience from party-level Manifesto data, I developed a set of functions in R that accomplish just that, combining the Manifesto data with data on the duration and composition of governments from ParlGov. The see how the functions work, read this detailed tutorial. You can access all the functions at the dedicated GitHub repository. And you can contribute to this project by forking the code on GitHub. If you have questions or suggestions, get in touch. Enjoy!

Visualizing asylum statistics

Note: of potential interest to R users for the dynamic Google chart generated via googleVis in R and discussed towards the end of the post. Here you can go directly to the graph. An emergency refugee center, opened in September 2013 in an abandoned school in Sofia, Bulgaria. Photo by Alessandro Penso, Italy, OnOff Picture. First prize at World Press Photo 2013 in the category General News (Single). The tragic lives of asylum-seekers make for moving stories and powerful photos. When individual tragedies are aggregated into abstract statistics, the message gets harder to sell. Yet, statistics are arguably more relevant for policy and provide for a deeper understanding, if not as much empathy, than individual stories. In this post, I will offer a few graphs that present some of the major trends and patterns in the numbers of asylum applications and asylum recognition rates in Europe over the last twelve years. I focus on two issues: which European countries take the brunt of the asylum flows, and the link between the application share that each country gets and its asylum recognition rate. Asylum applications and recognition rates Before delving into the details, let’s look at the big picture first. Each year between 2001 and 2012, 370,000 people on average have applied for asylum protection in one of the member states of the European Union (plus Norway and Switzerland). As can be seen from Figure 1, the number fluctuates between 250,000 and 500,000 per year, and there is no clear trend. Altogether, during this 12-year period, approximately 4.5 million…

Predicting movie ratings with IMDb data and R

It’s Oscars season again so why not explore how predictable (my) movie tastes are. This has literally been a million dollar problem and obviously I am not gonna solve it here, but it’s fun and slightly educational to do some number crunching, so why not. Below, I will proceed from a simple linear regression to a generalized additive model to an ordered logistic regression analysis. And I will illustrate the results with nice plots along the way. Of course, all done in R (you can get the script here). Data The data for this little project comes from the IMDb website and, in particular, from my personal ratings of 442 titles recorded there. IMDb keeps the movies you have rated in a nice little table which includes information on the movie title, director, duration, year of release, genre, IMDb rating, and a few other less interesting variables. Conveniently, you can export the data directly as a csv file. Outcome variable The outcome variable that I want to predict is my personal movie rating. IMDb lets you score movies with one to ten stars. Half-points and other fractions are not allowed. It is a tricky variable to work with. It is obviously not a continuous one; at the same time ten ordered categories are a bit too many to treat as a regular categorical variable. Figure 1 plots the frequency distribution (black bars) and density (red area) of my ratings and the density of the IMDb scores (in blue) for the 442 observations…

Music Network Visualization

Note: probably of interest only to the intersection of the readers who are into niche music genres and those interested in network visualization. My music interests have always been rather, hmm…, eclectic. Somehow IDM, ambient, darkwave, triphop, acid jazz, bossa nova, qawali, Mali blues and other more or less obscure genres have managed to happily co-exist in my music collection. The sheer diversity always invited the question whether there is some structure to the collection, or each genre is an island of its own. Sounds like a job for network visualization! Now, there are plenty of music network viz applications on the web. But they don’t show my collection, and just seem unsatisfactory for various reasons. So I decided to craft my own visualization using R and igraph. As a first step I collected for all artists in my last.fm library the artists that the site classifies as similar. So I piggyback on last.fm for the network similarity measures. I also get info on the most-often used tag for the artist and the number of plays it has on the site. The rest is pretty straightforward as can be seen from the code. # Load the igraph and foreign packages (install if needed) require(igraph) require(foreign) lastfm<-read.csv(“http://www.dimiter.eu/Data_files/lastfm_network_ad.csv”, header=T, encoding=”UTF-8″) #Load the dataset lastfm$include<-ifelse(lastfm$Similar %in% lastfm$Artist==T,1,0) #Index the links between artists in the library lastfm.network<-graph.data.frame(lastfm, directed=F) #Import as a graph last.attr<-lastfm[-which(duplicated(lastfm$Artist)),c(5,3,4) ] #Create some attributes V(lastfm.network)[1:106]$listeners<-last.attr[,2] V(lastfm.network)[107:length(V(lastfm.network))]$listeners<-NA V(lastfm.network)[1:106]$tag<-last.attr[,3] V(lastfm.network)[107:length(V(lastfm.network))]$tag<-NA #Attach the attributes to the artist from the library (only) V(lastfm.network)$label.cex$tag<-ifelse(V(lastfm.network)$listeners>1200000, 1.4, (ifelse(V(lastfm.network)$listeners>500000, 1.2, (ifelse(V(lastfm.network)$listeners>100000, 1.1,…