Skip to content

Category: Data visualization

Predicting movie ratings with IMDb data and R

It’s Oscars season again so why not explore how predictable (my) movie tastes are. This has literally been a million dollar problem and obviously I am not gonna solve it here, but it’s fun and slightly educational to do some number crunching, so why not. Below, I will proceed from a simple linear regression to a generalized additive model to an ordered logistic regression analysis. And I will illustrate the results with nice plots along the way. Of course, all done in R (you can get the script here). Data The data for this little project comes from the IMDb website and, in particular, from my personal ratings of 442 titles recorded there. IMDb keeps the movies you have rated in a nice little table which includes information on the movie title, director, duration, year of release, genre, IMDb rating, and a few other less interesting variables. Conveniently, you can export the data directly as a csv file. Outcome variable The outcome variable that I want to predict is my personal movie rating. IMDb lets you score movies with one to ten stars. Half-points and other fractions are not allowed. It is a tricky variable to work with. It is obviously not a continuous one; at the same time ten ordered categories are a bit too many to treat as a regular categorical variable. Figure 1 plots the frequency distribution (black bars) and density (red area) of my ratings and the density of the IMDb scores (in blue) for the 442 observations…

The evolution of EU legislation (graphed with ggplot2 and R)

During the last half century the European Union has adopted more than 100 000 pieces of legislation. In this presentation I look into the patterns of legislative adoption over time. I tried to create clear and engaging graphs that provide some insight into the evolution of law-making activity: not an easy task given the byzantine nature of policy making in the EU and the complex nomenclature of types of legal acts possible. The main plot showing the number of adopted directives, regulations and decisions since 1967 is pasted below. There is much more in the presentation. The time series data is available here, as well as the R script used to generate the plots (using ggplot2). Some of the graphs are also available as interactive visualizations via ManyEyes here, here, and here (requires Java). Enjoy.

New data source for political science researchers

Political Data Yearbook Interactive is a new source for data on election results, turnout and government composition for all EU and some non-European countries. It is basically an online version of the yearbooks that ECPR printed as part of the European Journal for Political Research for many years now. The interactive online tool has some (limited) visualization options and can export data in several formats.

Music Network Visualization

Note: probably of interest only to the intersection of the readers who are into niche music genres and those interested in network visualization. My music interests have always been rather, hmm…, eclectic. Somehow IDM, ambient, darkwave, triphop, acid jazz, bossa nova, qawali, Mali blues and other more or less obscure genres have managed to happily co-exist in my music collection. The sheer diversity always invited the question whether there is some structure to the collection, or each genre is an island of its own. Sounds like a job for network visualization! Now, there are plenty of music network viz applications on the web. But they don’t show my collection, and just seem unsatisfactory for various reasons. So I decided to craft my own visualization using R and igraph. As a first step I collected for all artists in my last.fm library the artists that the site classifies as similar. So I piggyback on last.fm for the network similarity measures. I also get info on the most-often used tag for the artist and the number of plays it has on the site. The rest is pretty straightforward as can be seen from the code. # Load the igraph and foreign packages (install if needed) require(igraph) require(foreign) lastfm<-read.csv(“http://www.dimiter.eu/Data_files/lastfm_network_ad.csv”, header=T, encoding=”UTF-8″) #Load the dataset lastfm$include<-ifelse(lastfm$Similar %in% lastfm$Artist==T,1,0) #Index the links between artists in the library lastfm.network<-graph.data.frame(lastfm, directed=F) #Import as a graph last.attr<-lastfm[-which(duplicated(lastfm$Artist)),c(5,3,4) ] #Create some attributes V(lastfm.network)[1:106]$listeners<-last.attr[,2] V(lastfm.network)[107:length(V(lastfm.network))]$listeners<-NA V(lastfm.network)[1:106]$tag<-last.attr[,3] V(lastfm.network)[107:length(V(lastfm.network))]$tag<-NA #Attach the attributes to the artist from the library (only) V(lastfm.network)$label.cex$tag<-ifelse(V(lastfm.network)$listeners>1200000, 1.4, (ifelse(V(lastfm.network)$listeners>500000, 1.2, (ifelse(V(lastfm.network)$listeners>100000, 1.1,…

Network visualization in R with the igraph package

In this post I showed a visualization of the organizational network of my department. Since several people asked for details how the plot has been produced, I will provide the code and some extensions below. The plot has been done entirely in R (2.14.01) with the help of the igraph package. It is a great package but I found the documentation somewhat difficult to use, so hopefully this post can be a helpful introduction to network visualization with R. Here we go: # Load the igraph package (install if needed) require(igraph) # Data format. The data is in ‘edges’ format meaning that each row records a relationship (edge) between two people (vertices). # Additional attributes can be included. Here is an example: # Supervisor Examiner Grade Spec(ialization) # AA BD 6 X # BD CA 8 Y # AA DE 7 Y # … … … … # In this anonymized example, we have data on co-supervision with additional information about grades and specialization. # It is also possible to have the data in a matrix form (see the igraph documentation for details) # Load the data. The data needs to be loaded as a table first: bsk<-read.table(“http://www.dimiter.eu/Data_files/edgesdata3.txt”, sep=’t’, dec=’,’, header=T)#specify the path, separator(tab, comma, …), decimal point symbol, etc. # Transform the table into the required graph format: bsk.network<-graph.data.frame(bsk, directed=F) #the ‘directed’ attribute specifies whether the edges are directed # or equivelent irrespective of the position (1st vs 2nd column). For directed graphs use ‘directed=T’ # Inspect the data:…

Scatterplots vs. regression tables (Economics professors edition)

I have always considered scatterplots to be the best available device to show relationships between variables. But it must be even better to have the regression table and a full description of the results in addition, right? Not so fast: A new paper shows that professional economists make largely correct inferences about data when looking at a scatterplot, but get confused when they are shown the details of the regressions next to the scatterplot, and totally mess it up when they are shown only the numbers without the plot! Wow! If you needed any more persuasion that graphing your data and your results are more important than those regression tables with zillions of numbers, now you have it. P.S. The authors of this research could have done a better job themselves in communicating visually their findings… [via Felix Salmon]   The illusion of predictability: How regression statistics mislead experts Emre Soyer& Robin M. Hogarth Abstract Does the manner in which results are presented in empirical studies affect perceptions of the predictability of the outcomes? Noting the predominant role of linear regression analysis in empirical economics, we asked 257 academic economists to make probabilistic inferences given different presentations of the outputs of this statistical tool. Questions concerned the distribution of the dependent variable conditional on known values of the independent variable. Answers based on the presentation mode that is standard in the literature led to an illusion of predictability; outcomes were perceived to be more predictable than could be justified by the model. In particular, many respondents failed to take…