{"id":767,"date":"2014-03-02T12:35:03","date_gmt":"2014-03-02T12:35:03","guid":{"rendered":"http:\/\/rulesofreason.wordpress.com\/?p=767"},"modified":"2014-03-02T12:35:03","modified_gmt":"2014-03-02T12:35:03","slug":"predicting-movie-ratings-with-imdb-data-and-r","status":"publish","type":"post","link":"http:\/\/re-design.dimiter.eu\/?p=767","title":{"rendered":"Predicting movie ratings with IMDb data and R"},"content":{"rendered":"<p>It&#8217;s <a href=\"http:\/\/www.imdb.com\/oscars\/nominations\/\" target=\"_blank\">Oscars<\/a> season again so why not explore how predictable (my) movie tastes are. This has literally been a million dollar <a href=\"http:\/\/en.wikipedia.org\/wiki\/Netflix_Prize\" target=\"_blank\">problem<\/a>\u00a0and obviously I am not gonna solve it here, but it&#8217;s fun and slightly educational to do some number crunching, so why not. Below, I will proceed from a simple linear regression to a generalized additive model to an ordered logistic regression analysis. And I will illustrate the results with nice plots along the way. Of course, all done in <code>R<\/code> (you can get the script <a href=\"http:\/\/www.dimiter.eu\/Visualizations_files\/imdb\/imdb_analysis.r\" target=\"_blank\">here<\/a>).<\/p>\n<p><strong>Data<br \/>\n<\/strong>The data for this little project comes from the <a href=\"http:\/\/www.imdb.com\/\" target=\"_blank\">IMDb website<\/a> and, in particular, from my personal <a href=\"http:\/\/www.imdb.com\/user\/ur49179813\/ratings\" target=\"_blank\">ratings<\/a> of 442 titles recorded there. IMDb keeps the movies you have rated in a nice little table which includes information on the movie title, director, duration, year of release, genre, IMDb rating, and a few other less interesting variables. Conveniently, you can export the data directly as a <code>csv<\/code> file.<\/p>\n<p><strong>Outcome variable<br \/>\n<\/strong>The outcome variable that I want to predict is my personal movie rating. IMDb lets you score movies with one to ten stars. Half-points and other fractions are not allowed. It is a tricky variable to work with. It is obviously not a continuous one; at the same time ten ordered categories are a bit too many to treat as a regular categorical variable. Figure 1 plots the frequency distribution (black bars) and\u00a0<span style=\"font-style:inherit;line-height:1.625;\">density (red area) of my ratings and the density of the IMDb scores (in blue) for the 442 observations in the data.<\/span><\/p>\n<p><a href=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure1.png\"><img data-attachment-id=\"772\" data-permalink=\"http:\/\/re-design.dimiter.eu\/?attachment_id=772\" data-orig-file=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure1-e1455367651264.png?fit=511%2C511\" data-orig-size=\"511,511\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"icon\" data-image-description=\"\" data-medium-file=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure1-e1455367651264.png?fit=300%2C300\" data-large-file=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure1-e1455367651264.png?fit=511%2C511\" loading=\"lazy\" class=\"alignnone size-full wp-image-772\" alt=\"figure1\" src=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure1.png?resize=584%2C345\" width=\"584\" height=\"345\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p>The mean of my ratings is a good 0.9 points lower than the IMDb scores, which are also less dispersed and have a higher peak (can you say &#8216;kurtosis&#8217;).<\/p>\n<p><strong>Data-generating process<br \/>\n<\/strong>Some reflection on how the data is generated can highlight its potential shortcomings. First, life is short and I try not to waste my time watching bad movies. Second, even if I get fooled to start watching a bad movie, usually I would not bother rating it on IMDb.There are occasional two- and three-star scores, but these are usually movies that were terrible\u00a0<strong>and<\/strong>\u00a0annoyed me for some reason or another (like, for <a href=\"http:\/\/www.imdb.com\/title\/tt0456149\/\" target=\"_blank\">example<\/a>, getting a Cannes award or <a href=\"http:\/\/www.imdb.com\/title\/tt0107048\/\" target=\"_blank\">featuring<\/a> Bill Murray). The data-generating process leads to a <strong>selection bias<\/strong> with two important implications. First, the effective range of variation of both the outcome and the main predictor variables is restricted, giving the models less information to work with. Second, because movies with a decent IMDb ratings which I disliked have a lower chance of being recorded in the dataset, the relationship we find in the sample will overestimate the real link between my ratings and the IMDb ones.<\/p>\n<p><strong>Take one: linear regression<br \/>\n<\/strong>Enough preliminaries, let&#8217;s get to business. An ordinary linear regression model is a common starting point for analysis and its results can serve as a baseline. Here are the estimates that <code>lm<\/code> provides for regressing my ratings on IMDb scores:<\/p>\n<pre>summary(lm(mine~imdb, data=d))\n\nCoefficients:\n            Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)  -0.6387     0.6669  -0.958    0.339    \nimdb          0.9686     0.0884  10.957   ***\n---\nResidual standard error: 1.254 on 420 degrees of freedom\nMultiple R-squared: 0.2223,\tAdjusted R-squared: 0.2205<\/pre>\n<p>The intercept indicates that on average my ratings are more than half a point lower. The positive coefficient of IMDb score is positive and very close to one which implies that one point higher (lower) IMDb rating would predict, on average, one point higher (lower) personal rating. Figure 2 plots the relationship between the two variables (for an interactive version of the scatter plot, click <a href=\"http:\/\/www.dimiter.eu\/Visualizations_files\/imdb\/imdb.html\" target=\"_blank\">here<\/a>):<\/p>\n<p><a href=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure21.png\"><img data-attachment-id=\"779\" data-permalink=\"http:\/\/re-design.dimiter.eu\/?attachment_id=779\" data-orig-file=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure21.png?fit=1197%2C708\" data-orig-size=\"1197,708\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure2\" data-image-description=\"\" data-medium-file=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure21.png?fit=300%2C177\" data-large-file=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure21.png?fit=1024%2C606\" loading=\"lazy\" class=\"alignnone size-full wp-image-779\" alt=\"figure2\" src=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure21.png?resize=584%2C345\" width=\"584\" height=\"345\" srcset=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure21.png?w=1197 1197w, https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure21.png?resize=300%2C177 300w, https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure21.png?resize=768%2C454 768w, https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure21.png?resize=1024%2C606 1024w\" sizes=\"(max-width: 584px) 100vw, 584px\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p>The solid black line is the regression fit, the blue one shows a non-parametric loess smoothing which suggests some non-linearity in the relationship that we will explore later.<\/p>\n<p>Although the IMDb score coefficient is highly statistically significant that should not fool us that we have gained much predictive capacity. The model fit is rather poor. The root mean squared error is 1.25 which is large given the variation in the data. But the inadequate fit is most clearly visible if we plot the actual data versus the predictions. Figure 3 below does just that. The grey bars show the prediction plus\/minus two <a href=\"http:\/\/www.stat.cmu.edu\/~cshalizi\/uADA\/13\/lectures\/ch09.pdf\" target=\"_blank\">predictive standard errors<\/a>. If the predictions derived from the model were good, the dots (observations) would be very close to the diagonal (indicated by the dotted line). In this case, they are not. The model does a particularly bad job in predicting very low and very high ratings.<\/p>\n<p><a href=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure3.png\"><img data-attachment-id=\"775\" data-permalink=\"http:\/\/re-design.dimiter.eu\/?attachment_id=775\" data-orig-file=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure3.png?fit=1197%2C850\" data-orig-size=\"1197,850\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure3\" data-image-description=\"\" data-medium-file=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure3.png?fit=300%2C213\" data-large-file=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure3.png?fit=1024%2C727\" loading=\"lazy\" class=\"alignnone size-full wp-image-775\" alt=\"figure3\" src=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure3.png?resize=584%2C414\" width=\"584\" height=\"414\" srcset=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure3.png?w=1197 1197w, https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure3.png?resize=300%2C213 300w, https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure3.png?resize=768%2C545 768w, https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure3.png?resize=1024%2C727 1024w\" sizes=\"(max-width: 584px) 100vw, 584px\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p><span style=\"font-style:inherit;line-height:1.625;\">We can also see how little information IMDb scores contain about (my) personal scores by going back to the raw data. Figure 4 plots to density of my ratings for two sets of values of IMDb scores &#8211; from 6.5 to 7.5 (blue) and from 7.5- to 8.5 (red). The means for the two sets differ somewhat, but the overlap in the density is great.<\/span><\/p>\n<p><a href=\"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure4.png\"><img data-attachment-id=\"776\" data-permalink=\"http:\/\/re-design.dimiter.eu\/?attachment_id=776\" data-orig-file=\"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure4.png?fit=1197%2C708\" data-orig-size=\"1197,708\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure4\" data-image-description=\"\" data-medium-file=\"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure4.png?fit=300%2C177\" data-large-file=\"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure4.png?fit=1024%2C606\" loading=\"lazy\" class=\"alignnone size-full wp-image-776\" alt=\"figure4\" src=\"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure4.png?resize=584%2C345\" width=\"584\" height=\"345\" srcset=\"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure4.png?w=1197 1197w, https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure4.png?resize=300%2C177 300w, https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure4.png?resize=768%2C454 768w, https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure4.png?resize=1024%2C606 1024w\" sizes=\"(max-width: 584px) 100vw, 584px\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p><span style=\"font-style:inherit;line-height:1.625;\">In sum, knowing the IMDb rating provides some information but on its own doesn&#8217;t get us very far in predicting what my score would be.<\/span><\/p>\n<p><strong>Take two: adding predictors<\/strong><br \/>\n<span style=\"font-style:inherit;line-height:1.625;\">Let&#8217;s add more variables to see if things improve. Some playing around shows that among the available candidates only the year of release of the movie and dummies for a few genres and directors (selected only from those with more than four movies in the data) give any leverage.<\/span><\/p>\n<pre> summary(lm(mine~imdb+d$comedy +d$romance+d$mystery+d$\"Stanley Kubrick\"+d$\"Lars Von Trier\"+d$\"Darren Aronofsky\"+year.c, data=d))\n\nCoefficients:\n                      Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)           1.074930   0.651223   1.651  .  \nimdb                  0.727829   0.087238   8.343  ***\nd$comedy             -0.598040   0.133533  -4.479  ***\nd$romance            -0.411929   0.141274  -2.916  ** \nd$mystery             0.315991   0.185906   1.700  .  \nd$\"Stanley Kubrick\"   1.066991   0.450826   2.367  *  \nd$\"Lars Von Trier\"    2.117281   0.582790   3.633  ***\nd$\"Darren Aronofsky\"  1.357664   0.584179   2.324  *  \nyear.c                0.016578   0.003693   4.488  ***\n---\nResidual standard error: 1.156 on 413 degrees of freedom\nMultiple R-squared: 0.3508,\tAdjusted R-squared: 0.3382<\/pre>\n<p>The fit improves somewhat. The root mean squared error of this model is 1.14. Moreover, looking again at the actual versus predicted ratings, the fit is better, especially for highly rated movies &#8211; no surprise given that the director dummies pick these up.<\/p>\n<p><a href=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure5.png\"><img data-attachment-id=\"778\" data-permalink=\"http:\/\/re-design.dimiter.eu\/?attachment_id=778\" data-orig-file=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure5.png?fit=1197%2C850\" data-orig-size=\"1197,850\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure5\" data-image-description=\"\" data-medium-file=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure5.png?fit=300%2C213\" data-large-file=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure5.png?fit=1024%2C727\" loading=\"lazy\" class=\"alignnone size-full wp-image-778\" alt=\"figure5\" src=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure5.png?resize=584%2C414\" width=\"584\" height=\"414\" srcset=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure5.png?w=1197 1197w, https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure5.png?resize=300%2C213 300w, https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure5.png?resize=768%2C545 768w, https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/02\/figure5.png?resize=1024%2C727 1024w\" sizes=\"(max-width: 584px) 100vw, 584px\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p>The last variable in the regression above is the year of release of the movie. It is coded as the difference from 2014, so the positive coefficient implies that older movies get higher ratings. The statistically significant effect, however, has no straightforward predictive interpretation. The reason is again <strong>selection bias<\/strong>. I have only watched movies released before the 1990s that have withstood the test of time. So even though in the <em>sample<\/em> older films have higher scores, it is highly unlikely that if I pick a <em>random<\/em> film made in the 1970s I would like it more than a random film made after 2010. In any case, Figure 6 below plots the year of release versus the residuals from the regression of my ratings on IMDb scores (for the subset of films after 1960). We can see that the relationship is likely nonlinear (and that I really dislike comedies from the 1980s).<\/p>\n<p><a href=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure6.png\"><img data-attachment-id=\"788\" data-permalink=\"http:\/\/re-design.dimiter.eu\/?attachment_id=788\" data-orig-file=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure6.png?fit=1197%2C708\" data-orig-size=\"1197,708\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure6\" data-image-description=\"\" data-medium-file=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure6.png?fit=300%2C177\" data-large-file=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure6.png?fit=1024%2C606\" loading=\"lazy\" class=\"alignnone size-full wp-image-788\" alt=\"figure6\" src=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure6.png?resize=584%2C345\" width=\"584\" height=\"345\" srcset=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure6.png?w=1197 1197w, https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure6.png?resize=300%2C177 300w, https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure6.png?resize=768%2C454 768w, https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure6.png?resize=1024%2C606 1024w\" sizes=\"(max-width: 584px) 100vw, 584px\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p>So far both regressions assumed that the relationship between the predictors and the outcome is linear. Needless to say, there is no compelling reason why this should be the case. Maybe our predictions will improve if we allow the relationships to take any form. This calls for a generalized additive model.<\/p>\n<p><strong>Take three: generalized additive model (GAM)<br \/>\n<\/strong>In <code>R<\/code>, we can use the <code>mgcv<\/code> library to fit a \u00a0GAM. It doesn&#8217;t make sense to hypothesize non-linear effects for binary variables, so we only smooth the effects of IMDb rating and year of release. But why stop there, perhaps the non-linear effects of IMDb rating and release year are not independent, why not allow them to interact!<\/p>\n<pre>library(mgcv)\nsummary(gam(mine ~ te(imdb,year.c)+d$\"comedy \" +d$\"romance \"+d$\"mystery \"+d$\"Stanley Kubrick\"+d$\"Lars Von Trier\"+d$\"Darren Aronofsky\", data = d)) \n\nPParametric coefficients:\n                     Estimate Std. Error t value Pr(|t|)    \n(Intercept)           6.80394    0.07541  90.225   ***\nd$\"comedy \"          -0.60742    0.13254  -4.583   ***\nd$\"romance \"         -0.43808    0.14133  -3.100   ** \nd$\"mystery \"          0.32299    0.18331   1.762   .  \nd$\"Stanley Kubrick\"   0.83139    0.45208   1.839   .  \nd$\"Lars Von Trier\"    2.00522    0.57873   3.465   ***\nd$\"Darren Aronofsky\"  1.26903    0.57525   2.206   *  \n---\nApproximate significance of smooth terms:\n                  edf Ref.df     F p-value    \nte(imdb,year.c) 10.85  13.42 11.09<\/pre>\n<p>Well, the root mean squared error drops to 1.11 and the jointly smoothed (with a full tensor product smooth) variables are significant, but the added predictive value is minimal in this case. Nevertheless, the plot below shows the smoothed terms are more appropriate than the linear ones, and that there is a complex interaction between the two:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure71.png\"><img data-attachment-id=\"791\" data-permalink=\"http:\/\/re-design.dimiter.eu\/?attachment_id=791\" data-orig-file=\"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure71.png?fit=1197%2C708\" data-orig-size=\"1197,708\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure7\" data-image-description=\"\" data-medium-file=\"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure71.png?fit=300%2C177\" data-large-file=\"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure71.png?fit=1024%2C606\" loading=\"lazy\" class=\"alignnone size-full wp-image-791\" alt=\"figure7\" src=\"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure71.png?resize=584%2C345\" width=\"584\" height=\"345\" srcset=\"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure71.png?w=1197 1197w, https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure71.png?resize=300%2C177 300w, https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure71.png?resize=768%2C454 768w, https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure71.png?resize=1024%2C606 1024w\" sizes=\"(max-width: 584px) 100vw, 584px\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p><strong>Take four: models for categorical data<br \/>\n<\/strong>So far we treated personal movie ratings as if they were a continuous variable, but they are not &#8211; taking into account that they are essentially an ordered categorical variable might help. But ten categories, while possible to model, would make the analysis rather unwieldy, so we recode the personal ratings into five categories without much loss of information: 5 and less, 6,7,8,9 and more.<\/p>\n<p>We can first see a nonparametric conditional destiny plot of the newly created categorical variable as a function of IMDb scores:<br \/>\n<a href=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure8.png\"><img data-attachment-id=\"792\" data-permalink=\"http:\/\/re-design.dimiter.eu\/?attachment_id=792\" data-orig-file=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure8.png?fit=1197%2C708\" data-orig-size=\"1197,708\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure8\" data-image-description=\"\" data-medium-file=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure8.png?fit=300%2C177\" data-large-file=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure8.png?fit=1024%2C606\" loading=\"lazy\" class=\"alignnone size-full wp-image-792\" alt=\"figure8\" src=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure8.png?resize=584%2C345\" width=\"584\" height=\"345\" srcset=\"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure8.png?w=1197 1197w, https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure8.png?resize=300%2C177 300w, https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure8.png?resize=768%2C454 768w, https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure8.png?resize=1024%2C606 1024w\" sizes=\"(max-width: 584px) 100vw, 584px\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p>The plot shows the observed density for each category of the outcome variable along the range of the predictor. For example, for a film with an IMDb rating of &#8216;6&#8217;, about 35% of the personal scores are &#8216;5&#8217;, a further 50% are &#8216;6&#8217;, and the remaining 15% are &#8216;7&#8217;. Remember that the plot is based on the <strong>observed conditional frequencies<\/strong> only (with some smoothing), not on the projections of a model. But the small ups and downs seem pretty idiosyncratic. We can also fit an ordered logistic regression model, which would be appropriated for the categorical outcome variable we have, and plot its <strong>predicted probabilities<\/strong> given the model.<\/p>\n<p>First, here is the output of the model:<\/p>\n<pre>library(MASS)\nsummary(polr(as.factor(mine.c) ~ imdb+year.c,  Hess=TRUE, data = d)\nCoefficients:\n        Value Std. Error t value\nimdb   1.4103   0.149921   9.407\nyear.c 0.0283   0.006023   4.699\n\nIntercepts:\n    Value   Std. Error t value\n5|6  9.0487  1.0795     8.3822\n6|7 10.6143  1.1075     9.5840\n7|8 12.1539  1.1435    10.6289\n8|9 14.0234  1.1876    11.8079\n\nResidual Deviance: 1148.665 \nAIC: 1160.665<\/pre>\n<p>The coefficients of the two predictors are significant. The plot below shows the predicted probability of the outcome variable &#8211; personal movie rating &#8211; being in each of the five categories as a function of IMDb rating and illustrates the substantive scale of the effect.<\/p>\n<p><a href=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure9.png\"><img data-attachment-id=\"795\" data-permalink=\"http:\/\/re-design.dimiter.eu\/?attachment_id=795\" data-orig-file=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure9.png?fit=1197%2C708\" data-orig-size=\"1197,708\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure9\" data-image-description=\"\" data-medium-file=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure9.png?fit=300%2C177\" data-large-file=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure9.png?fit=1024%2C606\" loading=\"lazy\" class=\"alignnone size-full wp-image-795\" alt=\"figure9\" src=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure9.png?resize=584%2C345\" width=\"584\" height=\"345\" srcset=\"https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure9.png?w=1197 1197w, https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure9.png?resize=300%2C177 300w, https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure9.png?resize=768%2C454 768w, https:\/\/i1.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2014\/03\/figure9.png?resize=1024%2C606 1024w\" sizes=\"(max-width: 584px) 100vw, 584px\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p>Compared to the non-parametric conditional density plot above, these model-based predictions are much smoother and have &#8216;disciplined&#8217; the effect of the predictor to follow a systematic pattern.<\/p>\n<p>It is interesting to ponder which of the two would be more useful for out-of-sample predictions. Despite the fact that the non-parametric one is more faithful to the current data, I think I would go for the parametric model projections. After all, is it really plausible that a random film with an IMDb rating of 5 would have lower chance a getting a 5 from me than a film with an IMDb rating of 6, as the non-parametric conditional density plot suggests? I don&#8217;t think so. Interestingly, in this case the parametric model has actually corrected for some of the selection bias and made for more plausible out-of-sample predictions.<\/p>\n<p><strong>Conclusion<br \/>\n<\/strong>In sum, whatever the method, it is not very fruitful to try to predict how much a person (or at least, the particular person writing this) would like a movie based on the average rating the movie gets and covariates like the genre or the director. Non-linear regressions and other modeling tricks offer only marginal predictive improvements over a simple linear regression approach, but bring plenty of insight about the data itself.<\/p>\n<p>What is the way ahead? Obviously, one would want to get more relevant predictors, but, unfortunately, IMDb <a href=\"http:\/\/www.imdb.com\/help\/show_leaf?usedatasoftware\" target=\"_blank\">seems<\/a> to have a policy against web-scrapping from its database, so one would either have to ask for permission or look at a different website with a more liberal policy (like Rotten Tomatoes perhaps). For me, the purpose of this exercise has been mostly in its methodological educational value, so I think I will leave it at that. Finally, don&#8217;t forget to check out the interactive <a href=\"http:\/\/www.dimiter.eu\/Visualizations_files\/imdb\/imdb_big.html\" target=\"_blank\">scatterplot<\/a>\u00a0of the data used here which shows a user&#8217;s entire movie rating history at a glance.<\/p>\n<p><strong>Endnote<br \/>\n<\/strong>As you would have noted, the IMDb ratings come at a greater level of precision (like 7.3) than the one available for individual users (like 7). So a user who really thinks that a film is worth 7.5 has to pick 7 or 8, but its average IMDb score could well be 7.5. If the rating categories available to the user are indeed too coarse, this would show up in the relationship with the IMDb score: movies with an average score of 7.5 would be less predictable that movies with an average score of either 7 or 8. To test this conjecture, a rerun the linear regression models on two subsets of the data: one comprising the movies with an average IMDb rating between 5.9 and 6.1, 6.9 an 7.1, etc., and a \u00a0second one comprising those with an average IMDb rating between 5.4 and 5.6, 6.4 and 6.6, etc. The fit of the regression for the first group was better than for the second (RMSE of 1.07 vs. 1.11), but, frankly, I expected a more dramatic difference. So maybe ten categories are just enough.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It&#8217;s Oscars season again so why not explore how predictable (my) movie tastes are. This has literally been a million dollar problem\u00a0and obviously I am not gonna solve it here, but it&#8217;s fun and slightly educational to do some number crunching, so why not. Below, I will proceed from a simple linear regression to a generalized additive model to an ordered logistic regression analysis. And I will illustrate the results with nice plots along the way. Of course, all done in R (you can get the script here). Data The data for this little project comes from the IMDb website and, in particular, from my personal ratings of 442 titles recorded there. IMDb keeps the movies you have rated in a nice little table which includes information on the movie title, director, duration, year of release, genre, IMDb rating, and a few other less interesting variables. Conveniently, you can export the data directly as a csv file. Outcome variable The outcome variable that I want to predict is my personal movie rating. IMDb lets you score movies with one to ten stars. Half-points and other fractions are not allowed. It is a tricky variable to work with. It is obviously not a continuous one; at the same time ten ordered categories are a bit too many to treat as a regular categorical variable. Figure 1 plots the frequency distribution (black bars) and\u00a0density (red area) of my ratings and the density of the IMDb scores (in blue) for the 442 observations&#8230;<\/p>\n<div class=\"more-link-wrapper\"><a class=\"more-link\" href=\"http:\/\/re-design.dimiter.eu\/?p=767\">Continue reading<span class=\"screen-reader-text\">Predicting movie ratings with IMDb data and R<\/span><\/a><\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false},"categories":[11,31,39],"tags":[117,266,267,286,324,386,411,417,418,453,460,472,514,515,539,619,681],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p7g3hj-cn","jetpack-related-posts":[{"id":496,"url":"http:\/\/re-design.dimiter.eu\/?p=496","url_meta":{"origin":767,"position":0},"title":"Scatterplots vs. regression tables (Economics professors edition)","date":"July 10, 2012","format":false,"excerpt":"I have always considered scatterplots to be the best available device to show relationships between variables. But it must be even better to have the regression table and a full description of the results in addition, right? Not so fast: A new paper shows that\u00a0professional economists make largely correct inferences\u2026","rel":"","context":"In &quot;Data visualization&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1013,"url":"http:\/\/re-design.dimiter.eu\/?p=1013","url_meta":{"origin":767,"position":1},"title":"The political geography of human development","date":"November 12, 2018","format":false,"excerpt":"The research I did for the previous post on the inadequacy of the widely-used term 'Global South' led me to some surprising results about the political geography of development. Although the relationship between latitude and human development is not linear, distance from the equator turned out to have a rather\u2026","rel":"","context":"In &quot;Data visualization&quot;","img":{"alt_text":"","src":"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2018\/11\/f3_hdi_eq.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":969,"url":"http:\/\/re-design.dimiter.eu\/?p=969","url_meta":{"origin":767,"position":2},"title":"The 'Global South' is a terrible term. Don't use it!","date":"November 6, 2018","format":false,"excerpt":"The Rise of the 'Global South' The 'Global South' and 'Global North' are increasingly popular terms used to categorize the countries of the world.\u00a0According to Wikipedia, the term 'Global South' originated in postcolonial studies, and was first used in 1969. The Google N-gram chart below shows the rise of the\u2026","rel":"","context":"In &quot;Classification&quot;","img":{"alt_text":"","src":"https:\/\/i2.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2018\/11\/f2_hdi_eq.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":307,"url":"http:\/\/re-design.dimiter.eu\/?p=307","url_meta":{"origin":767,"position":3},"title":"No use for big data in electioneering, according to Hollywood","date":"February 8, 2012","format":false,"excerpt":"Over the last year two major Hollywood movies that touch upon the use of big data and sophisticated data analysis hit the big screen. Which, of course, is two more than the mean (or was that the median). Moneyball shows how crunching numbers helps win baseball games and Margin Call\u2026","rel":"","context":"In &quot;Humour&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1053,"url":"http:\/\/re-design.dimiter.eu\/?p=1053","url_meta":{"origin":767,"position":4},"title":"Modeling mortality","date":"January 20, 2021","format":false,"excerpt":"To grasp the true impact of COVID-19 on our societies, we need to know the effect of the pandemic on mortality. In other words, we need to know how many deaths can be attributed to the virus, directly and indirectly. It is already popular to visualize mortality in order to\u2026","rel":"","context":"In &quot;Data visualization&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":260,"url":"http:\/\/re-design.dimiter.eu\/?p=260","url_meta":{"origin":767,"position":5},"title":"When 'just looking' beats regression","date":"January 30, 2012","format":false,"excerpt":"In a draft paper\u00a0currently under review I argue that the institutionalization of\u00a0a common EU\u00a0asylum policy has not led to a race to the bottom with respect to asylum applications, refugee status grants, and some other indicators. The graph below traces the number of asylum applications lodged in 29 European countries\u2026","rel":"","context":"In &quot;Time series analysis&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/re-design.dimiter.eu\/wp-content\/uploads\/2012\/01\/asylumapplications5.png?resize=350%2C200","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=\/wp\/v2\/posts\/767"}],"collection":[{"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=767"}],"version-history":[{"count":0,"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=\/wp\/v2\/posts\/767\/revisions"}],"wp:attachment":[{"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=767"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=767"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/re-design.dimiter.eu\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=767"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}