Modeling and Prediction for Movies

Setup

Load Packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.3

library(statsr)

## Warning: package 'statsr' was built under R version 3.5.2

library(GGally)

## Warning: package 'GGally' was built under R version 3.5.3

library(gridExtra)
library(boot)
library(broom)

## Warning: package 'broom' was built under R version 3.5.3

Load Data

load("movies.Rdata")

Project Sections

This project consists of six parts, and some of these parts have sub-sections. Kindly click on the Section Breakdown tab for the details of each section and click the numbers to navigate between sections.

Section Breakdown

1. Data.

2. Research Question.

3. Exploratory Data Analysis and Inference.

-3.1 Characteristics of Top 200 Box Office Movies.

—3.1.1 Difference in the Average IMDb Rating of Movies in and not in the Top 200 Box Office List.

—3.1.2. Confidence Interval for the Average IMDb Rating of Movies in the Top 200 Box Office List.

—3.1.3. Top 200 Box Office Movies and Best Picture Oscar Nomination/Award.

—3.1.4. Relationship between IMDb Rating and Some Variables, Focusing on Movies in the Top 200 Box Office List.

-3.2. Relationship between IMDb Rating, Critics’ Scores and Selected Variables.

4. Modeling.

-4.1. Simple Linear Regression.

-4.2. Multiple Linear Regression.

—4.2.1. Model Selection.

—-4.2.1.1 Model Diagnostics.

—-4.2.1.2 Interpreting the Regression Parameters: Slope and Intercept.

—-4.2.1.3 Interpreting the p-value and Strength of the Fit of the Linear Model using Adjusted R2

5. Prediction.

6. Conclusion

Part 1: Data

The data set consists of 651 movies randomly sampled from movies produced and released before 2016, with 32 variables. Information about the dataset was obtained from Rotten Tomatoes and IMDb.

The observations can be classified as a product of retrospective observational study, hence no random assignment. We cannot make causal conclusions from the data (we can only associate). The data were randomly sampled as stated above, hence generalizable.

Since there is no causal relationship but only association, the data is generalizable to all movies produced and released before 2016.

Part 2: Research Question

Anectodally, the goal of a movie is to entertain an audience and to generate revenue while doing so. One way to track a movie’s revenue is via Box Office Mojo called box office revenue. A movie making it to “Top 200 Box Office List” is a good way to go. Some will argue that it is a significant measure of a movie’s popularity and I do agree.

As a movie producer, as much as a critics score/rating is important, the audience score/rating should be of utmost consideration if the aim is to entertain, keeping in mind revenue. So, we want to know if critics score is a significant predictor of audience rating/score while simultaneously controlling for other variables.

For our project, we will develop a parsimonious model to predict the audience rating from critics score and other variables in the data.

In answering our research question, we will look at the following:

The characteristics of movies that made it to the Top 200 Box Office List.
Relationship between IMDb rating, critics’ score, and selected variables

Interest: A movie making it to the Top 200 Box Office List is an evidence of the audience’ choice, and a good way to determine the popularity of a movie is to predict the rating of the audience itself.

Part 3: Exploratory Data Analysis and Inference

For ease of analysis, first we check the summary and structure of our data, using summary (movies) and str(movies) respectively. See the codebook for more details. The result of our summary will not be included in the project.

str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

summary(movies)

3.1 Characteristics of Top 200 Box Office Movies

We will examine the relationship between the audience rating/score (imdb_rating and audience_score-Rotten Tomatoes) and movies in the Top 200 Box Office List to determine the best variable for our response variable.

First, let’s get the summary of our Top 200 Box Office List (yes and no).

summary(movies$top200_box)

##  no yes 
## 636  15

From our sampled movies, only 15 movies are listed, and 636 movies are not in the Top 200 Box Office List.

We will look at the audience rating/score of the sampled movies both in and not in the Top 200 Box Office List, using both imdb_rating and audience_score variables.

# `imdb_rating` of `top200_box`
ggplot(data = movies, aes(x = top200_box, y = imdb_rating)) +
  geom_boxplot()

The imdb_rating for movies in the Top 200 Box Office List is high and slightly skewed to the left while movies not in Top 200 Box Office List are strongly left-skewed with notable outliers. A typical imdb_rating for movies in the Top 200 Box Office List is about 7.3 and less variable than movies not in the top 200. The bottom 25% imdb_rating of movies in the top 200 Box Office List is below 6.9 and the least score about 5.6.

We will calculate the summary statistics of the imdb_rating of movies in and not in the Top 200 Box Office List to get an accurate estimate.

# imdb_rating of Movies in the Top 200 Box Office List
movies %>% 
  filter(top200_box == "yes") %>%
  summarise(min = min(imdb_rating), max = max(imdb_rating), median = median(imdb_rating), 
            mean = mean(imdb_rating), sd = sd(imdb_rating), IQR = IQR(imdb_rating), 
            Q1 = quantile(imdb_rating, 0.25), Q3 = quantile(imdb_rating, 0.75))

## # A tibble: 1 x 8
##     min   max median  mean    sd   IQR    Q1    Q3
##   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   5.6     8    7.3  7.14 0.692  0.85  6.85   7.7

# imdb_rating of Movies not in the Top 200 Box Office List
movies %>% 
  filter(top200_box == "no") %>%
  summarise(min = min(imdb_rating), max = max(imdb_rating), median = median(imdb_rating), 
            mean = mean(imdb_rating), sd = sd(imdb_rating), IQR = IQR(imdb_rating), 
            Q1 = quantile(imdb_rating, 0.25), Q3 = quantile(imdb_rating, 0.75))

## # A tibble: 1 x 8
##     min   max median  mean    sd   IQR    Q1    Q3
##   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   1.9     9    6.6  6.48  1.09  1.40   5.9   7.3

Movies not in the Top 200 Box Office List have an imdb_rating as low as 1.9 to a high 9.

Before we delve deeper into analyzing imdb_rating of movies in the Top 200 Box Office List, let us visualize the audience_score for movies in and not in the Top 200 Box Office List. This is to guide us in our decision for the better variable to use for our response variable.

# audience_score of top200_box
ggplot(data = movies, aes(x = top200_box, y = audience_score)) +
  geom_boxplot()

The audience_score for movies in the Top 200 Box Office List are skewed to the right with visible outliers on the left while movies not in Top 200 Box Office List are slightly left-skewed. A typical audience_score for movies in the Top 200 Box Office List is about 81% and less variable than movies not in the top 200. The bottom 25% audience_score of movies in the top 200 Box Office List is below 70% and the least score of about 34%.

We can see that these distributions are more variable compared to that of imdb_rating. Also, for movies listed in the Top 200 Box Office List, the distribution of audience_score has notable outliers while that of imdb_rating is slightly left-skewed. Since we are interested in learning what attributes make a movie popular, hence focusing on movies in the Top 200 Box Office List, we will take the imdb_rating as a better estimate for our audience rating. [Back to sections]

3.1.1 Difference in the Average `imdb_ratings` of Movies in and not in the Top 200 Box Office List

Exercise: Is the average imdb_rating of movies in the Top 200 Box Office List greater than the average imdb_rating of movies not in the top 200 Box Office List?

We are interested in finding out if the average imdb_rating of movies in the top 200 Box Office List is generally greater than movies not in the List. Our point estimate is the observed difference between the sampled imdb_ratings of movies in and not in the Top 200 Box Office List.

The conditions of independence are met - movies were randomly sampled, and both categories represent less than 10% of movies in and not in the Top 200 Box Office List. The movies are independent of each other, and both categories are too (non-paired).
For the sample size/skew condition - the imdb_rating distribution of movies not in the Top 200 Box Office List is strongly skewed to the left, and a sample size of 636 is sufficient to model its mean. While for movies in the Top 200 Box Office List, the distribution of imdb_rating is slightly left-skewed and with a sample size of 15, we can relax the normality condition, as slight skew is not problematic.

Since the conditions are met, we will go ahead with the hypothesis test for inference, for comparing two independent means.

\(H_0\): \(mu_{rating.movies.in}\) \(-\) \(mu_{rating.movies.not.in}\) \(=\) \(0\). There is no difference between the average imdb_ratings of movies in and not in the Top 200 Box Office List.

\(H_A\): \(mu_{rating.movies.in}\) \(-\) \(mu_{rating.movies.not.in}\) \(>\) \(0\). The average imdb_rating of movies in the Top 200 Box Office List is greater than the average imdb_rating of movies not in the Top 200 Box Office List.

inference(y = imdb_rating, x = top200_box, data = movies, type = "ht", statistic = "mean", order = "yes, no",
          method = "theoretical", alternative = "greater")

## Warning: Missing null value, set to 0

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_yes = 15, y_bar_yes = 7.14, s_yes = 0.6916
## n_no = 636, y_bar_no = 6.4778, s_no = 1.088
## H0: mu_yes =  mu_no
## HA: mu_yes > mu_no
## t = 3.6046, df = 14
## p_value = 0.0014

The p-value is less than 0.05, so we reject the null hypothesis. The data provide convincing evidence that the average imdb_rating of movies in the Top 200 Box Office List is greater than the average imdb_rating of movies not in the Top 200 Box Office List. Back to sections

3.1.2 Confidence Interval for the Average `imdb_rating` of Movies in the Top 200 Box Office List

Since our data provide convincing evidence that the imdb_rating of movies in Top 200 Box Office List is greater, we will go ahead to establish a confidence interval for the average imdb_rating of movies in Top 200 Box Office List.

The independence condition is met as discussed previously. Also, we can relax the normality condition for the sample size of 15. We can thus apply the t-distribution for the confidence interval for one sample mean, using the formula below.

\[ t-confidence\ interval\ for\ the\ mean = \bar{x} \pm t_{df}^*SE \]

# Confidence Interval for the Average imdb_rating of Movies in the Top 200 Box Office List
n <- 15
tstar = qt(0.975, df = 14)

movies %>%
  filter(top200_box == "yes") %>%
summarise(lower = mean(imdb_rating) - tstar * (sd(imdb_rating) / sqrt(n)),
            upper = mean(imdb_rating) + tstar * (sd(imdb_rating) / sqrt(n)))

## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1  6.76  7.52

We are 95% confident that the average imdb_rating of movies in Top 200 Box Office List is between 6.76 to 7.52.

If we are not satisfied with the sample size condition, we can decide to use the bootstrap method to estimate the sample mean (average imdb_rating of sampled movies in the Top 200 Box Office List) and estimate our confidence interval for the population.

# Bootstrapping
moviesbt <- movies %>%
  filter(top200_box == "yes") %>%
  select(imdb_rating)

moviesboot <- moviesbt$imdb_rating

set.seed(102)

bootmean <- function(original_vector, resample_vector){
  mean(original_vector[resample_vector])}

mean_imdbrating <- boot(moviesboot, bootmean, R = 10000)

tidy(mean_imdbrating)

## # A tibble: 1 x 3
##   statistic      bias std.error
##       <dbl>     <dbl>     <dbl>
## 1      7.14 -0.000559     0.172

The bootstrap mean of 7.14 is an accurate estimate of our sample mean.

boot.ci(mean_imdbrating)

## Warning in boot.ci(mean_imdbrating): bootstrap variances needed for
## studentized intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = mean_imdbrating)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 6.803,  7.478 )   ( 6.827,  7.506 )  
## 
## Level     Percentile            BCa          
## 95%   ( 6.774,  7.453 )   ( 6.713,  7.407 )  
## Calculations and Intervals on Original Scale

The percentile bootstrap confidence interval can be adopted as the bias of the sample mean estimate is negligible, and the distribution is slightly skewed. However, the bootstrap confidence interval corresponds with our t-confidence interval so we can maintain our previous result.

So can we conclude that imdb_rating is a significant predictor of a movie being in the Top 200 Box Office List? This question requires further analysis, not within the scope of our study at the moment. We have seen from our analysis so far that movies in the Top 200 Box Office List have high imdb_rating and as such popular. Back to sections

3.1.3 Top 200 Box Office Movies and Best Picture Oscar Nomination/Award

Some argue that Oscar’ nomination of a movie is a way to promote a movie and nominees are solely based on professional’s choice. We will visualize the movies in and not in the Top 200 Box Office List that were nominated for best picture Oscar award and for those that won the award; then carry out a hypothesis test to evaluate if there is a difference in the proportion.

Plot1 <- movies %>%
  group_by(top200_box) %>% 
  count(best_pic_win) %>% 
  mutate(proportion = n/sum(n))%>%
ggplot(aes(x = top200_box, y = proportion, fill = best_pic_win)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(proportion * 100, 2), "%"), vjust = -0.2)) +
  labs(title = "Best Picture Oscar Award for Top 200 
       Box Office Movies (No/Yes)", x = "Top 200 Box Office List") +
  theme(axis.title.y = element_blank() , axis.text.y = element_blank(), axis.ticks.y = element_blank())
  

Plot2 <- movies %>%
  group_by(top200_box) %>% 
  count(best_pic_nom) %>% 
  mutate(proportion = n/sum(n))%>%
ggplot(aes(x = top200_box, y = proportion, fill = best_pic_nom)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(proportion  * 100, 2), "%"), vjust = -0.02)) +
  labs(title = "Nominated for Best Picture Oscar Award 
for Top 200 Box Office Movies (No/Yes)", x = "Top 200 Box Office List") +
  theme(axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank())

grid.arrange(Plot1, Plot2, ncol = 2)

0.94% of movies not in the Top 200 Box Office List won best picture Oscar award while 6.67% of movies in the Top 200 Box Office List won best picture Oscar award.

Also, 3.14% of movies not in the Top 200 Box Office List were nominated for the best picture Oscar award while 13.33% of movies in the Top 200 Box Office List were nominated for the best picture Oscar award.

From the plot, the proportion of sampled movies in the Top 200 Box Office List nominated for best picture Oscar award, also those who won the award, is greater than those not in the Top 200 Box Office List. So the question that comes to mind is: is there actually a significant difference in the proportions of movies in and not in Top 200 Box Office List, nominated for best picture Oscar award? This question is applicable to the proportion of movies in and not in the Top 200 Box Office List that won the best picture Oscar award.

Before we go ahead with the hypothesis test, we will calculate the counts and check if the conditions for inference for conducting a hypothesis test, to compare two proportions, are met:

# Count of Movies in and not in the Top 200 Box Office List that were Nominated for Best Picture Oscar Award
movies %>%
  group_by(top200_box) %>% 
  count(best_pic_nom)

## # A tibble: 4 x 3
## # Groups:   top200_box [2]
##   top200_box best_pic_nom     n
##   <fct>      <fct>        <int>
## 1 no         no             616
## 2 no         yes             20
## 3 yes        no              13
## 4 yes        yes              2

Hypothesis Test for Movies in and not in the Top 200 Box Office List that were Nominated for Best Picture Oscar Award:

\(H_0\): \(p_{movies.in.nom}\) \(=\) \(p_{movies.not.in.nom}\). The population proportion of movies in the Top 200 Box Office List nominated for best picture Oscar award is same as the population proportion of movies not in the Top 200 Box Office List nominated for best picture Oscar award.

\(H_A\): \(p_{movies.in.nom}\) \(!=\) \(p_{movies.not.in.nom}\). There is a difference between the population proportions of movies in the Top 200 Box Office List nominated for the best picture Oscar award and those not in the Top 200 Box Office List nominated for best picture Oscar award.

Conditions for inference for conducting a hypothesis test, to compare two proportions:

Independence: within groups-met: random sample: both populations were sampled randomly; the 10% condition is met for both populations. So, sampled movies in the Top 200 Box Office List nominated for the best picture Oscar award are independent of each other and also those not in the Top 200 Box Office List nominated for best picture Oscar award. between groups-met: We do not expect sampled movies in the Top 200 Box Office List nominated for the best picture Oscar award and those not in the Top 200 Box Office List nominated for best picture Oscar award to be dependent.
Sample size/skew: We need the pooled proportion to check the success-failure condition.

\[Success\ Condition = n \hat{p}_{pool} >= 10 \] \[Failure\ Condition = n(1 - \hat{p}_{pool}) >= 10 \]

# Pooled proportion = total successes/total n
phat_pool = (20 + 2)/(616 + 20 + 13 + 2)
phat_pool

## [1] 0.03379416

# Movies in the Top 200 Box Office List: success
15 * phat_pool

## [1] 0.5069124

# Movies in the Top 200 Box Office List: failure
15 * (1 - phat_pool)

## [1] 14.49309

# Movies not in the Top 200 Box Office List: success
636 * phat_pool

## [1] 21.49309

# Movies not in the Top 200 Box Office List: failure
636 * (1 - phat_pool)

## [1] 614.5069

The success condition for the movies in the Top 200 Box Office List is not met, so we conduct our inference via simulation.

inference(y = best_pic_nom, x = top200_box, data = movies, type = "ht", statistic = "proportion", success = "yes",
          method = "simulation", alternative = "twosided", nsim = 15000)

## Warning: Missing null value, set to 0

## Response variable: categorical (2 levels, success: yes)
## Explanatory variable: categorical (2 levels) 
## n_no = 636, p_hat_no = 0.0314
## n_yes = 15, p_hat_yes = 0.1333
## H0: p_no =  p_yes
## HA: p_no != p_yes
## p_value = 0.1805

At 5% significant level, the p-value is greater; hence we fail to reject the null hypothesis. The data do not provide strong evidence that the population proportion of movies in the Top 200 Box Office List nominated for the best picture Oscar award is different from those not in the Top 200 Box Office List nominated for best picture Oscar award.

So we will go ahead with the inference for those that won the best picture Oscar award.

# Count of Movies in and not in the Top 200 Box Office List that won Best Picture Oscar Award
movies %>%
  group_by(top200_box) %>% 
  count(best_pic_win)

## # A tibble: 4 x 3
## # Groups:   top200_box [2]
##   top200_box best_pic_win     n
##   <fct>      <fct>        <int>
## 1 no         no             630
## 2 no         yes              6
## 3 yes        no              14
## 4 yes        yes              1

Hypothesis Test for Movies in and not in the Top 200 Box Office List that won Best Picture Oscar Award:

\(H_0\): \(p_{movies.in.win}\) \(=\) \(p_{movies.not.in.win}\). The population proportion of movies in the Top 200 Box Office List that won best picture Oscar award is same as the population proportion of movies not in the Top 200 Box Office List that won best picture Oscar award.

\(H_A\): \(p_{movies.in.win}\) \(!=\) \(p_{movies.not.in.win}\). There is a difference between the population proportions of movies in the Top 200 Box Office List that won best picture Oscar award and those not in the Top 200 Box Office List that won best picture Oscar award.

Conditions for inference for conducting a hypothesis test, to compare two proportions:

Independence: within groups-met: random sample: both populations were sampled randomly; the 10% condition is met for both populations. So, sampled movies in the Top 200 Box Office List that won best picture Oscar award are independent of each other and also those not in the Top 200 Box Office List that won best picture Oscar award. between groups-met: We do not expect sampled movies in the Top 200 Box Office List that won the best picture Oscar award and those not in the Top 200 Box Office List that won best picture Oscar award to be dependent.
Sample size/skew: We need the pooled proportion to check the success-failure condition for a hypothesis test.

\[Success\ Condition = n \hat{p}_{pool} >= 10 \] \[Failure\ Condition = n(1 - \hat{p}_{pool}) >= 10 \]

# Pooled proportion = total successes/total n
phat_pool_win = (6 + 1)/(630 + 6 + 14 + 1)
phat_pool_win

## [1] 0.01075269

# Movies in the Top 200 Box Office List: success
15 * phat_pool_win

## [1] 0.1612903

# Movies in the Top 200 Box Office List: failure
15 * (1 - phat_pool_win)

## [1] 14.83871

# Movies not in the Top 200 Box Office List: success
636 * phat_pool_win

## [1] 6.83871

# Movies not in the Top 200 Box Office List: failure
636 * (1 - phat_pool_win)

## [1] 629.1613

The success conditions for the movies in and not in the Top 200 Box Office List are not met, so we conduct our inference via simulation.

inference(y = best_pic_win, x = top200_box, data = movies, type = "ht", statistic = "proportion", success = "yes",
          method = "simulation", alternative = "twosided", nsim = 15000)

## Warning: Missing null value, set to 0

## Response variable: categorical (2 levels, success: yes)
## Explanatory variable: categorical (2 levels) 
## n_no = 636, p_hat_no = 0.0094
## n_yes = 15, p_hat_yes = 0.0667
## H0: p_no =  p_yes
## HA: p_no != p_yes
## p_value = 0.2984

The p-value is greater than 0.05, so we fail to reject the null hypothesis. The data do not provide strong evidence that the population proportion of movies in the Top 200 Box Office List that won best picture Oscar award is different from those not in the Top 200 Box Office List that won best picture Oscar award. Back to sections

Oscar’ best picture nomination and win seem independent of whether a movie is in the Top 200 Box Office List or not.

3.1.4 Relationship between IMDb Rating and Some Variables, Focusing on Movies in the Top 200 Box Office List

We will evaluate the relationship betwen imdb_rating, critics_score and top 200 Box Office movies (in and not in the list).

Plot3 <- ggplot(data = movies) +
  geom_jitter(aes( x = critics_score, y = imdb_rating, color = top200_box)) + 
  ggtitle("Imdb_rating ~ Critics Score ~ Top 200 
           Box Office Movies (No/Yes)") +
  theme(legend.justification = c(1,0), legend.position = c(1,0))

Plot4 <- ggplot(data = movies, aes(x = top200_box, y = critics_score)) +
  geom_boxplot() + 
  ggtitle("Critics Score ~ Top 200 Box Office 
               Movies (No/Yes)")

grid.arrange(Plot3, Plot4, ncol = 2)

# Correlation Coefficient of the Relationship between imdb_rating and critics_score
summarise(.data = movies, cor(x = critics_score, y = imdb_rating))

## # A tibble: 1 x 1
##   `cor(x = critics_score, y = imdb_rating)`
##                                       <dbl>
## 1                                     0.765

The plot shows a strong, positive linear relationship between imdb_rating and critics_score with a correlation coefficient, R of 0.76. In the next section, we will find out if critics score (critics_score) is a significant predictor of audience rating (imdb_rating) while simultaneously controlling for other variables.

The distribution of critics_score of movies in the Top 200 Box Office List is moderately skewed to the left with values between about 53% and 95% and two notable outliers of about 31% and 38%, but less variable than the critics_score of movies not in the Top 200 Box Office List. The critics_score of movies not in the Top 200 Box Office List is slightly left skewed with values as low as 1% and as high as 100%. We already know, from our previous analysis, that the imdb_rating of movies in the Top 200 Box Office List is greater than that of movies not in the Top 200 Box Office List (also as displayed in the Imdb_rating ~ Critics Score ~ Top 200 Movies’ plot)

The summary statistics for critics score are as follows:

# critics_score of Movies in the Top 200 Box Office List
movies %>% 
  filter(top200_box == "yes") %>%
  summarise(min = min(critics_score), max = max(critics_score), median = median(critics_score), 
            mean = mean(critics_score), sd = sd(critics_score), IQR = IQR(critics_score), 
            Q1 = quantile(critics_score, 0.25), Q3 = quantile(critics_score, 0.75))

## # A tibble: 1 x 8
##     min   max median  mean    sd   IQR    Q1    Q3
##   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1    31    94     83  75.5  19.8    18  70.5  88.5

# critics_score of Movies not in the Top 200 Box Office List
movies %>% 
  filter(top200_box == "no") %>%
  summarise(min = min(critics_score), max = max(critics_score), median = median(critics_score), 
            mean = mean(critics_score), sd = sd(critics_score), IQR = IQR(critics_score), 
            Q1 = quantile(critics_score, 0.25), Q3 = quantile(critics_score, 0.75))

## # A tibble: 1 x 8
##     min   max median  mean    sd   IQR    Q1    Q3
##   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1   100   60.5  57.3  28.4    49    33    82

A typical critics score of movies in the Top 200 Box Office List is 83% with variability of 18% while a typical critics score of movies not in the Top 200 Box Office List is 60.5% with a variability of 49%.

Next, we visualize the relationship between imdb_rating, runtime and top200_box and also that of imdb_rating, genre and top200_box

Plot5 <- ggplot(data = movies) +
  geom_jitter(aes( x = runtime, y = imdb_rating, color = top200_box)) + 
  ggtitle("IIMDb rating ~ Runtime ~ Top 200 Box Office Movies (No/Yes)")

Plot6 <- ggplot(data = movies) +
  geom_histogram(aes( x = imdb_rating, fill = genre)) +
  facet_grid(top200_box ~. ) + 
  ggtitle("IMDb rating ~ Genre ~ Top 200 Box Office Movies (No/Yes)")

grid.arrange(Plot5, Plot6, ncol = 2)

## Warning: Removed 1 rows containing missing values (geom_point).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Correlation Coefficient of the Relationship between imdb_rating and runtime
movies %>% 
  filter(!is.na(runtime)) %>%
  summarise(cor(x = runtime, y = imdb_rating))

## # A tibble: 1 x 1
##   `cor(x = runtime, y = imdb_rating)`
##                                 <dbl>
## 1                               0.268

The IMDb rating ~ runtime ~ top 200 Box Office movies (no/yes) plot shows a weak, positive, linear relationship between imdb_rating and runtime with a correlation coefficient, \(R\) of 0.27 with about one notable leverage point that could be influential. Back to diagnostics plot

The runtime for movies in the Top 200 Box Office List is between about 87.5mins to 162.5mins with an outlier of about 194 mins while the runtime of movies not in the Top 200 Box Office List seems to be concentrated around about 105mins with outliers, to the left as low as about 37mins and to the right, as high as about 265mins.

The imdb_rating varies within each genre of movies, and each distribution is typically left-skewed. The movies in and not in the Top 200 Box Office List are of different genres. A great percentage of our sampled movies is drama; this could be by chance or most movies produced are predominantly drama. Back to sections

3.2 Relationship between IMDb Rating, Critics’ Scores and Selected Variables

The relationship between imdb_rating and critics_score are positive and strongly linear as discussed earlier. We will visualize their relationship with selected variables from the dataframe movies and with a new variable best_dir_act_s_win.

Creating the Variable - best_dir_act_s_win:

We want to know if there is a difference in the critics_score and imdb_rating for movies that featured at least two of either one of the main actor, main actress or a director who won an Oscar award previously. So, we create a new variable best_dir_act_s_win and add to a new dataframe movies1, from the movies dataframe.

movies1 <- movies %>%
  mutate(best_dir_act_s_win = ifelse(best_dir_win == "yes" & best_actor_win == "yes", "yes",
                              ifelse(best_dir_win == "yes" & best_actress_win == "yes", "yes", 
                              ifelse(best_actor_win == "yes" & best_actress_win == "yes", "yes", 
                              ifelse(best_dir_win == "yes" & best_actor_win == "yes" & 
                                       best_actress_win == "yes", "yes", "no")))))

We will go ahead to visualize the relationships.

Plot7 <- ggplot(data = movies) +
  geom_jitter(aes( x = critics_score, y = imdb_rating, color = best_pic_win)) 

Plot8 <- ggplot(data = movies) +
  geom_jitter(aes( x = critics_score, y = imdb_rating, color = best_pic_nom)) +
  scale_color_manual(values = c("orange", "blue"))

Plot9 <- ggplot(data = movies) +
  geom_jitter(aes( x = critics_score, y = imdb_rating, color = best_actor_win)) +
  scale_color_manual(values = c("violet", "blue"))

Plot10 <- ggplot(data = movies) +
  geom_jitter(aes( x = critics_score, y = imdb_rating, color = best_actress_win)) +
  scale_color_manual(values = c("salmon", "purple"))

Plot11 <- ggplot(data = movies) +
  geom_jitter(aes( x = critics_score, y = imdb_rating, color = best_dir_win)) +
  scale_color_manual(values = c("gold", "darkgreen")) 

Plot12 <- ggplot(data = movies1) +
  geom_jitter(aes( x = critics_score, y = imdb_rating, color = best_dir_act_s_win)) +
  scale_color_manual(values = c("orange", "blue"))

Plot13 <- ggplot(data = movies) +
  geom_jitter(aes( x = critics_score, y = imdb_rating, color = genre))

Plot14 <- ggplot(data = movies) +
  geom_histogram(aes( x = critics_score, fill = genre))

grid.arrange(Plot7, Plot8, Plot9, Plot10, Plot11, Plot12, Plot13, Plot14, ncol = 2, 
             top = "Relationship between IMDb Rating, Critics Score,and Selected Variables")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The sampled movies nominated for a best picture Oscar award and those that won, typically have high imdb_rating and critics_score.

The imdb_rating of movies that won best picture Oscar award varies from about 7.3 to 8.5 (skewed to the right) and typically have a critics_score between about 81.25% to 97.5%. While for movies that did not win the best picture Oscar award has imdb_rating of about 1.9 to 8.5 and crtics_score of about 2% to 100%. Since the number of movies that did not win a best picture Oscar award outnumbers those that won, their imdb_rating and critics_score are more variable. For movies nominated for the best picture Oscar award, there is a notable outlier in the distribution of critics_score (a score as low as 31% compared to the typical critics_score range of 75% to 97%).

There seems to be little or no difference in the imdb_rating and critics_score of movies that one of the main actors had ever won an Oscar and those that had actors that had never won an Oscar. Same can be said for movies that have a main actress that has ever won an Oscar or not. However, for movies that featured a main actress that has ever won an Oscar, the minimum imdb_rating is about 4.25, higher compared to those that did not, with a minimum imdb_rating of about 1.9. For movies directed by a director that has ever won an Oscar, the minimum imdb_rating is about 5.05, and there is little difference in their critics_score compared to those not directed by an Oscar winner.

Movies that either has one of the main actor, main actress or a director that has ever won an Oscar award have a minimum critics_score lower than 12.5% each. From our plot of the new variable best_dir_act_s_win, we can see that movies that have at least two of either one of the main actor, main actress or a director who has ever won an Oscar award have a minimum critics_score of about 39.5%. Also, the minimum imdb_rating for this category is about 5.8, highest compared to the imdb_rating of movies that either has one of the main actor, main actress or a director that has ever won an Oscar award.

It seems movies directed by a director that has ever won an Oscar award tend to have above average audience rating, but a critic’s score seems to be independent of whether one of the main actor, main actress or a director has ever won an Oscar award. However, critic’s score seems to be influenced by a movie that has at least two of either one of the main actor, main actress or a director who has ever won an Oscar award. This might have been due to chance, that is the movies are extremely good movies; hence the critics’ scores. We need to carry out further analysis to determine if there is an actual difference in critics_score for the different scenarios.

The imdb_rating and critics_score varies within each genre of movies. Back to sections

Part 4: Modeling

4.1 Simple Linear Regression

We will develop a simple linear model, to analyze if critics_score is a statistically significant predictor of imdb_rating before we proceed with developing a parsimonious model, to predict the audience rating from critics_score and other variables in the data.

ggplot(data = movies, aes(x = critics_score, y = imdb_rating)) +
  geom_jitter() +         # we use geom_jitter in case of overplotting
  geom_smooth(method = "lm", se = FALSE)

The relationship between imdb_rating and critics_score is positive and strongly linear as stated earlier.

Let us go ahead with the linear model, write out the equation and then define the slope and intercept of the relationship.

m_imdbcritics <- lm(imdb_rating ~ critics_score, data = movies)
summary(m_imdbcritics)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.93679 -0.39499  0.04512  0.43875  2.47556 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.8075715  0.0620690   77.45   <2e-16 ***
## critics_score 0.0292177  0.0009654   30.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6991 on 649 degrees of freedom
## Multiple R-squared:  0.5853, Adjusted R-squared:  0.5846 
## F-statistic: 915.9 on 1 and 649 DF,  p-value: < 2.2e-16

\[\widehat{imdb\_rating} = 4.8075715 + 0.0292177\ critics\_score \]

For each additional percentage increase in critics score on Rotten Tomatoes, we would expect the imdb_rating of movies to increase on average by 0.0292177.

Movies with a critics_score of 0 are expected on average to have an imdb_rating of 4.8075715.

We will examine if the conditions of the least squares regression are reasonable before we interpret the p-value and the strength of the fit of the linear model, R-squared \((R^2)\).

The linearity condition is met as the relationship between imdb_rating and critics_score are linear.
The independence condition is met as the data were randomly sampled and the sampled movies in this distribution are less than 10% of movies produced and released before 2016.

# Nearly Normal Residuals
Plot15 <- ggplot(data = m_imdbcritics, aes(x = .resid)) +
  geom_histogram() +
  ggtitle("Histogram of Residuals")

Plot16 <- ggplot(data = m_imdbcritics, aes(sample = .resid)) +
  stat_qq()  + 
  stat_qq_line() + 
  ggtitle("Normal Probability Plot of Residuals")

# Constant Variability of Residuals
Plot17 <- ggplot(data = m_imdbcritics, aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted Values", y = "Residuals") +
  ggtitle("Residuals vs. Fitted Values")

# Plotting the Absolute Values of the Residuals
Plot18 <- ggplot(data = m_imdbcritics, aes(x = .fitted, y = abs(.resid))) +
  geom_jitter() +
  ggtitle("Absolute Value of Residuals vs. Fitted Values")

grid.arrange(Plot15, Plot16, Plot17, Plot18, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram and normal probability plot of residuals show that the residuals are nearly normally distributed and centered around 0.

The residual plot and the plot of the absolute value of the residuals against the fitted values show that the variability of residuals around the 0 line is approximately constant with a few outliers

Since the conditions of least squares regression are reasonable, at 5% significant level, critics_score is a statistically significant predictor of imdb_rating. Also, 58.46% of the variability in imdb_rating is explained by the model (explained by the critics_score).Back to sections

4.2 Multiple Linear Regression

We know a linear relationship exists between imdb_rating and critics_score; now, we will develop a parsimonious model to predict the IMDb rating from critics_score and other variables in the data.

Before we proceed with the stepwise model selection, we need to decide the variables we will include in our model.

The following variables will not be included in our model:

actor1 to actor5: As stated in the project instruction page, the information in the variables was used to determine if the movie features an actor or an actress who have ever won an Oscar award; hence they will not be included in our model.
title, studio and director: Each movie has its own unique title; hence there are more than 600 titles in our movies data. For the director variable, some movies in our observation might have been directed by the same director. However there are numerous movie directors, and the best way to single them out is to use the best_dir_win variable (whether or not the director of the movie ever won an Oscar). Both variables are nominal/regular categorical variables with more than 600 different characters hence difficult to model.

The studio variable, however an ordinal categorical variable, has 211 levels thus will be difficult to model.
thtr_rel_year,'thtr_rel_day, dvd_rel_year and dvd_rel_day: We will exclude these variables but include the month the movie is released in theaters and on DVD. We believe the month of release should be of utmost consideration compared to the day and year as we have various seasons and the festive periods which could influence the audience choice.
imdb_num_votes: This variable will not be included in our model as before a movie is released to the public and even immediately after its release; we would not have an idea of the intended number of votes.
critics_rating: The critics rating on Rotten Tomatoes will not be included in our model, to prevent multicollinearity, as this variable is determined with the critics_score variable.
audience_score and audience_rating: We can say the audience_score variable on Rotten Tomatoes is similar to the imdb_rating on IMDb as they are both numerical variables that measure an audience opinion, hence it is advisable to exclude the audience_score variable from our model. Also, since the audience_rating is determined from the audience_score variable, it should not be included in the model. We can examine the relationship between these variables and the critics_score variable to make a better decision.

ggpairs(movies, columns = 16:18 )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The critics_score and audience_score variables are collinear (correlated). Also, there appears to be a constant variance in the audience_rating variable with respect to the critics_score variable. In addition to the reason mentioned earlier, to prevent model estimate complication, it is best not to include the audience_score and audience_rating variables in the model. Back to sections

4.2.1 Model Selection

For our stepwise model selection, we will use the backward elimination method using the p-value approach. We decided to use the backward elimination method as we already have a full model in mind, which an elimination method will be applied, to drop one predictor at a time to achieve a parsimonious model. We chose the p-value approach as our focus is to obtain statistically significant predictors of our response variable, imdb_rating.

We will proceed with our full model, drop the variable with the highest p-value, refit the model and repeat the process until we have a model that has all significant variables. We choose a significant level of 5%. Click on each tab to navigate the steps.

Full Model

# Full Model
m_imdbrating <- lm(imdb_rating ~ critics_score + title_type + genre + runtime + mpaa_rating + thtr_rel_month +
                     dvd_rel_month + best_actor_win + best_actress_win + best_dir_win + top200_box, data = movies)

summary(m_imdbrating)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + title_type + genre + 
##     runtime + mpaa_rating + thtr_rel_month + dvd_rel_month + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71544 -0.36767  0.03346  0.39370  1.91265 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.183819   0.349046  11.986  < 2e-16 ***
## critics_score                   0.025088   0.001089  23.036  < 2e-16 ***
## title_typeFeature Film         -0.037049   0.247762  -0.150   0.8812    
## title_typeTV Movie             -0.509404   0.388331  -1.312   0.1901    
## genreAnimation                 -0.223722   0.261373  -0.856   0.3924    
## genreArt House & International  0.519468   0.207366   2.505   0.0125 *  
## genreComedy                    -0.144871   0.112384  -1.289   0.1979    
## genreDocumentary                0.633454   0.264579   2.394   0.0170 *  
## genreDrama                      0.134124   0.098443   1.362   0.1736    
## genreHorror                    -0.201666   0.165829  -1.216   0.2244    
## genreMusical & Performing Arts  0.388767   0.226550   1.716   0.0867 .  
## genreMystery & Suspense         0.104309   0.125530   0.831   0.4063    
## genreOther                      0.079219   0.189205   0.419   0.6756    
## genreScience Fiction & Fantasy -0.268602   0.249220  -1.078   0.2816    
## runtime                         0.007297   0.001604   4.548 6.53e-06 ***
## mpaa_ratingNC-17               -0.590952   0.505261  -1.170   0.2426    
## mpaa_ratingPG                  -0.206294   0.188791  -1.093   0.2749    
## mpaa_ratingPG-13               -0.179865   0.194097  -0.927   0.3545    
## mpaa_ratingR                   -0.103888   0.187919  -0.553   0.5806    
## mpaa_ratingUnrated             -0.262053   0.216882  -1.208   0.2274    
## thtr_rel_month                  0.008668   0.007741   1.120   0.2632    
## dvd_rel_month                   0.016038   0.007982   2.009   0.0450 *  
## best_actor_winyes               0.004729   0.078741   0.060   0.9521    
## best_actress_winyes             0.013014   0.086434   0.151   0.8804    
## best_dir_winyes                 0.078327   0.109672   0.714   0.4754    
## top200_boxyes                   0.116845   0.180540   0.647   0.5177    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6609 on 616 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.6364, Adjusted R-squared:  0.6216 
## F-statistic: 43.12 on 25 and 616 DF,  p-value: < 2.2e-16

The variable, best_actor_win has the highest p-value hence we drop it and proceed with the model.

Step 1

# Step 1
m_imdbrating1 <- lm(imdb_rating ~ critics_score + title_type + genre + runtime + mpaa_rating + thtr_rel_month + 
                     dvd_rel_month + best_actress_win + best_dir_win + top200_box, data = movies)

summary(m_imdbrating1)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + title_type + genre + 
##     runtime + mpaa_rating + thtr_rel_month + dvd_rel_month + 
##     best_actress_win + best_dir_win + top200_box, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71641 -0.36717  0.03491  0.39351  1.91186 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.182239   0.347772  12.026  < 2e-16 ***
## critics_score                   0.025088   0.001088  23.055  < 2e-16 ***
## title_typeFeature Film         -0.036812   0.247530  -0.149   0.8818    
## title_typeTV Movie             -0.509687   0.387989  -1.314   0.1894    
## genreAnimation                 -0.223465   0.261127  -0.856   0.3925    
## genreArt House & International  0.519088   0.207102   2.506   0.0125 *  
## genreComedy                    -0.144873   0.112293  -1.290   0.1975    
## genreDocumentary                0.633643   0.264347   2.397   0.0168 *  
## genreDrama                      0.134299   0.098320   1.366   0.1725    
## genreHorror                    -0.201873   0.165660  -1.219   0.2235    
## genreMusical & Performing Arts  0.388518   0.226329   1.717   0.0866 .  
## genreMystery & Suspense         0.104957   0.124965   0.840   0.4013    
## genreOther                      0.079389   0.189031   0.420   0.6746    
## genreScience Fiction & Fantasy -0.269015   0.248924  -1.081   0.2802    
## runtime                         0.007315   0.001575   4.644 4.17e-06 ***
## mpaa_ratingNC-17               -0.588904   0.503701  -1.169   0.2428    
## mpaa_ratingPG                  -0.205933   0.188543  -1.092   0.2752    
## mpaa_ratingPG-13               -0.179751   0.193931  -0.927   0.3544    
## mpaa_ratingR                   -0.103823   0.187764  -0.553   0.5805    
## mpaa_ratingUnrated             -0.262131   0.216703  -1.210   0.2269    
## thtr_rel_month                  0.008680   0.007732   1.123   0.2620    
## dvd_rel_month                   0.015996   0.007944   2.014   0.0445 *  
## best_actress_winyes             0.013348   0.086185   0.155   0.8770    
## best_dir_winyes                 0.078480   0.109553   0.716   0.4740    
## top200_boxyes                   0.116999   0.180376   0.649   0.5168    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6604 on 617 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.6364, Adjusted R-squared:  0.6222 
## F-statistic: 44.99 on 24 and 617 DF,  p-value: < 2.2e-16

The variable, title_type: Feature Film has the highest p-value but we cannot drop it, as one of the levels of movie types is significant. Instead, we drop the variable, best_actress_win and proceed.

Note that we cannot drop the individual level of a variable, especially when one of its level is significant, we keep the whole variable in. It simply means a variable with different levels can only be dropped as a whole when all of its levels are insignificant.

Step 2

# Step 2
m_imdbrating2 <- lm(imdb_rating ~ critics_score + title_type + genre + runtime + mpaa_rating + thtr_rel_month + 
                     dvd_rel_month + best_dir_win + top200_box, data = movies)

summary(m_imdbrating2)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + title_type + genre + 
##     runtime + mpaa_rating + thtr_rel_month + dvd_rel_month + 
##     best_dir_win + top200_box, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71843 -0.35795  0.03548  0.39665  1.91051 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.177784   0.346307  12.064  < 2e-16 ***
## critics_score                   0.025093   0.001087  23.086  < 2e-16 ***
## title_typeFeature Film         -0.036587   0.247331  -0.148   0.8824    
## title_typeTV Movie             -0.508036   0.387536  -1.311   0.1904    
## genreAnimation                 -0.221253   0.260529  -0.849   0.3961    
## genreArt House & International  0.520586   0.206713   2.518   0.0120 *  
## genreComedy                    -0.143122   0.111634  -1.282   0.2003    
## genreDocumentary                0.634741   0.264043   2.404   0.0165 *  
## genreDrama                      0.136296   0.097394   1.399   0.1622    
## genreHorror                    -0.201021   0.165437  -1.215   0.2248    
## genreMusical & Performing Arts  0.388399   0.226149   1.717   0.0864 .  
## genreMystery & Suspense         0.107231   0.124001   0.865   0.3875    
## genreOther                      0.080689   0.188696   0.428   0.6691    
## genreScience Fiction & Fantasy -0.268988   0.248727  -1.081   0.2799    
## runtime                         0.007350   0.001557   4.719 2.93e-06 ***
## mpaa_ratingNC-17               -0.590622   0.503181  -1.174   0.2409    
## mpaa_ratingPG                  -0.205563   0.188379  -1.091   0.2756    
## mpaa_ratingPG-13               -0.179396   0.193765  -0.926   0.3549    
## mpaa_ratingR                   -0.104001   0.187612  -0.554   0.5795    
## mpaa_ratingUnrated             -0.262590   0.216511  -1.213   0.2257    
## thtr_rel_month                  0.008692   0.007725   1.125   0.2610    
## dvd_rel_month                   0.015997   0.007938   2.015   0.0443 *  
## best_dir_winyes                 0.078934   0.109428   0.721   0.4710    
## top200_boxyes                   0.118609   0.179934   0.659   0.5100    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6599 on 618 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.6364, Adjusted R-squared:  0.6228 
## F-statistic: 47.02 on 23 and 618 DF,  p-value: < 2.2e-16

We drop the variable top200_box next.

Step 3

# Step 3
m_imdbrating3 <- lm(imdb_rating ~ critics_score + title_type + genre + runtime + mpaa_rating + thtr_rel_month + 
                     dvd_rel_month + best_dir_win, data = movies)

summary(m_imdbrating3)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + title_type + genre + 
##     runtime + mpaa_rating + thtr_rel_month + dvd_rel_month + 
##     best_dir_win, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72008 -0.36482  0.03522  0.40108  1.91190 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.178039   0.346148  12.070  < 2e-16 ***
## critics_score                   0.025172   0.001080  23.312  < 2e-16 ***
## title_typeFeature Film         -0.033715   0.247179  -0.136   0.8916    
## title_typeTV Movie             -0.505287   0.387336  -1.305   0.1925    
## genreAnimation                 -0.237694   0.259214  -0.917   0.3595    
## genreArt House & International  0.513053   0.206302   2.487   0.0131 *  
## genreComedy                    -0.149155   0.111207  -1.341   0.1803    
## genreDocumentary                0.627323   0.263682   2.379   0.0177 *  
## genreDrama                      0.128916   0.096704   1.333   0.1830    
## genreHorror                    -0.206209   0.165174  -1.248   0.2123    
## genreMusical & Performing Arts  0.377977   0.225492   1.676   0.0942 .  
## genreMystery & Suspense         0.100736   0.123553   0.815   0.4152    
## genreOther                      0.077349   0.188541   0.410   0.6818    
## genreScience Fiction & Fantasy -0.264241   0.248509  -1.063   0.2881    
## runtime                         0.007461   0.001548   4.821  1.8e-06 ***
## mpaa_ratingNC-17               -0.606883   0.502347  -1.208   0.2275    
## mpaa_ratingPG                  -0.215668   0.187668  -1.149   0.2509    
## mpaa_ratingPG-13               -0.191741   0.192769  -0.995   0.3203    
## mpaa_ratingR                   -0.118980   0.186145  -0.639   0.5229    
## mpaa_ratingUnrated             -0.277498   0.215228  -1.289   0.1978    
## thtr_rel_month                  0.009009   0.007707   1.169   0.2428    
## dvd_rel_month                   0.016126   0.007932   2.033   0.0425 *  
## best_dir_winyes                 0.078200   0.109372   0.715   0.4749    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6596 on 619 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.6361, Adjusted R-squared:  0.6232 
## F-statistic: 49.19 on 22 and 619 DF,  p-value: < 2.2e-16

We drop the best_dir_win variable and proceed with the model.

Step 4

# Step 4
m_imdbrating4 <- lm(imdb_rating ~ critics_score + title_type + genre + runtime + mpaa_rating + thtr_rel_month + 
                     dvd_rel_month, data = movies)

summary(m_imdbrating4)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + title_type + genre + 
##     runtime + mpaa_rating + thtr_rel_month + dvd_rel_month, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72216 -0.36671  0.04915  0.40998  1.91162 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.146908   0.343263  12.081  < 2e-16 ***
## critics_score                   0.025277   0.001069  23.636  < 2e-16 ***
## title_typeFeature Film         -0.026566   0.246879  -0.108   0.9143    
## title_typeTV Movie             -0.500053   0.387114  -1.292   0.1969    
## genreAnimation                 -0.238386   0.259110  -0.920   0.3579    
## genreArt House & International  0.508303   0.206114   2.466   0.0139 *  
## genreComedy                    -0.149033   0.111163  -1.341   0.1805    
## genreDocumentary                0.628875   0.263569   2.386   0.0173 *  
## genreDrama                      0.126453   0.096605   1.309   0.1910    
## genreHorror                    -0.205506   0.165106  -1.245   0.2137    
## genreMusical & Performing Arts  0.375823   0.225383   1.667   0.0959 .  
## genreMystery & Suspense         0.100550   0.123504   0.814   0.4159    
## genreOther                      0.072877   0.188363   0.387   0.6990    
## genreScience Fiction & Fantasy -0.259781   0.248333  -1.046   0.2959    
## runtime                         0.007685   0.001515   5.074 5.16e-07 ***
## mpaa_ratingNC-17               -0.608089   0.502146  -1.211   0.2264    
## mpaa_ratingPG                  -0.211103   0.187486  -1.126   0.2606    
## mpaa_ratingPG-13               -0.189250   0.192662  -0.982   0.3263    
## mpaa_ratingR                   -0.114865   0.185983  -0.618   0.5371    
## mpaa_ratingUnrated             -0.278364   0.215140  -1.294   0.1962    
## thtr_rel_month                  0.009061   0.007703   1.176   0.2399    
## dvd_rel_month                   0.015735   0.007910   1.989   0.0471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6593 on 620 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.6358, Adjusted R-squared:  0.6235 
## F-statistic: 51.54 on 21 and 620 DF,  p-value: < 2.2e-16

We drop the thtr_rel_month variable next.

Step 5

# Step 5
m_imdbrating5 <- lm(imdb_rating ~ critics_score + title_type + genre + runtime + mpaa_rating + 
                     dvd_rel_month, data = movies)

summary(m_imdbrating5)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + title_type + genre + 
##     runtime + mpaa_rating + dvd_rel_month, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72236 -0.35491  0.04024  0.40182  1.93769 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.173066   0.342648  12.179  < 2e-16 ***
## critics_score                   0.025288   0.001070  23.641  < 2e-16 ***
## title_typeFeature Film         -0.026714   0.246956  -0.108   0.9139    
## title_typeTV Movie             -0.511830   0.387104  -1.322   0.1866    
## genreAnimation                 -0.228620   0.259057  -0.883   0.3778    
## genreArt House & International  0.511141   0.206163   2.479   0.0134 *  
## genreComedy                    -0.144192   0.111121  -1.298   0.1949    
## genreDocumentary                0.631203   0.263643   2.394   0.0170 *  
## genreDrama                      0.124222   0.096616   1.286   0.1990    
## genreHorror                    -0.204178   0.165154  -1.236   0.2168    
## genreMusical & Performing Arts  0.377072   0.225450   1.673   0.0949 .  
## genreMystery & Suspense         0.092460   0.123350   0.750   0.4538    
## genreOther                      0.062294   0.188206   0.331   0.7408    
## genreScience Fiction & Fantasy -0.265299   0.248365  -1.068   0.2859    
## runtime                         0.008114   0.001471   5.516 5.08e-08 ***
## mpaa_ratingNC-17               -0.616055   0.502255  -1.227   0.2204    
## mpaa_ratingPG                  -0.209321   0.187537  -1.116   0.2648    
## mpaa_ratingPG-13               -0.194926   0.192661  -1.012   0.3120    
## mpaa_ratingR                   -0.113208   0.186035  -0.609   0.5431    
## mpaa_ratingUnrated             -0.282718   0.215175  -1.314   0.1894    
## dvd_rel_month                   0.014229   0.007808   1.822   0.0689 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6595 on 621 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.635,  Adjusted R-squared:  0.6232 
## F-statistic: 54.02 on 20 and 621 DF,  p-value: < 2.2e-16

Next, we drop the mpaa_rating variable.

Step 6

# Step 6
m_imdbrating6 <- lm(imdb_rating ~ critics_score + title_type + genre + runtime + dvd_rel_month, data = movies)

summary(m_imdbrating6)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + title_type + genre + 
##     runtime + dvd_rel_month, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.77800 -0.35270  0.03902  0.40298  1.86330 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.005772   0.304622  13.150  < 2e-16 ***
## critics_score                   0.025334   0.001045  24.254  < 2e-16 ***
## title_typeFeature Film          0.011252   0.244832   0.046   0.9634    
## title_typeTV Movie             -0.499670   0.386556  -1.293   0.1966    
## genreAnimation                 -0.147920   0.237019  -0.624   0.5328    
## genreArt House & International  0.486790   0.201358   2.418   0.0159 *  
## genreComedy                    -0.155098   0.109920  -1.411   0.1587    
## genreDocumentary                0.590588   0.260285   2.269   0.0236 *  
## genreDrama                      0.127380   0.094237   1.352   0.1770    
## genreHorror                    -0.188218   0.161701  -1.164   0.2449    
## genreMusical & Performing Arts  0.373625   0.224545   1.664   0.0966 .  
## genreMystery & Suspense         0.117185   0.120530   0.972   0.3313    
## genreOther                      0.043995   0.187422   0.235   0.8145    
## genreScience Fiction & Fantasy -0.244167   0.248203  -0.984   0.3256    
## runtime                         0.007854   0.001445   5.434 7.92e-08 ***
## dvd_rel_month                   0.014078   0.007792   1.807   0.0713 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6597 on 626 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.6318, Adjusted R-squared:  0.623 
## F-statistic: 71.62 on 15 and 626 DF,  p-value: < 2.2e-16

We drop the title_type variable.

Step 7

# Step 7
m_imdbrating7 <- lm(imdb_rating ~ critics_score + genre + runtime + dvd_rel_month, data = movies)

summary(m_imdbrating7)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + genre + runtime + 
##     dvd_rel_month, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.77108 -0.35353  0.03966  0.40566  1.86784 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.007538   0.180606  22.189  < 2e-16 ***
## critics_score                   0.025333   0.001033  24.513  < 2e-16 ***
## genreAnimation                 -0.146492   0.237179  -0.618   0.5370    
## genreArt House & International  0.487025   0.201499   2.417   0.0159 *  
## genreComedy                    -0.154777   0.109874  -1.409   0.1594    
## genreDocumentary                0.580995   0.134949   4.305 1.93e-05 ***
## genreDrama                      0.119758   0.094144   1.272   0.2038    
## genreHorror                    -0.186927   0.161817  -1.155   0.2485    
## genreMusical & Performing Arts  0.368513   0.211406   1.743   0.0818 .  
## genreMystery & Suspense         0.116403   0.120602   0.965   0.3348    
## genreOther                      0.011119   0.186526   0.060   0.9525    
## genreScience Fiction & Fantasy -0.243577   0.248379  -0.981   0.3271    
## runtime                         0.007963   0.001445   5.512 5.19e-08 ***
## dvd_rel_month                   0.013812   0.007790   1.773   0.0767 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6602 on 628 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.6301, Adjusted R-squared:  0.6225 
## F-statistic:  82.3 on 13 and 628 DF,  p-value: < 2.2e-16

We drop the dvd_rel_month variable.

Final Model

# Final Model
mfinal_imdbrating <- lm(imdb_rating ~ critics_score + genre + runtime, data = movies)

summary(mfinal_imdbrating)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + genre + runtime, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.77547 -0.34271  0.03323  0.40747  1.83491 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.095442   0.170768  23.982  < 2e-16 ***
## critics_score                   0.025661   0.001034  24.828  < 2e-16 ***
## genreAnimation                 -0.166659   0.237643  -0.701   0.4834    
## genreArt House & International  0.394428   0.195927   2.013   0.0445 *  
## genreComedy                    -0.159168   0.109287  -1.456   0.1458    
## genreDocumentary                0.581629   0.133603   4.353 1.56e-05 ***
## genreDrama                      0.114409   0.093403   1.225   0.2211    
## genreHorror                    -0.183407   0.162012  -1.132   0.2580    
## genreMusical & Performing Arts  0.347183   0.211859   1.639   0.1018    
## genreMystery & Suspense         0.112592   0.120381   0.935   0.3500    
## genreOther                      0.001071   0.186950   0.006   0.9954    
## genreScience Fiction & Fantasy -0.413202   0.236343  -1.748   0.0809 .  
## runtime                         0.007824   0.001445   5.416 8.65e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6639 on 637 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6324, Adjusted R-squared:  0.6255 
## F-statistic: 91.34 on 12 and 637 DF,  p-value: < 2.2e-16

We can summarize our final model for predicting imdb_rating with the equation below:

\[\begin{align} \widehat{imdb\_rating} = 4.095 + 0.026\ critics\_score\ - 0.167\ genre: Animation\ + 0.394\ genre: Art\ House\ \&\ International\\ -\ 0.159\ genre: Comedy\ + 0.582\ genre: Documentary\ + 0.114\ genre: Drama\ - 0.183\ genre: Horror\\ +\ 0.347\ genre: Musical\ \&\ Performing Arts\ + 0.112\ genre: Mystery\ \&\ Suspense\ + 0.001\ genre: Other\\ -\ 0.413\ genre: Science\ Fiction\ \&\ Fantasy\ + 0.008\ runtime\end{align} \]

4.2.1.1 Model Diagnostics

Before we interpret our parameter estimate, p-value and the strength of the fit of the linear model, adjusted R-squared \((R^2)\), we will examine if the conditions of the least squares regression are reasonable, using diagnostic plots.

# Linear Relationship between (Numerical) x and y
Plot19 <- ggplot(data = mfinal_imdbrating, aes(x = critics_score, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  ggtitle("Residuals vs. Critics Score")

Plot20 <- ggplot(data = mfinal_imdbrating, aes(x = runtime, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  ggtitle("Residuals vs. Runtime")

# Nearly Normal Residuals
Plot21 <- ggplot(data = mfinal_imdbrating, aes(x = .resid)) +
  geom_histogram() +
  ggtitle("Histogram of Residuals")

Plot22 <- ggplot(data = mfinal_imdbrating, aes(sample = .resid)) +
  stat_qq()  + 
  ggtitle("Normal Probability Plot of Residuals")

# Constant Variability of Residuals
Plot23 <- ggplot(data = mfinal_imdbrating, aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted Values", y = "Residuals") +
  ggtitle("Residuals vs. Fitted Values")

# Plotting the Absolute Values of the Residuals # to also confirm constant variability
Plot24 <- ggplot(data = mfinal_imdbrating, aes(x = .fitted, y = abs(.resid))) +
  geom_jitter() +
  ggtitle("Absolute Value of Residuals vs. Fitted Values")

# Variability across Categorical Variable- "genre"
Plot25 <- ggplot(data = mfinal_imdbrating, aes(x = genre, y = .resid)) +
  geom_boxplot() +
  ggtitle("Residuals vs. Genre") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


Plot26 <- ggplot(data = mfinal_imdbrating, aes(x = seq_along(.resid), y = .resid)) +
  geom_jitter() +
  labs(x = "Order of Collection", y = "Residuals", title = "Residuals in their Order of Data Collection")

grid.arrange(Plot19, Plot20, Plot21, Plot22, Plot23, Plot24, Plot25, Plot26, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Linear Relationship between x vs. y: The plot of residuals vs. critics_score shows random scatter of residuals around 0; so does the plot of residuals vs. runtime. However, for the plot of residuals vs. runtime, the variability of points around the least square line is not actually constant, indicating heteroscedasticity, though a linear relationship exists but weaker compared to the plot of residuals vs. critics_score. (See our earlier discussion on the bivariate relationship between imdb_rating and runtime).

From the residuals vs. genre plot, there is a difference in the variability of the residuals across the groups.
Nearly Normal Residuals: From the normal probability plot and histogram of residuals, there is random scatter of residuals around 0. However, there are a few outliers to the left which are no cause for concern as the number of observations is large enough. The condition of the nearly normal distribution of residuals around 0 is met.
Constant Variability of Residuals: The residuals vs. fitted values plot shows random scatter of residuals with a nearly constant width around 0. However, there are few outliers for low and high fitted values. Also, the absolute value of the residuals against their fitted values plot shows the nearly constant variance of the residuals.
Independent residuals: The plot of residuals in their order of data collection shows no connection (no structure) between observations, hence independent. Also, the data were randomly sampled, and less than 10% of movies produced and released before 2016.

The conditions for this regression model are reasonable, so we will go ahead to interpret the parameter estimates, p-value and strength of the fit of the linear model, adjusted \(R^2\). Back to sections

4.2.1.2 Interpreting the Regression Parameters: Slope and Intercept

Slope of Critics’ Score: All else held constant, for each additional percentage increase in critics score on Rotten Tomatoes, we would expect the imdb_rating of movies to increase on average by 0.026. The addition of other variables has reduced the slope estimate of critics_score by 12.17% when compared to the slope from our previous model (simple linear regression model).
Confidence Interval for the Slope of Critics’ Score`:

\[\begin{align} =\ point\ estimate\ \pm\ margin\ of\ error \\ =\ b_1 \pm\ t^*_{df}SE_{b1}\end{align}\]

SE = 0.001034
b1 = 0.025661
tstar_df = qt(0.975, df = 637)
lower = b1 - (tstar_df * SE)
upper = b1 + (tstar_df * SE)

confint <- matrix(c(lower, upper), ncol = 2, nrow = 1)
colnames(confint) <- c("Lower", "Upper")
rownames(confint) <- c("Confidence Interval")
as.table(confint)

##                          Lower      Upper
## Confidence Interval 0.02363054 0.02769146

We are 95% confident that, all else being equal, the model predicts that for each additional percentage increase in critics score on Rotten Tomatoes, we would expect the imdb_rating of movies to increase on average between 0.024 and 0.028.

Slope of Runtime: All else held constant, for each 1-minute increase in the runtime of movie, the model predicts the imdb_rating of movies to increase on average by 0.008.
Intercept: Action & Adventure movies with no critics_score and no runtime are expected to have an imdb_rating of 4.095. The reference level for movies’ genre is Action & Adventure.

The intercept is meaningless in context, serves to adjust the height of the line as no runtime simply means no movie.

From our model, for movie genres, Science Fiction & Fantasy movie is predicted to have the lowest imdb_rating, and Documentary movie is expected to have the highest imdb_rating, all else held constant.

4.2.1.3 Interpreting the p-value and Strength of the Fit of the Linear Model, using Adjusted \(R^2\)

To determine the significance of the whole model and one of the predictors, critics_score, we will interpret the p-values and make an inference.

Inference for the Model as a Whole

\(H_0\): \(\beta_i\) \(=\) \(\beta_1\) \(=\)……\(=\) \(\beta_k\) \(=\) \(0\)

\(H_A\): At least one \(\beta_i\) is different than \(0\).

The result from our final model is as follows: \(*[F-statistic: 91.34\ on\ 12\ and\ 637\ DF,\ p-value: < 2.2e-16]*\)

At 5% significant level, we reject the null hypothesis; hence the model is significant. This simply means that at least one of the slopes is not zero.

Hypothesis Testing for the Slope of Critics’ Score

Question: Is critics_score a significant predictor of the IMDb rating of movies, given all other variables in the model?

\(H_0\): \(\beta_1\) \(=\) \(0\), when all other variables are included in the model.

\(H_A\): \(\beta_1\) \(\not=\) \(0\), when all other variables are included in the model.

From our model result, the p-value of the critics_score \((< 2 \times 10^{-16})\) is less than \(0.05\), so we reject the null hypothesis. critics_score is thus a significant predictor of the imdb_rating of movies, given all other variables in the model.

Strength of the Fit of the Linear Model, Adjusted \(R^2\)

62.55% of the variability in imdb_rating is explained by the model. So, 37.45% of the variability in imdb_rating is not explained by this model.

The \(adjusted\ R^2,\ 0.6255,\) of the final model increased by 0.63% from that of the full model \((adjusted\ R^2 = 0.6216)\).

There is a 7% increase in the adjusted \(R^2\) of this model compared to the adjusted \(R^2\) of the simple linear model that considered only one predictor variable (critics_score).

Part 5: Prediction

We will use the created model to predict the imdb_rating of some new movies from 2016 (not in our sample distribution) and quantify the prediction uncertainty, at a confidence level of 95%. The critics_score, genre, runtime and actual imdb_rating of each movie were obtained from Rotten Tomatoes and IMDb.

Zootopia <- data.frame(critics_score = 97, genre = "Animation", runtime = 108)
predict1 <- predict(mfinal_imdbrating, Zootopia, interval = "prediction")

Batman <- data.frame(critics_score = 27, genre = "Action & Adventure", runtime = 151)
predict2 <- predict(mfinal_imdbrating, Batman, interval = "prediction")

Deadpool <- data.frame(critics_score = 84, genre = "Action & Adventure", runtime = 108)
predict3 <- predict(mfinal_imdbrating, Deadpool, interval = "prediction")

Pets <- data.frame(critics_score = 73, genre = "Animation", runtime = 87)
predict4 <- predict(mfinal_imdbrating, Pets, interval = "prediction")

Captain <- data.frame(critics_score = 82, genre = "Drama", runtime = 118)
predict5 <- predict(mfinal_imdbrating, Captain, interval = "prediction")

Zoolander2 <- data.frame(critics_score = 23, genre = "Comedy", runtime = 101)
predict6 <- predict(mfinal_imdbrating, Zoolander2, interval = "prediction")

Me_before_you <- data.frame(critics_score = 56, genre = "Drama", runtime = 106)
predict7 <- predict(mfinal_imdbrating, Me_before_you, interval = "prediction")

imdb_rating <- c(8, 8, 6.5, 6.5, 7.9, 4.7, 7.4)

predict <- data.frame ("Movie" = c("Zootopia", "Batman V Superman: Dawn of Justice", "Deadpool (2016)", 
                                 "The Secret Life of Pets (2016)", "Captain Fantastic", "Zoolander 2 (2016)", 
                                 "Me Before You"),
"Prediction" = c(sprintf("%2.1f", predict1[1]), sprintf("%2.1f", predict2[1]), 
                 sprintf("%2.1f", predict3[1]), sprintf("%2.1f", predict4[1]), 
                 sprintf("%2.1f", predict5[1]), sprintf("%2.1f", predict6[1]), 
                 sprintf("%2.1f", predict7[1])),
"Conf_Int." = c(sprintf("%2.1f-%2.1f", predict1[2], predict1[3]), 
                sprintf("%2.1f-%2.1f", predict2[2], predict2[3]), 
                sprintf("%2.1f-%2.1f", predict3[2], predict3[3]), 
                sprintf("%2.1f-%2.1f", predict4[2], predict4[3]), 
                sprintf("%2.1f-%2.1f", predict5[2], predict5[3]), 
                sprintf("%2.1f-%2.1f", predict6[2], predict6[3]), 
                sprintf("%2.1f-%2.1f", predict7[2], predict7[3])),
"Actual_rating" = imdb_rating)

predict

##                                Movie Prediction Conf_Int. Actual_rating
## 1                           Zootopia        7.3   5.9-8.6           8.0
## 2 Batman V Superman: Dawn of Justice        6.0   4.6-7.3           8.0
## 3                    Deadpool (2016)        7.1   5.8-8.4           6.5
## 4     The Secret Life of Pets (2016)        6.5   5.1-7.9           6.5
## 5                  Captain Fantastic        7.2   5.9-8.5           7.9
## 6                 Zoolander 2 (2016)        5.3   4.0-6.6           4.7
## 7                      Me Before You        6.5   5.2-7.8           7.4

The description of each column of the table above is as follows:

Movie: New movies from 2016, not in the sample distribution.
Prediction: The model prediction for each movie.
Conf_Int.: 95% confidence interval of the expected/predicted imdb_rating
Actual_rating: Actual IMDb rating of each movie.

We will interpret only the confidence interval of the Zootopia movie for the purpose of this course.

The model predicts, with 95% confidence, that the Zootopia movie (an animated movie) with a critics_score of 97% is expected to have an average imdb_rating between 5.9 to 8.6.

Note that the actual imdb_rating of the Zootopia movie was within the predicted 95% confidence interval. Back to sections

Part 6: Conclusion

From our data, movies in the Top 200 Box Office List generally have high imdb_rating with few having below a 6.9 but does not necessarily mean nomination of the best picture Oscar award nor a win. Also, the critics_score is typically high with less than 30% having below a 70.5% but with an exception as low as 31% (just like the Batman V Superman: Dawn of Justice movie with a critics_score of 27%). Critics_score seems to be independent of if an actor, actress or director has ever won an Oscar, but movies that featured at least two of either an actor, actress or director who has previously won an award appears to have slightly higher critics_score.

Critics_score is a significant predictor of imdb_rating from our model and 62.55% of the variability in imdb_rating is explained by the model. This is evident in our predictions above - the actual imdb_rating of all except the Batman V Superman: Dawn of Justice movies in our prediction list was within the predicted 95% confidence interval. The prediction was accurate for the movie “The Secret Life of Pets (2016)”.

The model, however significant has some shortcomings as listed below:

For the model to be feasible, we need to develop a model to predict the critics’ score on Rotten Tomatoes.
We observed some fluctuation in the variability of residuals across the genre groups. Also, for runtime, there appears to be a non-constant variability of residual points scattered around 0. This suggests that a statistically significant runtime variable might not be significant, though it is important to note that the p-value is actually very small \((8.65 \times10^{-8})\).
Though there was a 0.63% increase in the adjusted \(R^2\) of the model, from the full model, indicating a higher predictive power, the p-value approach of model selection relies on an arbitrary significant level hence different significant level means different model.
We are not definite if appropriate weighting measures were applied to the data to guide against bias due to a sampling of movies that have more than one part, to ensure the independence of each sampled movie.
Most movies are a combination of more than one genre, for example, the Batman V Superman: Dawn of Justice is both an Action & Adventure and a Science, Fiction & Fantasy movie; which was not considered in this model.

The critics_score, genre and runtime of movies, however statistically significant, does not explain 37.45% of the variability in the imdb_rating for our model. Hence it is necessary to pay attention to the overall quality of the movie and other variables like the script, picture, sound, not present in this data.

A low critics_score does not necessarily equate to a low imdb_rating and guarantee that a movie will not be listed in top 200 Box Office Movies.

For further analysis, we recommend the following:

Weighted or/and a robust regression to guide against bias due to heteroscedasticity.
For model selection, apply backward elimination method using an adjusted \(R^2\) approach, for more reliable predictions.

References

For this analysis, I used contents from the following websites and materials as a guide:

1. http://www.cookbook-r.com/Graphs/Facets_(ggplot2)/

2. http://rpubs.com/chiajieh

3. https://rstudio-pubs-static.s3.amazonaws.com/236787_a3b63b84e9b8423abf88411ee83106f3.html

4. https://stackoverflow.com

5. http://www.sthda.com

6. Cosma Shalizi. [“Using R Markdown for Class Reports”.] (http://www.stat.cmu.edu/~cshalizi/rmarkdown/). (2016).

7. David M Diez, Christopher D Barr and Mine Cetinkaya-Rundel. “OpenIntro Statistics, Third Edition”. (2016).

8. Eugenia Inzaugarat. “Linear and Bayesian Modeling in R: Predicting Movie Popularity” on Medium. (2018).

9. Jim Frost. “Heteroscedasticity in Regression Analysis”

10. Karl Broman. “Knitr with R Markdown”

11. Luke Johnston. “Resampling Techniques In R: Bootstrapping And Permutation Testing” on University of Toronto Coders.

12. “Statistics with R Specialization”, by Duke University on Coursera - (ongoing)

13. Yan Holtz. “Pimp my RMD: a few tips for R Markdown”. (2018).

14. Yi Yang. “An Example R Markdown”. (2017).

15. Yihui Xie, J. J. Allaire and Garrett Grolemund. “R Markdown: The Definitive Guide”. (2019).