LAB 4F: This Model Is Big Enough for All of Us
Lab 4F  This model is big enough for all of us!
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Building better models

So far, in the labs, we've learned how to make predictions using the line of best fit, also knowns as linear models or regression models.

We've also learned how to measure our model's prediction accuracy by crossvalidation.

In this lab, we'll investigate the following question:
Will including more variables in our model improve its predictions?
Divide & Conquer

Start by loading the
movie
data and split it into two sets (See Lab 4C for help).– A set named
training
that includes 75% of the data.– A set named
test
that includes the remaining 25%. Remember to use
set.seed
.
 Remember to use

Create a linear model, using the
training
data, that predictsgross
usingruntime
.– Compute the MSE of the model by making predictions for the
test
data. 
Do you think that a movie's
runtime
is the only factor that goes into how much a movie will make? What else might affect a movie'sgross
?
Including more info

Data scientists often find that including more relevant information in their models leads to better predictions.
– Fill in the blanks below to predict
gross
usingruntime
andreviews_num
.lm(____ ~ ____ + ____, data = training)

Does this new model make more or less accurate predictions? Describe the process you used to arrive at your conclusion.

Write down the code you would use to include a 3rd variable, of your choosing, in your
lm()
.
Own your own

Write down which other variables in the
movie
data you think would help you make better predictions.– Are there any variables that you think would not improve our predictions?

Create a model for all of the variables you think are relevant.
– Assess whether your model makes more accurate predictions for the
test
data than the model that included onlyruntime
andreviews_num

With your neighbors, determine which combination of variables leads to the best predictions for the
test
data.