LAB 4F: This Model Is Big Enough for All of Us
Lab 4F - This model is big enough for all of us!
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Building better models
-
So far, in the labs, we've learned how to make predictions using the line of best fit, also knowns as linear models or regression models.
-
We've also learned how to measure our model's prediction accuracy by cross-validation.
-
In this lab, we'll investigate the following question:
Will including more variables in our model improve its predictions?
Divide & Conquer
-
Start by loading the
movie
data and split it into two sets (See Lab 4C for help).– A set named
training
that includes 75% of the data.– A set named
test
that includes the remaining 25%.- Remember to use
set.seed
.
- Remember to use
-
Create a linear model, using the
training
data, that predictsgross
usingruntime
.– Compute the MSE of the model by making predictions for the
test
data. -
Do you think that a movie's
runtime
is the only factor that goes into how much a movie will make? What else might affect a movie'sgross
?
Including more info
-
Data scientists often find that including more relevant information in their models leads to better predictions.
– Fill in the blanks below to predict
gross
usingruntime
andreviews_num
.lm(____ ~ ____ + ____, data = training)
-
Does this new model make more or less accurate predictions? Describe the process you used to arrive at your conclusion.
-
Write down the code you would use to include a 3rd variable, of your choosing, in your
lm()
.
Own your own
-
Write down which other variables in the
movie
data you think would help you make better predictions.– Are there any variables that you think would not improve our predictions?
-
Create a model for all of the variables you think are relevant.
– Assess whether your model makes more accurate predictions for the
test
data than the model that included onlyruntime
andreviews_num
-
With your neighbors, determine which combination of variables leads to the best predictions for the
test
data.