LAB 4F: This Model Is Big Enough for All of Us
Lab 4F - This model is big enough for all of us!
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Building better models
- 
So far, in the labs, we've learned how to make predictions using the line of best fit, also knowns as linear models or regression models. 
- 
We've also learned how to measure our model's prediction accuracy by cross-validation. 
- 
In this lab, we'll investigate the following question: Will including more variables in our model improve its predictions? 
Divide & Conquer
- 
Start by loading the moviedata and split it into two sets (See Lab 4C for help).– A set named trainingthat includes 75% of the data.– A set named testthat includes the remaining 25%.- Remember to use set.seed.
 
- Remember to use 
- 
Create a linear model, using the trainingdata, that predictsgrossusingruntime.– Compute the MSE of the model by making predictions for the testdata.
- 
Do you think that a movie's runtimeis the only factor that goes into how much a movie will make? What else might affect a movie'sgross?
Including more info
- 
Data scientists often find that including more relevant information in their models leads to better predictions. – Fill in the blanks below to predict grossusingruntimeandreviews_num.lm(____ ~ ____ + ____, data = training)
- 
Does this new model make more or less accurate predictions? Describe the process you used to arrive at your conclusion. 
- 
Write down the code you would use to include a 3rd variable, of your choosing, in your lm().
Own your own
- 
Write down which other variables in the moviedata you think would help you make better predictions.– Are there any variables that you think would not improve our predictions? 
- 
Create a model for all of the variables you think are relevant. – Assess whether your model makes more accurate predictions for the test
 data than the model that included onlyruntimeandreviews_num
- 
With your neighbors, determine which combination of variables leads to the best predictions for the testdata.