LAB 4F: This Model Is Big Enough for All of Us

Lab 4F - This model is big enough for all of us!

Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.

Building better models

So far, in the labs, we've learned how to make predictions using the line of best fit, also knowns as linear models or regression models.
We've also learned how to measure our model's prediction accuracy by cross-validation.
In this lab, we'll investigate the following question:

Will including more variables in our model improve its predictions?

Divide & Conquer

Start by loading the movie data and split it into two sets (See Lab 4C for help).

– A set named training that includes 75% of the data.

– A set named test that includes the remaining 25%.
- Remember to use set.seed.
Create a linear model, using the training data, that predicts gross using runtime.

– Compute the MSE of the model by making predictions for the test data.
Do you think that a movie's runtime is the only factor that goes into how much a movie will make? What else might affect a movie's gross?

Including more info

Data scientists often find that including more relevant information in their models leads to better predictions.

– Fill in the blanks below to predict gross using runtime and reviews_num.
```
lm(____ ~ ____ + ____, data = training)
```
Does this new model make more or less accurate predictions? Describe the process you used to arrive at your conclusion.
Write down the code you would use to include a 3rd variable, of your choosing, in your lm().

Own your own

Write down which other variables in the movie data you think would help you make better predictions.

– Are there any variables that you think would not improve our predictions?
Create a model for all of the variables you think are relevant.

– Assess whether your model makes more accurate predictions for the test
data than the model that included only runtime and reviews_num
With your neighbors, determine which combination of variables leads to the best predictions for the test data.