LAB 4C: Cross-Validation

Lab 4C - Cross-Validation

Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.

What is cross-validation?

In the previous two labs, we learned how to:

– Create a linear model predicting height from the arm_span data (4A).

– See how well our model predicts height on the arm_span data by computing mean squared error (MSE)(4B).
In this lab, we will see how well our model predicts the heights of people we haven't yet measured.
To do this, we will use a method called cross-validation.
Cross-validation consists of three steps:

– Step 1: Split the data into training and test sets.

– Step 2: Create a model using the training set.

– Step 3: Use this model to make predictions on the test set.

Step 1: train-test split

Waiting for new observations can take a long time. The U.S. takes a census of its population once every 10 years, for example.
Instead of waiting for new observations, data scientists will take their current data and divide it into two distinct sets.
Split the arm_span data into training and test sets using the following two steps.
First, fill in the blanks below to randomly select which rows of arm_span will go into the training set.
```
set.seed(123)
train_rows <- sample(1:____, size = 85)
```
Second, use the slice function to create two dataframes: one called train consisting of the train_rows, and another called test consisting of the remaining rows of arm_span.
```
train <- slice(arm_span, ____)
test <- slice(____, - ____)
```
Explain these lines of code and describe the train and test datasets.

Aside: set.seed()

When we split data, we're randomly separating our observations into training and testing sets.

– It's important to notice that no single observation will be placed in both sets.
Because we're splitting the data sets randomly, our models can also vary slightly, person-to-person.

– This is why it's important to use set.seed.
By using set.seed, we're able to reproduce the random splitting so that each person's model outputs the same results.

Whenever you split data into training and testing, always use set.seed first.

Aside: train-test ratio

When splitting data into training and testing sets, we need to have enough observations in our data so that we can build a good model.

– This is why we kept 85 observations in our training data.
As datasets grow larger, we can use a larger proportion of the data to test with.

Step 2: train the model

Step 2 is to create a linear model relating height and armspan using the training data.
Fit a line of best fit model to our training data and assign it the name best_train.
Recall that the slope and intercept of our linear model are chosen to minimize MSE.
Since the MSE being minimized is from the training data, we can call it training MSE.

Step 3: test the model

Step 3 is to use the model we built on the training data to make predictions on the test data.
Note that we are NOT recomputing the slope and intercept to fit the test data best. We use the same slope and intercept that were computed in step 2.
Because we're using the line of best fit, we can use the predict() function we introduced in the last lab to make predictions.

– Fill in the blanks below to add predicted heights to our test data:
```
test <- mutate(test, ____ = predict(best_train, newdata = ____))
```
Hint: the predict function without the argument newdata will output predictions on the training data. To output predictions on the test data, supply the test data to the newdata argument.
Calculate the test MSE in the same way as you did in the previous lab (test MSE is simply MSE of the predictions on the test data).

Recap

Another way to describe the three steps is
- Step 1: Split the data into training and test sets.
- Step 2: Choose a slope and intercept that minimize training MSE.
- Step 3: Using the same slope and intercept from step 2, make predictions on the test set, and use these predictions to compute test MSE.
This begs the question, why do we care about test MSE?

Why cross-validate?

Why go to all this trouble to compute test MSE when we could just compute MSE on the original dataset?
When we compute MSE on the original dataset, we are measuring the ability of a model to make predictions on the current batch of data.
Relying on a single dataset can lead to models that are so specific to the current batch of data that they're unable to make good predictions for future observations.

– This phenomenon is known as overfitting.
By splitting the data into a training and test set, we are hiding a proportion of the data from the model. This emulates future observations, which are unseen.
Test MSE estimates the ability of a model to make predictions of future observations.

Example of overfitting

The following example motivates cross-validation by illustrating the dangers of overfitting.
We randomly select 7 points from the arm_span dataset and fit two models: a linear model, and a polynomial model.

– You will learn how to fit a polynomial model in the next lab.
Below is a plot of these 7 training points, and two curves representing the value of height each model would predict given a value of armspan.

Which model does a better job of predicting the 7 training points?
Which model do you think will do a better job of predicting the rest of the data?

Example of overfitting, continued

Below is a plot of the rest of the arm_span dataset, along with the predictions each model would make.

Which model does a better job of generalizing to the rest of the arm_span dataset?