LAB 4C: Cross-Validation
Lab 4C - Cross-Validation
Directions: Follow along with the slides and answer the questions in bold font in your journal.
Predictions
-
In the previous lab, we learned how to calculate the mean squared error (MSE).
– This let us measure how well our model predicts values of our
y
-variable. -
To really measure how well our line of best fit predicts people's
heights
, we want see how well we predict theheights
of people that we haven't yet measured. -
To do this, we'll divide our data into two sets:
– A training set used to build our model.
– And a testing set we can use to measure how well our model predicts new data values.
-
This method of dividing data into sets is called cross-validation.
Why cross-validate?
-
Data scientists are often tasked with predicting some aspect of future observations.
– Relying on a single data set to both train and test models can lead to models that are so specific to the current batch of data that they're unable to make good predictions for these future observations.
-
Cross-validating allows data scientists to measure how well their models predict new observations.
– It also gives them the ability to compare different models to see which models make better/worse predictions.
Splitting the data
-
Waiting for new observations can take a long time. The U.S. takes a census of its population once every 10 years, for example.
-
Instead of waiting for new observations, data scientists will take their current data and divide it into two distinct sets.
-
For our
arm_span
data, fill in the blanks to create atraining
andtesting
data set.set.seed(123) train_rows <- sample(1:____, size = 85) train <- slice(arm_span, ____) test <- slice(____, - ____)
-
Explain these lines of code and describe the
train
andtest
data sets.
set.seed then split
-
When we split data, we're randomly separating our observations into training and testing sets.
– It's important to notice that no single observation will be placed in both sets.
-
Because we're splitting the data sets randomly, our models can will also vary slightly, person-to-person.
– This is why it's important to use
set.seed
. -
By using
set.seed
, we're able to reproduce the random splitting so that each person's model outputs the same results.Whenever you split data into training and testing, always use
set.seed
first.
Building on training
-
When splitting data into training and testing sets, we need to have enough observations in our data so that we can build a good model.
– This is why we kept 85 observations in our
training
data. -
As data sets grow larger, we can use a larger proportion of the data to test with.
-
Fit a line of best fit model to our training data and assign it the name
best_train
.
Predicting on testing
-
Now that our model has been built, we can use it to predict the values of
height
in ourtest
data. -
Because we're using the line of best fit, we can use the
predict()
function we introduced in the last lab to make predictions.– Fill in the blanks below to add predicted heights to our test data:
test <- mutate(test, ____ = predict(best_train, newdata = ____))
-
Calculate the MSE in the same way as you did in the previous lab.
Avoiding being too specific
-
When we build models without cross-validating, we run the risk of building models that are too specific to the data we already have.
– Meaning, the model predicts values we know about really well BUT predicts new values very poorly.
-
The plot on the following slide shows a single, randomly chosen
height
for each value ofarmspan
. -
With a neighbor, write down a prediction rule that would predict a person's
height
based on theirarmspan
really well for people already shown in our plot but would predict people not in our plot very poorly.