LAB 4B: What’s the Score?
Lab 4B - What's the score?
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Previously
-
In the previous lab, we learned we could make predictions about one variable by utilizing the information of another.
-
In this lab, we will learn how to measure the accuracy of our predictions.
– This in turn will let us evaluate how well a model performs at making predictions.
– We'll also use this information later to compare different models to find which model makes the best predictions.
Predictions using a line
-
Load the
arm_span
data again.– Create an
xyplot
withheight
on the y-axis andarmspan
on the x-axis.– Type
add_line()
to run theadd_line
function; you'll be prompted to click twice in the plot window to create a line that you think fits the data well. -
Fill in the blanks below to create a function that will make predictions of people's
height
s based on theirarmspan
:predict_height <- function(armspan) { ____ * armspan + ____ }
Make your predictions
-
Fill in the blanks to include your predictions in the
arm_span
data.____ <- mutate(____, predicted_height = ____(____))
-
Now that we've made our predictions, we'll need to figure out a way to decide how accurate our predictions are.
– We'll want to compare our predicted heights to the actual heights.
– At the end, we'll want to come up with a single number summary that describes our model's accuracy.
Sums of differences
-
A residual is the difference between the actual and predicted value of a quantity of interest.
-
Fill in the blanks below to add a column of residucals to
arm_span
.____ <- mutate(____, residual = ____ - ____)
-
What do the residuals measure?
-
One method we might consider to measure our model's accuracy is to sum the residuals.
-
Fill in the blanks below to calculate our accuracy summary.
summarize(____, sum(____))
-
Hint: Like
mutate
, the first argument ofsummarize
is a dataframe, and the second argument is the action to perform on a column of the dataframe. Whereas the output ofmutate
is a column, the output ofsummarize
is (usually) a single number summary. -
Describe and interpret, in words, what the output of your accuracy summary means.
-
Write down why adding positive and negative errors together is problematic for assessing prediction accuracy.
Mean squared error
-
When adding residuals, the positive errors in our predictions (underestimates) are cancelled out by negative errors (overestimates) which lead to the impression that our model is making better predictions than it actually is.
-
To solve this problem we calculate the squared values of the errors because squared values are always positive.
-
The mean squared error (MSE) is calculated by squaring all of the residuals, and then taking the mean of the squared residuals.
-
Fill in the blanks below to calculate the MSE of your line.
summarize(____, mean((____))^2)
-
Compare your MSE with a neighbor. Whose line was more accurate and why?
Regression lines
-
If you were to go around your class, each student would have created a different line that they feel fit the data best.
- Which is a problem because everyone's line will make slightly different predictions.
-
To avoid this variation in predictions, data scientists will use regression lines.
-
We also refer to regression lines as lienar models.
-
This line connects the mean
height
of people with similararmspan
s. -
Fill in the blanks below to create a regression line using
lm
, which stands for linear model.best_fit <- lm(_ ~ _, data = arm_span)
-
Predicting wiht regression lines
-
Making predictions with models
R
is familiar with is simpler than with lines, or models, we come up with ourselves.– Fill in the blanks to make predictions using
best_fit
:____ <- mutate(____, predicted_height = predict(____))
-
Hint: the
predict
function takes a linear model as input, and outputs the predictions of that model.
The magic of lm()
-
The
lm()
function creates the line of best fit equation by finding the line that minimizes the mean squared error. Meaning, it's the best fitting line possible. -
Calculate the MSE for the values predicted using the regression line.
-
Compare the MSE of the linear model you fitted with
add_line()
to the MSE of the linear model obtained withlm
. Which linear model performed better? -
Ask your neighbors if any of their lines beat the
lm
line in terms of the MSE. Were any of them successful?