LAB 4B: What’s the Score?

Lab 4B - What's the score?

Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.

Previously

In the previous lab, we learned we could make predictions about one variable by utilizing the information of another.
In this lab, we will learn how to measure the accuracy of our predictions.

– This in turn will let us evaluate how well a model performs at making predictions.

– We'll also use this information later to compare different models to find which model makes the best predictions.

Predictions using a line

Load the arm_span data again.

– Create an xyplot with height on the y-axis and armspan on the x-axis.

– Type add_line() to run the add_line function; you'll be prompted to click twice in the plot window to create a line that you think fits the data well.
Fill in the blanks below to create a function that will make predictions of people's heights based on their armspan:
```
predict_height <- function(armspan) {
____ * armspan + ____
}
```

Make your predictions

Fill in the blanks to include your predictions in the arm_span data.
```
____ <- mutate(____, predicted_height = ____(____))
```
Now that we've made our predictions, we'll need to figure out a way to decide how accurate our predictions are.

– We'll want to compare our predicted heights to the actual heights.

– At the end, we'll want to come up with a single number summary that describes our model's accuracy.

Sums of differences

A residual is the difference between the actual and predicted value of a quantity of interest.
Fill in the blanks below to add a column of residucals to arm_span.
```
____ <- mutate(____, residual = ____ - ____)
```
What do the residuals measure?
One method we might consider to measure our model's accuracy is to sum the residuals.
Fill in the blanks below to calculate our accuracy summary.
```
summarize(____, sum(____))
```
Hint: Like mutate, the first argument of summarize is a dataframe, and the second argument is the action to perform on a column of the dataframe. Whereas the output of mutate is a column, the output of summarize is (usually) a single number summary.
Describe and interpret, in words, what the output of your accuracy summary means.
Write down why adding positive and negative errors together is problematic for assessing prediction accuracy.

Mean squared error

When adding residuals, the positive errors in our predictions (underestimates) are cancelled out by negative errors (overestimates) which lead to the impression that our model is making better predictions than it actually is.
To solve this problem we calculate the squared values of the errors because squared values are always positive.
The mean squared error (MSE) is calculated by squaring all of the residuals, and then taking the mean of the squared residuals.
Fill in the blanks below to calculate the MSE of your line.
```
summarize(____, mean((____))^2)
```
Compare your MSE with a neighbor. Whose line was more accurate and why?

Regression lines

If you were to go around your class, each student would have created a different line that they feel fit the data best.
- Which is a problem because everyone's line will make slightly different predictions.
To avoid this variation in predictions, data scientists will use regression lines.
- We also refer to regression lines as lienar models.
- This line connects the mean height of people with similar armspans.
- Fill in the blanks below to create a regression line using lm, which stands for linear model.
  
  best_fit <- lm(_ ~ _, data = arm_span)

Predicting wiht regression lines

Making predictions with models R is familiar with is simpler than with lines, or models, we come up with ourselves.

– Fill in the blanks to make predictions using best_fit:
```
____ <- mutate(____, predicted_height = predict(____))
```
Hint: the predict function takes a linear model as input, and outputs the predictions of that model.

The magic of lm()

The lm() function creates the line of best fit equation by finding the line that minimizes the mean squared error. Meaning, it's the best fitting line possible.
Calculate the MSE for the values predicted using the regression line.
Compare the MSE of the linear model you fitted with add_line() to the MSE of the linear model obtained with lm. Which linear model performed better?
Ask your neighbors if any of their lines beat the lm line in terms of the MSE. Were any of them successful?