LAB 4D: Interpreting Correlations
Lab 4D - Interpreting correlations
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Some background...
-
So far, we’ve learned about measuring the success of a model based on how close its predictions come to the actual observations.
-
The correlation coefficient is a tool that gives us a fairly good idea of how these predictions will turn out without having to make predictions on future observations.
-
For this lab, we will be using the
movie
data set to investigate the following questions:Which variables are better predictors of a movie's
critics_rating
when the predictions are made using a line of best fit?
Correlation coefficients
-
The correlation coefficient describes the strength and direction of the linear trend.
-
It's only useful when the trend is linear and both variables are numeric.
-
Are these variables linearly related? Why or why not?
Correlation review I
-
Correlation coefficients with values close to 1 are very strong with a positive slope. Values close to -1 means the correlation is very strong with a negative slope.
-
Does this plot have a positive or negative correlation?
Correlation review II
-
Recall that if there is no linear relation between two numerical variables, the correlation coefficient is close to 0.
-
What do you guess the correlation coefficient will be for these two variables?
The movie data
-
Load the
movie
data using thedata
command. -
The data comes from a variety of sources like IMDB and Rotten Tomatoes.
– The
critics_rating
contains values between 0 and 100, 100 being the best.– The
audience_rating
contains values that range between 0 and 10, 10 being the best.–
n_critics
andn_audience
describe the number of reviews used for the ratings.–
gross
andbudget
descibes the amount of money the film made and took to make.
Calculating Correlation Coefficients!
-
We can use the
cor()
function to find the particular correlation coefficient of the variables from the previous plot, which happen to beaudience_rating
andcritics_rating
.– But note, the
cor()
function removes any observations which contain anNA
value in either variable.– Calculate the correlation coefficient for these variables using the
cor
function. The inputs to the functions work just like the inputs of thexyplot
function.
Now answer the following.
-
What was the value of the correlation coefficient you calculated?
-
How does this actual value compare with the one you estimated previously?>/span
-
Does this indicate a strong, weak, or moderate association? Why?
-
How would the scatterplot need to change in order for the correlation to be stronger?
-
How would it need to change in order for the correlation to be weaker?
Correlation and Predictions
-
Find the two variables that look to have the strongest correlation with
critics_rating
.– Compute the correlation coefficients for
critics_rating
and each of the two variables.– Use the correlation coefficient to determine which variable has a stronger linear relationship with
critics_rating
. -
Fit two
lm
models to predictcritics_rating
with each variable and compute the MSE for each.– Use the MSE to determine which variable is a better predictor of
critics_rating
. -
How are the correlation coefficient and the MSE related?
On your own
-
Select two different numerical variables from the
movie
data. Plot the variables using thexyplot()
function.– Would calculating a correlation coefficient for the two variables be appropriate? Justify your answer.
– Predict what value you think the correlation coefficient will be. Compare this value to the actual value. Finally, interpret what the actual correlation coefficient means.
-
Work with your classmates to determine which two variables have the strongest correlation coefficient.
- Why do you think these variables are so strongly related? Is using the correlation coefficient to describe the relationship appropriate and why/why not?