Skip to content

Lab 1E: What’s the Relationship?

Lab 1E - What's the Relationship?

Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.

Finding patterns in data.

  • To discover (really) interesting observations or relationships in data, we need to find them!

    – Which is difficult if we only look at the raw data.

  • The best tool for finding patterns is often ... your own eyes.

    – Plots are an excellent way to help your eye search for patterns.

  • In this lab, we'll learn how to include more variables in our plots to make them more informative.

  • Import the data from your class' Food Habits campaign and name it food.

Where's the variables?

  • How many variables were used to create this plot? Which variables were used and how were they used?

Multiple variable plots

  • The previous graph is an example of a multiple variable plot, which means that more than a single variable was used. In this case:

  • Variable 1: height

  • Variable 2: gender

  • Multiple variable plots are tools for finding relationships between data.

  • Let's take our food data and make some new multiple variable plots you haven't created before!

Scatterplots

  • Scatterplots are useful for viewing how one numerical variable relates to another numerical variable.

Creating scatterplots

  • Fill in the blanks to create a scatterplot with sodium on the y-axis and sugar on the x-axis.
    xyplot(____ ~ ____, data = food)
    

Scatterplots in action

  • Use a scatterplot to answer the following questions:

    Do snacks that have more protein also have more calories? Why do you think that?

    What happens if you swap the protein and calories variables in your code? Does the relationship between the variables change?

    Does the relationship between protein and calories change when the snack is either Salty or Sweet? Write down the code you used to answer this question.

4-variable scatterplots

  • When we make scatterplots, we can include:

    – 1 numerical variable on the x-axis

    – 1 numerical variable on the y-axis

    – Use 1 categorical variable to facet our scatterplot

    – Change the color of the points based on another categorical variable

  • To change the color of our points, we can include the groups argument much like we did for bargraphs (use the search feature in the History pane if you need help).

  • Create a scatterplot that uses these 4 variables: sodium, sugar, cost, salty_sweet.

Multiple facets

  • It can sometimes be helpful to facet on more than 1 variable.

    – Splitting the data using 2 facets can give us additional insights that might otherwise be hidden.

  • Create a dotPlot or histogram of the calories variable, but facet the data using:

    healthy_level + salty_sweet
    
  • How does the healthy_level of a Salty or Sweet snack impact the number of calories in the snack?

  • Although we are treating healthy_level as a categorical variable, R recongizes it as a numerical variable.

    Verify this using the str function.

    – Notice that the faceted histograms or dotPlots do not have labels but rather tick-marks.

    – You will have the opportunity to convert the healthy_level variable into a factor later on.

  • Faceting your data on a numerical variable is NOT recommended.

    – Numerical variables often have so many different values that they overwhelm the plot and make it hard to read.

On your own

  • Answer the following questions by creating an appropriate graph or graphs:

    Do healthier snacks have more or less ingredients than less healthy snacks?

    What other variables seem to be related to the number of ingredients of a snack? Describe their relationships.