Lab 2G - Getting It Together
Lab 2G - Getting It Together
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Putting data together
-
In the labs so far, we've only ever looked at individual data files.
-
But often times, we gain additional insights by including additional information from a separate data set.
-
In this lab, we will learn how to merge information from our personality color data with our stress/chill data.
-
Export, upload, import your Personality Color dataset and name it
colors
. -
Then, export, upload, import your Stress/Chill dataset and name it
stress
.
Looking at Stress/Chill
-
We would like to analyze the research question:
How do people's personality colors and/or sports participation affect their stress levels?
-
We already have data about personality color and a separate data set about stress.
– What we don't have is a single data set with information from both ... yet.
-
We'll start then by strategizing how to merge our data together.
Deciding how to merge
-
Before we merge data, we need to decide how we plan to merge it:
-
We can stack our datasets, that is, take one dataset's rows and add them to the bottom of the other dataset.
-
We can also join our data sets horizontally. This is where we take one dataset's columns and add them to the end of the other dataset's columns based on matching an ID variable.
– The ID variable will have entries that we use to match observations in both datasets.
-
To answer the statistical question of interest, would it make more sense to stack or join our
colors
andstress
data?
Finding variables in common:
-
Look at the
names
of the variables in each dataset.– To merge different datasets together, we need to find variables they have in common.
-
Which variables do the datasets have in common?
-
Which variable would make sense to merge the datasets together with? Why not the others?
Caution required
-
Whether stacking or joining, we need to be careful when we merge data:
-
When stacking data, we need to be absolutely certain that the variables we're stacking represent the exact same measurements.
– We wouldn't want to stack
height
in meters andheight
in inches, for instance (without converting one to the other). -
When joining data, we need to make sure that the id variable in our primary dataset matches to one and only one observation in the joining data.
– Otherwise,
R
won't know which observation to match to.
Getting ready
-
Our goal is to add the variables from the
colors
data onto thestress
data. -
Start by ensuring that every
user.id
in thecolors
data is unique.– If there's a duplicate, have your teacher remove the duplicate from your class' Web Response Manager and then re-export, upload, import your
colors
data. -
After we add the data from colors to stress, how many rows should our merged data have? Write this number down.
Putting them together
-
We can use the
merge
function to join our datasets together using the variables that appear in both sets. -
Fill in the blanks below to join the information from the
colors
data onto thestress
data.merge(____, ____, by = "____")
-
Assign
thismerged
data set the namestress_colors
.– Make sure your data has the same number of observations that you wrote down on the previous slide.
Saving your data:
-
View
your merged data and make sure nothing appears to be blatantly wrong with it. -
Why didn't we stack the rows of data instead?
-
What happens if you swap the order of the data sets in the
merge
function? -
Fill in the blank below to save our
stress_colors
data for later use.save(stress_colors, file = "stress_colors.rda")
-
Be sure to look in the Files tab to make sure your data was saved.
Moving on
-
In the next lab, we'll begin analyzing our merged data. In the meantime:
-
Make a few plots using variables from the
stress
data and facet or group the plots based on variables from thecolors
data.– Write down the most interesting discovery you make by just exploring your data. Write out how you found your discovery and interpret what it means for the people in your class.
-
With our colors data, we could answer questions about the typical color scores in your class. Why can we no longer answer this question in our
stress_colors
data?