Lab 3B: Confound It All!
Lab 3B - Confound it all!
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Finding data in new places
Since your first forays into doing data science, you've used data from two sources:
– Built-in datasets from RStudio.
– Campaign data from the Campaign Manager.
Data can be found in many other places though, especially online.
In this lab, we'll read an observational study dataset from a website.
– We'll use this data to then explore what factors are associated with a person's lung capacity.
Importing our data
Rather than export-ing the data and then upload-ing and importing-ing it, we'll pull the data straight from the webpage into R.
You can find the data online here:
– (Right-click and select Open in New Window)
Click on the Import Dataset button under the Environment tab.
– Then click on the From Text (readr) option.
– Type or copy/paste the URL into the box.
– Click Update.
Before importing, change the following Import Options:
– Uncheck the First Row as Names
– Change Delimiter to Whitespace
Our new data
Variables that were measured include:
– Age in years.
– Lung capacity, measured in liters.
– The youth's heights, in inches
– Whether the participant was a smoker,
"1", or non-smoker
About the data
The data come from the Forced Expiratory Volume (FEV) study that took place in the late 1970's.
The observations come from a sample of 654 youths, aged 3 to 19, in/around East Boston.
Researchers were interested in answering the research question:
What is the effect of childhood smoking on lung health?
Cleaning your data
Now that we've got the data loaded, we need to clean it to get it ready for use (Look at lab 1F for help). Specifically:
– We want to name the variables:
"smoker", in that order.
– Change the type of variable for
smokerfrom numeric to character.
After changing the variable types for
Analyzing our data
lungsdata is from an observational study.
Write down a reason the researchers couldn't use an experiment to test the effects of smoking on children's lungs.
Observational studies are often helpful for analyzing how variables are related:
– Do you think that a person's age affects their lung capacity? Make a sketch of what you think a scatterplot of the two variables would look like and explain.
lungsdata to create an
– Interpret the plot and describe why the relationship between the two variables makes sense.
Smoking and lung capacity
Make a plot that can be used to answer the statistical investigative question:
Do people who smoke tend to have lower lung capacity than those who do not smoke?
Use your plot to answer the question.
– Were you surprised by the answer? Why?
– Can you suggest a possible confounding factor that might be affecting the result?
Create three subsets of the data:
– One that includes only 13-year-olds ...
– One that includes only 15-year-olds ...
– and one that includes only 17-year-olds.
Make a plot that compares the lung capacity of smokers and non-smokers for each subset.
How does the relationship between smoking and lung capacity change as we increase the age from 13 to 15 to 17?
Sum it up!
Does smoking affect lung capacity? If so, how?
– Support your answers with appropriate plots.
– Explain why you included the variables you used in your plots.