Lesson 13: RStudio Basics
Lesson 13: RStudio Basics
Objective:
Students will learn RStudio’s interface, as well as a few basic commands to discover the structure behind a data set.
Materials:

Computer

Projector

RStudio: https://portal.idsucla.org

Video showing how to log into RStudio Cloud for the first time found at: https://www.youtube.com/watch?v=vgh7C8U8Ekk
Vocabulary:
pane preview console plot environment
RStudio Commands:
data( ), View( ), names( ), help( ), dim( ), tally( ), load_labs( )
Essential Concepts:
Essential Concepts:
The computer has a syntax, and it can only understand if you speak its language.
Lesson:

Inform students that the Dashboard and PlotApp are data visualization tools that are coded in R, the statistical programming software that academics and professional statisticians use. The Introduction to Data Science course will utilize RStudio, which also runs on R. They will learn the programming language of RStudio for data analysis.

Demonstrate how to access RStudio by projecting the URL: https://portal.idsucla.org on a screen. Then, click on the RStudio icon on the page.

Inform students that they will log into RStudio using the "Log In with Google" option. Note that it is not the same as their IDS App & IDS Homepage login.

Once logged in, show each pane, or rectangular area, of the RStudio interface:

preview (spreadsheet)  where they will be able to see the variables and observations (index); rows and columns of data

console  where they will be entering their code

plot  where their plots/graphs/visualizations will be generated

environment  where they will see values and objects


Inform students that they will be looking at a data set from The Centers for Disease Control and Prevention (CDC), a government agency that collects data about teenagers on a variety of topics.

Demonstrate how to load and view the CDC data file to the workspace by typing the following command in the console:
>data(cdc)
>View(cdc)

Examine the environment pane. Ask a student to describe how the data are displayed. The data are displayed in rows and columns.

Demonstrate how to list the variables found in the CDC data set. Students may take notes and write down commands in their DS journals:

>names(cdc)

Ask: What do you notice? What is one variable of this data set? How many variables are there? How does this output compare to the information in the preview pane? This command lists the names of each variable in the data set.


Demonstrate how to obtain more detailed information about the data set by typing the following command in the console

>help(cdc)

Ask: What unit of measurement is height reported in? Height was reported in meters.


Demonstrate how to find the number of rows and columns in the data set.

>dim(cdc)

Ask: Which number do you think represents the rows? Which one represents the columns? How does this output compare to the information in the preview and environment panes? How many observations are there in the data set? How many variables does this data set contain? There are 13,677 rows, or 13,677 observations; and there are 30 columns, or 30 variables. This information is also visible in the preview pane.


Next, show students how to access the number of observations of a specific variable.

>tally(~seat_belt, data = cdc)

Ask: What do you notice? Describe the output. Notice that six categories are displayed. Each category shows the number of observations contained in it. E.g,. “Never” has 294 observations, meaning 294 teens reported never wearing their seat belt as a passenger in a motor vehicle. <NA> = Not Available, represents teens that did not provide information about their seat belt habits.


Change the variable to height.

>tally(~height, data = cdc)

Ask: What do you notice? Describe the output. The levels are missing. It happened because the variable height contains numbers, not categories.


Let’s take a closer look at the variables seat_belt and height. Maximize the console. Ask teams to discuss the following question:
What is the difference between the data from the variables seat_belt and height? The data from the seat_belt variable is categorical, which means it consists of groupings. The data from the variable height is numerical, which means it consists of numbers.

Summarize: In data science, the variable seat_belt is what we call a categorical variable, and the variable height is what we call a numerical variable.

Let’s look at the other variables in this data set. In pairs, categorize each variable as categorical or numerical:

eat_fruit (categorical)

weight (numerical)

grade (categorical)

gender (categorical)


Inform students that they will be learning RStudio code to work with data. They will be completing RStudio labs throughout the course.

Demonstrate how to load the menu of labs by typing the following code:
>load_labs( )

The load labs command displays a list of available labs and a selection prompt. To select Lab 1A, type number 1 after the selection prompt.

Next, direct students’ attention to the plot pane. Show them the location of Lab 1A’s presentation.

Click on the arrows at the bottom righthand side of the presentation to view each slide. Pause on a slide titled “R’s most important syntax.” There are 3 boxes, each containing a line of code.

Explain that every time they see a grey box with a line of code, they are to type the code in the console. The output will appear either on the console itself or on the plot pane.

Type in one of the lines of code. In this particular case, the output will be a plot. Show students the location of the plot and demonstrate how to toggle between the plots and presentation tabs.

Inform students that they will be completing the first lab, 1A, the next day.
Class Scribes:
One team of students will give a brief talk to discuss what they think the 3 most important topics of the day were.
Homework & Next 3 Days
Students should continue to collect nutritional facts data using the Food Habits Participatory Sensing campaign on their smart devices or via web browser.