# Lab 2B - Oh the Summaries ...

## Lab 2B - Oh the Summaries...

Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.

### Just the beginning

• Means, medians,and MAD are just a few examples of numerical summaries.

• In this lab, we will learn how to calculate and interpret additional summaries of distributions such as: minimums, maximums, ranges, quartiles and IQRs.

– We'll also learn how to write our first custom function!

• Start by loading your Personality Color data again and name it `colors`.

### Extreme values

• Besides looking at typical values, sometimes we want to see extreme values, like the smallest and largest values.

– To find these values, we can use the `min`, `max` or `range` functions. These functions use a similar syntax as the `mean` function.

• Find and write down the `min` value and `max` value for your predominant color.

• Apply the `range` function to your predominant color and describe the output.

– The range of a variable is the difference between a variable’s smallest and largest value.

– Notice, however, that our `range` function calculates the maximum and minimum values for a variable, but not the difference between them.

– Later in this lab you will create a custom `Range` function that will calculate the difference.

### Quartiles (Q1 & Q3)

• The median of our data is the value that splits our data in half.

– Half of our data is smaller than the median, half is larger.

• Q1 and Q3 are similar.

– 25% of our data are smaller than Q1, 75% are larger. - 75% of our data are smaller than Q3, 25% are larger.

• Fill in the blanks to compute the value of Q1 for your predominant color.

``````quantile(~____, data = ____, p = 0.25)
``````
• Use a similar line of code to calculate Q3, which is the value that's larger than 75% of our data.

### The Inter-Quartile-Range (IQR)

• Make a `dotPlot` of your predominant color's scores. Make sure to include the `nint` option.

• Visually (Don't worry about being super-precise):

Cut the distribution into quarters so the number of data points is equal for each piece. (Each piece should contain 25% of the data.)

• Hint: You might consider using the `add_line(vline = )` to add vertical lines at the quarter marks.

Write down the numbers that split the data up into these 4 pieces.

How long is the interval of the middle two pieces?

– This length is the IQR.

### Calculating the IQR

• The `IQR` is another way to describe spread.

– It describes how wide or narrow the middle 50% of our data are.

• Just like we used the `min` and `max` to compute the `range`, we can also use the 1st and 3rd quartiles to compute the IQR.

• Use the values of Q1 and Q3 you calculated previously and find the IQR by hand.

Then use the `iqr()` function to calculate it for you.

• Which personality color score has the widest spread according to the IQR? Which is narrowest?

### Boxplots

• By using the medians, quartiles, and min/max, we can construct a new single variable plot called the box and whisker plot, often shortened to just boxplot.

• By showing someone a `dotPlot`, how would you teach them to make a boxplot? Write out your explanation in a series of steps for the person to use.

Use the steps you write to create a sketch of a boxplot for your predominant color's scores in your journal.

Then use the `bwplot` function to create a boxplot using `R`.

### Our favorite summaries

• In the past two labs, we've learned how to calculate numerous numerical summaries.

– Computing lots of different summaries can be tedious.

• Fill in the blanks below to compute some of our favorite summaries for your predominant color all at once.

``````favstats(~____, data=colors)
``````

### Calculating a range value

• We saw in the previous slide that the `range` function calculates the maximum and minimum values for a variable, but not the difference between them.

• We could calculate this difference in two steps:

– Step 1: Use the `range` function to `assign` the max and min values of a variable the name `values`. This will store the output from the `range` function in the environment pane.

``````values <- range(~____, data=colors)
``````

– Step 2: Use the `diff` function to calculate the difference of `values`. The input for the `diff` function needs to be a vector containig two numeric values.

``````diff(values)
``````
• Use these two steps to calculate the range of your predominant color.

### Introducing custom functions

• Calculating the range of many variables can be tedious if we have to keep performing the same two steps over and over.

– We can combine these two steps into one by writing our own custom `function`.

• Custom functions can be used to combine a task that would normally take many steps to compute and simplify them into one.

• The next slide shows an example of how we can create a custom function called `mm_diff` to calculate the absolute difference between the `mean` and `median` value of a `variable` in our `data`.

### Example function

``````mm_diff <- function(variable, data) {
mean_val <- mean(variable, data = data)
med_val <- median(variable, data = data)
abs(mean_val - med_val)
}
``````
• The function takes two generic arguments: `variable` and `data`

• It then follows the steps between the curly braces `{ }`

– Each of the generic arguments is used inside the `mean` and `median` functions.

• Copy and paste the code above into an R script and run it.

• The `mm_diff` function will appear in your Environment pane.

### Using mm_diff()

• After running the code used to create the function, we can use it just like we would any other numerical summary.

In the console, fill in the blanks below to calculate the absolute difference between the `mean` and `median` values of your predominant color:

``````____(~____, data = ____)
``````
• Which of the four colors has the largest absolute difference between the `mean` and `median` values?

By examining a `dotPlot` for this personality color, make an argument why either the `mean` or `median` would be the better description of the center of the data.

### Our first function

• Using the previous example as a guide, create a function called `Range` (Note the capial 'R') that calculates the range of a variable by filling in the blanks below:

``````____ <- function (____, ____) {
values <- range(____, data = ____)
diff(___)
}
``````
• Use the `Range` function to find the personality color with the largest difference between the `max` and `min` values.

• Create a function called `myIQR` that uses the `quantile` function to compute the middle 30% of the data.