Lab 2B  Oh the Summaries ...
Lab 2B  Oh the Summaries ...
Directions: Follow along with the slides and answer the questions in bold font in your journal.
Just the beginning

Means, medians,and MAD are just a few examples of numerical summaries.

In this lab, we will learn how to calculate and interpret additional summaries of distributions such as: minimums, maximums, ranges, quartiles and IQRs.
– We'll also learn how to write our first custom function!

Start by loading your Personality Color data again and name it
colors
.
Extreme values

Besides looking at typical values, sometimes we want to see extreme values, like the smallest and largest values.
– To find these values, we can use the
min
,max
orrange
functions. These functions use a similar syntax as themean
function. 
Find the
min
value andmax
value for your predominant color. 
Apply the
range
function to your predominant color and describe the output.– The range of a variable is the difference between a variable’s smallest and largest value.
– Notice, however, that our
range
function calculates the maximum and minimum values for a variable, but not the difference between them.– Later in this lab you will create a custom
Range
function that will calculate the difference.
Quartiles (Q1 & Q3)

The median of our data is the value that splits our data in half.
– Half of our data is smaller than the median, half is larger.

Q1 and Q3 are similar.
– 25% of our data is smaller than Q1, 75% are larger.

Fill in the blanks to compute the value of Q1 for your predominant color.
quantile(~____, data = ____, p = 0.25)

Use a similar line of code to calculate Q3, which is the value that's larger than 75% of our data.
The InterQuartileRange (IQR)

Make a
dotPlot
of your predominant color's scores. Make sure to include thenint
option. 
Visually (Don't worry about being superprecise):
– Cut the distribution into quarters so the number of data points is equal for each piece. (Each piece should contain 25% of the data.)
 Hint: You might consider using the
add_line(vline = )
to add vertical lines at the quarter marks.
– Write down the numbers that split the data up into these 4 pieces.
– How long is the interval of the middle two pieces?
– This length is the IQR.
 Hint: You might consider using the
Calculating the IQR

The
IQR
is another way to describe spread.– It describes how wide or narrow the middle 50% of our data are.

Just like we used the
min
andmax
to compute therange
, we can also use the 1st and 3rd quartiles to compute the IQR. 
Use the values of Q1 and Q3 you calculated previously and find the IQR by hand.
– Then use the
iqr()
function to calculate it for you. 
Which personality color score has the widest spread according to the IQR? Which is narrowest?
Boxplots

By using the medians, quartiles, and min/max, we can construct a new single variable plot called the box and whisker plot, often shortened to just boxplot.

By showing someone a
dotPlot
, how would you teach them to make a boxplot? Write out your explanation in a series of steps for the person to use.– Use the steps you write to create a sketch of a boxplot for your predominant color's scores in your journal.
– Then use the
bwplot
function to create a boxplot usingR
.
Our favorite summaries

In the past two labs, we've learned how to calculate numerous numerical summaries.
– Computing lots of different summaries can be tedious.

Fill in the blanks below to compute some of our favorite summaries for your predominant color all at once.
favstats(~____, data=colors)
Calculating a range value

We saw in the previous slide that the
range
function calculates the maximum and minimum values for a variable, but not the difference between them. 
We could calculate this difference in two steps:
– Step 1: Use the
range
function toassign
the max and min values of a variable the namevalues
. This will store the output from therange
function in the environment pane.values < range(~____, data=colors)
– Step 2: Use the
diff
function to calculate the difference ofvalues
. The input for thediff
function needs to be a vector containig two numeric values.diff(values)

Use these two steps to calculate the range of your predominant color.
Introducing custom functions

Calculating the range of many variables can be tedious if we have to keep performing the same two steps over and over.
– We can combine these two steps into one by writing our own custom
function
. 
Custom functions can be used to combine a task that would normally take many steps to compute and simplify them into one.

The next slide shows an example of how we can create a custom function called
mm_diff
to calculate the absolute difference between themean
andmedian
value of avariable
in ourdata
.
Example function
mm_diff < function(variable, data) {
mean_val < mean(variable, data = data)
med_val < median(variable, data = data)
abs(mean_val  med_val)
}

The function takes two generic arguments:
variable
anddata

It then follows the steps between the curly braces
{ }
– Each of the generic arguments is used inside the
mean
andmedian
functions. 
Copy and paste the code above into an R script and run it.

The
mm_diff
function will appear in your Environment pane.
Using mm_diff()

After running the code used to create the function, we can use it just like we would any other numerical summary.
– In the console, fill in the blanks below to calculate the absolute difference between the
mean
andmedian
values of your predominant color:____(~____, data = ____)

Which of the four colors has the largest absolute difference between the
mean
andmedian
values?– By examining a
dotPlot
for this personality color, make an argument why either themean
ormedian
would be the better description of the center of the data.
Our first function

Using the previous example as a guide, create a function called
Range
(Note the capial 'R') that calculates the range of a variable by filling in the blanks below:____ < function (____, ____) { values < range(____, data = ____) diff(___) }

Use the
Range
function to find the personality color with the largest difference between themax
andmin
values.
On your own
 Create a function called
myIQR
that uses thequantile
function to compute the middle 30% of the data.