Skip to content

Lab 2B - Oh the Summaries ...

Lab 2B - Oh the Summaries ...

Directions: Follow along with the slides and answer the questions in bold font in your journal.

Just the beginning

  • Means, medians,and MAD are just a few examples of numerical summaries.

  • In this lab, we will learn how to calculate and interpret additional summaries of distributions such as: minimums, maximums, ranges, quartiles and IQRs.

    – We'll also learn how to write our first custom function!

  • Start by loading your Personality Color data again and name it colors.

Extreme values

  • Besides looking at typical values, sometimes we want to see extreme values, like the smallest and largest values.

    – To find these values, we can use the min, max or range functions. These functions use a similar syntax as the mean function.

  • Find the min value and max value for your predominant color.

  • Apply the range function to your predominant color and describe the output.

    – The range of a variable is the difference between a variable’s smallest and largest value.

    – Notice, however, that our range function calculates the maximum and minimum values for a variable, but not the difference between them.

    – Later in this lab you will create a custom Range function that will calculate the difference.

Quartiles (Q1 & Q3)

  • The median of our data is the value that splits our data in half.

    – Half of our data is smaller than the median, half is larger.

  • Q1 and Q3 are similar.

    – 25% of our data is smaller than Q1, 75% are larger.

  • Fill in the blanks to compute the value of Q1 for your predominant color.

    quantile(~____, data = ____, p = 0.25)
    
  • Use a similar line of code to calculate Q3, which is the value that's larger than 75% of our data.

The Inter-Quartile-Range (IQR)

  • Make a dotPlot of your predominant color's scores. Make sure to include the nint option.

  • Visually (Don't worry about being super-precise):

    – Cut the distribution into quarters so the number of data points is equal for each piece. (Each piece should contain 25% of the data.)

    • Hint: You might consider using the add_line(vline = ) to add vertical lines at the quarter marks.

    Write down the numbers that split the data up into these 4 pieces.

    How long is the interval of the middle two pieces?

    – This length is the IQR.

Calculating the IQR

  • The IQR is another way to describe spread.

    – It describes how wide or narrow the middle 50% of our data are.

  • Just like we used the min and max to compute the range, we can also use the 1st and 3rd quartiles to compute the IQR.

  • Use the values of Q1 and Q3 you calculated previously and find the IQR by hand.

    Then use the iqr() function to calculate it for you.

  • Which personality color score has the widest spread according to the IQR? Which is narrowest?

Boxplots

  • By using the medians, quartiles, and min/max, we can construct a new single variable plot called the box and whisker plot, often shortened to just boxplot.

  • By showing someone a dotPlot, how would you teach them to make a boxplot? Write out your explanation in a series of steps for the person to use.

    Use the steps you write to create a sketch of a boxplot for your predominant color's scores in your journal.

    Then use the bwplot function to create a boxplot using R.

Our favorite summaries

  • In the past two labs, we've learned how to calculate numerous numerical summaries.

    – Computing lots of different summaries can be tedious.

  • Fill in the blanks below to compute some of our favorite summaries for your predominant color all at once.

    favstats(~____, data=colors)
    

Calculating a range value

  • We saw in the previous slide that the range function calculates the maximum and minimum values for a variable, but not the difference between them.

  • We could calculate this difference in two steps:

    – Step 1: Use the range function to assign the max and min values of a variable the name values. This will store the output from the range function in the environment pane.

    values <- range(~____, data=colors)
    

    – Step 2: Use the diff function to calculate the difference of values. The input for the diff function needs to be a vector containig two numeric values.

    diff(values)
    
  • Use these two steps to calculate the range of your predominant color.

Introducing custom functions

  • Calculating the range of many variables can be tedious if we have to keep performing the same two steps over and over.

    – We can combine these two steps into one by writing our own custom function.

  • Custom functions can be used to combine a task that would normally take many steps to compute and simplify them into one.

  • The next slide shows an example of how we can create a custom function called mm_diff to calculate the absolute difference between the mean and median value of a variable in our data.

Example function

mm_diff <- function(variable, data) {
  mean_val <- mean(variable, data = data)
  med_val <- median(variable, data = data)
  abs(mean_val - med_val)
}
  • The function takes two generic arguments: variable and data

  • It then follows the steps between the curly braces { }

    – Each of the generic arguments is used inside the mean and median functions.

  • Copy and paste the code above into an R script and run it.

  • The mm_diff function will appear in your Environment pane.

Using mm_diff()

  • After running the code used to create the function, we can use it just like we would any other numerical summary.

    – In the console, fill in the blanks below to calculate the absolute difference between the mean and median values of your predominant color:

    ____(~____, data = ____)
    
  • Which of the four colors has the largest absolute difference between the mean and median values?

    By examining a dotPlot for this personality color, make an argument why either the mean or median would be the better description of the center of the data.

Our first function

  • Using the previous example as a guide, create a function called Range (Note the capial 'R') that calculates the range of a variable by filling in the blanks below:

    ____ <- function (____, ____) {
      values <- range(____, data = ____)
      diff(___)
    }
    
  • Use the Range function to find the personality color with the largest difference between the max and min values.

On your own

  • Create a function called myIQR that uses the quantile function to compute the middle 30% of the data.