LAB 4H: Finding Clusters
Lab 4H  Finding clusters
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Clustering data

We've seen previously that data scientists have methods to predict values of specific variables.
– We used regression to predict numerical values and classification to predict categories.

Clustering is similar to classification in that we want to group people into categories. But there's one important difference:
– In clustering, we don't know how many groups to use because we're not predicting the value of a known variable!

In this lab, we'll learn how to use the kmeans clustering algorithm to group our data into clusters.
The kmeans algorithm

The kmeans algorithm works by splitting our data into k different clusters.
– The number of clusters, the value of k, is chosen by the data scientist.

The algorithm works only for numerical variables and only when we have no missing data.

To start, use the
data
function to load thefutbol
data set.– This data contains 23 players from the US Men's National Soccer team (USMNT) and 22 quarterbacks from the National Football League (NFL).

Create a scatterplot of the players
ht_inches
andwt_lbs
and color each dot based on theleague
they play for.
Running kmeans

After plotting the player's heights and weights, we can see that there are two clusters, or different types, of players:
– Players in the NFL tend to be taller and weigh more than the shorter and lighter USMNT players.

Fill in the blanks below to use kmeans to cluster the same height and weight data into two groups:
kclusters(____~____, data = futbol, k = ____)

Use this code and the
mutate
function to add the values fromkclusters
to thefutbol
data. Call the variableclusters
.
kmeans vs. groundtruth

In comparing our football and soccer players, we know for certain which league each player plays in.
– We call this knowledge groundtruth.

Knowing the groundtruth for this example is helpful to illustrate how kmeans works, but in reality, datascientists would run kmeans not knowing the groundtruth.

Compare the clusters chosen by kmeans to the groundtruth. How successful was kmeans at recovering the
league
information?
On your own

Load your class'
timeuse
data (remember to runtimeuse_format
so each row represents the mean time each student spent participating in the various activities). 
Create a scatterplot of
homework
andvideogames
variables.– Based on this graph, identify and remove any outliers by using the
filter
function. 
Use
kclusters
withk=2
forhomework
andvideogames
.– Describe how the groups differ from each other in terms of how long each group spends playing
videogames
and doinghomework
.