LAB 4H: Finding Clusters
Lab 4H - Finding clusters
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Clustering data
-
We've seen previously that data scientists have methods to predict values of specific variables.
– We used regression to predict numerical values and classification to predict categories.
-
Clustering is similar to classification in that we want to group people into categories. But there's one important difference:
– In clustering, we don't know how many groups to use because we're not predicting the value of a known variable!
-
In this lab, we'll learn how to use the k-means clustering algorithm to group our data into clusters.
The k-means algorithm
-
The k-means algorithm works by splitting our data into k different clusters.
– The number of clusters, the value of k, is chosen by the data scientist.
-
The algorithm works only for numerical variables and only when we have no missing data.
-
To start, use the
data
function to load thefutbol
data set.– This data contains 23 players from the US Men's National Soccer team (USMNT) and 22 quarterbacks from the National Football League (NFL).
-
Create a scatterplot of the players
ht_inches
andwt_lbs
and color each dot based on theleague
they play for.
Running k-means
-
After plotting the player's heights and weights, we can see that there are two clusters, or different types, of players:
– Players in the NFL tend to be taller and weigh more than the shorter and lighter USMNT players.
-
Fill in the blanks below to use k-means to cluster the same height and weight data into two groups:
kclusters(____~____, data = futbol, k = ____)
-
Use this code and the
mutate
function to add the values fromkclusters
to thefutbol
data. Call the variableclusters
.
k-means vs. ground-truth
-
In comparing our football and soccer players, we know for certain which league each player plays in.
– We call this knowledge ground-truth.
-
Knowing the ground-truth for this example is helpful to illustrate how k-means works, but in reality, data-scientists would run k-means not knowing the ground-truth.
-
Compare the clusters chosen by k-means to the ground-truth. How successful was k-means at recovering the
league
information?
On your own
-
Load your class'
timeuse
data (remember to runtimeuse_format
so each row represents the mean time each student spent participating in the various activities). -
Create a scatterplot of
homework
andvideogames
variables.– Based on this graph, identify and remove any outliers by using the
filter
function. -
Use
kclusters
withk=2
forhomework
andvideogames
.– Describe how the groups differ from each other in terms of how long each group spends playing
videogames
and doinghomework
.