LAB 4H: Finding Clusters
Lab 4H - Finding clusters
Directions: Follow along with the slides and answer the questions in bold font in your journal.
We've seen previously that data scientists have methods to predict values of specific variables.
– We used regression to predict numerical values and classification to predict categories.
Clustering is similar to classification in that we want to group people into categories. But there's one important difference:
– In clustering, we don't know how many groups to use because we're not predicting the value of a known variable!
In this lab, we'll learn how to use the k-means clustering algorithm to group our data into clusters.
The k-means algorithm
The k-means algorithm works by splitting our data into k different clusters.
– The number of clusters, the value of k, is chosen by the data scientist.
The algorithm works only for numerical variables and only when we have no missing data.
To start, use the
datafunction to load the
– This data contains 23 players from the US Men's National Soccer team (USMNT) and 22 quarterbacks from the National Football League (NFL).
Create a scatterplot of the players
wt_lbsand color each dot based on the
leaguethey play for.
After plotting the player's heights and weights, we can see that there are two clusters, or different types, of players:
– Players in the NFL tend to be taller and weigh more than the shorter and lighter USMNT players.
Fill in the blanks below to use k-means to cluster the same height and weight data into two groups:
kclusters(____~____, data = futbol, k = ____)
Use this code and the mutate function to add the values from
futboldata. Call the variable
k-means vs. ground-truth
In comparing our football and soccer players, we know for certain which league each player plays in.
– We call this knowledge ground-truth.
Knowing the ground-truth for this example is helpful to illustrate how k-means works, but in reality, data-scientists would run k-means not knowing the ground-truth.
Compare the clusters chosen by k-means to the ground-truth. How successful was k-means at recovering the
On your own
Load your class'
timeusedata (remember to run
timeuse_formatso each row represents the mean time each student in spent participating in the various activities).
Create a scatterplot of
– Based on this graph, identify and remove any outliers by using the
– Describe how the groups differ from eachother in terms of how long each group spends playing