# LAB 4H: Finding Clusters

## Lab 4H - Finding clusters

### Clustering data

• We've seen previously that data scientists have methods to predict values of specific variables.

– We used regression to predict numerical values and classification to predict categories.

• Clustering is similar to classification in that we want to group people into categories. But there's one important difference:

– In clustering, we don't know how many groups to use because we're not predicting the value of a known variable!

• In this lab, we'll learn how to use the k-means clustering algorithm to group our data into clusters.

### The k-means algorithm

• The k-means algorithm works by splitting our data into k different clusters.

– The number of clusters, the value of k, is chosen by the data scientist.

• The algorithm works only for numerical variables and only when we have no missing data.

• To start, use the `data` function to load the `futbol` data set.

– This data contains 23 players from the US Men's National Soccer team (USMNT) and 22 quarterbacks from the National Football League (NFL).

• Create a scatterplot of the players `ht_inches` and `wt_lbs` and color each dot based on the `league` they play for.

### Running k-means

• After plotting the player's heights and weights, we can see that there are two clusters, or different types, of players:

– Players in the NFL tend to be taller and weigh more than the shorter and lighter USMNT players.

• Fill in the blanks below to use k-means to cluster the same height and weight data into two groups:

``````kclusters(____~____, data = futbol, k = ____)
``````
• Use this code and the mutate function to add the values from `kclusters` to the `futbol` data. Call the variable `clusters`.

### k-means vs. ground-truth

• In comparing our football and soccer players, we know for certain which league each player plays in.

– We call this knowledge ground-truth.

• Knowing the ground-truth for this example is helpful to illustrate how k-means works, but in reality, data-scientists would run k-means not knowing the ground-truth.

• Compare the clusters chosen by k-means to the ground-truth. How successful was k-means at recovering the `league` information?

• Load your class' `timeuse` data (remember to run `timeuse_format` so each row represents the mean time each student in spent participating in the various activities).
• Create a scatterplot of `homework` and `videogames` variables.
– Based on this graph, identify and remove any outliers by using the `subset` function.
• Use `kclusters` with `k=2` for `homework` and `videogames`.
Describe how the groups differ from eachother in terms of how long each group spends playing `videogames` and doing `homework`.