# Essential Concepts

**IDS Unit 4: Essential Concepts**

__Lesson 1: Water Usage__

__Lesson 1: Water Usage__

Data can be used to make predictions. Official data sets rely on censuses or random samples and can be used to make generalizations. On the other hand, data from Participatory Sensing campaigns are not random and rely on the sensors, in our case, humans, to be gathered and limits the ability to generalize.

__Lesson 2: Exploring Water Usage__

__Lesson 2: Exploring Water Usage__

Exploring different data sets can give us insight about the same processes. Information from an official data set compared with a Participatory Sensing data set can yield more information than one data set alone. Research questions provide an overall direction to make comparisons between data sets.

__Lesson 3: Evaluating and Implementing a Water Campaign__

__Lesson 3: Evaluating and Implementing a Water Campaign__

Statistical questions guide a Participatory Sensing campaign so that we can learn about a community or ourselves. These campaigns should be evaluated before implementing to make sure they are reasonable and ethically sound.

__Lesson 4: Learning About Our Water Campaign__

__Lesson 4: Learning About Our Water Campaign__

Statistical questions guide a Participatory Sensing campaign so that we can learn about a community or ourselves. These campaigns should be tried before implementing to make sure they are collecting the data they are meant to collect and refined accordingly.

__Lesson 5: Statistical Predictions using One Variable__

__Lesson 5: Statistical Predictions using One Variable__

Anyone can make a prediction. But statisticians measure the success of their predictions. This lesson encourages the classroom to consider different measures of success.

__Lesson 6: Statistical Predictions by Applying the Rule__

__Lesson 6: Statistical Predictions by Applying the Rule__

If we use the squared residuals rule, then the mean of our current data is the best prediction of future values. If we use the mean absolute error rule, then the median of the current data is the best prediction of future values.

__Lesson 7: Statistical Predictions Using Two Variables__

__Lesson 7: Statistical Predictions Using Two Variables__

When predicting values of a variable y, and if y is associated with x, then we can get improved predictions by using our knowledge about x. Basically, we “subset” the data for a given value of x, and use the mean y for those subset values. If the resulting means follow a trend, we can model this trend to generalize to as-yet unseen values of x.

__Lesson 8: What’s the Trend?__

__Lesson 8: What’s the Trend?__

Associations are important because they help us make better predictions; the stronger the trend, the better the prediction we can make. “Better” in this case means that our mean squared residuals can be made smaller.

__Lesson 9: Spaghetti Line__

__Lesson 9: Spaghetti Line__

We can often use a straight line to summarize a trend. “Eye balling” a straight line to a scatterplot is one way to do this.

__Lesson 10: Predicting Values__

__Lesson 10: Predicting Values__

The regression line can be used to make good predictions about values of y for any given value of x. This works for exactly the same reason the mean works well for one variable: the predictions will make your score on the mean squared residuals as small as possible.

__Lesson 11: How Strong Is It?__

__Lesson 11: How Strong Is It?__

A high absolute value for correlation means a strong linear trend. A value close to 0 means a weak linear trend.

__Lesson 12: More Variables to Make Better Predictions__

__Lesson 12: More Variables to Make Better Predictions__

We can use scatterplots to assess which variables might lead to strong predictive models. Sometimes using several predictors in one model can produce stronger models.

__Lesson 13: Combination of Variables__

__Lesson 13: Combination of Variables__

If multiple predictors are associated with the response variable, a better predictive model will be produced, as measured by the mean absolute error.

__Lesson 14: Improving your Model__

__Lesson 14: Improving your Model__

If a linear model is fit to a non-linear trend, it will not do a good job of predicting. For this reason, we need to identify non-linear trends by looking at a scatterplot or the model needs to match the trend.

__Lesson 15: The Growth of Landfills__

__Lesson 15: The Growth of Landfills__

Modeling does not always have to produce an equation. Instead, we can create models to answer real-world problems related to our community.

__Lesson 16: Exploring Trash via the Dashboard__

__Lesson 16: Exploring Trash via the Dashboard__

Exploring the IDS Dashboard provides a visual approach to data analysis.

__Lesson 17: Exploring Trash via RStudio__

__Lesson 17: Exploring Trash via RStudio__

RStudio can be used to verify initial results/findings from data analysis done via the IDS Dashboard.

__Lesson 18: Grow Your Own Classification Tree__

__Lesson 18: Grow Your Own Classification Tree__

Many data sets have multiple predictors and are very non-linear. We can still use this data, but need to model it differently, such as in a decision tree. Decision trees are a useful tool for classifying observations into groups.

__Lesson 19: Data Scientists or Doctors?__

__Lesson 19: Data Scientists or Doctors?__

We can determine the usefulness of decision trees by comparing the number of misclassifications in each.

__Lesson 20: Where Do I Belong?__

__Lesson 20: Where Do I Belong?__

We can identify groups, or “clusters,” in data based on a few characteristics. For example, it is easy to classify a classroom into males and females, but what if you only knew each student’s arm span? How well could you classify their genders now?

__Lesson 21: Our Class Network__

__Lesson 21: Our Class Network__

Networks are made when observations are interconnected. In a social setting, we can examine how different people are connected by finding relationships between other people in a network.