IDS Unit 4: Essential Concepts
Data can be used to make predictions. Official data sets rely on censuses or random samples and can be used to make generalizations. On the other hand, data from Participatory Sensing campaigns are not random and rely on the sensors, in our case, humans, to be gathered and limits the ability to generalize.
Exploring different data sets can give us insight about the same processes. Information from an official data set compared with a Participatory Sensing data set can yield more information than one data set alone. Research questions provide an overall direction to make comparisons between data sets.
Statistical questions guide a Participatory Sensing campaign so that we can learn about a community or ourselves. These campaigns should be evaluated before implementing to make sure they are reasonable and ethically sound.
Statistical questions guide a Participatory Sensing campaign so that we can learn about a community or ourselves. These campaigns should be tried before implementing to make sure they are collecting the data they are meant to collect and refined accordingly.
Anyone can make a prediction. But statisticians measure the success of their predictions. This lesson encourages the classroom to consider different measures of success.
If we use the squared residuals rule, then the mean of our current data is the best prediction of future values. If we use the mean absolute error rule, then the median of the current data is the best prediction of future values.
When predicting values of a variable y, and if y is associated with x, then we can get improved predictions by using our knowledge about x. Basically, we “subset” the data for a given value of x, and use the mean y for those subset values. If the resulting means follow a trend, we can model this trend to generalize to as-yet unseen values of x.
Associations are important because they help us make better predictions; the stronger the trend, the better the prediction we can make. “Better” in this case means that our mean squared residuals can be made smaller.
We can often use a straight line to summarize a trend. “Eye balling” a straight line to a scatterplot is one way to do this.
The regression line can be used to make good predictions about values of y for any given value of x. This works for exactly the same reason the mean works well for one variable: the predictions will make your score on the mean squared residuals as small as possible.
A high absolute value for correlation means a strong linear trend. A value close to 0 means a weak linear trend.
We can use scatterplots to assess which variables might lead to strong predictive models. Sometimes using several predictors in one model can produce stronger models.
If multiple predictors are associated with the response variable, a better predictive model will be produced, as measured by the mean absolute error.
If a linear model is fit to a non-linear trend, it will not do a good job of predicting. For this reason, we need to identify non-linear trends by looking at a scatterplot or the model needs to match the trend.
Modeling does not always have to produce an equation. Instead, we can create models to answer real-world problems related to our community.
Exploring the IDS Dashboard provides a visual approach to data analysis.
RStudio can be used to verify initial results/findings from data analysis done via the IDS Dashboard.
Many data sets have multiple predictors and are very non-linear. We can still use this data, but need to model it differently, such as in a decision tree. Decision trees are a useful tool for classifying observations into groups.
We can determine the usefulness of decision trees by comparing the number of misclassifications in each.
We can identify groups, or “clusters,” in data based on a few characteristics. For example, it is easy to classify a classroom into males and females, but what if you only knew each student’s arm span? How well could you classify their genders now?
Networks are made when observations are interconnected. In a social setting, we can examine how different people are connected by finding relationships between other people in a network.