Unit 4 Vocabulary
census
an official count or survey of a population, typically recording various details of individuals
Classification and Regression Trees (CART)
a predictive algorithm used in machine leanring; it explains how a target variable's values can be predicted based on other values
classify
is the problem of identifying which of a set of categories (sub-populations) an observation (or observations), belongs to
cluster
a group of similar things or people positioned or occurring closely together
clustering
is the process of grouping a set of objects (or people) in such a way that objects (or people) in the same group are more similar to each other than those in other groups
correlation coefficient
a statistical measure that calculates the strength of the relationship between the relative movements of two variables
decision tree
a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance outcomes
k-means
aims to partition data into k clusters in a way that data points in the same cluster are similar and data points in the different clusters are farther apart
linear
used to describe a straight-line relationship between two variables
line of best fit
a line through a scatterplot of data points that best expresses the relationship between those points
market
refers to the live streaming of trade-related data; it encompasses a range of information such as price, bid/ask quotes and market volume
mean absolute error
the amount of error in your measurements; it is the difference between the measured value adn the "true" value
mean squared error
tells you how close a regression line is to a set of points; is determined by finding the average of the squared differences between your guess and the actual values
misclassification rate
the proportion of observations who were predicted to be in one category but were actually in another
model
provides a simplified version or representation of real-life situations or data. It is used to make sense of data or make predictions based on it.
negative assocation
when the values of one variable tend to decrease as the values of the other variable increase
network
a system designed to transfer data from one network access point to one other or more network access points via data switching, transmission lines, and system controls
no association
means that there is no line and all the dots are scattered
nodes
a point of intersection/connection within a data communication network
non-linear
a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables; the data are fitted by a method of successive approximations
observed value
the value that is actually observed (what actually happened)
polynomial trends
describes a pattern in data that is curved or breaks from a straight linear trend; it often occurs in a large set of data that contains many fluctuations
positive association
when the values of one variable tend to increase as the values of the other variable increase
predicted value
shows the projected equation of the line of best fit
regression line
a regression line is a line that best describes the behavior of a set of data
residual
the difference between our prediction and the actual outcome; also called an "error"
rule
a set way to calculate or solve a problem
shape
describes the distribution (or pattern) of the data within a dataset
strength of association
how much two variables covary and the extent to which the INDEPENDENT VARIABLE affects the DEPENDENT VARIABLE
testing data
!!! note " a random subset consisting of about 15-25% of the original dataset on which a model is tested
training data
!!! note " a random subset consisting of about 75-85% of the original dataset on which a model is trained
trend
often referred to as a line of best fit, is a line that is used to represent the behavior of a set of data to determine if there is a certain pattern