# Unit 4 Vocabulary

### census

an official count or survey of a population, typically recording various details of individuals

### Classification and Regression Trees (CART)

a predictive algorithm used in machine leanring; it explains how a target variable's values can be predicted based on other values

### classify

is the problem of identifying which of a set of categories (sub-populations) an observation (or observations), belongs to

### cluster

a group of similar things or people positioned or occurring closely together

### clustering

is the process of grouping a set of objects (or people) in such a way that objects (or people) in the same group are more similar to each other than those in other groups

### correlation coefficient

a statistical measure that calculates the strength of the relationship between the relative movements of two variables

### decision tree

a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance outcomes

### k-means

aims to partition data into k clusters in a way that data points in the same cluster are similar and data points in the different clusters are farther apart

### linear

used to describe a straight-line relationship between two variables

### line of best fit

a line through a scatterplot of data points that best expresses the relationship between those points

### market

refers to the live streaming of trade-related data; it encompasses a range of information such as price, bid/ask quotes and market volume

### mean absolute error

the amount of error in your measurements; it is the difference between the measured value adn the "true" value

### mean squared error

tells you how close a regression line is to a set of points; is determined by finding the average of the squared differences between your guess and the actual values

### misclassification rate

the proportion of observations who were predicted to be in one category but were actually in another

### model

provides a simplified version or representation of real-life situations or data. It is used to make sense of data or make predictions based on it.

### negative assocation

when the values of one variable tend to decrease as the values of the other variable increase

### network

a system designed to transfer data from one network access point to one other or more network access points via data switching, transmission lines, and system controls

### no association

means that there is no line and all the dots are scattered

### nodes

a point of intersection/connection within a data communication network

### non-linear

a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables; the data are fitted by a method of successive approximations

### observed value

the value that is actually observed (what actually happened)

### polynomial trends

describes a pattern in data that is curved or breaks from a straight linear trend; it often occurs in a large set of data that contains many fluctuations

### positive association

when the values of one variable tend to increase as the values of the other variable increase

### predicted value

shows the projected equation of the line of best fit

### regression line

a regression line is a line that best describes the behavior of a set of data

### residual

the difference between our prediction and the actual outcome; also called an "error"

### rule

a set way to calculate or solve a problem

### shape

describes the distribution (or pattern) of the data within a dataset

### strength of association

how much two variables covary and the extent to which the INDEPENDENT VARIABLE affects the DEPENDENT VARIABLE

### testing data

!!! note " a random subset consisting of about 15-25% of the original dataset on which a model is tested

### training data

!!! note " a random subset consisting of about 75-85% of the original dataset on which a model is trained

### trend

often referred to as a line of best fit, is a line that is used to represent the behavior of a set of data to determine if there is a certain pattern