What will happen? (e.g., what will Apple’s stock price be tomorrow?)
What should we do? (e.g., What actions should we take to reduce employee turnover?)
It is a mathematical representation of a real-world process or system. In other words, it is real-life situation expressed as math.
Number with a meaning: higher means more, lower means less (e.g., age, sales, temperature, income)
Numbers w/o meaning (e.g., zip codes), non-numeric (e.g., hair color), binary data (e.g., male/female, yes/no, on/off).
A. The average cost of a house in the United States every year since 1820
B. The height of each professional basketball player in the NBA at the start of the season
Solution: A
A. The contents of a person’s Twitter feed
B. The amount of money in a person’s bank account
Solution: B
A survey of 25 people recorded each person’s family size and type of car. Which of these is a data point?
A. The 14th person’s family size and car type B. The 14th person’s family size C.The car type of each person C.The car type of each person
Solution: A. A data point is all the information about one observation
We need to scale data when we are dealing with Gradient Descent Based algorithms (Linear and Logistic Regression, Neural Network) and Distance-based algorithms (KNN, K-means, SVM, PCA) as these are very sensitive to the range of the data points.
\((X-X_{mean})/X_{standard\ deviation}\)
Real relationships between attributes and responses. They are the same in all data sets.
They are random and different in all data sets.
The model’s performance on its training data is usually too optimistic, the model is fit to both real and random patterns in the data, so it becomes overly specialized to the specific randomness in the training set, that doesn’t exist in other data.
The model will appear to be better than it really is, which is also called overfitting.
The model will be fit to both real and random patterns in the data. The model’s effectiveness on this data set will include both types of patterns, but its true effectiveness on other data sets (with different random patterns) will only include the real patterns.
Used to fit the models
Used to choose best model
sometimes the randomness will make the performance look worse than it really is, and sometimes the randomness will make the performance look better than it really is.
To estimate the generalization performance of chosen model
When we are choosing between multiple models.
Split the training/validation data into k-parts; we train on k-1 parts and validate on the remaining part.
The average of all k evaluations.
We train the model again using all the data.
Better use of data, better estimate of model quality, and chooses model more effectively.
grouping data points (e.g., market segmentation) and discovering groups in data points (e.g., personalized medicine)
k-1 times for training, and 1 time for validation
The center of a cluster.
We take the mean of all the data points in cluster.
The higher the k the higher the bias the lower the k the higher the variance. when K = 1 that is the most complex model and thus likely to overfit the data.
Elbow method: we calculate the total distance of each data point to its cluster center and plot it in two dimensions. We look for the kik in the graph.
When we see a new data point, we just choose whichever cluster center is closest.
With classification models, we know each data point’s attributes and we already know the right classification for the data points (supervised). In clustering (unsupervised) we know the attributes but we don’t know what group any of these data points are in.
Supervised - the response is known Unsupervised - response is not known.
A group of astronomers has a set of long-exposure CCD images of various distant objects. They do not know yet which types of object each one is, and would like your help using analytics to determine which ones look similar. Which is more appropriate: classification or clustering? - Clustering
Suppose one astronomer has categorized hundreds of the images by hand, and now wants your help using analytics to automatically determine which category each new image belongs to. Which is more appropriate: classification or clustering? - Classification
A. The outlier is an incorrectly-entered data, not real data.
B. Outliers like this only happen occasionally.
Solution: A. If the data point isn’t a true one, you should remove it from your data set.
A data point that is very different from the rest.
Box plot.
Omit them or use imputation.
RIMA and exponential smoothing both estimate the value of an attribute; GARCH estimates the variance.
When there are other factors or predictors that affect the response. Regression helps show the relationships between factors and a response.
Prescriptive analytics: Determining the best course of action
Regression is often good for describing and predicting, but is not as helpful for suggesting a course of action
No. Regression can show relationships between observations, but it doesn’t show whether one thing causes another.
The variance is different in different ranges of the data
Transform data so there’s no correlation between dimensions and rank the new dimensions in likely order of importance.
Each original attribute’s implied regression coefficient is equal to a linear combination of the principal components’ regression coefficients.
This is equivalent to using the inverse transformation.
False. Unlike a model like regression where we can show the result as a simple linear combination of each attribute times its regression coefficient, in a random forest model there are so many different trees used simultaneously that it’s difficult to interpret exactly how any factor or factors affect the result.
…is a probability (a number between zero and one) or is binary (either zero or one).
A model is built to determine whether data points belong to a category or not. A “true negative” result is a data point that is not in the category, and the model correctly says so. True’ and ‘false’ refer to whether the model is correct or not, and ‘positive’ and ‘negative’ refer to whether the model says the point is in the category.
False. Sometimes the cost of a false positive is so high that it’s worth accepting more false negatives, or vice versa.
Predicts the value based on other factors (regression), uses earlier values to predict (auto). ARIMA autoregresses on the differences. It uses p time periods of previous observations to predict d-th order differences and also incorporates the moving average by looking at q previous errors (x_t_hat - x_t).
The pth order autoregression, dth order differences, q th order moving average.
white noise
random walk
Auto Regression model, only the auto regressive part is active
Moving Average model - only the moving average part is active
what is ARIMA (0, 1, 1) basic exponential smoothing
Model that estimates or forecasts the variance of something that we have time series data for.
Traditional portfolio optimization: balances the expected return of a set of investment with the amount of volatility. Variance is a proxy for the amount of volatility or risk here.
It is the line that minimizes the sum of squared errors.
Encourages fewer parameters k and higher likelihood. Works well with a lot of data points.
One thing causes another.
Two things tend to happen or not happen together.
Estimates how much variability the model accounts for.
Same as \(r^2\) but favors simpler models by penalizing for using too many variables.
Q-Q plot
Reduces the effect of randomness and earlier principal components are likely to have higher signal to-noise ratios
Classification and Regression Trees
For every pair of leaves created by the same branch, we use the other half of the data to see whether the estimation error is actually improved by branching. If the branching does improve error, the branches stay, but if the branching actually makes the error gets or not change, we move the branches.
Introduce radomness. We generate many different trees. They will have different strengths and weaknesses. The average of all these trees is better than a single tree with specific strengths and weaknesses
It has better overall estimates. while each tree might be over-fitting in one place or another they don’t necessarily over-fit the same way. The average overall tree tends to fall those overreaction to random effects.
Harder to explain/interpret results. Can’t give us a specific regression or classification model from the data.
Use the average of the predicted response
Use the mode – the most common predicted response
point in the category, correctly classified
point not in category, model says it is
point not in category, correctly classified
point in the category model says no
The fraction of category members that are correctly classified TP / (TP + FN)
The fraction of non-category member that are correctly identified TN / (TN + FP)
sensitivity plotted against 1 - specificity
Probability that the model estimates a random “yes” point higher than a random “no” point
We are just guessing
gives a quick-and-dirty estimate of quality but does not differentiate between the coset of FN and FP
Transformation of input data, consider interaction terms, variable selection, has trees.
Logistic Regression takes longer to calculate, has no closed-form solution, and difficult to understand model quality (no r-squared value).
Bias is an error from erroneous assumption in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
An error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting)
Data point that is very different from the rest
The probability (probability density) of some observed outcomes given a set of parameter values.
Parameters that give the highest probability
The set of parameters that minimizes the sum of squared errors.
Cause overfitting
Very low.
Less data is required; less chance of insignificant factors and easier to interpret.
We select the best new factor and see if it’s good enough (R^2, AIC, or p-value) add it to our model and fit the model with the current set of factors. Then at the end we remove factors that are lower than a certain threshold.
We start with all factors and find the worst on a supplied threshold (p = 0.15). If it is worse we remove it and start the process over. We do that until we have the number of factors that we want and then we move the factors lower than a second threshold (p = .05) and fit the model with all set of factors
It is a combination of forward selection and backward elimination. We can either start with all factors or no factors and at each step we remove or add a factor. As we go through the procedure after adding each new factor and at the end we eliminate right away factors that no longer appear.
If we’re testing to see whether red cars sell for higher prices than blue cars, we need to account for the type and age of the cars in our data set. This is called Controlling.
When you can collect data quickly. When the data is representative and the amount of data is small compared to the whole population.
discard the data, use categorical variables to indicate missing data, estimate missing values
Pros: not potentially introducing errors; easy to implement
Cons: don’t want to lose to many data points; potential for censored or biased missing data
Advantage: hedge against being too wrong and easy to compute
Disadvantage: it can be biased imputation. Example people with high income less likely to answer survey and thus the mean/median will underestimate the missing value
It reduces or eliminates the problem of bias. Also gives better values for missing data
Disadvantages: we have to build, validate and test a whole other model just to fill in the missing data and then we have to do it all over again to get the answer we want. Also we are using the same data twice: once for imputation and a second time to fit the model.
overfitting: when # of predictors is close to or larger than # of data points. Model may fit too closely to random effects
simplicity: simple models are usually better
LASSO: It is a statistical method that applies a penalty to the coefficients of the variables in a regression model, shrinking some of them to zero. This results in a more simplified and interpretable model with fewer predictors.
Recursive feature elimination (RFE): RFE is an iterative method that starts with all the predictors and then removes the least important one at each iteration, based on a certain criterion, until a desired number of predictors is reached.
Principal component analysis (PCA)
PCA is a technique used to simplify complex datasets by reducing their dimensions (variables) while retaining most of their information. It works by transforming variables into a set of new variables, called principal components, which are independent linear combinations of the original variables with the highest variance. The first principal component captures the largest amount of variability in the data, followed by the second and so on.