IAML Unit 6: Discussion

Announcements

Midterm

Application exam available now. Due NEXT friday at noon
Concepts exam in discussion section next thursday
Review for concepts exam next tuesday in Lab (w/ me)

Feedback

Noise in discussion videos
Quiz
Interactive (ask your questions, answer my questions; groups?)
Slides are just to organize my thoughts (often wont cover all topics; Flexible based on where discussion goes)

Bias and Variance - The MOST Fundamental Issues in Inferential Statistics

Describe these two properties of estimates in general

need to think about repeated estimation of something (\(\hat{f}\), \(\hat{DGP}\); \(\hat{accuracy}\), \(\hat{rmse}\))

Describe bias and variance associated with developing models (estimates of the DGP)

Examples for our models (\(\hat{f}\), \(\hat{DGP}\))
What factors affect each?

Describe bias and variance of our performance estimates

Examples for our performance estimates (\(\hat{accuracy}\), \(\hat{rmse}\))
What factors affect each?

Curse of Dimensionality

What is high dimensional data?
What happens when p >= n in GLM
Connect to issues of overfitting and interpretation
Why would we even want high dimensional data?
Describe two broad categories of approaches to solve high dimensionality problem
- Feature selection (subset approaches, LASSO)
- Dimensionality reduction (Ridge, PCA, PCR, PLS, autoencoders, word embeddings)

Steps for three subsetting approaches

Three approaches

Best subset
Forward stepwise
Backward stepwise
Why can you select among models with fixed p using training error
Why do you need test error to select among models with different p

Advantages and disadvantages

Best subset
- potentially better at finding DPG because considers all possible models for specific k
- Computationally very expensive (can’t do with p > 40?)
- LASSO is very similar and computationally much faster!
Forward stepwise
- Lower computational costs than best subset
- Wont always find best model (3 X example)
- Can be used with p > n
Backward stepwise
- Lower computational costs than best subset
- Maybe better in scenarios like 3 X example
- Can’t be used with p > n

Cost functions for OLS, LASSO and Ridge

NEED TO KNOW THESE FULL FORMULAS (AT LEAST FOR REGRESSION) and PENALTY FOR CLASSIFICATION

OLS

\(\frac{1}{n}\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}\)

Ridge (l2)

\(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} \beta_j^{2}\:])\)

LASSO

\(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} |\beta_j|\:])\)

Lambda

How does the value of Lambda affect LASSO and Ridge?

LASSO vs. Ridge Comparison

With respect to the parameter estimates:

LASSO yields sparse solution (some parameter estimates set to exactly zero)
Ridge tends to retain all features (parameter estimates don’t get set to exactly zero)
LASSO selects one feature among correlated group and sets others to zero
Ridge shrinks all parameter estimates for correlated features

Ridge tends to outperform LASSO wrt prediction in new data. There are cases where LASSO can predict better (most features have zero effect and only a few are non-zero) but even in those cases, Ridge is competitive.

Advantages of LASSO

Does feature selection (sets parameter estimates to exactly 0)
- Yields a sparse solution
- Sparse model is more interpretable?
- Sparse model is easier to implement? (fewer features included so don’t need to measure as many predictors)
More robust to outliers (similar to LAD vs. OLS)
Tends to do better when there are a small number of robust features and the others are close to zero or zero

Advantages of Ridge

Computationally superior (closed form solution vs. iterative; Only one solution to minimize the cost function)
More robust to measurement error in features (remember no measurement error is an assumption for unbiased estimates in OLS regression)
Tends to do better when there are many features with large (and comparable) effects (i.e., most features are related to the outcome)

Elasticnet

Cost function

\(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda (\alpha\sum_{j = 1}^{p} |\beta_j| + (1-\alpha)\sum_{j = 1}^{p} \beta_j^{2})\:])\)
Lambda works the same
What does alpha (blending/mixture hyperparameter) do

PCA, PRC, and PLS

PCA

See appendix and section 12.2 in James et al

PCR

GLM with PCA as part of feature engineering
Tune number of components retained
Doesnt derived components based on Y
Very similar to Ridge

PLS

Modification where components are derived to predict Y
Still prefer ridge generally

All do dimensionality reduction but NOT feature selection. Still need to measure all predictors