Announcements
Midterm
- Application exam available now. Due NEXT friday at noon
- Concepts exam in discussion section next thursday
- Review for concepts exam next tuesday in Lab (w/ me)
Feedback
- Noise in discussion videos
- Quiz
- Interactive (ask your questions, answer my questions; groups?)
- Slides are just to organize my thoughts (often wont cover all topics; Flexible based on where discussion goes)
Bias and Variance - The MOST Fundamental Issues in Inferential Statistics
Describe these two properties of estimates in general
- need to think about repeated estimation of something (\(\hat{f}\), \(\hat{DGP}\); \(\hat{accuracy}\), \(\hat{rmse}\))
Describe bias and variance associated with developing models (estimates of the DGP)
- Examples for our models (\(\hat{f}\), \(\hat{DGP}\))
- What factors affect each?
Describe bias and variance of our performance estimates
- Examples for our performance estimates (\(\hat{accuracy}\), \(\hat{rmse}\))
- What factors affect each?
Curse of Dimensionality
- What is high dimensional data?
- What happens when p >= n in GLM
- Connect to issues of overfitting and interpretation
- Why would we even want high dimensional data?
- Describe two broad categories of approaches to solve high dimensionality problem
- Feature selection (subset approaches, LASSO)
- Dimensionality reduction (Ridge, PCA, PCR, PLS, autoencoders, word embeddings)
Steps for three subsetting approaches
Three approaches
Advantages and disadvantages
- Best subset
- potentially better at finding DPG because considers all possible models for specific k
- Computationally very expensive (can’t do with p > 40?)
- LASSO is very similar and computationally much faster!
- Forward stepwise
- Lower computational costs than best subset
- Wont always find best model (3 X example)
- Can be used with p > n
- Backward stepwise
- Lower computational costs than best subset
- Maybe better in scenarios like 3 X example
- Can’t be used with p > n
Cost functions for OLS, LASSO and Ridge
NEED TO KNOW THESE FULL FORMULAS (AT LEAST FOR REGRESSION) and PENALTY FOR CLASSIFICATION
OLS
- \(\frac{1}{n}\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}\)
Ridge (l2)
- \(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} \beta_j^{2}\:])\)
LASSO
- \(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda\sum_{j = 1}^{p} |\beta_j|\:])\)
Lambda
- How does the value of Lambda affect LASSO and Ridge?
LASSO vs. Ridge Comparison
With respect to the parameter estimates:
LASSO yields sparse solution (some parameter estimates set to exactly zero)
Ridge tends to retain all features (parameter estimates don’t get set to exactly zero)
LASSO selects one feature among correlated group and sets others to zero
Ridge shrinks all parameter estimates for correlated features
Ridge tends to outperform LASSO wrt prediction in new data. There are cases where LASSO can predict better (most features have zero effect and only a few are non-zero) but even in those cases, Ridge is competitive.
Advantages of LASSO
Does feature selection (sets parameter estimates to exactly 0)
- Yields a sparse solution
- Sparse model is more interpretable?
- Sparse model is easier to implement? (fewer features included so don’t need to measure as many predictors)
More robust to outliers (similar to LAD vs. OLS)
Tends to do better when there are a small number of robust features and the others are close to zero or zero
Advantages of Ridge
Computationally superior (closed form solution vs. iterative; Only one solution to minimize the cost function)
More robust to measurement error in features (remember no measurement error is an assumption for unbiased estimates in OLS regression)
Tends to do better when there are many features with large (and comparable) effects (i.e., most features are related to the outcome)
Elasticnet
Cost function
- \(\frac{1}{n}([\sum_{i = 1}^{n} (Y_i - \hat{Y_i})^{2}] + [\:\lambda (\alpha\sum_{j = 1}^{p} |\beta_j| + (1-\alpha)\sum_{j = 1}^{p} \beta_j^{2})\:])\)
- Lambda works the same
- What does alpha (blending/mixture hyperparameter) do
PCA, PRC, and PLS
PCA
- See appendix and section 12.2 in James et al
PCR
- GLM with PCA as part of feature engineering
- Tune number of components retained
- Doesnt derived components based on
Y
- Very similar to Ridge
PLS
- Modification where components are derived to predict
Y
- Still prefer ridge generally
All do dimensionality reduction but NOT feature selection. Still need to measure all predictors