IAML Unit 2: Dicussion

Housekeeping

Unit 2 solutions
- Review in lab
- Format of lab open for influence
Course feedback for extra credit (to quiz score) starting on next quiz!
Unit 3 homework
- test set predictions
- free lunch!
An attempt at a reprex is generally expected now for code help. Very important to learn!!!

Data are very different in 752 vs. 610/710.
- More real world
- Messy
- Often higher dimensionality (more predictors)
Review
- Goal is to develop model that closely approximates DGP
- Goal is to evaluate (estimate) how close our model is to the DGP (how much error) with as little error as possible
- Bias, overfitting/variance for any estimate (model and performance of model)
- candidate model configurations
- fit, select, evaluate
- training, validation, test (held-in; held-out)

Review: 2.4.2 Prepping and Baking a Recipe
- Review section in web book
- Prep always with held in data, bake both held in (data alread in object) & held-out
- Converts from matrix (df) of predictors to feature matrix
- Feature engineering (transformations, combinations of predictors, interactions, imputation, dimensionality reduction)

Functions sidenote - fun_modeling.R on github
- look at the functions. Don’t use them blackbox
- make your own function script

Missing data
- Exclude vs. Impute in training data. Outcomes?
- How to impute
- Missing predictors in validate or test (can’t exclude?). Exclude cases with missing outcomes.

Issues with high dimensionality
- Hard to do predictor level EDA
- Common choices (normality transformations)
- Observed vs. predicted plots
- Methods for automated variable selection (glmnet)

Distributional Shape
- Measurement issues (interval scale)
- Implications for relationships with other variables
- Solutions?

Linearity vs. More Complex Relationships
- Transformations
- Choice of statistical algorithm
- Do you need a linear model?

Interactions
- Domain expertise
- Visual options for interactions
- But what do you with high dimensional data?
- Explanatory vs. prediction goals (algorithms that accommodate interactions)

Model Assumptions
- Why do we make assumptions?
  - Inference
  - Not needed for prediction (but?)
- Flexibility wrt DGP