IAML Unit 2: Dicussion

Housekeeping

  • Unit 2 solutions
    • Review in lab
    • Format of lab open for influence
  • Course feedback for extra credit (to quiz score) starting on next quiz!
  • Unit 3 homework
    • test set predictions
    • free lunch!
  • An attempt at a reprex is generally expected now for code help. Very important to learn!!!

Quiz review and student questions

  • Quiz review

  • Student Questions

    • Great questions!
    • Write them down as you read and watch lectures
    • Use them to guide what topics we discuss
    • Bring them up during this discussion as we cover relevant topics
    • Not all questions will fit (Slack as an alternative for follow-up)

More central topics

  1. Data are very different in 752 vs. 610/710.
    • More real world
    • Messy
    • Often higher dimensionality (more predictors)
  2. Review
    • Goal is to develop model that closely approximates DGP
    • Goal is to evaluate (estimate) how close our model is to the DGP (how much error) with as little error as possible
    • Bias, overfitting/variance for any estimate (model and performance of model)
    • candidate model configurations
    • fit, select, evaluate
    • training, validation, test (held-in; held-out)


  1. Review: 2.2.1 Stages of Data Analysis and Model Development


  1. Best practices (discuss quickly)
    • csv for data sharing, viewing, git (though be careful with data in github or other public repo!)
    • variable values saved as text when nominal and ordinal (self-documenting)
    • Create data dictionary - Documentation is critical!!
    • snake_case for variables and self-documenting names (systematic names too)
    • write function for classing data (done often)


  1. Review: 2.3.1 Data Leakage Issues
    • Review section in webbook
    • Cleaning EDA is done with full dataset (but univariate). Very limited (variable names, values, find errors)
    • Modeling EDA is only done with a training set (or even “eyeball” sample) - NEVER use validate or test set
    • Never estimate anything with full data set (e.g., missing values, standardize, etc) if you plan to evaluate the model
    • Use recipes, prep (all estimation) with held-in data then bake held-in (already in object) and the appropriate held-out set
    • Put test aside
    • You work with validation (for selecting best configuration) but never explore with validation (will still catch leakage with test but will be mislead to be overly optimistic and spoil test)


  1. Review: 2.4.2 Prepping and Baking a Recipe
    • Review section in web book
    • Prep always with held in data, bake both held in (data alread in object) & held-out
    • Converts from matrix (df) of predictors to feature matrix
    • Feature engineering (transformations, combinations of predictors, interactions, imputation, dimensionality reduction)


  1. Functions sidenote - fun_modeling.R on github
    • look at the functions. Don’t use them blackbox
    • make your own function script


  1. EDA for modeling
    • Limitless, just scratched the surface
    • Differs some based on dimensionality of dataset
    • Differs based on measurement of X and Y
    • Very important (and under-appreciated part) of science!
    • Learning about DGP
      • Understand univariate distributions, frequencies
      • Bivariate relationships
      • Interactions (3 or more variables)
      • Patterns in data

Extra topics, time permitting

  1. Missing data
    • Exclude vs. Impute in training data. Outcomes?
    • How to impute
    • Missing predictors in validate or test (can’t exclude?). Exclude cases with missing outcomes.


  1. Outliers
    • Drop or fix errors!
    • Goal is always to estimate DGP
    • Exclude
    • Retain
    • Bring to fence
    • Don’t exclude/change outcome in validate/test


  1. Issues with high dimensionality
    • Hard to do predictor level EDA
    • Common choices (normality transformations)
    • Observed vs. predicted plots
    • Methods for automated variable selection (glmnet)


  1. Distributional Shape
    • Measurement issues (interval scale)
    • Implications for relationships with other variables
    • Solutions?


  1. Linearity vs. More Complex Relationships
    • Transformations
    • Choice of statistical algorithm
    • Do you need a linear model?


  1. Interactions
    • Domain expertise
    • Visual options for interactions
    • But what do you with high dimensional data?
    • Explanatory vs. prediction goals (algorithms that accommodate interactions)


  1. How to handle all of these decisions in the machine learning framework
    • Goal is to develop a model that most closely approximates the DGP
    • How does validation and test help this?
    • Preregistration?
      • Pre-reg for performance metric, resampling method
      • Use of resampling for other decisions
      • Use of resampling to find correct model to test explanatory goals


  1. Model Assumptions
    • Why do we make assumptions?
      • Inference
      • Not needed for prediction (but?)
    • Flexibility wrt DGP