IAML Unit 3: Dicussion

Announcements

  • Feedback - THANKS!
    • consistent feedback will be implemented (as best I can!)
    • vocabulary and concepts - developing appendix
    • you can clone book_iaml. Render to slides with your notes, render to pdf
    • Tuning lab and discussion
      • Rank order based on frequency and importance
      • Put some in slack
      • Can’t do all. Ask in slack, ask in office hours or after discussion/lab
  • Homework is basically same for unit 4
    • New dataset - titanic
    • Do EDA but we don’t need to see it
    • Fit KNN and RDA models (will learn about LDA, QDA and RDA in unit)
    • Submit predictions. Free lunch!
    • And for this free lunch….
library(tidyverse)
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/print_kbl.R?raw=true",
                     sha1 = "021a7f7cddc1f0ffcd0613e57b94c81246b84f7b")
read_csv(here::here("./application_assignments/competitions/2025_unit_03.csv")) |>
  mutate(rmse_test = round(rmse_test)) |> 
  print_kbl()
name rmse_test
janssen 24598
Wong 30461
Murray 30647
higgins 30943
Zhai 31293
xu 32621
baker 32743
Friedl 33022
Zhao 33458
lau 33612
Nasir 33615
Tanchone 33615
Villwock 33770
Cha 34591
yin 34646
Vo 34895
Jolly 35821
Jett 36317
Hwang 36431
ahmed 37965
Yu 39839
cox 39934
khoury 47947
holenarasipura 68107
Zhang Inf

Bias and Variance of estimates

Bias and variances- general

  • property of any estimate
  • for now, we will focus on model (and its \(\hat{Y}\))
  • Describe each
  • Scale example for single measurement

Bias bias and variance - glm vs knn

Bias and variance - k in knn?

Impact of correlated features>

  • in GLM
  • in KNN

Held-in and held-out data

Connect to training, validation, and test sets

Held in and held out data

  • fit models vs calculate performance metric

Why need held-out

  • Example: Selecting best k - train vs validation data

Performance metrics

  • SSE, MSE, RMSE, MAE, R2
  • How else are SSE class of metrics used in glm?

KNN

  • How does KNN use training data to make predictions
  • What is k and how does it get used when making predictions?
  • What is the impact of k on bias and variance/overfitting?
  • k=1: performance in train? in val?
  • Distance measures: use Euclidean (default in kknn)!
  • Tuning K: stay “tuned”

Interaction in KNN - Consider bias first (but also variance) in this example

  • Simulate data
  • Fit models for lm and knn with and without interaction
  • Took some shortcuts (no recipe, predict back into train)
library(tidymodels)
n <- 200
set.seed(5433)

d <- tibble(x1 = runif(n, 0,100), # uniform
               x2 = rep(c(0,1), n/2), # dichotomous
               x1_x2 = x1*x2, # interaction
               y = rnorm(n, 0 + 1*x1 + 10*x2 + 10* x1_x2, 20)) #DGP + noise

fit_lm <- 
  linear_reg() |>   
  set_engine("lm") |>   
  fit(y ~ x1 + x2, data = d)

fit_lm_int <- 
  linear_reg() |>   
  set_engine("lm") |>   
  fit(y ~ x1 + x2 + x1_x2, data = d)

fit_knn <- 
  nearest_neighbor(neighbors = 20) |>   
  set_engine("kknn") |>   
  set_mode("regression") |> 
  fit(y ~ x1 + x2, data = d)

fit_knn_int <- 
  nearest_neighbor(neighbors = 20) |>   
  set_engine("kknn") |>   
  set_mode("regression") |> 
  fit(y ~ x1 + x2 + x1_x2, data = d)

d <- d |> 
  mutate(pred_lm = predict(fit_lm, d)$.pred,
         pred_lm_int = predict(fit_lm_int, d)$.pred,
         pred_knn = predict(fit_knn, d)$.pred,
         pred_knn_int = predict(fit_knn_int, d)$.pred)
  • Predictions from linear model with and without interaction
    • You NEED interaction features with LM
d |> 
  ggplot(aes(x = x1, group = factor(x2), color = factor(x2))) +
    geom_line(aes(y = pred_lm)) +
    geom_point(aes(y = y)) +
    ggtitle("lm without interaction") +
    ylab("y") +
    scale_color_discrete(name = "x2")

d |> 
  ggplot(aes(x = x1, group = factor(x2), color = factor(x2))) +
    geom_line(aes(y = pred_lm_int)) +
    geom_point(aes(y = y)) +
    ggtitle("lm with interaction") +
    ylab("y") +
    scale_color_discrete(name = "x2")

  • Predictions from KNN with and without interaction
    • You do NOT need interaction features with KNN!
d |> 
  ggplot(aes(x = x1, group = factor(x2), color = factor(x2))) +
    geom_line(aes(y = pred_knn)) +
    geom_point(aes(y = y)) +
    ggtitle("KNN without interaction") +
    ylab("y") +
    scale_color_discrete(name = "x2")

d |> 
  ggplot(aes(x = x1, group = factor(x2), color = factor(x2))) +
    geom_line(aes(y = pred_knn_int)) +
    geom_point(aes(y = y)) +
    ggtitle("KNN with interaction") +
    ylab("y") +
    scale_color_discrete(name = "x2")

LM vs. KNN better with some predictors or overall?

  • “Why do some features seem to improve performance more in linear models or only in KNNs?”
  • “What are some contexts where KNN doesn’t work well? In other words, what are the advantages/disadvantages of using KNN?”
    • Always comes down to bias vs. variance
    • Flexibility and N are key moderators of these two key factors.
    • k? - impact on bias, variance?
  • KNN for explanation?
    • Visualizations (think of interaction plot above) make clear the effect
    • Will learn more (better visualizations, variable importance, model comparisons) in later unit

GLM assumptions

What are GLM assumptions and what happens if violated

  • Measurement error

  • Independent

  • Normal

  • Constant

  • Linear

Normalizing transformations - Yeo Johnson

  • when needed for lm?
  • when needed for knn?

Categorical coding

  • Why do we do it? (Not need for every algorithm!)
  • Describe the values assigned to the dummy coded features
  • Why these values? In other words, how can you interpret the effect of a dummy coded feature?
  • “Dummy variable trap”
  • How is it different from one-hot coding. When to use or not use one-hot coding?
  • Target encoding

Exploration

  • “I feel that I can come up with models that decrease the RMSE, but I don’t have good priors on whether adding any particular variable or observation will result in an improved model. I still feel a little weird just adding and dropping variables into a KNN and seeing what gets the validation RMSE the lowest (even though because we’re using validation techniques it’s a fine technique)”
    • Exploration is learning. This is research. If you knew the answer you wouldn’t be doing the study
    • Domain knowledge is still VERY important
    • Some algorithms (LASSO, glmnet) will help with feature selection
    • staying organized
      • Script structure
      • Good documentation - QMD as analysis notebook
    • Some overfitting to validation will occur? Consequence? Solutions?

“Curse of dimensionality” - Bias vs. variance

  • Missing features produce biased models.
  • Unnecessary features or even many features relative to N produce variance
  • Does your available N in your algorithm support the features you need to have low bias.
    • Mostly an empirical question - can’t really tell otherwise outside of simulated data. Validation set is critical!
    • Flexible models often need more N holding features constant
    • Regularization (unit 6) will work well when lots of features

Transformations of numeric predictors

  • Use of plot_truth() [predicted vs. observed]

  • Residuals do not have mean of 0 for every \(\hat{y}\)

    • Consequence: biased parameter estimates. Linear is bad DGP
    • Also bad test of questions RE the predictor (underestimate? misinform)
  • Non-normal residuals

    • Consequence: lm parameter estimates still unbiased (for linear DGP) but more “efficient” solutions exist
    • Bad for prediction b/c higher variance than other solutions
    • May suggest omission of variables
  • Heteroscasticity

    • Consequence: Inefficient and inaccurate standard errors.
    • Statistical tests wrong
    • Poor prediction for some (where larger variance of residuals) \(\hat{y}\)
    • higher variance overall than other solutions - bad again for prediction
  • Transformation of outcome?

    • metric
    • back to raw predictions

Sources of error

  • What are two broad sources of error?
  • What are two broad sources of reducible error. Describe what they are and factors that affect them?
  • Why do we need independent validation data to select the best model configuration?
  • Why do we need test data if we used validation data to select among many model configurations
  • What is RMSE? Connect it to metric you already know? How is it being used in lm (two ways)?; in knn (one way)?
  • How does bias and variance manifest when you look at your performance metric (RMSE) in training and validation sets?
  • Will the addition of new features to a (lm?) model always reduce RMSE in train? in validation? Connect to concepts of bias and variance

“In GLM, why correlation/collinearity among predictors will cause larger variance? Is it because of overfitting?”

KNN (black box) for explanatory purposes