IAML Unit 4: Discussion

Anouncements

  • Please meet with TA or me if you can’t generate predictions from your models
  • And the winner is…..
library(tidyverse)
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/print_kbl.R?raw=true",
                     sha1 = "021a7f7cddc1f0ffcd0613e57b94c81246b84f7b")
read_csv(here::here("./application_assignments/competitions/2025_unit_04.csv"),
         show_col_types = FALSE) |>
  mutate(acc_test = round(acc_test, 3)) |> 
  print_kbl()
name acc_test
Tanchone 0.79
Hwang 0.78
Diao 0.78
Lin 0.78
Villwock 0.78
Higgins 0.77
Khoury 0.77
Luo 0.76
Janssen 0.76
Yu 0.76
Zhao 0.76
Cha 0.75
Yin 0.75
Baker 0.75
Lau 0.75
Xu 0.75
Vo 0.74
Wong 0.74
Holenarasipura 0.74
Zhang 0.74
Ahmed 0.74
Hey 0.73
Jolly 0.73
Murray 0.70
Cox 0.65
NA NA
NA NA

Kaggle Competitions

Quiz review!

Comparisons across algorithms

Logistic regression models DGP condition probabilities using logistic function

  • Get parameter estimates for effects of X
  • Makes strong assumptions about shape of DGP - linear on log-odds(Y)
  • Yields linear decision boundary
  • Better for binary outcomes but can do more than two levels (with some effort)
    • (Briefly describe: multiple one vs other classes models approach)
  • Needs numeric features but can dummy code categorical variables (as with lm)
  • Problems when classes are fully separable (or even mostly separable)

LDA uses Bayes theorem to estimate condition probability

  • LDA models the distributions of the Xs separately for each class
  • Then uses Bayes theorem to estimate \(Pr(Y = k | X)\) for each k and assigns the observation to the class with the highest probability

\(Pr(Y = k|X) = \frac{\pi_k * f_k(X)}{\sum_{l = 1}^{K} f_l(X)}\)

where

  • \(\pi_k\) is the prior probability that an observation comes from class k (estimated from frequencies of k in training)
  • \(f_k(X)\) is the density function of X for an observation from class k
    • \(f_k(X)\) is large if there is a high probability that an observation in class k has that set of values for X and small if that probability is low
    • \(f_k(X)\) is difficult to estimate unless we make some simplifying assumptions
    • X is multivariate normal
    • Common covariance matrix (\(\sum\)) across K classesj
    • With these assumptions, we can estimate \(\pi_k\), \(\mu_k\), and \(\sigma^2\) from the training set and calculate \(Pr(Y = k|X)\) for each k
  • Parametric model but parameters not useful for interpretation of effects of X
  • Linear decision boundary
  • Assumptions about multivariate normal X and common \(\sum\)
  • Dummy features may not work well given assumption about normally distributed?
  • May require smaller sample sizes to fit than logistic regression if assumptions met
  • Can natively handle more than two level for outcome

QDA relaxes one restrictive assumption of LDA

  • Still required multivariate normal X
  • But it allows each class to have its own \(\sum\)
  • This makes it:
    • More flexible
    • Able to model non-linear decision boundaries including 2-way interactions (see formula for discriminant in James et al. (2023))
    • How handle interactions by relaxing common \(\sum\)?
    • But requires substantial increase in parameter estimation (more potential to overfit)
  • Still problems with dummy features (not normal; product terms?)
  • Can natively handle more than 2 levels of outcome like LDA
  • Compare to LDA and Logistic Regression on bias-variance trade off?

RDA may be better than both LDA and QDA? More on idea of blending after elastic net

KNN works similar to regression

  • But now looks at percentage of observations for each class among nearest neighbors to estimate conditional probabilities
  • Doesn’t make assumptions about Xs or \(\sum\) for LDA and/or QDA
  • Not limited to linear decision boundaries like logistic and LDA
  • Very flexible - low bias but high variance?
  • K can be adjusted to impact bias-variance trade-off
  • KNN can handle more than two level outcomes natively
  • KNN can be computationally cost with big N (and many X)
    • Can down-sample training data to reduce this problem

Summary

  • Both logistic and LDA are linear functions of X and therefore produce linear decision boundaries

  • LDA makes additional assumptions about X (multivariate normal and common \(\sum\)) beyond logistic regression. Relative performance is based on the quality of this assumption

  • QDA relaxes the LDA assumption about common \(\sum\) (and RDA can relax it partially)

    • This also allows for nonlinear decision boundaries including 2-way interactions among features
    • QDA is therefore more flexible, which means possibly less bias but more potential for overfitting
  • Both QDA and LDA assume multivariate normal X so may not accommodate categorical predictors very well. Logistic and KNN do accommodate categorical predictors

  • KNN is non-parametric and therefore the most flexible

    • Can also handle interactions and non-linear effects natively (with feature engineering)
    • Increased overfitting, decreased bias?
    • Not very interpretable. But LDA/QDA, although parametric, aren’t as interpretable as logistic regression
  • Logistic regression fails when classes are perfectly separated (but does that ever happen?) and is less stable when classes are well separated

  • LDA, KNN, and QDA naturally accommodate more than two classes

    • Logistic requires additional tweak
  • Sample size issues

    • Logistic regression requires relatively large sample sizes.
    • LDA may perform better than logistic regression with smaller sample sizes if assumptions are met (QDA?)
    • KNN can be computationally very costly with large sample sizes (and large number of X) but could always downsample training set.

Interactions in LDA and QDA

  • Simulate multivariate normal distribution for X (x1 and x2) using MASS package
  • Separately for trn and val
  • NOTE: I first did this with uniform distributions on X and the models fit more poorly. Why?
library(tidymodels)
library(discrim, exclude = "smoothness")
set.seed(5433)
means <- c(0, 0)
sigma <- diag(2) * 100
data_trn <- MASS::mvrnorm(n = 300, mu = means, Sigma = sigma) |>  
    magrittr::set_colnames(str_c("x", 1:length(means))) |>  
    as_tibble()

data_val <- MASS::mvrnorm(n = 3000, mu = means, Sigma = sigma) |>  
    magrittr::set_colnames(str_c("x", 1:length(means))) |>  
    as_tibble()
  • Write function for interactive DGP based on x1 and x2
  • Will map this over rows of d
  • Can specify any values for b
  • b[4] will be interaction parameter estimate
b <- c(0, 0, 0, .5)

calc_p <- function(x, b){
   exp(b[1] + b[2]*x$x1 + b[3]*x$x2 + b[4]*x$x1*x$x2) /
     (1 + exp(b[1] + b[2]*x$x1 + b[3]*x$x2 + b[4]*x$x1*x$x2))
}
  • Add p and then observed classes to trn and val
data_trn <- data_trn |> 
  mutate(p = calc_p(data_trn, b)) |> 
  mutate(y = rbinom(nrow(data_trn), 1, p),
         y = factor(y, levels = 0:1, labels = c("neg", "pos")))

head(data_trn, 10)
# A tibble: 10 × 4
        x1     x2        p y    
     <dbl>  <dbl>    <dbl> <fct>
 1   9.85   -5.53 1.52e-12 neg  
 2  -2.48    5.64 9.21e- 4 neg  
 3  -4.74  -13.7  1.00e+ 0 pos  
 4  -7.02   12.6  6.09e-20 neg  
 5   0.942  -3.48 1.63e- 1 pos  
 6   0.151   2.78 5.52e- 1 neg  
 7   7.74   -6.84 3.24e-12 neg  
 8  -4.85  -10.1  1.00e+ 0 pos  
 9  -0.937   2.03 2.78e- 1 neg  
10 -16.5     6.08 1.83e-22 neg  
data_val <- data_val |> 
  mutate(p = calc_p(data_val, b)) |> 
  mutate(y = rbinom(nrow(data_val), 1, p),
         y = factor(y, levels = 0:1, labels = c("neg", "pos")))
  • Lets look at what an interactive DGP looks like for two features and a binary outcome
  • Parameter estimates set up a “cross-over” interaction
data_val |> 
  ggplot(mapping = aes(x = x1, y = x2, color = y)) +
    geom_point(size = 2, alpha = .5)
  • Fit models in trn
fit_lda <- 
  discrim_linear() |> 
  set_engine("MASS") |> 
  fit(y ~ x1 + x2, data = data_trn)

fit_lda_int <- 
  discrim_linear() |> 
  set_engine("MASS") |> 
  fit(y ~ x1 + x2 + x1*x2, data = data_trn)

fit_qda <- 
  discrim_regularized(frac_common_cov = 0, frac_identity = 0) |> 
  set_engine("klaR") |> 
  fit(y ~ x1 + x2, data = data_trn)

fit_qda_int <- 
  discrim_regularized(frac_common_cov = 0, frac_identity = 0) |> 
  set_engine("klaR") |> 
  fit(y ~ x1 + x2 + x1*x2, data = data_trn)
  • Additive LDA model decision boundary and performance in val
data_val |> 
  plot_decision_boundary(fit_lda, x_names = c("x1", "x2"), y_name = "y",
                         n_points = 400)
  • Interactive LDA model decision boundary and performance in val
data_val |> 
  plot_decision_boundary(fit_lda_int, x_names = c("x1", "x2"), y_name = "y",
                         n_points = 400) 
  • Additive QDA model decision boundary
data_val |> 
  plot_decision_boundary(fit_qda, x_names = c("x1", "x2"), y_name = "y",
                         n_points = 400)
  • Interactive QDA model decision boundary
data_val |> 
  plot_decision_boundary(fit_qda_int, x_names = c("x1", "x2"), y_name = "y",
                         n_points = 400) 
  • Costs for QDA vs. LDA interactive in this example and more generally with more features?
  • What if you were using RDA, which can model the full range of models between linear and quadratic?

Categorical predictors

  • All algorithms so far require numeric features
  • Ordinal can be made numeric sometimes by just substituting ordered vector (i.e. 1, 2, 3, etc)
  • Nominal needs something more
  • Our go to method is dummy features
    • What is big problem with dummy features?
    • Collapsing levels?
    • Difference between dummy coding and one-hot encoding
    • What is “dummy variable trap?” with one-hot
    • Issue of binary scores for LDA/QDA
  • Target encoding example
    • Country of origin for car example (but maybe think of many countries?)
    • Why not data leakage?
    • Problems with step_mutate()
    • Can manually do it with our current resampling
    • See step_lencode_*() in embed package

DGP and Errors

  • DGP on probability
  • what is irreducible error for classification?
  • DGP on X1 - draw it with varying degrees of error
  • DGP and error on two features

Bayes Classifier

The previous figure below displays simulated data for a classification problem for K = 2 classes as a function of X1 and X2

The Bayes classifier assigns each observation its most likely class given its conditional probabilities for the values for X1 and X2

  • \(Pr(Y = k | X = x_0) for\:k = 1:K\)
  • For K = 2, this means assigning to the class with Pr > .50
  • This decision boundary for the two class problem is displayed in the figure

The Bayes classifier provides the minimum error rate for test data

  • Error rate for any \(x_0\) will be \(1 - max (Pr( Y = k | X = x_0))\)
  • Overall error rate will be the average of this across all possible X
  • This is the irreducible error for classification problems
  • This is a theoretical model b/c (except for simulated data), we don’t know the conditional probabilities based on X
  • Many classification models try to estimate these conditionals
  • Probability vs. odds vs. log-odds
  • How to interpret parameter estimates (effects of X)

PCA

  • https://setosa.io/ev/principal-component-analysis/
  • https://www.cs.cmu.edu/~elaw/papers/pca.pdf

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2023. An Introduction to Statistical Learning: With Applications in R. 2nd ed. Springer Texts in Statistics. New York: Springer-Verlag.