IAML Unit 4: Discussion

Anouncements

Please meet with TA or me if you can’t generate predictions from your models
And the winner is…..

library(tidyverse)
devtools::source_url("https://github.com/jjcurtin/lab_support/blob/main/print_kbl.R?raw=true",
                     sha1 = "021a7f7cddc1f0ffcd0613e57b94c81246b84f7b")
read_csv(here::here("./application_assignments/competitions/2025_unit_04.csv"),
         show_col_types = FALSE) |>
  mutate(acc_test = round(acc_test, 3)) |> 
  print_kbl()

name	acc_test
Tanchone	0.79
Hwang	0.78
Diao	0.78
Lin	0.78
Villwock	0.78
Higgins	0.77
Khoury	0.77
Luo	0.76
Janssen	0.76
Yu	0.76
Zhao	0.76
Cha	0.75
Yin	0.75
Baker	0.75
Lau	0.75
Xu	0.75
Vo	0.74
Wong	0.74
Holenarasipura	0.74
Zhang	0.74
Ahmed	0.74
Hey	0.73
Jolly	0.73
Murray	0.70
Cox	0.65
NA	NA
NA	NA

Kaggle Competitions

Quiz review!

Comparisons across algorithms

Logistic regression models DGP condition probabilities using logistic function

Get parameter estimates for effects of X
Makes strong assumptions about shape of DGP - linear on log-odds(Y)
Yields linear decision boundary
Better for binary outcomes but can do more than two levels (with some effort)
- (Briefly describe: multiple one vs other classes models approach)
Needs numeric features but can dummy code categorical variables (as with lm)
Problems when classes are fully separable (or even mostly separable)

LDA uses Bayes theorem to estimate condition probability

LDA models the distributions of the Xs separately for each class
Then uses Bayes theorem to estimate \(Pr(Y = k | X)\) for each k and assigns the observation to the class with the highest probability

\(Pr(Y = k|X) = \frac{\pi_k * f_k(X)}{\sum_{l = 1}^{K} f_l(X)}\)

where

\(\pi_k\) is the prior probability that an observation comes from class k (estimated from frequencies of k in training)
\(f_k(X)\) is the density function of X for an observation from class k
- \(f_k(X)\) is large if there is a high probability that an observation in class k has that set of values for X and small if that probability is low
- \(f_k(X)\) is difficult to estimate unless we make some simplifying assumptions
- X is multivariate normal
- Common covariance matrix (\(\sum\)) across K classesj
- With these assumptions, we can estimate \(\pi_k\), \(\mu_k\), and \(\sigma^2\) from the training set and calculate \(Pr(Y = k|X)\) for each k

Parametric model but parameters not useful for interpretation of effects of X
Linear decision boundary
Assumptions about multivariate normal X and common \(\sum\)
Dummy features may not work well given assumption about normally distributed?
May require smaller sample sizes to fit than logistic regression if assumptions met
Can natively handle more than two level for outcome

QDA relaxes one restrictive assumption of LDA

Still required multivariate normal X
But it allows each class to have its own \(\sum\)
This makes it:
- More flexible
- Able to model non-linear decision boundaries including 2-way interactions (see formula for discriminant in James et al. (2023))
- How handle interactions by relaxing common \(\sum\)?
- But requires substantial increase in parameter estimation (more potential to overfit)
Still problems with dummy features (not normal; product terms?)
Can natively handle more than 2 levels of outcome like LDA
Compare to LDA and Logistic Regression on bias-variance trade off?

RDA may be better than both LDA and QDA? More on idea of blending after elastic net

KNN works similar to regression

But now looks at percentage of observations for each class among nearest neighbors to estimate conditional probabilities
Doesn’t make assumptions about Xs or \(\sum\) for LDA and/or QDA
Not limited to linear decision boundaries like logistic and LDA
Very flexible - low bias but high variance?
K can be adjusted to impact bias-variance trade-off
KNN can handle more than two level outcomes natively
KNN can be computationally cost with big N (and many X)
- Can down-sample training data to reduce this problem

Summary

Both logistic and LDA are linear functions of X and therefore produce linear decision boundaries
LDA makes additional assumptions about X (multivariate normal and common \(\sum\)) beyond logistic regression. Relative performance is based on the quality of this assumption
QDA relaxes the LDA assumption about common \(\sum\) (and RDA can relax it partially)
- This also allows for nonlinear decision boundaries including 2-way interactions among features
- QDA is therefore more flexible, which means possibly less bias but more potential for overfitting
Both QDA and LDA assume multivariate normal X so may not accommodate categorical predictors very well. Logistic and KNN do accommodate categorical predictors
KNN is non-parametric and therefore the most flexible
- Can also handle interactions and non-linear effects natively (with feature engineering)
- Increased overfitting, decreased bias?
- Not very interpretable. But LDA/QDA, although parametric, aren’t as interpretable as logistic regression
Logistic regression fails when classes are perfectly separated (but does that ever happen?) and is less stable when classes are well separated
LDA, KNN, and QDA naturally accommodate more than two classes
- Logistic requires additional tweak
Sample size issues
- Logistic regression requires relatively large sample sizes.
- LDA may perform better than logistic regression with smaller sample sizes if assumptions are met (QDA?)
- KNN can be computationally very costly with large sample sizes (and large number of X) but could always downsample training set.

Interactions in LDA and QDA

Simulate multivariate normal distribution for X (x1 and x2) using MASS package
Separately for trn and val
NOTE: I first did this with uniform distributions on X and the models fit more poorly. Why?

library(tidymodels)
library(discrim, exclude = "smoothness")
set.seed(5433)
means <- c(0, 0)
sigma <- diag(2) * 100
data_trn <- MASS::mvrnorm(n = 300, mu = means, Sigma = sigma) |>  
    magrittr::set_colnames(str_c("x", 1:length(means))) |>  
    as_tibble()

data_val <- MASS::mvrnorm(n = 3000, mu = means, Sigma = sigma) |>  
    magrittr::set_colnames(str_c("x", 1:length(means))) |>  
    as_tibble()

Write function for interactive DGP based on x1 and x2
Will map this over rows of d
Can specify any values for b
b[4] will be interaction parameter estimate

b <- c(0, 0, 0, .5)

calc_p <- function(x, b){
   exp(b[1] + b[2]*x$x1 + b[3]*x$x2 + b[4]*x$x1*x$x2) /
     (1 + exp(b[1] + b[2]*x$x1 + b[3]*x$x2 + b[4]*x$x1*x$x2))
}

Add p and then observed classes to trn and val

data_trn <- data_trn |> 
  mutate(p = calc_p(data_trn, b)) |> 
  mutate(y = rbinom(nrow(data_trn), 1, p),
         y = factor(y, levels = 0:1, labels = c("neg", "pos")))

head(data_trn, 10)

# A tibble: 10 × 4
        x1     x2        p y    
     <dbl>  <dbl>    <dbl> <fct>
 1   9.85   -5.53 1.52e-12 neg  
 2  -2.48    5.64 9.21e- 4 neg  
 3  -4.74  -13.7  1.00e+ 0 pos  
 4  -7.02   12.6  6.09e-20 neg  
 5   0.942  -3.48 1.63e- 1 pos  
 6   0.151   2.78 5.52e- 1 neg  
 7   7.74   -6.84 3.24e-12 neg  
 8  -4.85  -10.1  1.00e+ 0 pos  
 9  -0.937   2.03 2.78e- 1 neg  
10 -16.5     6.08 1.83e-22 neg

data_val <- data_val |> 
  mutate(p = calc_p(data_val, b)) |> 
  mutate(y = rbinom(nrow(data_val), 1, p),
         y = factor(y, levels = 0:1, labels = c("neg", "pos")))

Lets look at what an interactive DGP looks like for two features and a binary outcome
Parameter estimates set up a “cross-over” interaction

data_val |> 
  ggplot(mapping = aes(x = x1, y = x2, color = y)) +
    geom_point(size = 2, alpha = .5)

Fit models in trn

fit_lda <- 
  discrim_linear() |> 
  set_engine("MASS") |> 
  fit(y ~ x1 + x2, data = data_trn)

fit_lda_int <- 
  discrim_linear() |> 
  set_engine("MASS") |> 
  fit(y ~ x1 + x2 + x1*x2, data = data_trn)

fit_qda <- 
  discrim_regularized(frac_common_cov = 0, frac_identity = 0) |> 
  set_engine("klaR") |> 
  fit(y ~ x1 + x2, data = data_trn)

fit_qda_int <- 
  discrim_regularized(frac_common_cov = 0, frac_identity = 0) |> 
  set_engine("klaR") |> 
  fit(y ~ x1 + x2 + x1*x2, data = data_trn)

Additive LDA model decision boundary and performance in val

data_val |> 
  plot_decision_boundary(fit_lda, x_names = c("x1", "x2"), y_name = "y",
                         n_points = 400)

Interactive LDA model decision boundary and performance in val

data_val |> 
  plot_decision_boundary(fit_lda_int, x_names = c("x1", "x2"), y_name = "y",
                         n_points = 400)

Additive QDA model decision boundary

data_val |> 
  plot_decision_boundary(fit_qda, x_names = c("x1", "x2"), y_name = "y",
                         n_points = 400)

Interactive QDA model decision boundary

data_val |> 
  plot_decision_boundary(fit_qda_int, x_names = c("x1", "x2"), y_name = "y",
                         n_points = 400)

Costs for QDA vs. LDA interactive in this example and more generally with more features?
What if you were using RDA, which can model the full range of models between linear and quadratic?

Categorical predictors

All algorithms so far require numeric features
Ordinal can be made numeric sometimes by just substituting ordered vector (i.e. 1, 2, 3, etc)
Nominal needs something more
Our go to method is dummy features
- What is big problem with dummy features?
- Collapsing levels?
- Difference between dummy coding and one-hot encoding
- What is “dummy variable trap?” with one-hot
- Issue of binary scores for LDA/QDA
Target encoding example
- Country of origin for car example (but maybe think of many countries?)
- Why not data leakage?
- Problems with step_mutate()
- Can manually do it with our current resampling
- See step_lencode_*() in embed package

DGP and Errors

DGP on probability
what is irreducible error for classification?
DGP on X1 - draw it with varying degrees of error

DGP and error on two features

Bayes Classifier

The previous figure below displays simulated data for a classification problem for K = 2 classes as a function of X1 and X2

The Bayes classifier assigns each observation its most likely class given its conditional probabilities for the values for X1 and X2

\(Pr(Y = k | X = x_0) for\:k = 1:K\)
For K = 2, this means assigning to the class with Pr > .50
This decision boundary for the two class problem is displayed in the figure

The Bayes classifier provides the minimum error rate for test data

Error rate for any \(x_0\) will be \(1 - max (Pr( Y = k | X = x_0))\)
Overall error rate will be the average of this across all possible X
This is the irreducible error for classification problems
This is a theoretical model b/c (except for simulated data), we don’t know the conditional probabilities based on X
Many classification models try to estimate these conditionals

Probability vs. odds vs. log-odds
How to interpret parameter estimates (effects of X)

PCA

https://setosa.io/ev/principal-component-analysis/
https://www.cs.cmu.edu/~elaw/papers/pca.pdf

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2023. An Introduction to Statistical Learning: With Applications in R. 2nd ed. Springer Texts in Statistics. New York: Springer-Verlag.