Unit 14: The Generalized Linear Model

Question

What are the assumptions of general linear models?

The general linear model makes the 5 assumptions below. When these assumptions are met, OLS regression coefficients are MVUE (Minimum Variance Unbiased Estimators) and BLUE (Best Linear Unbiased Estimators).

  1. Exact X: The IVs are assumed to be known exactly (i.e., without measurement error).
  2. Independence: Residuals are independently distributed (prob. of obtaining a specific observation does not depend on other observations).
  3. Normality: All residual distributions are normally distributed.
  4. Constant variance: All residual distributions have a constant variance, \(SEE^2\).
  5. Linearity: All residual distributions (i.e., for each \(Y'\)) are assumed to have means equal to zero.

Question

What are the consequences of violating each of these assumptions? What options exist when these assumptions are violated?

  1. Exact X: Biased parameters (to the degree that measurement error exists). Use reliable measures.
  2. Independence: Inaccurate standard errors, degrees of freedom and significance tests. Use repeated measures or linear mixed effects models or ANCOVA.
  3. Normality: Inefficient (with large N). Use power transformations, generalized linear models.
  4. Constant variance: Inefficient and inaccurate standard errors. Use power transformations, SE corrections, weighted least squares, generalized linear models.
  5. Linearity: Biased parameter estimates. Use power transformations, polynomial regression, generalized linear models.

An Example

Predicting admission to a grad program in engineering based on quantitative GRE, GPA, and Undergraduate Institution Rank.

data |> head()
# A tibble: 6 × 6
  sub_id admit   gre   gpa rank  rank_2   
   <dbl> <dbl> <dbl> <dbl> <fct> <fct>    
1    290     0   420  2.1  4     Low rank 
2     41     0   560  2.2  2     High rank
3    373     0   680  2.3  1     High rank
4     49     0   440  2.05 4     Low rank 
5    157     0   560  2.42 2     High rank
6    295     0   480  2.55 1     High rank
data |> 
  skim() |> 
  focus(n_missing, numeric.mean, numeric.sd, 
        min = numeric.p0, max = numeric.p100) |> 
  yank("numeric")

Variable type: numeric

skim_variable n_missing mean sd min max
sub_id 0 200.50 115.61 1.00 400
admit 0 0.42 0.49 0.00 1
gre 0 587.70 115.52 220.00 800
gpa 0 3.39 0.39 2.05 4

You will occasionally see linear regressions conducted with binary outcomes.

  • The binary outcome is coded as 0 and 1.
  • This model is called the linear probability model where \(\hat{Y}\) is considered the probability of membership in the group coded 1
m_lm <- lm(admit ~ gpa, data = data)
m_lm |> tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -1.94     0.182      -10.7 1.45e-23
2 gpa            0.697    0.0534      13.1 1.15e-32

Question

What is the effect of GPA on admission to grad school?

Question

What are the problems with using a general linear model to assess the effects of these predictors on admission outcomes?

  1. Residuals will not be normal (not efficient).

  2. Residual variance often will not be constant (not efficient, SEs are inaccurate).

  3. Probability may not be linear on X (yielding biased parameter estimates).

  4. \(\hat{Y}\) is not constrained between 0 – 1 (model may make nonsensical predictions if we consider \(\hat{Y}\) to be probabilites).

Here is a scatterplot of admit as a function of GPA.

Code
data |> 
  ggplot(aes(x = gpa, y = admit)) +
  geom_point() +
  scale_x_continuous(limits = c(0, 4)) +
  labs(y = "p(admit)")

Overplotting can be a problem in scatterplots when one (or both) of the variables are integers.

Jittering that variable can help.

Code
data |> 
  ggplot(aes(x = gpa, y = admit)) +
  geom_jitter(width = 0, height = .05) +
  scale_x_continuous(limits = c(0, 4)) +
  labs(y = "p(admit)")

Clearly not a good model.

Code
data |> 
  ggplot(aes(x = gpa, y = admit)) +
  geom_jitter(width = 0, height = .05) +
  scale_x_continuous(limits = c(0, 4)) +
  geom_smooth(formula = y ~ x, method = "lm", color = "red", se = FALSE,
              fullrange = TRUE)  +
  labs(y = "p(admit)")

Still a problem even if we limit the plot (like we should!) to just the range of observed data for gpa

Code
data |> 
  ggplot(aes(x = gpa, y = admit)) +
  geom_jitter(width = 0, height = .05) +
  scale_x_continuous(limits = c(2, 4)) +
  scale_y_continuous(limits = c(-2, 1.2)) +
  geom_smooth(formula = y ~ x, method = "lm", color = "red", se = FALSE,
              fullrange = TRUE)  +
  labs(y = "p(admit)")

Global test of assumptions

gvlma::gvlma(m_lm)

Call:
lm(formula = admit ~ gpa, data = data)

Coefficients:
(Intercept)          gpa  
    -1.9432       0.6969  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = m_lm) 

                      Value       p-value                   Decision
Global Stat        42.36416 0.00000001402 Assumptions NOT satisfied!
Skewness            5.12561 0.02357534169 Assumptions NOT satisfied!
Kurtosis           13.81465 0.00020175698 Assumptions NOT satisfied!
Link Function      23.39426 0.00000131972 Assumptions NOT satisfied!
Heteroscedasticity  0.02965 0.86328440028    Assumptions acceptable.

Normality of residuals

Code
ggplot() +
  geom_density(aes(x = value), data = enframe(rstudent(m_lm), name = NULL)) +
  labs(title = "Density Plot to Assess Normality of Residuals",
       x = "Studentized Residual") +
  geom_line(aes(x = x, y = y), 
            data = tibble(x = seq(-4, 4, length.out = 100),
                          y = dnorm(seq(-4, 4, length.out = 100), 
                                    mean = 0, sd = sd(rstudent(m_lm)))), 
            linetype = "dashed", color = "blue")

Normality of residuals

Code
car::qqPlot(m_lm, id = FALSE, simulate = TRUE,
            main = "QQ Plot to Assess Normality", 
            ylab = "Studentized Residuals")

Constant variance of residuals

Code
tibble(x = predict(m_lm),
       y = rstudent(m_lm)) |> 
  ggplot(aes(x = x, y = y)) +
  geom_point(alpha = .6) +
  geom_hline(yintercept = 0, linetype = "dashed", 
             color = "blue", linewidth = 1) +
  labs(title = "Studentized Residuals vs.  Predicted Values",
       x = "Predicted Values",
       y = "Studentized Residuals")

car::spreadLevelPlot(m_lm)


Suggested power transformation:  0.3622581 

Linearity

Code
car::crPlots(m_lm, ask = FALSE)

Generalized Linear Models

The Generalized Linear Model is a family of models that includes linear regression, logistic regression, Poisson regression, and others.

The linear model is a special case of this more general(ized) model.

Some of the other models in this family are used to to address problems with model assumptions of the linear model.

  • binary outcomes (logistic regression): binomial family
  • count data (bounded at 0): poisson family

glm(formula, family = familytype(link = linkfunction), data = data)

Family Default Link Function
binomial (link = "logit")
gaussian (link = "identity")
Gamma (link = "inverse")
inverse.gaussian (link = "1/mu^2")
poisson (link = "log")
quasi (link = "indentity", variance = "constant")
quasibinomial (link = "logit")
quasipoisson (link = "log")

Let’s fit a logistic regression (binomial) model to predict admission to grad school based on GPA.

m_glm <- glm(admit ~ gpa, data = data, family = binomial(logit))
m_glm |> tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -14.5      1.51      -9.62 6.65e-22
2 gpa             4.12     0.433      9.51 1.85e-21
Code
x <- tibble(gpa = seq(0, 6, length.out = 500))

preds <- predict(m_glm, newdata = x) |> 
  as_tibble() |> 
  mutate(value = boot::inv.logit(value)) |> 
  bind_cols(x)

ggplot(data = data, aes(x = gpa, y = admit)) +
  geom_jitter(width = 0, height = .05) +
  scale_x_continuous(limits = c(0, 6)) +
  geom_line(data = preds, aes(y = value, x = gpa), color = "blue")  +
  labs(y = "p(admit)")

NOTE: I plotted GPA from 0 to 6 to show the full range of the logistic function. In practice, you would limit the x-axis to the range of observed data.

Question

What other non-linear shapes do you know how to model in the general linear model?

  1. Simple monotone relationships with power transforms.

  2. You will also eventually learn how to model quadratic, cubic, etc relationships with polynomial regression.

Simple monotone

Quadradic

Cubic

Not simple monotone (logistic function has two bends)

Not cubic (cubic is unbounded)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -14.5      1.51      -9.62 6.65e-22
2 gpa             4.12     0.433      9.51 1.85e-21

There are two characteristics that vary across the families within the generalized linear models

  • The distribution of the residuals
  • The link function that connects the \(X\)s to \(\hat{Y}\)

Linear Regression:

  • Residuals are gaussian.
  • Model predicts raw scores on \(Y\). Link function is identity/no transformation (\(1 * \hat{Y}\) ).
    • \(1 * \hat{Y} = b_0 + b_1X_1 + ... + b_kX_k\)

Logistic Regression:

  • Residuals are binomial
  • Model predicts probability of Y = 1 (\(\pi\)). Link function is logit (or log-odds; A transformation of these predicted probabilities).
    • \(ln(\frac{\pi}{1-\pi}) = b_0 + b_1X_1 + ... + b_kX_k\)
    • It is still a linear (on \(X\)) model

Logs and Exponentials

You are likely familiar with logs using base 10.

log10(10)
[1] 1
log10(100)
[1] 2
log10(1000)
[1] 3
log10(1)
[1] 0
log10(15)
[1] 1.176091
log10(0)
[1] -Inf
log10(.1)
[1] -1

The natural log (often abbreviated ln) is similar but uses base \(e\) (approx 2.718) rather than base 10.

log(2.718282)
[1] 1
log(10)
[1] 2.302585
log(1)
[1] 0
log(0)
[1] -Inf

The inverse of the natural log is the exponential function: exp().

  • This function simply raises \(e\) to the power of \(X\) (whatever value you provide).
  • Logistic regression uses natural logs and exponentials for the transformations that connect \(\hat{Y}\) with \(X\)s.
exp(1)
[1] 2.718282
exp(2)
[1] 7.389056
exp(0)
[1] 1
exp(-1)
[1] 0.3678794

This is the model we are fitting for logistic regression.

  • \(ln(\frac{\pi}{1-\pi}) = b_0 + b_1X_1 + ... + b_kX_k\)


We can manipulate the formula to isolate \(\pi\) on one side such that it predicts the probability (\(\pi\)) of \(Y = 1\) as a function of \(X\)s.

  • \(\pi = \frac{e^{b_0 + b_1X_1 + ... + b_kX_k}}{1 + e^{b_0 + b_1X_1 + ... + b_kX_k}}\)
  • \(\pi\) = \(\frac{1}{1 + e^{-(b_0 + b_1X_1 + ... + b_kX_k)}}\)
  • This is often how we will use the model for prediction (and plots)
  • Probability is non-linear on \(X\) (as we saw in the previous plot)
m_glm |> tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -14.5      1.51      -9.62 6.65e-22
2 gpa             4.12     0.433      9.51 1.85e-21

\(\pi\) = \(\frac{1}{1 + e^{-(14.5 + 4.1*X_1)}}\)

Logistic regression (and all generalized linear models) can accommodate multiple \(X\)s.

  • \(\pi = \frac{1}{1 + e^{-(b_0 + b_1X_1 + ... + b_kX_k)}}\)

But lets consider simple (one \(X\)) logistic regression to understand what the intercept and parameter estimates for \(X\) tell us about the probability of \(Y\).

  • \(\pi = \frac{1}{1 + e^{-(b_0 + b_1X_1)}}\)

Over the next several slides, we will work through variations of \(b_0\) and \(b_1\) to understand how they affect the shape and location of the logistic function.

\(b_0 = 0\); \(b_1 = 0\)

  • No relationship between \(X\) and \(\pi\). \(\pi\) is always .5.

\(b_0 = 0\); \(b_1 = 0 \text{ to }1\)

  • As \(b_1\) becomes more positive, the slope of the logistic function increases.
  • The steeper the slope, the more quickly the probability of \(y = 1\) changes as \(x\) changes.
  • As \(X\) increases, the probability of \(Y = 1\) increases.
  • \(\pi\) is 0.5 at mean of \(x\) (because \(b_0 = 0\)).

\(b_0 = 0\); \(b_1 = -1 \text{ to }0\)

  • As \(b_1\) becomes more negative (relative to 0), the negative slope of the logistic function increases.
  • The steeper the slope, the more quickly the probability of \(y = 1\) changes as \(x\) changes.
  • As \(X\) increases, the probability of \(Y = 1\) decreases.
  • \(\pi\) is 0.5 at mean of \(x\) (because \(b_0 = 0\)).

\(b_0 = -5 \text{ to } 5\); \(b_1 = 0\)

  • When \(b_0 = 0\), the model predicts \(\pi = 0.5\) at the mean of \(X\).
  • As \(b_0\) becomes more positive, the model predicts higher probabilites for \(Y = 1\) at the mean of \(X\).
  • As \(b_0\) becomes more negative, the model predicts lower probabilites for \(Y = 1\)
  • This is clearest when \(b_1 = 0\).

\(b_0 = -5 \text{ to } 5\); \(b_1 = 1\)

  • But the same is true for non-zero values for \(b_1\).
  • Increases in probability for \(Y = 1\) result in left shift of the logistic function.
  • Decreases in probability for \(Y = 1\) result in right shift of the logistic function.

Odds

Odds are a transformation of probabilites that can be useful in some situations (e.g., gambling).

Odds = \(\frac{\pi}{1-\pi}\)

Question

What are the odds of obtaining a head on a fair coin toss?

Odds = \(\frac{0.5}{1-0.5} = \frac{0.5}{0.5} = 1 [1:1]\)

Question

What would the odds of obtaining a head be if I altered the coin to have a probability of heads = 0.67?

Odds = \(\frac{0.67}{1-0.67} = \frac{0.67}{0.33} = 2 [2:1]\)

\(\pi = \frac{\text{odds}}{\text{odds} + 1}\)

Question

What is the probability of an event that has an odds of 3?

\(\pi = \frac{3}{3 + 1} = .75\)

Odds can range from 0 - infinity.

Odds = \(\frac{\pi}{1-\pi}\)

Question

What are the approximate odds of getting into grad school with a GPA of 3.5?

Odds = \(\frac{0.5}{1-0.5} = 1[1:1]\)

We can connect our \(X\)s to either the probability of \(Y = 1\) or the odds of \(Y = 1\)

Logistic function (probability Y = 1):

  • \(\pi = \frac{1}{1 + e^{-(b_0 + b_1X_1)}}\)
  • \(0 \le\pi \le 1\)

Convert \(\pi\) to odds (odds of Y = 1)

  • \(\frac{\pi}{1-\pi} = e^{b_0 + b_1X_1 + ... + b_kX_k}\)
  • \(0 \le \text{odds} \le\text{INF}\)

Question

What are the predicted odds of getting into grad school with a GPA of 3.5?

Code
m_glm |> tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -14.5      1.51      -9.62 6.65e-22
2 gpa             4.12     0.433      9.51 1.85e-21

\(\frac{\pi}{1-\pi} = e^{b_0 + b_1X_1 + ... + b_kX_k}\)

\(= e^{-14.5 + 4.1 * 3.5}\)

\(= e^{-0.15}\)

\(= 0.86 [0.86:1]\)

Note: Probability of 0.50 occurs with odds of 1.

Remember that we said the link function for logistic regression is the logit function. This is yet one more transformation.

Logistic function (probability Y = 1):

  • \(\pi = \frac{e^{b_0 + b_1X_1 + ... + b_kX_k}}{1 + e^{b_0 + b_1X_1 + ... + b_kX_k}}\)
  • \(0 \le\pi \le 1\)

Convert \(\pi\) to odds (odds of Y = 1):

  • \(\frac{\pi}{1-\pi} = e^{b_0 + b_1X_1 + ... + b_kX_k}\)
  • \(0 \le \text{odds} \le\text{INF}\)

Convert odds to log-odds(logit function; log-odds of Y = 1):

  • \(ln(\frac{\pi}{1-\pi}) = b_0 + b_1X_1 + ... + b_kX_k\)
  • \(-\text{INF} \le logit \le \text{INF}\)
  • And now we are back to a linear (on \(X\)) model!

Question

What are the log-odds (logit) of getting into grad school with a GPA of 3.5?

Code
tidy(m_glm)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -14.5      1.51      -9.62 6.65e-22
2 gpa             4.12     0.433      9.51 1.85e-21

\(logit= b_0 + b_1X_1 + ... + b_kX_k\)

\(= b_0 + b_1*gpa\)

\(= -14.5 + 4.1 * 3.5\)

\(= -0.15\)

Note: Odds of 1 occur when logit = 0.

Log-odds are not very intuitive.

However, they are a linear function of our regressors. GLM estimates the parameters in the logit function.

Code
m_glm |> tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -14.5      1.51      -9.62 6.65e-22
2 gpa             4.12     0.433      9.51 1.85e-21

\(logit(pY) = -14.5 + 4.1*gpa\)

We transform the logit function to either odds or probability functions to convey relationships in meaningful units.

Log-odds are a linear function of \(X\)s, but because log-odds are not intuitive, it decreases the descriptive value of the raw parameter estimates.

\(logit(admission)= -14.5 + 4.1*gpa\)

Not clear how to interpret a parameter estimate of 4.1 for GPA.

The log-odds of admission increases by 4.1 units for every 1 point increase on GPA??

Odds and probability are more descriptive but they are not linear functions of the \(X\)s so their parameter estimates aren’t very useful to describe effect of \(X\)s because they are not constant across the values of \(X\) (the slope is not linear).

Can’t make simple statement about unit change in odds or probability as a result of unit change in GPA.

\(\frac{\pi}{1-\pi} = e^{b_0 + b_1X_1 + ... + b_kX_k}\)

\(\pi = \frac{1}{1 + e^{-(b_0 + b_1X_1 + ... + b_kX_k)}}\)

Odds Ratios

This is where the odds ratio comes in!

Odds are defined at a specific point for \(X\).

Odds ratio (\(\Psi\)) = the change in odds for a change in \(X\) of some magnitude \(c\).

\(\Psi = \frac{odds(X+c)}{odds(X)}\)

\(= \frac{e^{b_0 +b_1(X+c)}}{e^{b_0+b_1(X)}}\)

\(= e^{c*b_1}\)

Question

What is the odds ratio for a change in GPA of 1.0?

Code
tidy(m_glm)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -14.5      1.51      -9.62 6.65e-22
2 gpa             4.12     0.433      9.51 1.85e-21

\(\Psi = e^{c*b_1}\)

\(= e^{1.0*4.1}\)

\(= 60.3\)

The odds of getting into grad school increase by a factor of 60.3 for every 1 point increase in GPA.

This is the preferred descriptor for the effect of \(X\). For a quantitative variable, choose a meaningful value for \(c\) (though often 1 is most appropriate).

\(\Psi = e^{c*b_1}\)

  • When \(b_1 = 0\), \(\Psi = 1\), indicating no change in odds for change in \(X\).

  • When \(b_1 > 0\), \(\Psi > 1\), indicating an increase in odds with increasing \(X\).

  • When \(b_1 < 0\), \(\Psi < 1\), indicting a decrease in odds with increasing \(X\).

  • Odds ratio (\(\Psi\)) is never negative.

Parameter Significance Tests

There are three common statistical tests for a parameter in logistic regression:

  1. Z test: Reported in the tidy() output.

    \(z = \frac{b_j}{SE_j}\)

  2. Wald test: Reported by SPSS. Wald is asymptotically distributed as Chi-square with 1 df.

    \(wald = \frac{b_j^2}{SE_j^2}\)

Both 1 and 2 are not preferred as they have higher type II error rates than the third option.

  1. Likelihood ratio test: Reported in the anova output.

Likelihood Ratio Test

Deviance is the maximum likelihood generalization of SSE from OLS.

The likelihood ratio test involves a comparison of two models’ deviances.

To test the effect of GPA, compare deviances.

  • Compact: Intercept only (null model)
  • Augmented: Model with GPA
  • LR test = compact - augmented
  • Distributed as chi-square with df = df(augmented) - df(compact).
compact <- glm(admit ~ 1, data = data, family = binomial(logit))
augmented <- glm(admit ~ gpa, data = data, family = binomial(logit))

anova(compact, 
      augmented)
Analysis of Deviance Table

Model 1: admit ~ 1
Model 2: admit ~ gpa
  Resid. Df Resid. Dev Df Deviance              Pr(>Chi)    
1       399     543.58                                      
2       398     400.03  1   143.54 < 0.00000000000000022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Maximum Likelihood Estimation (MLE)

The regression coefficients are estimated using maximum likelihood estimation.

It is iterative (Newton’s method).

Model may fail to converge if:

  • Ratio of predictors to infrequent cases (10 events per predictor).
  • High multicollinearity.
  • Sparseness (cells with zero events). Bigger problem for categorical predictor.
  • Complete separation: Predictors perfectly predict the criterion.

MLE can yield biased parameters with small samples. Definitions of what is not small vary, but \(\ge 200\) is a reasonable minimum.

Model Assumptions

  1. Exact X.
  2. Independence.
  3. Logistic function and logit correctly specify the form of the relationship (equivalent to the correct fit in linear regression). Logistic function is almost always correct for dichotomous data. You can examine the logit function directly to assess the shape of the relationship.

Question

What about categorical variables?

  • Handled exactly as in linear regression.

  • Contrast codes, dummy codes.

  • Issues of family-wise error rates apply as before for non-orthogonal and unplanned contrasts. Both Fisher LSD and Holm-Bonferroni are available.

  • Odds ratio is for contrast (assuming unit weighted).

Given that you will need to do model comparisons to test contrasts (pocs or dummy), we prefer to manually code contrasts.

Here is an example with dummy codes for the 4 levels of rank (using low rank/4 as reference) and gpa

data <- data |> 
  mutate(d_rank1 = if_else(rank == 1, 1, 0),
         d_rank2 = if_else(rank == 2, 1, 0),
         d_rank3 = if_else(rank == 3, 1, 0))

data |> head()
# A tibble: 6 × 9
  sub_id admit   gre   gpa rank  rank_2    d_rank1 d_rank2 d_rank3
   <dbl> <dbl> <dbl> <dbl> <fct> <fct>       <dbl>   <dbl>   <dbl>
1    290     0   420  2.1  4     Low rank        0       0       0
2     41     0   560  2.2  2     High rank       0       1       0
3    373     0   680  2.3  1     High rank       1       0       0
4     49     0   440  2.05 4     Low rank        0       0       0
5    157     0   560  2.42 2     High rank       0       1       0
6    295     0   480  2.55 1     High rank       1       0       0

Test main effect of rank, controlling for GPA (main effect is only needed if using Fisher LSD).

  • Compact model excludes all the dummy codes
  • Compact model DOES include other covariates/predictors (if present in augmented model)
m_rank <- glm(admit ~ gpa + d_rank1 + d_rank2 + d_rank3, data = data, family = binomial(logit))
anova(glm(admit ~ gpa, data = data, family = binomial(logit)), 
      m_rank)
Analysis of Deviance Table

Model 1: admit ~ gpa
Model 2: admit ~ gpa + d_rank1 + d_rank2 + d_rank3
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
1       398     400.03                        
2       395     384.85  3   15.184 0.001666 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test each pairwise contrast

  • rank 1 vs. rank 4
anova(glm(admit ~ gpa + d_rank2 + d_rank3, data = data, family = binomial(logit)), 
      m_rank)
Analysis of Deviance Table

Model 1: admit ~ gpa + d_rank2 + d_rank3
Model 2: admit ~ gpa + d_rank1 + d_rank2 + d_rank3
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       396     395.77                          
2       395     384.85  1   10.917 0.0009529 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • rank 2 vs. rank 4
anova(glm(admit ~ gpa + d_rank1 + d_rank3, data = data, family = binomial(logit)), 
      m_rank)
Analysis of Deviance Table

Model 1: admit ~ gpa + d_rank1 + d_rank3
Model 2: admit ~ gpa + d_rank1 + d_rank2 + d_rank3
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1       396     389.56                       
2       395     384.85  1   4.7062  0.03005 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • rank 3 vs. rank 4
anova(glm(admit ~ gpa + d_rank1 + d_rank2, data = data, family = binomial(logit)), 
      m_rank)
Analysis of Deviance Table

Model 1: admit ~ gpa + d_rank1 + d_rank2
Model 2: admit ~ gpa + d_rank1 + d_rank2 + d_rank3
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       396     385.12                     
2       395     384.85  1  0.27266   0.6015

Question

What are the odds of getting into grad school from a top (1) ranked university with a gpa of 3.5?

You can use predict() with response = “link” to get the log-odds. Then use exp() to get the odds.

odds1 <- tibble(gpa = 3.5, d_rank1 = 1, d_rank2 = 0, d_rank3 = 0) |> 
  predict(m_rank, newdata = _, type = "link") |> 
  exp()
odds1
       1 
2.277199 

Question

What are the odds of getting into grad school from the lowest (4) ranked university?

odds4 <- tibble(gpa = 3.5, d_rank1 = 0, d_rank2 = 0, d_rank3 = 0) |> 
  predict(m_rank, newdata = _, type = "link") |> 
  exp()
odds4
        1 
0.5178722 

And if you wanted the probability for any combination of \(X\), use response rather than link in the predict() function. And remove exp()

tibble(gpa = 3.5, d_rank1 = 0, d_rank2 = 0, d_rank3 = 0) |> 
  predict(m_rank, newdata = _, type = "response")
       1 
0.341183 

The odds for low and high rank can be used to calculate the odds ratio.

odds for High rank

odds1
       1 
2.277199 

odds for Low rank

odds4
        1 
0.5178722 

Odds ratio

odds1/odds4
       1 
4.397221 

Or we can get the odds ratio directly from the parameter estimate from this contrast (\(c\) = 1 for unit weighted contrast or dummy codes)

m_rank |> tidy()
# A tibble: 5 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -15.6       1.61     -9.66  4.39e-22
2 gpa            4.26      0.453     9.41  4.95e-21
3 d_rank1        1.48      0.460     3.22  1.27e- 3
4 d_rank2        0.810     0.380     2.13  3.32e- 2
5 d_rank3        0.206     0.396     0.521 6.02e- 1

\(c\) = 1

m_rank |> tidy() |> 
  filter(term == "d_rank1") |> 
  pull(estimate) |> 
  exp()
[1] 4.397221

NOTE: The model we fit was additive so the odds ratio for rank1 vs. rank4 (i.e., parameter estimate for d_rank is the same across all levels of gpa

Question

What about multiple predictors?

Nothing new! (we just used multiple predictors and regressors in the last example)

We tested the main effect and pairwise contrasts for rank, controlling for gpa above. Here we test of the effect of gpa controlling for rank.

anova(glm(admit ~ d_rank1 + d_rank2 + d_rank3, data = data, family = binomial(logit)), 
      m_rank)
Analysis of Deviance Table

Model 1: admit ~ d_rank1 + d_rank2 + d_rank3
Model 2: admit ~ gpa + d_rank1 + d_rank2 + d_rank3
  Resid. Df Resid. Dev Df Deviance              Pr(>Chi)    
1       396     527.82                                      
2       395     384.85  1   142.97 < 0.00000000000000022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What about interactions? This is a more complicated story.

Interactions with respect to log-odds(\(Y\)) are handled exactly in the linear model

  • Use product terms for regressors
  • Center regressors for overall/marginal effects
  • Use dummy coded regressors for simple effects
  • Overall and simple effects are quantified as odds ratios
  • Interaction term is a ratio of odds ratios!

However, we often want to consider interactions with respect to changes in the probability of \(Y = 1\).

  • This is NOT what is tested and an interaction for log-odds (\(Y\)) is not the same as an interaction for probability (\(Y\)).
  • The interaction can be significant in log-odds(\(Y\)) but not for probsbility (\(Y\)) and vice versa!
  • Also, the interaction is not constant across values of other covariates/predictors in the model!
  • Use interactions sparingly if at all in logistic regression.
  • Maybe linear probability model?

You can read more about this issue elsewhere and I plan to create a more elaborate demonstration of this issue using R in an appendix to this course.

Results

We analyzed admission (coded 1/0 for yes vs. no, respectively) in a generalized linear model (GLM) that included GPA (quantfied on a four point scale) and School rank (quantified across four levels). We used the binomial family with the logit link function for the GLM because the dependent variable, admission, was dichotomous. School rank was coded using dummy-coded regressors to represent pairwise contrasts between each of the top three highest ranks (1-3) vs. the lowest rank (rank 4). We report the raw parameter estimates from the GLM and the odds ratio to quantify effect size of significant effects. We report Holm-Bonferroni corrected p-values for the pairwise contrasts across school ranks to protect against inflation of family-wise type 1 error rates.

The effect of GPA was significant, \(b = 4.26, SE = 0.45, \chi^2(1) = 142.97, p < .001\), which indicates that the odds of admission increase by a factor of 70.8 for every one point increase in GPA. Applicants from the highest ranked school were significantly more likely to be admitted relative to applicants from the lowest ranked school, \(b = 1.48, SE = 0.46, odds ratio = 4.4, \chi^2(1) = 10.91, p < .001\). There were no signficant differences in admission rates between second and third ranked schools vs. the lowest ranked school. See figure 1 for display of the probability of admission as a function of GPA and school rank. [Might also consider a table for all parameter estimates, standard errors, odds ratios, and p-values]

Figure 1

Code
ranks <- tibble(rank = fct(rep(c("1", "2", "3", "4"), 50), 
                           levels = c("1", "2", "3", "4")),
                d_rank1 = rep(c(1, 0, 0, 0), 50), 
                d_rank2 = rep(c(0, 1, 0, 0), 50), 
                d_rank3 = rep(c(0, 0, 1, 0), 50))
gpa <- tibble(gpa = rep(seq(2, 4, length.out = 50), 4))
x <- gpa |> 
  bind_cols(ranks)

preds <- x |> 
  bind_cols(predict(m_rank, newdata = x, type = "link",
                 se.fit = TRUE)) |> 
  mutate(upper = plogis(fit + 1.96* se.fit),
         lower = plogis(fit - 1.96 * se.fit),
         fit = plogis(fit)) |>
  select(-se.fit, -residual.scale)

ggplot() +
  geom_jitter(data = data, aes(x = gpa, y = admit, color = rank), width = 0, height = .04) +
  geom_line(data = preds, aes(y = fit, x = gpa, color = rank), 
            linewidth = .75) +
  labs(y = "p(admit)",
       color = NULL) +
  scale_x_continuous(limits = c(2, 4)) 

All Common Transformations

In logistic regression, we need to move back and forth between logit(\(Y), odds(\)Y$), and \(\pi_Y\). We also need to calculate odds ratios. There are lots of ways to transform between these metrics.

The remaining slides put all the formulas for this transformations/calculations in one place!

The Model

Predicting logit(\(Y\)) (aka log-odds(\(Y\))). This is the native model in the GLM

  • \(logit(Y) = b_0 + b_1X_1 + ... + b_kX_k\)

Predicting odds(\(Y\))

  • \(odds(Y) = e^{b_0 + b_1X_1 + ... + b_kX_k}\)

Predicting \(\pi_Y\)

  • \(\pi_Y\) = \(\frac{e^{b_0 + b_1X_1 + ... + b_kX_k}}{1 + e^{b_0 + b_1X_1 + ... + b_kX_k}}\)
  • \(\pi_Y\) = \(\frac{1}{1 + e^{-(b_0 + b_1X_1 + ... + b_kX_k)}}\)}

Calculating odds

From logit(\(Y\)) to odds

  • odds(\(Y\)) = \(e^{logit(Y)}\) [use exp() in R]

From \(\pi_Y\) to odds

  • odds(\(Y\)) = \(\frac{\pi_Y}{1-\pi_Y}\)

Calculating probability

From odds(\(Y\)) to \(\pi_Y\) \(\pi_Y\) = \(\frac{odds(Y)}{1 + odds(Y)}\)

For logit(\(Y\)) to \(\pi_Y\)

  • \(\pi_Y\) = \(\frac{e^{logit(Y)}}{1 + e^{logit(Y)}}\) [use plogis() in R]

Calculating logit

From odds(\(Y\)) to logit(\(Y\))

  • logit(\(Y\)) = \(ln(odds(Y))\) [use log() in R]

From probability (\(Y\)) to logit(\(Y\))

  • logit(\(Y\)) = \(ln(\frac{\pi_Y}{1-\pi_Y})\) [use qlogis() in R]

Odds Ratios

The odds ratio is the ratio of odds for a change in \(X\) of some magnitude \(c\). For unit weighted contrasts (dummy or contrast codes), \(c = 1\) by default

  • \(\Psi = e^{c*b_1}\)

The odds ratio between two levels of \(X\) can also be calculated manually as

  • Odds(\(X\) at level 1) / Odds(\(X\) at level 2)