What are the assumptions of general linear models?
The general linear model makes the 5 assumptions below. When these assumptions are met, OLS regression coefficients are MVUE (Minimum Variance Unbiased Estimators) and BLUE (Best Linear Unbiased Estimators).
Exact X: The IVs are assumed to be known exactly (i.e., without measurement error).
Independence: Residuals are independently distributed (prob. of obtaining a specific observation does not depend on other observations).
Normality: All residual distributions are normally distributed.
Constant variance: All residual distributions have a constant variance, \(SEE^2\).
Linearity: All residual distributions (i.e., for each \(Y'\)) are assumed to have means equal to zero.
Question
What are the consequences of violating each of these assumptions? What options exist when these assumptions are violated?
Exact X: Biased parameters (to the degree that measurement error exists). Use reliable measures.
Independence: Inaccurate standard errors, degrees of freedom and significance tests. Use repeated measures or linear mixed effects models or ANCOVA.
Normality: Inefficient (with large N). Use power transformations, generalized linear models.
Constant variance: Inefficient and inaccurate standard errors. Use power transformations, SE corrections, weighted least squares, generalized linear models.
Linearity: Biased parameter estimates. Use power transformations, polynomial regression, generalized linear models.
An Example
Predicting admission to a grad program in engineering based on quantitative GRE, GPA, and Undergraduate Institution Rank.
What is the effect of GPA on admission to grad school?
Question
What are the problems with using a general linear model to assess the effects of these predictors on admission outcomes?
Residuals will not be normal (not efficient).
Residual variance often will not be constant (not efficient, SEs are inaccurate).
Probability may not be linear on X (yielding biased parameter estimates).
\(\hat{Y}\) is not constrained between 0 – 1 (model may make nonsensical predictions if we consider \(\hat{Y}\) to be probabilites).
Here is a scatterplot of admit as a function of GPA.
Code
data |>ggplot(aes(x = gpa, y = admit)) +geom_point() +scale_x_continuous(limits =c(0, 4)) +labs(y ="p(admit)")
Overplotting can be a problem in scatterplots when one (or both) of the variables are integers.
Jittering that variable can help.
Code
data |>ggplot(aes(x = gpa, y = admit)) +geom_jitter(width =0, height = .05) +scale_x_continuous(limits =c(0, 4)) +labs(y ="p(admit)")
Clearly not a good model.
Code
data |>ggplot(aes(x = gpa, y = admit)) +geom_jitter(width =0, height = .05) +scale_x_continuous(limits =c(0, 4)) +geom_smooth(formula = y ~ x, method ="lm", color ="red", se =FALSE,fullrange =TRUE) +labs(y ="p(admit)")
Still a problem even if we limit the plot (like we should!) to just the range of observed data for gpa
Code
data |>ggplot(aes(x = gpa, y = admit)) +geom_jitter(width =0, height = .05) +scale_x_continuous(limits =c(2, 4)) +scale_y_continuous(limits =c(-2, 1.2)) +geom_smooth(formula = y ~ x, method ="lm", color ="red", se =FALSE,fullrange =TRUE) +labs(y ="p(admit)")
Global test of assumptions
gvlma::gvlma(m_lm)
Call:
lm(formula = admit ~ gpa, data = data)
Coefficients:
(Intercept) gpa
-1.9432 0.6969
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = m_lm)
Value p-value Decision
Global Stat 42.36416 0.00000001402 Assumptions NOT satisfied!
Skewness 5.12561 0.02357534169 Assumptions NOT satisfied!
Kurtosis 13.81465 0.00020175698 Assumptions NOT satisfied!
Link Function 23.39426 0.00000131972 Assumptions NOT satisfied!
Heteroscedasticity 0.02965 0.86328440028 Assumptions acceptable.
Normality of residuals
Code
ggplot() +geom_density(aes(x = value), data =enframe(rstudent(m_lm), name =NULL)) +labs(title ="Density Plot to Assess Normality of Residuals",x ="Studentized Residual") +geom_line(aes(x = x, y = y), data =tibble(x =seq(-4, 4, length.out =100),y =dnorm(seq(-4, 4, length.out =100), mean =0, sd =sd(rstudent(m_lm)))), linetype ="dashed", color ="blue")
Normality of residuals
Code
car::qqPlot(m_lm, id =FALSE, simulate =TRUE,main ="QQ Plot to Assess Normality", ylab ="Studentized Residuals")
Constant variance of residuals
Code
tibble(x =predict(m_lm),y =rstudent(m_lm)) |>ggplot(aes(x = x, y = y)) +geom_point(alpha = .6) +geom_hline(yintercept =0, linetype ="dashed", color ="blue", linewidth =1) +labs(title ="Studentized Residuals vs. Predicted Values",x ="Predicted Values",y ="Studentized Residuals")
car::spreadLevelPlot(m_lm)
Suggested power transformation: 0.3622581
Linearity
Code
car::crPlots(m_lm, ask =FALSE)
Generalized Linear Models
The Generalized Linear Model is a family of models that includes linear regression, logistic regression, Poisson regression, and others.
The linear model is a special case of this more general(ized) model.
Some of the other models in this family are used to to address problems with model assumptions of the linear model.
binary outcomes (logistic regression): binomial family
count data (bounded at 0): poisson family
glm(formula, family = familytype(link = linkfunction), data = data)
Family
Default Link Function
binomial
(link = "logit")
gaussian
(link = "identity")
Gamma
(link = "inverse")
inverse.gaussian
(link = "1/mu^2")
poisson
(link = "log")
quasi
(link = "indentity", variance = "constant")
quasibinomial
(link = "logit")
quasipoisson
(link = "log")
Let’s fit a logistic regression (binomial) model to predict admission to grad school based on GPA.
m_glm <-glm(admit ~ gpa, data = data, family =binomial(logit))m_glm |>tidy()
NOTE: I plotted GPA from 0 to 6 to show the full range of the logistic function. In practice, you would limit the x-axis to the range of observed data.
Question
What other non-linear shapes do you know how to model in the general linear model?
Simple monotone relationships with power transforms.
You will also eventually learn how to model quadratic, cubic, etc relationships with polynomial regression.
Simple monotone
Quadradic
Cubic
Not simple monotone (logistic function has two bends)
But lets consider simple (one \(X\)) logistic regression to understand what the intercept and parameter estimates for \(X\) tell us about the probability of \(Y\).
\(\pi = \frac{1}{1 + e^{-(b_0 + b_1X_1)}}\)
Over the next several slides, we will work through variations of \(b_0\) and \(b_1\) to understand how they affect the shape and location of the logistic function.
\(b_0 = 0\); \(b_1 = 0\)
No relationship between \(X\) and \(\pi\). \(\pi\) is always .5.
\(b_0 = 0\); \(b_1 = 0 \text{ to }1\)
As \(b_1\) becomes more positive, the slope of the logistic function increases.
The steeper the slope, the more quickly the probability of \(y = 1\) changes as \(x\) changes.
As \(X\) increases, the probability of \(Y = 1\) increases.
\(\pi\) is 0.5 at mean of \(x\) (because \(b_0 = 0\)).
\(b_0 = 0\); \(b_1 = -1 \text{ to }0\)
As \(b_1\) becomes more negative (relative to 0), the negative slope of the logistic function increases.
The steeper the slope, the more quickly the probability of \(y = 1\) changes as \(x\) changes.
As \(X\) increases, the probability of \(Y = 1\) decreases.
\(\pi\) is 0.5 at mean of \(x\) (because \(b_0 = 0\)).
\(b_0 = -5 \text{ to } 5\); \(b_1 = 0\)
When \(b_0 = 0\), the model predicts \(\pi = 0.5\) at the mean of \(X\).
As \(b_0\) becomes more positive, the model predicts higher probabilites for \(Y = 1\) at the mean of \(X\).
As \(b_0\) becomes more negative, the model predicts lower probabilites for \(Y = 1\)
This is clearest when \(b_1 = 0\).
\(b_0 = -5 \text{ to } 5\); \(b_1 = 1\)
But the same is true for non-zero values for \(b_1\).
Increases in probability for \(Y = 1\) result in left shift of the logistic function.
Decreases in probability for \(Y = 1\) result in right shift of the logistic function.
Odds
Odds are a transformation of probabilites that can be useful in some situations (e.g., gambling).
Odds = \(\frac{\pi}{1-\pi}\)
Question
What are the odds of obtaining a head on a fair coin toss?
We transform the logit function to either odds or probability functions to convey relationships in meaningful units.
Log-odds are a linear function of \(X\)s, but because log-odds are not intuitive, it decreases the descriptive value of the raw parameter estimates.
\(logit(admission)= -14.5 + 4.1*gpa\)
Not clear how to interpret a parameter estimate of 4.1 for GPA.
The log-odds of admission increases by 4.1 units for every 1 point increase on GPA??
Odds and probability are more descriptive but they are not linear functions of the \(X\)s so their parameter estimates aren’t very useful to describe effect of \(X\)s because they are not constant across the values of \(X\) (the slope is not linear).
Can’t make simple statement about unit change in odds or probability as a result of unit change in GPA.
The odds of getting into grad school increase by a factor of 60.3 for every 1 point increase in GPA.
This is the preferred descriptor for the effect of \(X\). For a quantitative variable, choose a meaningful value for \(c\) (though often 1 is most appropriate).
\(\Psi = e^{c*b_1}\)
When \(b_1 = 0\), \(\Psi = 1\), indicating no change in odds for change in \(X\).
When \(b_1 > 0\), \(\Psi > 1\), indicating an increase in odds with increasing \(X\).
When \(b_1 < 0\), \(\Psi < 1\), indicting a decrease in odds with increasing \(X\).
Odds ratio (\(\Psi\)) is never negative.
Parameter Significance Tests
There are three common statistical tests for a parameter in logistic regression:
Z test: Reported in the tidy() output.
\(z = \frac{b_j}{SE_j}\)
Wald test: Reported by SPSS. Wald is asymptotically distributed as Chi-square with 1 df.
\(wald = \frac{b_j^2}{SE_j^2}\)
Both 1 and 2 are not preferred as they have higher type II error rates than the third option.
Likelihood ratio test: Reported in the anova output.
Likelihood Ratio Test
Deviance is the maximum likelihood generalization of SSE from OLS.
The likelihood ratio test involves a comparison of two models’ deviances.
To test the effect of GPA, compare deviances.
Compact: Intercept only (null model)
Augmented: Model with GPA
LR test = compact - augmented
Distributed as chi-square with df = df(augmented) - df(compact).
compact <-glm(admit ~1, data = data, family =binomial(logit))augmented <-glm(admit ~ gpa, data = data, family =binomial(logit))anova(compact, augmented)
The regression coefficients are estimated using maximum likelihood estimation.
It is iterative (Newton’s method).
Model may fail to converge if:
Ratio of predictors to infrequent cases (10 events per predictor).
High multicollinearity.
Sparseness (cells with zero events). Bigger problem for categorical predictor.
Complete separation: Predictors perfectly predict the criterion.
MLE can yield biased parameters with small samples. Definitions of what is not small vary, but \(\ge 200\) is a reasonable minimum.
Model Assumptions
Exact X.
Independence.
Logistic function and logit correctly specify the form of the relationship (equivalent to the correct fit in linear regression). Logistic function is almost always correct for dichotomous data. You can examine the logit function directly to assess the shape of the relationship.
Question
What about categorical variables?
Handled exactly as in linear regression.
Contrast codes, dummy codes.
Issues of family-wise error rates apply as before for non-orthogonal and unplanned contrasts. Both Fisher LSD and Holm-Bonferroni are available.
Odds ratio is for contrast (assuming unit weighted).
Given that you will need to do model comparisons to test contrasts (pocs or dummy), we prefer to manually code contrasts.
Here is an example with dummy codes for the 4 levels of rank (using low rank/4 as reference) and gpa
data <- data |>mutate(d_rank1 =if_else(rank ==1, 1, 0),d_rank2 =if_else(rank ==2, 1, 0),d_rank3 =if_else(rank ==3, 1, 0))data |>head()
What about interactions? This is a more complicated story.
Interactions with respect to log-odds(\(Y\)) are handled exactly in the linear model
Use product terms for regressors
Center regressors for overall/marginal effects
Use dummy coded regressors for simple effects
Overall and simple effects are quantified as odds ratios
Interaction term is a ratio of odds ratios!
However, we often want to consider interactions with respect to changes in the probability of \(Y = 1\).
This is NOT what is tested and an interaction for log-odds (\(Y\)) is not the same as an interaction for probability (\(Y\)).
The interaction can be significant in log-odds(\(Y\)) but not for probsbility (\(Y\)) and vice versa!
Also, the interaction is not constant across values of other covariates/predictors in the model!
Use interactions sparingly if at all in logistic regression.
Maybe linear probability model?
You can read more about this issue elsewhere and I plan to create a more elaborate demonstration of this issue using R in an appendix to this course.
Results
We analyzed admission (coded 1/0 for yes vs. no, respectively) in a generalized linear model (GLM) that included GPA (quantfied on a four point scale) and School rank (quantified across four levels). We used the binomial family with the logit link function for the GLM because the dependent variable, admission, was dichotomous. School rank was coded using dummy-coded regressors to represent pairwise contrasts between each of the top three highest ranks (1-3) vs. the lowest rank (rank 4). We report the raw parameter estimates from the GLM and the odds ratio to quantify effect size of significant effects. We report Holm-Bonferroni corrected p-values for the pairwise contrasts across school ranks to protect against inflation of family-wise type 1 error rates.
The effect of GPA was significant, \(b = 4.26, SE = 0.45, \chi^2(1) = 142.97, p < .001\), which indicates that the odds of admission increase by a factor of 70.8 for every one point increase in GPA. Applicants from the highest ranked school were significantly more likely to be admitted relative to applicants from the lowest ranked school, \(b = 1.48, SE = 0.46, odds ratio = 4.4, \chi^2(1) = 10.91, p < .001\). There were no signficant differences in admission rates between second and third ranked schools vs. the lowest ranked school. See figure 1 for display of the probability of admission as a function of GPA and school rank. [Might also consider a table for all parameter estimates, standard errors, odds ratios, and p-values]
In logistic regression, we need to move back and forth between logit(\(Y), odds(\)Y$), and \(\pi_Y\). We also need to calculate odds ratios. There are lots of ways to transform between these metrics.
The remaining slides put all the formulas for this transformations/calculations in one place!
The Model
Predicting logit(\(Y\)) (aka log-odds(\(Y\))). This is the native model in the GLM
From odds(\(Y\)) to \(\pi_Y\)\(\pi_Y\) = \(\frac{odds(Y)}{1 + odds(Y)}\)
For logit(\(Y\)) to \(\pi_Y\)
\(\pi_Y\) = \(\frac{e^{logit(Y)}}{1 + e^{logit(Y)}}\) [use plogis() in R]
Calculating logit
From odds(\(Y\)) to logit(\(Y\))
logit(\(Y\)) = \(ln(odds(Y))\) [use log() in R]
From probability (\(Y\)) to logit(\(Y\))
logit(\(Y\)) = \(ln(\frac{\pi_Y}{1-\pi_Y})\) [use qlogis() in R]
Odds Ratios
The odds ratio is the ratio of odds for a change in \(X\) of some magnitude \(c\). For unit weighted contrasts (dummy or contrast codes), \(c = 1\) by default
\(\Psi = e^{c*b_1}\)
The odds ratio between two levels of \(X\) can also be calculated manually as