
Understand 5 assumptions of GLMs.
Understand consequences of assumption violations.
Learn how to detect violations (statistical tests and visual diagnosis).
Recognize the range of options available when violations are detected. Implementation of these options will be covered over the semester.
Examination of assumptions will further inform you about your data.
All GLM procedures commonly make the 5 assumptions below.
When these assumptions are met, OLS regression coefficients are MVUE (Minimum Variance Unbiased Estimators) and BLUE (Best Linear Unbiased Estimators).
With the exception of \(\#1\), these assumptions are expressed (and assessed) with respect to the residuals around the prediction line.
Violations of the Exact X assumption lead to biased (i.e., inaccurate) estimates of regression coefficients.
Violations are caused by problems with reliability of measurement of your predictors.
Question
In simple, bivariate regression, how will reducing reliability affect the regression model?
Question
In simple, bivariate regression, how will reducing reliability affect the regression model?
It will reduce \(b_1\). We will underestimate the strength of relationship between \(X\) and \(Y\).
In multiple predictor models, the bias can be either positive or negative based on the nature of the correlations among the predictors. Use reliable variables!.

Question
What are the implications of unreliable \(X\) for the use of covariates to control variables?
Covariates only control from the construct they measure to the degree that they are reliable (and valid) measures of that construct.
Analysis that rely on unreliable covariates are not controlling the variance for the construct well.
Violations of the independence of residuals assumption can compromise the validity of our statistical tests (inaccurate standard errors).
Violations of residuals independence is a function of the research design caused by repeated measures on the same individual or related individuals/observations (participants in same family, school, etc).
Often difficult to detect in data but clear from research design.
Can be fixed by a variety of approaches including repeated measures analyses or multi-level, mixed effects, and/or hierarchical linear models (next semester).
Remaining three assumptions (Normality distributed residuals with a mean of 0 and constant variance) can be assessed via examination of the residuals.
Use of Graphical Methods is emphasized.
Statistical tests of assumptions exist but should be used cautiously.
Assessment of assumptions about residuals is an inexact science: Conclusions are tentative.
The process of examining residuals will increase your understanding of your data.
- May suggest transformations of your data.
- May suggest alternative analytic strategies.
- Will increase your confidence in your conclusions.
norm_tbl <- tibble(y_hat = rep(1:25, 1000),
e = rnorm(25000, mean = 0, sd = 1))
norm_tbl |>
ggplot(aes(x = y_hat,
y = e)) +
geom_jitter(alpha = .4, width = .75, height = 0, size = .5) +
geom_hline(yintercept = 0, color = "blue", linewidth = 1) +
geom_vline(xintercept = 10, color = "red", linewidth = 1) +
geom_vline(xintercept = 15, color = "red", linewidth = 1) +
geom_vline(xintercept = 20, color = "red", linewidth = 1) +
scale_x_continuous(breaks = c(0, 5, 10, 15, 20, 25))
hist_10 <- norm_tbl |>
filter(y_hat == 10) |>
ggplot(aes(x = e)) +
geom_histogram(aes(y = after_stat(density)), color = "black",
fill = "light grey", bins = 10) +
geom_density() +
labs(title = "y_hat = 10")
hist_15 <- norm_tbl |>
filter(y_hat == 15) |>
ggplot(aes(x = e)) +
geom_histogram(aes(y = after_stat(density)), color = "black",
fill = "light grey", bins = 10) +
geom_density() +
labs(title = "y_hat = 15")
hist_20 <- norm_tbl |>
filter(y_hat == 20) |>
ggplot(aes(x = e)) +
geom_histogram(aes(y = after_stat(density)), color = "black",
fill = "light grey", bins = 10) +
geom_density() +
labs(title = "y_hat = 20")
hist_10 + hist_15 + hist_20
The errors for each \(\hat{Y}\) are assumed to be normally distributed. Normally distributed errors are required for OLS regression coefficients to be MVUE but not BLUE.
Central limit theory indicates that even with non-normal errors, significance tests and confidence intervals are approximately correct with large \(N\).
Coefficients are still best unbiased efficient estimators among linear solutions (i.e., BLUE) but more efficient non-linear solutions may exist (e.g., Generalized Linear Models such as Poisson regression for thick tailed distributions).
Mean may not be best measure of center of a highly skewed distribution.
Multimodal error distributions suggest the omission of one or more categorical variables that divide the data into groups.
Transformations may correct shape of residuals (Unit 9).
Lets refit our last model from the previous unit
We can use a density plot to plot the model’s residuals against a normal distribution.
ggplot() +
geom_density(aes(x = value), data = enframe(rstudent(m_2), name = NULL)) +
labs(title = "Density Plot to Assess Normality of Residuals",
x = "Studentized Residual") +
geom_line(aes(x = x, y = y),
data = tibble(x = seq(-4, 4, length.out = 100),
y = dnorm(seq(-4, 4, length.out = 100),
mean = 0, sd = sd(rstudent(m_2)))),
linetype = "dashed", color = "blue")
Better still We can use qqplot() from the car package to assess normality of residuals.
Here are three examples of qq plots for normal distributions of small samples (N = 100)



And here for the chi-squared (positive skewed)

And the t-distributions (which is heavy tailed)

The errors for each \(\hat{Y}\) are assumed have a constant variance (homoscedasticity). This is necessary for the OLS estimated coefficients to be BLUE.
If the errors are heteroscedastic, the coefficients remain unbiased but the efficiency (precision of estimation) is impaired and the coefficient SEs become inaccurate. The degree of the problem depends on severity of violation and sample size.
Rough rule is that estimation is seriously degraded if the ratio of largest to smallest variance is 10 or greater (or more conservatively, 4 or greater)
Transformations may fix this issue (next unit).
Weighted Least Squares provides an alternative to estimation when heteroscedasticity exists (maybe next semester?).
Corrections also exist for SEs when errors are heteroscedastic (more on this in a moment).
Can look at plot of studentized residuals vs. predicted values
A spread level plot is a plot of the log(abs(studentized residuals) vs. log(predicted values).
Predicted values must all be positive to take log. Can use start to handle negative predicted values.
tibble(x = log(predict(m_2) + start), # shift to positive values
y = log(abs(rstudent(m_2)))) |>
ggplot(aes(x = x, y = y)) +
geom_point(alpha = .6) +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE,
color = "blue", linetype = "dashed") +
labs(title = "Spread-Level Plot for m_2",
x = "Predicted Values",
y = "|Studentized Residuals|") +
scale_x_continuous(trans = "exp",
breaks = scales::trans_breaks("exp", function(x) log(x)),
labels = scales::trans_format("exp",
format = scales::math_format(.x))) +
scale_y_continuous(trans = "exp",
breaks = scales::trans_breaks("exp", function(y) log(y)),
labels = scales::trans_format("exp", scales::math_format(.x)))
\(1-b\) (from the regression line) is the suggested power transformation for \(Y\) to stabilize variance.
log(predict(m_2) + start)
0.5226341
see also: car::spreadLevelPlot()
Two groups independently developed test for constant variance:
Available as ncvTest() in car package.
Do not use it blindly.
Standard errors are inaccurate when variance of residuals is not constant.
A procedure to provide White (1980) corrected SEs is described in Fox (2008), chapter 12, pp 275-276.
See:
Uncorrected Tests of Coefficients
White (1980) Heteroscedasticity-corrected SEs and Tests
# Heteroscedasticity-Corrected Covariance Matrices (hccm)
corrected_ses <- sqrt(diag(car::hccm(m_2)))
tidy(m_2) |>
select(term, estimate) |>
add_column(std.error = corrected_ses) |>
mutate(statistic = estimate/std.error,
p.value = 2 * (pt(abs(statistic),
df = m_2$df.residual,
lower.tail=FALSE)))# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 26.6 6.90 3.85 0.000220
2 bac -229. 72.1 -3.18 0.00204
3 ta 0.126 0.0304 4.13 0.0000799
4 sex_c -15.5 5.70 -2.72 0.00791
If Linearity assumption is not met, coefficients are biased.
Plot partial residual (\(e_{i(j)} = e_i + b_jX_{ij}\)) by each predictor.
Can include factors but can not include interactions with factors. Code regressors manually.
Pena and Slate (2006) validated a global test of linear model assumptions.
Provided by gvlma() using gvlma package
Call:
lm(formula = fps ~ bac + ta + sex_c, data = data_rm_outliers)
Coefficients:
(Intercept) bac ta sex_c
26.5528 -228.8721 0.1256 -15.4874
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = m_2)
Value p-value Decision
Global Stat 9.8263 0.043458 Assumptions NOT satisfied!
Skewness 0.2921 0.588886 Assumptions acceptable.
Kurtosis 1.0393 0.307987 Assumptions acceptable.
Link Function 0.1898 0.663091 Assumptions acceptable.
Heteroscedasticity 8.3051 0.003953 Assumptions NOT satisfied!
Power transformations (next unit) are very useful for correcting problems with normality, constant, variance, and linearity of errors.
Polynomial regression (710) is useful when you have quadratic, cubic, etc. effects of \(X\)s on \(Y\).
Generalized linear models (e.g., Logistic regression; last unit) are also available.
Exact X
Independence
Normally distributed errors
Constant variance for errors
Linearity (Error distributions all have mean of 0)