Unit 8: Dealing with Messy Data II - Model Assumptions

Goals of Unit

Understand 5 assumptions of GLMs.
Understand consequences of assumption violations.
- Inefficient standard errors (low power, type II errors).
- Inaccurate standard errors (incorrect statistical tests, type I errors).
- Inaccurate parameter estimates.
Learn how to detect violations (statistical tests and visual diagnosis).
Recognize the range of options available when violations are detected. Implementation of these options will be covered over the semester.
Examination of assumptions will further inform you about your data.

Assessment of Assumptions for Sig. Tests

All GLM procedures commonly make the 5 assumptions below.

When these assumptions are met, OLS regression coefficients are MVUE (Minimum Variance Unbiased Estimators) and BLUE (Best Linear Unbiased Estimators).

With the exception of \(\#1\), these assumptions are expressed (and assessed) with respect to the residuals around the prediction line.

Exact X: The IVs are assumed to be known exactly (i.e., without measurement error).
Independence: Residuals are independently distributed (prob. of obtaining a specific observation does not depend on other observations).
Normality: All residual distributions are normally distributed.
Constant variance: All residual distributions have a constant variance.
Linearity: All residual distributions (i.e., for each \(\hat{Y}\)) are assumed to have means equal to zero.

Exact X

Violations of the Exact X assumption lead to biased (i.e., inaccurate) estimates of regression coefficients.

Violations are caused by problems with reliability of measurement of your predictors.

Question

In simple, bivariate regression, how will reducing reliability affect the regression model?

Question

In simple, bivariate regression, how will reducing reliability affect the regression model?

It will reduce \(b_1\). We will underestimate the strength of relationship between \(X\) and \(Y\).

In multiple predictor models, the bias can be either positive or negative based on the nature of the correlations among the predictors. Use reliable variables!.

Question

What are the implications of unreliable \(X\) for the use of covariates to control variables?

Covariates only control from the construct they measure to the degree that they are reliable (and valid) measures of that construct.

Analysis that rely on unreliable covariates are not controlling the variance for the construct well.

Independence

Violations of the independence of residuals assumption can compromise the validity of our statistical tests (inaccurate standard errors).

Violations of residuals independence is a function of the research design caused by repeated measures on the same individual or related individuals/observations (participants in same family, school, etc).

Often difficult to detect in data but clear from research design.

Can be fixed by a variety of approaches including repeated measures analyses or multi-level, mixed effects, and/or hierarchical linear models (next semester).

Assessment of Residuals: General Issues

Remaining three assumptions (Normality distributed residuals with a mean of 0 and constant variance) can be assessed via examination of the residuals.

Use of Graphical Methods is emphasized.

Statistical tests of assumptions exist but should be used cautiously.

Assessment of assumptions about residuals is an inexact science: Conclusions are tentative.

The process of examining residuals will increase your understanding of your data.

- May suggest transformations of your data.
- May suggest alternative analytic strategies.
- Will increase your confidence in your conclusions.

Residuals by Predicted Y: Normal M = 0, S = Constant

Code

norm_tbl <- tibble(y_hat = rep(1:25, 1000),
                   e = rnorm(25000, mean = 0, sd = 1))

norm_tbl |>
  ggplot(aes(x = y_hat,
             y = e)) +
    geom_jitter(alpha = .4, width = .75, height = 0, size = .5) +
    geom_hline(yintercept = 0, color = "blue", linewidth = 1) +
    geom_vline(xintercept = 10, color = "red", linewidth = 1) +
    geom_vline(xintercept = 15, color = "red", linewidth = 1) +
    geom_vline(xintercept = 20, color = "red", linewidth = 1) +
    scale_x_continuous(breaks = c(0, 5, 10, 15, 20, 25))

Code

hist_10 <- norm_tbl |> 
  filter(y_hat == 10) |> 
  ggplot(aes(x = e)) +
  geom_histogram(aes(y = after_stat(density)), color = "black", 
                 fill = "light grey", bins = 10) +
  geom_density() +
  labs(title = "y_hat = 10")

hist_15 <- norm_tbl |> 
  filter(y_hat == 15) |> 
  ggplot(aes(x = e)) +
  geom_histogram(aes(y = after_stat(density)), color = "black", 
                 fill = "light grey", bins = 10) +
  geom_density() +
  labs(title = "y_hat = 15")

hist_20 <- norm_tbl |> 
  filter(y_hat == 20) |> 
  ggplot(aes(x = e)) +
  geom_histogram(aes(y = after_stat(density)), color = "black", 
                 fill = "light grey", bins = 10) +
  geom_density() +
  labs(title = "y_hat = 20")

hist_10 + hist_15 + hist_20

Normally Distributed Errors

The errors for each \(\hat{Y}\) are assumed to be normally distributed. Normally distributed errors are required for OLS regression coefficients to be MVUE but not BLUE.

Central limit theory indicates that even with non-normal errors, significance tests and confidence intervals are approximately correct with large \(N\).

Coefficients are still best unbiased efficient estimators among linear solutions (i.e., BLUE) but more efficient non-linear solutions may exist (e.g., Generalized Linear Models such as Poisson regression for thick tailed distributions).

Mean may not be best measure of center of a highly skewed distribution.

Multimodal error distributions suggest the omission of one or more categorical variables that divide the data into groups.

Transformations may correct shape of residuals (Unit 9).

Code

path_data <- "data_lecture"

data <- read_csv(here::here(path_data, "07_three_predictors_fps.csv"), 
                 show_col_types = FALSE) |> 
   mutate(sex_c = if_else(sex == "female", -.5, .5))

Lets refit our last model from the previous unit

data_rm_outliers <- data |> 
  filter(!subid %in% c("0125", "2112"))

m_2 <- lm(fps ~ bac + ta + sex_c, data = data_rm_outliers)

Density plot of residuals

We can use a density plot to plot the model’s residuals against a normal distribution.

ggplot() +
  geom_density(aes(x = value), data = enframe(rstudent(m_2), name = NULL)) +
  labs(title = "Density Plot to Assess Normality of Residuals",
       x = "Studentized Residual") +
  geom_line(aes(x = x, y = y), 
            data = tibble(x = seq(-4, 4, length.out = 100),
                          y = dnorm(seq(-4, 4, length.out = 100), 
                                    mean = 0, sd = sd(rstudent(m_2)))), 
            linetype = "dashed", color = "blue")

Quantile-Quantile Plot

Better still We can use qqplot() from the car package to assess normality of residuals.

car::qqPlot(m_2, id = FALSE, simulate = TRUE,
            main = "QQ Plot to Assess Normality", 
            ylab = "Studentized Residuals")

Here are three examples of qq plots for normal distributions of small samples (N = 100)

Code

set.seed(102030)
tibble(x = rnorm(100, mean=0, sd=1)) |> 
  pull(x) |>
  car::qqPlot(id = FALSE, 
              main = "QQ Plot for Random Normal (N = 100)", 
       ylab = "n1")

Code

tibble(x = rnorm(100, mean=0, sd=1)) |> 
  pull(x) |>
  car::qqPlot(id = FALSE, 
              main = "QQ Plot for Random Normal (N = 100)", 
              ylab = "n2")

Code

tibble(x = rnorm(100, mean=0, sd=1)) |> 
  pull(x) |>
  car::qqPlot(id = FALSE, 
              main = "QQ Plot for Random Normal (N = 100)", 
              ylab = "n3")

And here for the chi-squared (positive skewed)

And the t-distributions (which is heavy tailed)

Constant Variance

The errors for each \(\hat{Y}\) are assumed have a constant variance (homoscedasticity). This is necessary for the OLS estimated coefficients to be BLUE.

If the errors are heteroscedastic, the coefficients remain unbiased but the efficiency (precision of estimation) is impaired and the coefficient SEs become inaccurate. The degree of the problem depends on severity of violation and sample size.

Rough rule is that estimation is seriously degraded if the ratio of largest to smallest variance is 10 or greater (or more conservatively, 4 or greater)

Transformations may fix this issue (next unit).
Weighted Least Squares provides an alternative to estimation when heteroscedasticity exists (maybe next semester?).
Corrections also exist for SEs when errors are heteroscedastic (more on this in a moment).

Residual by Predicted Values

Can look at plot of studentized residuals vs. predicted values

Code

tibble(x = predict(m_2),
       y = rstudent(m_2)) |> 
  ggplot(aes(x = x, y = y)) +
  geom_point(alpha = .6) +
  geom_hline(yintercept = 0, linetype = "dashed", 
             color = "blue", linewidth = 1) +
  labs(title = "Studentized Residuals vs.  Predicted Values",
       x = "Predicted Values",
       y = "Studentized Residuals")

Spread Level Plot

A spread level plot is a plot of the log(abs(studentized residuals) vs. log(predicted values).

Predicted values must all be positive to take log. Can use start to handle negative predicted values.

min(predict(m_2))

[1] -6.428208

start <- 7

Code

tibble(x = log(predict(m_2) + start), # shift to positive values
       y = log(abs(rstudent(m_2)))) |> 
  ggplot(aes(x = x, y = y)) +
  geom_point(alpha = .6) +
  geom_smooth(formula = y ~ x, method = "lm", se = FALSE, 
              color = "blue", linetype = "dashed") +
  labs(title = "Spread-Level Plot for m_2",
       x = "Predicted Values",
       y = "|Studentized Residuals|") +
  scale_x_continuous(trans = "exp", 
                     breaks = scales::trans_breaks("exp", function(x) log(x)),
                     labels = scales::trans_format("exp", 
                                                   format = scales::math_format(.x))) +
  scale_y_continuous(trans = "exp", 
                     breaks = scales::trans_breaks("exp", function(y) log(y)),
                     labels = scales::trans_format("exp", scales::math_format(.x)))

\(1-b\) (from the regression line) is the suggested power transformation for \(Y\) to stabilize variance.

1 - (lm(log(abs(rstudent(m_2))) ~ log(predict(m_2) + start))$coefficients[2])

log(predict(m_2) + start) 
                0.5226341

Statistical Test

Two groups independently developed test for constant variance:

Breusch,T. S. and Pagan,A. R. (1979) A simple test for heteroscedasticity and random coefficient variation. Econometrica, 47, 1287-1294.
Cook & Weisberg (1983)^[Cook,R. D. and Weisberg,S. (1983) Diagnostics for heteroscedasticity in regression. Biometrika, 70, 1-10.

Available as ncvTest() in car package.

Do not use it blindly.

All statistical tests are designed to be sensitive to specific types of violations.
May miss other types of violations.
May also provide false positive due to other aspects of the distribution.

car::ncvTest(m_2)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 21.89294, Df = 1, p = 0.0000028829

Correcting SEs

Standard errors are inaccurate when variance of residuals is not constant.

A procedure to provide White (1980) corrected SEs is described in Fox (2008), chapter 12, pp 275-276.

See:

White,H. (1980) A heterskedastic consistent covariance matrix estimator and a direct test of heteroskedasticity. Econometrica 48, 817–838.
Long,J.S. and Ervin,L.H. (2000) Using heteroscedasity consistent standard errors in the linear regression model. The American Statistician 54, 217–224.

Uncorrected Tests of Coefficients

tidy(m_2)

# A tibble: 4 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   26.6      6.44        4.13 0.0000825
2 bac         -229.      72.4        -3.16 0.00214  
3 ta             0.126    0.0273      4.61 0.0000135
4 sex_c        -15.5      5.70       -2.72 0.00790

White (1980) Heteroscedasticity-corrected SEs and Tests

# Heteroscedasticity-Corrected Covariance Matrices (hccm)
corrected_ses <- sqrt(diag(car::hccm(m_2)))

tidy(m_2) |> 
  select(term, estimate) |> 
  add_column(std.error = corrected_ses) |> 
  mutate(statistic = estimate/std.error,
         p.value = 2 * (pt(abs(statistic), 
                         df = m_2$df.residual, 
                         lower.tail=FALSE)))

# A tibble: 4 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   26.6      6.90        3.85 0.000220 
2 bac         -229.      72.1        -3.18 0.00204  
3 ta             0.126    0.0304      4.13 0.0000799
4 sex_c        -15.5      5.70       -2.72 0.00791

Linearity: Component + Residuals Plots

If Linearity assumption is not met, coefficients are biased.

Plot partial residual (\(e_{i(j)} = e_i + b_jX_{ij}\)) by each predictor.
Can include factors but can not include interactions with factors. Code regressors manually.

car::crPlots(m_2, ask = FALSE)

Global Test of Model Assumptions

Pena and Slate (2006) validated a global test of linear model assumptions.

Pena,E.A. and Slate,E.H. (2006). Global validation of linear model assumptions, Journal of the American Statistical Association, 101, 341-354.

Provided by gvlma() using gvlma package

gvlma::gvlma(m_2)


Call:
lm(formula = fps ~ bac + ta + sex_c, data = data_rm_outliers)

Coefficients:
(Intercept)          bac           ta        sex_c  
    26.5528    -228.8721       0.1256     -15.4874  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = m_2) 

                    Value  p-value                   Decision
Global Stat        9.8263 0.043458 Assumptions NOT satisfied!
Skewness           0.2921 0.588886    Assumptions acceptable.
Kurtosis           1.0393 0.307987    Assumptions acceptable.
Link Function      0.1898 0.663091    Assumptions acceptable.
Heteroscedasticity 8.3051 0.003953 Assumptions NOT satisfied!

Some Solutions

Power transformations (next unit) are very useful for correcting problems with normality, constant, variance, and linearity of errors.
Polynomial regression (710) is useful when you have quadratic, cubic, etc. effects of \(X\)s on \(Y\).
Generalized linear models (e.g., Logistic regression; last unit) are also available.

Summary of Violation Consequences and Solutions

Exact X
- Inaccurate (biased) parameter estimates.
- Use reliable measures of \(X\)s, use SEM with latent variables.
Independence
- Inaccurate standard errors.
- Use repeated measures, use multi-level models.
Normally distributed errors
- Inefficient standard errors.
- Consider omitted variables, use transformations, use generalized linear models.
Constant variance for errors
- Inaccurate and inefficient standard errors.
- Use SE corrections (but still inefficient), use transformations, use weighted least squares.
Linearity (Error distributions all have mean of 0)
- Inaccurate (biased) parameter estimates.
- Use transformations, use polynomial regression, use generalized linear models.