Unit 3: Inferences About a Single Mean (1 Parameter Models)

Units 3-4 Organization

First, consider details of simplest model (one parameter estimate; mean-only model; no \(X\)s)
Next, examine simple (bivariate) regression (two parameter estimates; one \(X\) for one quantitative predictor)
These provide a critical foundation for all linear models
Subsequent units will generalize to one dichotomous variable (Unit 5), multiple predictors (Units 6-7), and beyond…

Linear Models as Models

Linear models (including regression) are models.

\(DATA = MODEL + ERROR\)

Three general uses for models:

Describe and summarize DATA (\(Y\)s) in a simpler form using MODEL.
Predict DATA (\(Y\)s) from MODEL.
Understand (test inferences about) complex relationships between individual regressors (\(X\)s) in MODEL and the DATA (\(Y\)s). How precise are estimates of relationship?

\(DATA = MODEL + ERROR\)

MODELS are simplifications of reality

As such, there is ERROR
They also make assumptions that must be evaluated

Fear Potential Startle

We were interested in producing anxiety in the laboratory

To do this, we developed a procedure where we expose people to periods of unpredictable electric shock administration alternating with periods of safety
We measure their startle response in the shock and safe periods
We use the difference between their startle during shock – safe to determine if they are anxious
This is called Fear potentiated startle (FPS). Our procedure works if FPS > 0. We need a model of FPS scores to determine if FPS > 0

Fear Potentiated Startle: One parameter model

A very simple model for the population of FPS scores would predict the same value (\(\beta_0\)) for everyone in the population.

\(\hat{Y}_i=\beta_0\)

We would like this value to be the best prediction.

Question

In the context of DATA = MODEL + ERROR, how can we quantify best?

We want to predict some characteristic about the population of FPS scores that minimizes the ERROR from our model.

ERROR = DATA – MODEL

\(\varepsilon_i=Y_i-\hat{Y}_i\)

There is an error (\(\varepsilon_i\)) for each population score.

Total Error

Question

How might we quantify total model error?

1. Sum of errors across all scores in the population isn’t ideal because positive and negative errors will tend to cancel each other out. We won’t use this.

\(\sum(Y_i-\hat{Y}_i)\)

2. Sum of absolute values of errors could work. If we selected \(\beta_0\) to minimize the sum of the absolute value of errors, \(\beta_0\) would equal the median of the population. We could use this, but there is a better option.

\(\sum(|Y_i-\hat{Y}_i|)\)

3. Sum of squared errors (SSE) could work. If we selected \(\beta_0\) to minimize the sum of squared errors, \(\beta_0\) would equal the mean of the population. This is the best option and what we will use.

\(\sum(Y_i-\hat{Y}_i)^2\)

One Parameter Model for FPS

For the moment, lets just accept that we prefer to minimize SSE (more on that in a moment). You should predict the population mean FPS for everyone.

\(\hat{Y}_i=\beta_0\) … where \(\beta_0=\mu\)

Question

What is the problem with this model?

We don’t know the population mean for FPS scores (\(\mu\)).

Question

How could we solve this problem to develop a model?

We can collect a sample from the population and use the sample mean (\(\overline{X}\)) as an estimate of the population mean (\(\mu\)).

\(\overline{X}\) is an unbiased estimate for \(\mu\).

Model Parameter Estimation

DATA = MODEL + ERROR

Population model

\(Y_i=\beta_0+\varepsilon_i\)
\(\hat{Y}_i=\beta_0\) … where \(\beta_0=\mu\)

Estimate population parameters from sample

\(Y_i=b_0+e_i\)
\(\hat{Y}_i=b_0\) … where \(b_0=\overline{X}\)

Least Squares Criterion

In ordinary least squares (OLS) regression and other least squares linear models, the model parameter estimates (e.g., \(b_0\)) are calculated such that they minimize the sum of squared errors (SSE) in the sample in which you estimate the model.

\(\text{SSE}=\sum(Y_i-\hat{Y}_i)^2\)

\(\text{SSE}=\sum e_i^2\)

Properties of Parameter Estimates

There are three properties that make a parameter estimate attractive.

Unbiased: Mean of the sampling distribution for the parameter estimate is equal to the value for that parameter in the population.

Efficient: The sample estimates are close to the population parameter. In other words, the narrower the sampling distribution for any specific sample size \(N\), the more efficient the estimator. Efficient means small SE for parameter estimate.

Consistent: As the sample size increases, the sampling distribution becomes narrower (more efficient). Consistent means as \(N\) increases, SE for parameter estimate decreases

Least Squares Criterion (Continued)

If the \(\varepsilon_i\) are normally distributed, both the median and the mean are unbiased and consistent estimators.

The variance of the sampling distribution for the mean is:

\(\frac{\sigma^2}{N}\)

The variance of the sampling distribution for the median is:

\(\frac{\pi\sigma^2}{2N}\)

Therefore, the mean is the more efficient parameter.

For this reason, we tend to prefer to estimate our models by minimizing the sum of squared errors.

Fear-potentiated Startle During Threat of Shock

options(conflicts.policy = "depends.ok") 
library(tidyverse)
1library(broom)

2theme_set(theme_classic())
path_data <- "data_lecture"

1: We will frequently use the tidy() function from the broom package so I am loading that package here. The alternative is to not load the package but use broom::tidy() each time we want to use the function. I prefer to load the package given that its part of the extended tidyverse.
2: My preferred theme for ggplots. Set it here to affect all plots in this document.

data <- read_csv(here::here(path_data, "03_single_mean_fps.csv"), 
                 show_col_types = FALSE) |> 
  glimpse()

Rows: 96
Columns: 2
$ subid <dbl> 11, 12, 13, 14, 15, 16, 21, 22, 23, 24, 25, 26, 111, 112, 113, 1…
$ fps   <dbl> 19.4909278, 48.4069444, -22.5285000, 6.7237833, 89.6587222, 40.5…

Descriptives and Univariate Plots

data |> 
  summarise(n = n(),
            mean = mean(fps),
            sd = sd(fps),
            min = min(fps),
            max = max(fps))

# A tibble: 1 × 5
      n  mean    sd   min   max
  <int> <dbl> <dbl> <dbl> <dbl>
1    96  32.2  37.5 -98.1  163.

data |> 
  ggplot(aes(x = fps)) +
  geom_histogram(color = "black", fill = "light grey", bins = 20) + 
  scale_x_continuous(breaks = c(-100, -50, 0, 50, 100, 150, 200))

FPS Experiment: The Inference Details

Goal: Determine if our shock threat procedure is effective at potentiating startle (increasing startle during threat relative to safe).

Step 1: Create a simple model of FPS scores in the population

\(\text{FPS}=\beta_0\)

Step 2. Collect sample of \(N=96\) to estimate \(\beta_0\)

Step 3. Calculate sample parameter estimate (\(b_0\)) that minimizes SSE in sample

Step 4. Use \(b_0\) to test hypotheses

\(H_0: \beta_0 = 0\)
\(H_a: \beta_0 \neq 0\)

Estimating a One Parameter Model in R

1m <- lm(fps ~ 1, data = data)

1: Here we are fitting a linear model with FPS regressed on the intercept. In other words, we are fitting a model that predicts FPS using only the sample mean.

We can get the predicted value for each individual in the sample using this model with the function predict().

\(\hat{Y}=32.19\)

predict(m)

       1        2        3        4        5        6        7        8 
32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 
       9       10       11       12       13       14       15       16 
32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 
      17       18       19       20       21       22       23       24 
32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 
      25       26       27       28       29       30       31       32 
32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 
      33       34       35       36       37       38       39       40 
32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 
      41       42       43       44       45       46       47       48 
32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 
      49       50       51       52       53       54       55       56 
32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 
      57       58       59       60       61       62       63       64 
32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 
      65       66       67       68       69       70       71       72 
32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 
      73       74       75       76       77       78       79       80 
32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 
      81       82       83       84       85       86       87       88 
32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 
      89       90       91       92       93       94       95       96 
32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084 32.19084

We can pull out the errors (\(e_i=Y_i-\hat{Y}i\)) for each observation in the sample using residuals()

residuals(m)

           1            2            3            4            5            6 
 -12.6999127   16.2161040  -54.7193405  -25.4670572   57.4678817    8.3829373 
           7            8            9           10           11           12 
  -2.7175738  -16.8541238   19.6643817   58.6873817   78.7543262   35.0963817 
          13           14           15           16           17           18 
 -29.4627960   72.7258928  -31.7275460   36.5672151   19.1260706  -30.9964738 
          19           20           21           22           23           24 
   1.5669373   11.7176040    9.3662151  -25.3710072 -130.2886183   53.1913817 
          25           26           27           28           29           30 
  29.8681317   59.8164373  -14.1219516   34.7095484   17.9774928   47.3338484 
          31           32           33           34           35           36 
  61.4058262   67.7537262  104.6339928   36.5526595   14.2658262  -16.7506349 
          37           38           39           40           41           42 
 -29.6592294   12.9909373   20.9858817  -29.1695572  -24.1598966  -19.2076849 
          43           44           45           46           47           48 
  11.7108262  -25.2434516  -18.4250627  -20.3317905   -8.4337683  -18.0094960 
          49           50           51           52           53           54 
 -12.7704849    3.9210484  -58.2597294  -35.5108960  -32.0183927   -1.7377294 
          55           56           57           58           59           60 
   0.3123817  -35.5405016  -12.5921183   25.0772151  -20.6439405   37.4066428 
          61           62           63           64           65           66 
   9.3974373  130.5457706    5.2138262  -13.0036627   -9.8150183  -27.4784549 
          67           68           69           70           71           72 
  17.0578817   27.5951151  -28.0089794  -28.5735072  -23.4260627    4.5087151 
          73           74           75           76           77           78 
  77.8639373  -21.4575572  -18.5716738  -17.1700072   27.4325484  -26.4386960 
          79           80           81           82           83           84 
 -18.1054016    6.1488262  -14.5139683    1.6943262  -21.4997294  -25.3833322 
          85           86           87           88           89           90 
 -26.9358794  -17.5872294  -25.7722738    4.8073817  -26.9565572  -32.1845627 
          91           92           93           94           95           96 
 -31.0086183  -34.0540127  -17.4630572  -31.4756127  -31.8114616  -15.9328183

We can also easily calculate the SSE with the following code:

sum(residuals(m)^2)

[1] 133888.3

This tells us about how well the model fits the data.
Specifically it is the sum of the squared differences between the predicted values and the actual participant scores

We also may want to look at the parameter estimates

We will also call these model (or regression) coefficients (In this case we are only looking at the intercept)
We can use the tidy() function from the broom package to do this

m |> 
1  tidy()

1: Again, remember you could have instead used broom::tidy() if you didn’t load the broom package.

# A tibble: 1 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     32.2      3.83      8.40 4.26e-13

The estimate is \(b_0\), the unbiased sample estimate of \(\beta_0\), and its standard error.

It is also called the intercept in regression (more on this later).

\(\hat{Y}_i=b_0=32.2\)

m |> 
  tidy()

# A tibble: 1 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     32.2      3.83      8.40 4.26e-13

The statistic is the t-statistic to test the \(H_0\) that \(\beta_0=0\).

The probability (p-value) of obtaining a sample \(b_0=32.2\) if \(H_0\) is true (\(\beta_0=0\)) is < .0001.

Sampling Distribution: Testing Inferences About \(\beta_0\)

m |> 
  tidy()

# A tibble: 1 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     32.2      3.83      8.40 4.26e-13

Question

Describe the logic of how the standard error, t statistic and p-value were determined given your understanding of sampling distributions.

1. Establish null and alternative hypotheses.

\(H_0: \beta_0 = 0; H_a: \beta_0 \neq 0\)

2. If \(H_0\) is true, the sampling distribution for \(b_0\) will have a mean of 0. We can estimate standard deviation of the sampling distribution with SE for \(b_0\).

3. \(b_0\) is approximately 8 standard deviations above the expected mean of the distribution if \(H_0\) is true.

\(t(df=N-P)=\frac{b_0-0}{\text{SE}_{b_0}}=\frac{32.2-0}{3.8}=8.40\)

4. We can use pt() to calculate the exact \(p\) value for this parameter estimate given \(H_0\).

pt(8.40,95,lower.tail=FALSE) * 2

[1] 0.0000000000004293253

The probability of obtaining a sample \(b_0\) = 32.2 (or more extreme) if \(H_0\) is true is very low (< .05). Therefore we reject \(H_0\) and conclude that \(\beta_0 \neq 0\) and \(b_0\) is our best (unbiased) estimate of it.

Of course, we don’t need to calculate the p-value with pt() because we got it directly from tidy()

tibble(b0 = seq(-40,40,.01),
       probability = dt(b0/tidy(m)$std.error, m$df.residual)) |> 
  ggplot(aes(x = b0, y = probability)) +
  geom_line() +
  geom_vline(xintercept = tidy(m)$estimate, color = "red") +
  labs(title = "Sampling Distribution for b0")

Statistical Inference and Model Comparisons

Statistical inference about parameters is fundamentally about model comparisons.

You are implicitly (t-test of parameter estimate) or explicitly (F-test of model comparison) comparing two different models of your data.
We follow Judd et al and call these two models the compact model and the augmented model.
The compact model will represent reality as the null hypothesis predicts. The augmented model will represent reality as the alternative hypothesis predicts.
The compact model is simpler (fewer parameters) than the augmented model. It is also nested in the augmented model (i.e. a subset of parameters).

Model Comparisons: Testing Inferences about \(\beta_0\)

\(\hat{\text{FPS}_i}=\beta_0\)

\(H_0: \beta_0 = 0\)
\(H_a: \beta_0 \neq 0\)

Compact model:

\(\hat{\text{FPS}_i}=0\)
We estimate 0 parameters (\(P=0\)) in this compact model.

Augmented model:

\(\hat{\text{FPS}_i}=\beta_0 (\approx b_0)\)
We estimate 1 parameter (\(P=1\)) in this augmented model.

Choosing between these two models is equivalent to testing if \(\beta_0 = 0\) as you did with the t-test.

Model Comparison Plots

data |> 
  ggplot(aes(x = "", y = fps)) +
  geom_jitter(width = 0.1, alpha = .6, size = 2) +
  theme(axis.line.x = element_blank(),
        axis.ticks.x = element_blank()) +
  xlab(NULL) +
  geom_hline(yintercept = 0, color = "red", linewidth = 1) +
  geom_hline(yintercept = tidy(m)$estimate, color = "blue", linewidth = 1)

Model Comparisons: Testing Inferences about \(\beta_0\) (Continued)

Compact model:

\(\hat{\text{FPS}_i}=0\)
\(\text{SSE}_c = \sum(Y_i-0)^2\)

Augmented model:

\(\hat{\text{FPS}_i}=\beta_0 (\approx b_0)\)
\(\text{SSE}_a = \sum(Y_i-\hat{Y}_i)^2\)

We can compare (and choose between) these two models by comparing their total error (SSE) in our sample.

For any model: \(\text{SSE} = \sum(Y_i-\hat{Y}_i)^2\)

We can calculate for the SSE for the compact model (\(\text{SSE}_c = \sum(Y_i-0)^2\)) directly:

sum((data$fps - 0)^2)

[1] 233368.3

We can calculate the SSE for the augmented model (\(\text{SSE}_a = \sum(Y_i- 32.2 )^2\)) directly:

sum((data$fps - tidy(m)$estimate)^2)

[1] 133888.3

We can also get the SSE for the augmented model by summing the squared errors/residuals of the fitted model

sum(residuals(m)^2)

[1] 133888.3

Compact model:

\(\hat{\text{FPS}_i}=0\)
\(\text{SSE}_c =\) 233368.3
P = 0

Augmented model:

\(\hat{\text{FPS}_i}=\beta_0 (\approx b_0)\)
\(\text{SSE}_a =\) 133888.3
P = 1

\(F(P_a-P_c, N-P_a) = \frac{(\text{SSE}_c-\text{SSE}_a)/(P_a-P_c)}{\text{SSE}_a/(N-P_a)}\)

\(F(1-0, 96 -1)= \frac{(233368.3 - 133888.3)/(1-0)}{133888.3 / (96 - 1)}\)

\(F(1, 95 )= 70.59, p < .0001\)

pf(70.59, 1, 95, lower.tail = FALSE)

[1] 0.0000000000004255967

Sampling Distribution vs. Model Comparison

The two approaches to testing \(H_0\) about parameters (\(\beta_0, \beta_j\)) are statistically equivalent.

They are complementary approaches with respect to conceptual understanding of GLMs.

Sampling Distribution

Focus on population parameters and their estimates.
Tight connection to sampling and probability distributions.
Understanding of SE (sampling error/power; confidence intervals, graphic displays).

Model Comparison

Focus on models themselves.
Highlights model fit (SSE) and model parsimony (P).
Clearer link to PRE (\(\eta_p^2\)).
Test comparisons that differ by >1 parameter (discouraged).

Effect Sizes

Your parameter estimates are descriptive.

They describe effects in the original units of the predictors and DV.
Report them in your paper.

There are many other effect size estimates available. You will learn two that we prefer.

Partial eta squared (\(\eta_p^2\)): Judd et al call this PRE (proportional reduction in error).
Eta squared (\(\eta^2\)): This is also commonly referred to as \(\Delta R^2\) in regression.

Partial eta squared (\(\eta_p^2\); PRE)

Compact model:

\(\hat{\text{FPS}_i}=0\)
SSE = 233368.3
P = 0

Augmented model:

\(\hat{\text{FPS}_i}=\beta_0 (\approx b_0)\)
SSE = 133888.3
P = 1

We can use the formula for PRE to calculate how much the error was reduced in the augmented model relative to the compact model?

\(\frac{\text{SSE}_c-\text{SSE}_a}{\text{SSE}_c} = \frac{233368.3 - 133888.3}{233368.3} = 0.426\)

Our more complex model that includes \(\beta_0\) reduces prediction error (SSE) by approximately 43%. Not bad!

Note

We won’t introduce eta squared/\(\Delta R^2\) until we have two parameters in the model.

Confidence Interval for \(b_0\)

A confidence interval (CI) is an interval for a parameter estimate in which you can be fairly confident that you will capture the true population parameter (in this case, \(\beta_0\)).

Most commonly reported is the 95% CI.
Across repeated samples, 95% of the calculated CIs will include the population parameter.

1confint(m)

1: Use the confint() function to calculate confidence intervals. The default is to provide 95% CIs, but you can change this using the level parameter if you wish.

               2.5 %   97.5 %
(Intercept) 24.58426 39.79742

Question

Given what you now know about confidence intervals and sampling distributions, what should the formula be?

\(\text{CI}(b_0)= b_0 \pm t(\alpha;N-P) * \text{SE}_{b_0}\)

For the 95% confidence interval this is approximately 2 SEs around our unbiased estimate of \(\beta_0\).

Question

How can we tell if a parameter is significant using only the confidence interval?

If a parameter estimate \(\neq\) 0 at \(\alpha\) = .05, then the 95% confidence interval for its parameter estimate should not include 0.

This is also true for testing whether the parameter estimate is equal to any other non-zero value for the population parameter.

The one parameter (mean-only) model: Special Case

Question

What special case (specific analytic test) is statistically equivalent to the test of the null hypothesis: \(\beta_0\) = 0 in the one parameter model?

The one sample t-test testing if a population mean = 0.

t.test(data$fps)


    One Sample t-test

data:  data$fps
t = 8.4015, df = 95, p-value = 0.0000000000004261
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 24.58426 39.79742
sample estimates:
mean of x 
 32.19084

Testing \(\beta_0\) = non-zero values

Question

How could you test an \(H_0\) regarding \(B_0\) = some value other than 0 (e.g., 10)? HINT: There are at least three methods.

Option 1: Compare SSE for the augmented model (\(\hat{Y}_i= \beta_0\)) to SSE from a different compact model for this new \(H_0\) (\(\hat{Y}_i= 10\)).

compact model: \(\hat{\text{FPS}_i}=10\)
augmented model: \(\hat{\text{FPS}_i}=\beta_0)\)

Option 2: Recalculate t-statistic using this new \(H_0\).

\(t= \frac{b_0 - 10}{\text{SE}_{b_0}}\)

Option 3: Does the confidence interval for the parameter estimate contain this other value? No p-value provided.

confint(m)

               2.5 %   97.5 %
(Intercept) 24.58426 39.79742

Intermission…

One parameter (\(\beta_0\)) mean-only model

Description: \(b_0\) describes mean of \(Y\).
Prediction: \(b_0\) is predicted value that minimizes sample SSE.
Inference: Use \(b_0\) to test id \(\beta_0 = 0\) (default) or any other value. One sample t-test.

Two parameter (\(\beta_0, \beta_1\)) model

Description: \(b_1\) describes how \(Y\) changes as a function of \(X_1\). \(b_0\) describes expected value of \(Y\) ar specific value (0) for \(X_1\).
Prediction: \(b_0\) and \(b_1\) yield predicted values that vary by \(X_1\) and minimize SSE in sample.
Inference: Test if \(\beta_1 = 0\). Pearson’s \(r\); independent sample t-test. Test if \(\beta_0=0\). Analogous to one-sample t-test controlling for \(X_1\), if \(X_1\) is mean-centered. Very flexible!