Introductions (preferred name, program/department and year; share name and pronouns in Slack)
Structure of course (after this short week)
Why these tools?
Quarto?
tidyverse?
tidymodels?
Why me?
Environment
Exams, Application Assignments and Quizzes
ChatGPT and other LLMs
We will start with discussion of Yarkoni and Westfall today. Continue and integrate ISL chapter on thursday
What follows today is mostly my notes on key topics from the paper! Not meant to be a lecture.
Question
What is the difference between association vs. prediction?
Association quantifies the relationship between variables within a sample (predictors-outcome).
Prediction requires using an established model to predict (future?) outcomes for new (“out-of-sample,”held-out”) participants.
Much research in psychology demonstrates association but calls it prediction!
Association (sometimes substantially) overestimates the predictive strength of our models
Prediction requires:
Model trained/fit on data that does NOT include the participants/observations to be predicted (model already exists)
Predictors to be measured before outcome (for true prediction)
Consider how to make a lapse prediction for a new person when implementing prediction model
Goal of scientific psychology is to understand human behavior. It involves both explaining behavior (i.e., identifying causes) and predicting (yet to be observed) behaviors.
Prediction focuses on questions about:
Question
Examples of valuable prediction without explanation?
Can you have explanation without prediction?
Question
Supervised vs. unsupervised machine learning?
Question
Supervised regression vs classification?
NOTE: We will not discuss reinforcement learning in this course
Question
What is a data generating process?
Question
Why do we estimate the data generating process?
Same as use of models in 610
Question
What is it and why do we do it?
Question
How are replication and cross-validation different?
Reducible vs. Irreducible error?
Question
What are the three general steps by which we estimate and evaluate the data generating process with a sample of data? Lets use all this vocabulary!
Question
What is underfitting, overfitting, bias, and variance?
Bias and variance are general concepts to understand during any estimation process
Conceptual example of bias-variance: Darts from Yarkoni and Westfall (2017)

Second Conceptual Example: Models to measure/estimate weight
Biased models are generally less complex models (i.e., underfit) than the data-generating process for your outcome
Biased models lead to errors in prediction because the model will systematically over- or under-predict outcomes (scores or probabilities) for specific values of predictor(s) (bad for prediction goals!)
Parameter estimates from biased models may over or under-estimate the true effect of a predictor (bad for explanatory goals!)
Question
Are GLMs biased models?
GLM parameter estimates are BLUE - best linear unbiased estimators if the assumptions of the GLM are met. Parameter estimates from any sample are unbiased estimates of the linear model coefficients for population model.
However, if DGP is not linear (i.e., that assumption is violated), the linear model will produce biased parameter estimates.
Bias seems like a bad thing (and it is but its not the only source of reducible error)
Both bias (due to underfitting) and variance (due to overfitting) are sources of (reducible) prediction errors (and imprecise/inaccurate parameter estimates). They are also often inversely related (i.e., the trade-off).
The world is complex. In many instances,
Question
Consider example p = n in general linear model. What happens in this situation? How is this related to overfitting and model flexibility?
The model will perfectly fit the sample data even when there is no relationship between the predictors and the outcome. e.g., Any two points can be fit perfectly with one predictor (line), any three points can be fit perfectly with two predictors (plane). This model will NOT predict well in new data. This model is overfit because n-1 predictors is too flexible for the linear model. You will fit the noise in the training data.
Factors that increase overfitting
You might have noticed that many of the above factors contribute to the standard error of a parameter estimate/model coefficient from the GLM
The standard error increases as model overfitting increases due to these factors
Question
Explain the link between model variance/overfitting, standard errors, and sampling distributions?
All parameter estimates have a sampling distribution. This is the distribution of estimates that you would get if you repeatedly fit the same model to new samples.
When a model is overfit, that means that aspects of the model (its parameter estimates, its predictions) will vary greatly from sample to sample. This is represented by a large standard error (the SD of the sampling distribution) for the model’s parameter estimates.
It also means that the predictions you will make in new data will be very different depending on the sample that was used to estimate the parameters.
Question
Describe problem of p-hacking with respect to overfitting?
When you p-hack, you are overfitting the training set (your sample). You try out many, many different model configurations and choose the one you like best rather than what works well in new data. This model likely capitalizes on noise in your sample. It won’t fit well in another sample.
In other words, your conclusions are not linked to the true DGP and would be different if you used a different sample.
In a different vein, your significance test is wrong. The SE does not reflect the model variance that resulted from testing many different configurations b/c your final model didn’t “know” about the other models. Statistically invalid conclusion!
Parameter estimates from an overfit model are specific to the sample within which they were trained and are not true for other samples or the population as a whole
Parameter estimates from overfit models have big (TRUE) SE and so they may be VERY different in other samples
With traditional (one-sample) statistics, this can lead us to incorrect conclusions about the effect of predictors associated with these parameter estimates (bad for explanatory goals!).
If the parameter estimates are very different sample to sample (and different from the true population parameters), this means the model will predict poorly in new samples (bad for prediction goals!). We fix this by using resampling to evaluate model performance.
