Introductions (preferred name, pronouns, program/department and year)
Structure of course
Why these tools?
Quarto
?
tidyverse
?
tidymodels
?
Why me?
Environment
Exams, Application Assignments and Quizzes
Approximately weekly quizzes will be administered through Canvas and due each Wednesday at 8 pm (15% of grade)
Approximately weekly application assignments will be submitted via Canvas and due each Wednesday at 8 pm (25% of grade)
Midterm application/project exam due March 5th at 8 pm (15% of grade)
Midterm concepts exam in class on March 6th (15% of grade)
Final concepts exam during finals week on May 6th from 11 - 12:15 (15% of grade)
Final application exam due May 9th at 8 pm (15% of grade)
ChatGPT and other LLMs
Question
What is the difference between association vs. prediction?
Association quantifies the relationship between variables within a sample (predictors-outcome). Prediction requires using an established model to predict (future?) outcomes for new (“out-of-sample,”held-out”) participants.
Much research in psychology demonstrates association but calls it prediction!
Association (sometimes substantially) overestimates the predictive strength of our models
Prediction requires:
Model trained/fit on data that does NOT include the outcome to be predicted (model already exists)
May also predictors to be measured before outcome
Consider how to make a lapse prediction for a new person when implementing prediction model
Goal of scientific psychology is to understand human behavior. It involves both explaining behavior (i.e., identifying causes) and predicting (yet to be observed) behaviors.
Question
Examples of valuable prediction without explanation?
Can you have explanation without prediction?
Question
Supervised vs. unsupervised machine learning?
Question
Supervised regression vs classification?
Question
What is a data generating process?
Question
Why do we estimate the data generating process?
Question
What is it and why do we do it?
Question
How are replication and cross-validation different?
Reducible vs. Irreducible error?
Question
What are the three general steps by which we estimate and evaluate the data generating process with a sample of data? Lets use all this vocabulary!
Question
What is underfitting, overfitting, bias, and variance?
Bias and variance are general concepts to understand during any estimation process
Conceptual example of bias-variance: Darts from Yarkoni and Westfall (2017)
Second Conceptual Example: Models to measure/estimate weight
Biased models are generally less complex models (i.e., underfit) than the data-generating process for your outcome
Biased models lead to errors in prediction because the model will systematically over- or under-predict outcomes (scores or probabilities) for specific values of predictor(s) (bad for prediction goals!)
Parameter estimates from biased models may over or under-estimate the true effect of a predictor (bad for explanatory goals!)
Question
Are GLMs biased models?
GLM parameter estimates are BLUE - best linear unbiased estimators if the assumptions of the GLM are met. Parameter estimates from any sample are unbiased estimates of the linear model coefficients for population model.
However, if DGP is not linear (i.e., that assumption is violated), the linear model will produce biased parameter estimates.
Bias seems like a bad thing (and it is but its not the only source of reducible error)
Both bias (due to underfitting) and variance (due to overfitting) are sources of (reducible) prediction errors (and imprecise/inaccurate parameter estimates). They are also often inversely related (i.e., the trade-off).
The world is complex. In many instances,
Question
Consider example p = n in general linear model. What happens in this situation? How is this related to overfitting and model flexibility?
The model will perfectly fit the sample data even when there is no relationship between the predictors and the outcome. e.g., Any two points can be fit perfectly with one predictor (line), any three points can be fit perfectly with two predictors (plane). This model will NOT predict well in new data. This model is overfit because n-1 predictors is too flexible for the linear model. You will fit the noise in the training data.
Factors that increase overfitting
You might have noticed that many of the above factors contribute to the standard error of a parameter estimate/model coefficient from the GLM
The standard error increases as model overfitting increases due to these factors
Question
Explain the link between model variance/overfitting, standard errors, and sampling distributions?
All parameter estimates have a sampling distribution. This is the distribution of estimates that you would get if you repeatedly fit the same model to new samples.
When a model is overfit, that means that aspects of the model (its parameter estimates, its predictions) will vary greatly from sample to sample. This is represented by a large standard error (the SD of the sampling distribution) for the model’s parameter estimates.
It also means that the predictions you will make in new data will be very different depending on the sample that was used to estimate the parameters.
Question
Describe problem of p-hacking with respect to overfitting?
When you p-hack, you are overfitting the training set (your sample). You try out many, many different model configurations and choose the one you like best rather than what works well in new data. This model likely capitalizes on noise in your sample. It won’t fit well in another sample.
In other words, your conclusions are not linked to the true DGP and would be different if you used a different sample.
In a different vein, your significance test is wrong. The SE does not reflect the model variance that resulted from testing many different configurations b/c your final model didn’t “know” about the other models. Statistically invalid conclusion!
Parameter estimates from an overfit model are specific to the sample within which they were trained and are not true for other samples or the population as a whole
Parameter estimates from overfit models have big (TRUE) SE and so they may be VERY different in other samples
With traditional (one-sample) statistics, this can lead us to incorrect conclusions about the effect of predictors associated with these parameter estimates (bad for explanatory goals!).
If the parameter estimates are very different sample to sample (and different from the true population parameters), this means the model will predict poorly in new samples (bad for prediction goals!). We fix this by using resampling to evaluate model performance.