The foundation of predictive analytics. Learn to build, interpret, and test regression models.
Before we dive into regression, let's understand what we're building.
βAll models are wrong, some are usefulβ
Greenland looks huge on most maps, but it's smaller than Africa. The map is βwrongβ but still useful.
We can't capture every variable. We focus on the ones that matter most for prediction.
Where the name βregressionβ comes fromβa 140-year-old discovery.
Studied heights of parents and children
Galton found that tall parents tend to have children shorter than themselves, and short parents tend to have taller children.
The mathematical foundation of linear regression.
What we're trying to predict. The outcome variable.
The baseline value when X = 0.
How much Y changes when X increases by 1.
Random variation we can't explain.
Let's predict post-graduate earnings based on college characteristics.
# Build the regression model
Model = lm(Earnings ~ Cost + Grad + Debt + City,
data = college)
# View the results
summary(Model)Click each card to see the full interpretation.
How well does our model explain the data?
Testing hypotheses about our coefficients.
Testing whether our estimated coefficients are statistically different from zero.
If Ξ²β = 0, then Xβ has no relationship with Y. Does the data support this?
bβ±Ό = estimated coefficient
Ξ²β±Όβ = hypothesized value (usually 0)
se(bβ±Ό) = standard error
df = n - k - 1
The p-value is the largest Ξ± at which we would fail to reject Hβ.
If Ξ± = 0.05 β reject Hβ if p < 0.05
If Ξ± = 0.01 β reject Hβ if p < 0.01
Should we keep or drop a variable?
Testing if Cost affects Earnings.
Null Hypothesis
Alternative Hypothesis
Hβ says: Cost does NOT affect earnings (after controlling for other variables).
If we reject Hβ (p < 0.05), Cost does have a significant effect.
Testing multiple coefficients at once.
The t-test only tests one coefficient. What if we want to test if all predictors together matter?
βNone of the predictors matterβ
Tests whether predictors have a joint statistical influence. R shows the F-statistic and p-value at the bottom of summary output.
Getting output from R doesn't mean the output is reliable. We need to check our work.
Just because you feel fine doesn't mean everything is fine. You get bloodwork done to check for hidden problems. Residual plots are the bloodwork for your regression model.
ACTUAL VALUE
y
PREDICTED VALUE
Ε·
RESIDUAL
e
Residuals are the mistakes your model makes. Studying the pattern of mistakes tells us if the model is working properly.
Unequal Spread
Errors are bigger for some predictions than others
Heteroscedasticity
Curved Pattern
Model is systematically off in some ranges
Nonlinearity
Outliers
A few extreme points distorting everything
Influential Points
This is the single most important diagnostic plot in regression. Here's how to read it.
X-AXIS
Fitted Values (Ε·)
What your model predicted
Y-AXIS
Residuals (e = y β Ε·)
How far off each prediction was
Residuals
| * *
+ | * * * * *
| * * * *
0 |--*----*--*-----*--*-----
| * * * * *
- | * * * *
| * *
|_________________________ Fitted
"Looks like static on a TV" = good
Residuals
| *
+ | * * * *
| * * *
0 |--*--*--*-----------*--------
| * * *
- | * * * *
| *
|_________________________ Fitted
This is heteroscedasticity!
Residuals
| * *
+ |* *
| * *
0 |------------*--*---------
| * *
- | *
|_________________________ Fitted
This means the relationship isn't linear β you may need a polynomial term or log transform.
plot(model, which = 1)
This produces the residual vs. fitted plot. R also draws a red smoothed line β it should be flat and close to the dashed zero line.
When your model's errors aren't evenly spread, your conclusions may be wrong.
PRONUNCIATION
hetero (different) + scedasticity (spread)
HET-er-oh-skeh-das-TIS-ih-tee
Literally: "different spread" β the errors have different amounts of scatter
Imagine predicting home prices based on square footage:
| Home Size | Predicted Price | Actual Prices You See | Error Range |
|---|---|---|---|
| 500 sq ft (studio) | 100K | 90K β 110K | Β± 10K |
| 2,000 sq ft (house) | 400K | 320K β 480K | Β± 80K |
| 5,000 sq ft (mansion) | 1M | 600K β 1.5M | Β± 400K |
The errors get bigger as the home gets bigger. Small homes are easy to price; mansions are wildly unpredictable.
The formulas assume constant error spread. If that's violated, your SEs are too big or too small.
Wrong SEs β wrong t-statistics β wrong p-values. You might think a variable is significant when it's not!
Your 95% interval might actually be 80% or 99% β you can't tell without fixing the problem.
Good news: the Ξ² estimates themselves are still unbiased. It's the confidence in those estimates that's messed up.
(What we want)
| * *
+ | * * * *
| * * * *
0 |--*----*--*-----*--*--
| * * * * *
- | * * * *
| * *
|______________________
Same spread everywhere β reliable results
(The problem)
| *
+ | * * * *
| * * *
0 |--*--*-----------*-----
| * * *
- | * * * *
| *
|______________________
Fan shape β SEs and p-values can't be trusted
Transform the Y variable
Taking log(Y) often stabilizes the
variance. Very common with dollar amounts.
Use robust standard errors
Keep the same model but calculate SEs that don't assume constant variance. Available in
R with packages like sandwich.
Weighted Least Squares (WLS)
Give less weight to observations with high variance. Advanced but effective.
plot(model, which = 1)