Topic overview
Advanced Topics in Regression Analysis
Interactions, nonlinear models, log transformations, and cross-validation.
Learning objectives
- •What is the interaction effect in a regression model? When do we need interaction in a model?
- •How to capture a U-shaped or invert U-shaped relationship in a regression model?
- •What is the interpretation on the coefficients of the log-log regression model?
- •What is the exponential regression model? How to interpret its coefficients?
- •What is cross-validation? And how to use it?
Key Learning Summary
The core ideas to take away from this week.
Interactions Reveal Hidden Dynamics
An interaction term captures when the effect of one variable depends on the level of another. Without it, you assume every group has the same slope — which can hide critical insights like a widening gender pay gap.
Not All Relationships Are Linear
Quadratic models (I(X^2)) capture U-shaped relationships. The turning point is at -β₁/(2β₂). Wages peak around age 50, then decline — a linear model would miss this entirely.
Log Models Give You Percentages and Elasticities
Logging the dependent variable turns coefficients into percentage changes. Logging both sides gives you elasticities — the most common way businesses measure responsiveness (e.g., price elasticity of demand).
Never Compare R² Across Different Dependent Variables
A model predicting Y and a model predicting ln(Y) have R² values on different scales. To compare fairly, back-transform predictions to the original scale and compute a comparable R².
R² Can Lie — Cross-Validation Tells the Truth
R² always increases with more variables, even junk ones. Cross-validation tests on unseen data, revealing which model actually predicts best. K-fold CV is more robust than a single holdout split.
RMSE Is Your Prediction Scorecard
Root Mean Squared Error measures average prediction error in the original units of Y. Lower is better. Compare RMSE across models on validation data — the model with the lowest average RMSE wins.
The one sentence to remember: This week's tools — interactions, polynomials, logs, and cross-validation — let you model real-world complexity and honestly measure which model predicts best on new data.
Key Vocabulary
Terms you should be able to define and use confidently.
Interaction Effects
Interaction Effect
When the effect of one independent variable on Y depends on the level of another independent variable. Modeled by multiplying the two variables together (X₁ × X₂).
Interaction Term
The product of two variables (X₁ × X₂) added to a regression. Its coefficient (β₄) measures how much the slope of X₁ changes per unit of X₂, and vice versa.
Synergy Effect
When combining two factors produces value beyond the sum of their individual effects. A positive interaction coefficient indicates synergy.
Nonlinear Models
Quadratic Model
A regression with a squared term (X²) to capture U-shaped or inverted U-shaped relationships. Y = β₀ + β₁X + β₂X².
Turning Point
The X value where a quadratic relationship switches from increasing to decreasing (or vice versa). Calculated as -β₁/(2β₂).
Log-Linear Model
ln(Y) = β₀ + β₁X. A one-unit increase in X leads to approximately (β₁ × 100)% change in Y. Used when Y grows exponentially.
Linear-Log Model
Y = β₀ + β₁ ln(X). A 1% increase in X leads to β₁/100 unit change in Y. Captures diminishing returns.
Log-Log Model
ln(Y) = β₀ + β₁ ln(X). The coefficient β₁ is an elasticity: a 1% increase in X leads to a β₁% change in Y.
Elasticity
A unit-free measure of responsiveness: the percentage change in Y for a 1% change in X. Central to pricing, demand analysis, and marketing.
Back-Transformation Correction
When predicting from a log model, you must multiply by e^(s²/2) to correct for retransformation bias. Without it, predictions are systematically too low.
Cross-Validation
Overfitting
When a model fits the training data too well — capturing noise rather than signal. Shows as high R² but poor performance on new data.
Holdout Method
Splitting data into training (build the model) and validation (test it). Simple but results depend on which observations land in each set.
K-Fold Cross-Validation
Divide data into K equal parts. Each part takes a turn as the validation set while the rest train the model. Average the K RMSE values. More robust than a single holdout.
RMSE (Root Mean Squared Error)
Average prediction error in the original units of Y. Calculated as √(mean((actual - predicted)²)). Lower is better. The standard metric for comparing predictive models.
R Functions & Syntax
X1 * X2
In an R formula, this automatically includes X1, X2, AND the interaction X1:X2. Shorthand for X1 + X2 + X1:X2.
I(X^2)
The I() wrapper tells R to treat ^2 as "square this variable" rather than a formula operator. Required for polynomial terms.
log()
Natural logarithm in R. Use inside lm() formulas: lm(log(Y) ~ log(X)) for a log-log model.
sigma()
Returns the residual standard error (sₑ) from a model. Needed for the back-transformation correction: exp(pred + sigma(model)^2/2).
sqrt(mean((actual - predicted)^2))
Manual RMSE calculation in R. Compare this value across models on the same validation set to find the best predictor.
Key Formulas
The essential formulas for this week. Focus on what each one does and when to use it.
Interaction Models
With Interaction Term
Effect of X₁ depends on X₂. When X₂ = 0, effect of X₁ is β₁. When X₂ = 1, effect of X₁ is β₁ + β₃.
In R: lm(Y ~ X1 + X2 + X1*X2, data = df)
Nonlinear Models
Quadratic Model
β₂ < 0 → inverted U (rises then falls). β₂ > 0 → U-shape (falls then rises).
In R: lm(Y ~ X + I(X^2), data = df)
Turning Point
The X value where the curve peaks (inverted U) or bottoms out (U-shape). Example: if wages peak, this tells you the age at which earnings are highest.
In R: -coef(model)[2] / (2 * coef(model)[3])
Log Specifications
| Model | Equation | Interpretation of β₁ |
|---|---|---|
| Linear | Y = β₀ + β₁X | 1-unit ↑ in X → β₁ unit ↑ in Y |
| Log-Linear | ln(Y) = β₀ + β₁X | 1-unit ↑ in X → (β₁ × 100)% ↑ in Y |
| Linear-Log | Y = β₀ + β₁ ln(X) | 1% ↑ in X → β₁/100 unit ↑ in Y |
| Log-Log | ln(Y) = β₀ + β₁ ln(X) | 1% ↑ in X → β₁% ↑ in Y (elasticity) |
Back-Transformation Correction
When your model predicts ln(Y), you must apply this correction to get predictions in original units. The sₑ²/2 term corrects for retransformation bias.
In R: exp(predict(model, newdata) + sigma(model)^2/2)
Cross-Validation
RMSE (Root Mean Squared Error)
Average prediction error in the original units of Y. Compute on the validation set. Lower RMSE = better predictions. The model with the lowest average RMSE (across K folds) wins.
In R: sqrt(mean((actual - predicted)^2))
Comparable R² (for log models)
When comparing models with different dependent variables (Y vs ln(Y)), back-transform predictions to the original scale and compute the squared correlation. This makes R² values comparable.
In R: cor(df$Y, exp(predict(model) + sigma(model)^2/2))^2