Topic overview

Advanced Topics in Regression Analysis

Interactions, nonlinear models, log transformations, and cross-validation.

Learning objectives

  • What is the interaction effect in a regression model? When do we need interaction in a model?
  • How to capture a U-shaped or invert U-shaped relationship in a regression model?
  • What is the interpretation on the coefficients of the log-log regression model?
  • What is the exponential regression model? How to interpret its coefficients?
  • What is cross-validation? And how to use it?

Start here

Open the lecture experience and follow the guided flow.

Open lecture
🎯

Key Learning Summary

The core ideas to take away from this week.

1

Interactions Reveal Hidden Dynamics

An interaction term captures when the effect of one variable depends on the level of another. Without it, you assume every group has the same slope — which can hide critical insights like a widening gender pay gap.

2

Not All Relationships Are Linear

Quadratic models (I(X^2)) capture U-shaped relationships. The turning point is at -β₁/(2β₂). Wages peak around age 50, then decline — a linear model would miss this entirely.

3

Log Models Give You Percentages and Elasticities

Logging the dependent variable turns coefficients into percentage changes. Logging both sides gives you elasticities — the most common way businesses measure responsiveness (e.g., price elasticity of demand).

4

Never Compare R² Across Different Dependent Variables

A model predicting Y and a model predicting ln(Y) have R² values on different scales. To compare fairly, back-transform predictions to the original scale and compute a comparable R².

5

R² Can Lie — Cross-Validation Tells the Truth

R² always increases with more variables, even junk ones. Cross-validation tests on unseen data, revealing which model actually predicts best. K-fold CV is more robust than a single holdout split.

6

RMSE Is Your Prediction Scorecard

Root Mean Squared Error measures average prediction error in the original units of Y. Lower is better. Compare RMSE across models on validation data — the model with the lowest average RMSE wins.

The one sentence to remember: This week's tools — interactions, polynomials, logs, and cross-validation — let you model real-world complexity and honestly measure which model predicts best on new data.

📖

Key Vocabulary

Terms you should be able to define and use confidently.

Interaction Effects

Interaction Effect

When the effect of one independent variable on Y depends on the level of another independent variable. Modeled by multiplying the two variables together (X₁ × X₂).

concept

Interaction Term

The product of two variables (X₁ × X₂) added to a regression. Its coefficient (β₄) measures how much the slope of X₁ changes per unit of X₂, and vice versa.

concept

Synergy Effect

When combining two factors produces value beyond the sum of their individual effects. A positive interaction coefficient indicates synergy.

concept
Nonlinear Models

Quadratic Model

A regression with a squared term (X²) to capture U-shaped or inverted U-shaped relationships. Y = β₀ + β₁X + β₂X².

model

Turning Point

The X value where a quadratic relationship switches from increasing to decreasing (or vice versa). Calculated as -β₁/(2β₂).

model

Log-Linear Model

ln(Y) = β₀ + β₁X. A one-unit increase in X leads to approximately (β₁ × 100)% change in Y. Used when Y grows exponentially.

model

Linear-Log Model

Y = β₀ + β₁ ln(X). A 1% increase in X leads to β₁/100 unit change in Y. Captures diminishing returns.

model

Log-Log Model

ln(Y) = β₀ + β₁ ln(X). The coefficient β₁ is an elasticity: a 1% increase in X leads to a β₁% change in Y.

model

Elasticity

A unit-free measure of responsiveness: the percentage change in Y for a 1% change in X. Central to pricing, demand analysis, and marketing.

model

Back-Transformation Correction

When predicting from a log model, you must multiply by e^(s²/2) to correct for retransformation bias. Without it, predictions are systematically too low.

model
Cross-Validation

Overfitting

When a model fits the training data too well — capturing noise rather than signal. Shows as high R² but poor performance on new data.

evaluation

Holdout Method

Splitting data into training (build the model) and validation (test it). Simple but results depend on which observations land in each set.

evaluation

K-Fold Cross-Validation

Divide data into K equal parts. Each part takes a turn as the validation set while the rest train the model. Average the K RMSE values. More robust than a single holdout.

evaluation

RMSE (Root Mean Squared Error)

Average prediction error in the original units of Y. Calculated as √(mean((actual - predicted)²)). Lower is better. The standard metric for comparing predictive models.

evaluation
R Functions & Syntax

X1 * X2

In an R formula, this automatically includes X1, X2, AND the interaction X1:X2. Shorthand for X1 + X2 + X1:X2.

R

I(X^2)

The I() wrapper tells R to treat ^2 as "square this variable" rather than a formula operator. Required for polynomial terms.

R

log()

Natural logarithm in R. Use inside lm() formulas: lm(log(Y) ~ log(X)) for a log-log model.

R

sigma()

Returns the residual standard error (sₑ) from a model. Needed for the back-transformation correction: exp(pred + sigma(model)^2/2).

R

sqrt(mean((actual - predicted)^2))

Manual RMSE calculation in R. Compare this value across models on the same validation set to find the best predictor.

R
📐

Key Formulas

The essential formulas for this week. Focus on what each one does and when to use it.

Interaction Models

With Interaction Term

Y = β₀ + β₁X₁ + β₂X₂ + β₃(X₁ × X₂) + ε

Effect of X₁ depends on X₂. When X₂ = 0, effect of X₁ is β₁. When X₂ = 1, effect of X₁ is β₁ + β₃.

In R: lm(Y ~ X1 + X2 + X1*X2, data = df)

Nonlinear Models

Quadratic Model

Y = β₀ + β₁X + β₂X² + ε

β₂ < 0 → inverted U (rises then falls). β₂ > 0 → U-shape (falls then rises).

In R: lm(Y ~ X + I(X^2), data = df)

Turning Point

X* = -β₁ / (2β₂)

The X value where the curve peaks (inverted U) or bottoms out (U-shape). Example: if wages peak, this tells you the age at which earnings are highest.

In R: -coef(model)[2] / (2 * coef(model)[3])

Log Specifications
ModelEquationInterpretation of β₁
LinearY = β₀ + β₁X1-unit ↑ in X → β₁ unit ↑ in Y
Log-Linearln(Y) = β₀ + β₁X1-unit ↑ in X → (β₁ × 100)% ↑ in Y
Linear-LogY = β₀ + β₁ ln(X)1% ↑ in X → β₁/100 unit ↑ in Y
Log-Logln(Y) = β₀ + β₁ ln(X)1% ↑ in X → β₁% ↑ in Y (elasticity)

Back-Transformation Correction

Ŷ = epredicted ln(Y) + sₑ²/2

When your model predicts ln(Y), you must apply this correction to get predictions in original units. The sₑ²/2 term corrects for retransformation bias.

In R: exp(predict(model, newdata) + sigma(model)^2/2)

Cross-Validation

RMSE (Root Mean Squared Error)

RMSE = √( (1/n) Σ(yᵢ - ŷᵢ)² )

Average prediction error in the original units of Y. Compute on the validation set. Lower RMSE = better predictions. The model with the lowest average RMSE (across K folds) wins.

In R: sqrt(mean((actual - predicted)^2))

Comparable R² (for log models)

R²_comp = cor(Y_actual, Ŷ_backtransformed)²

When comparing models with different dependent variables (Y vs ln(Y)), back-transform predictions to the original scale and compute the squared correlation. This makes R² values comparable.

In R: cor(df$Y, exp(predict(model) + sigma(model)^2/2))^2