Topic overview

Regression Analysis

The foundation of predictive analytics. Learn to build, interpret, and test regression models.

Learning objectives

  • Understand the assumptions behind the linear regression model.
  • Understand the difference between correlation and causality.
  • Estimate and interpret a linear regression model.
  • Interpret goodness-of-fit measures.
  • Conduct tests of significance.
  • Address common violations of the OLS assumptions.
  • Introduce R Markdown.

Start here

Open the lecture experience and follow the guided flow.

Open lecture
🎯

Key Learning Summary

The core ideas to take away from this week.

1

OLS Assumptions Matter

Linear regression relies on assumptions like linearity, independent errors, and constant variance. Violations can make your inference unreliable.

2

Correlation Is Not Causation

A strong relationship does not prove cause and effect. You need theory, design, or controls to make causal claims.

3

Interpret Coefficients in Context

Each coefficient is the expected change in Y for a one-unit change in X, holding other variables constant.

4

Goodness-of-Fit Is Not Enough

R-squared and adjusted R-squared summarize fit, but you still need significance tests and diagnostics.

5

Significance Tests Guide Decisions

t-tests, p-values, and confidence intervals tell you which predictors matter and how precise your estimates are.

6

Document Work with R Markdown

Use R Markdown to combine code, output, and explanation in one reproducible report.

The one sentence to remember: Linear regression is powerful, but only when its assumptions are checked and its results are interpreted responsibly.

📖

Key Vocabulary

Terms you should be able to define and use confidently.

Core Concepts

Statistical Model

A simplified representation of reality built from assumptions.

concept

Dependent Variable (Y)

The outcome we want to explain or predict.

concept

Independent Variable (X)

A predictor used to explain variation in Y.

concept

Correlation vs. Causation

Correlation shows association. Causation requires evidence of a true cause-and-effect link.

concept
Fit & Inference

R-squared

Proportion of variation in Y explained by the model.

fit

Adjusted R-squared

R-squared adjusted for the number of predictors.

fit

t-statistic

Measures how many standard errors a coefficient is from zero.

inference

p-value

Probability of seeing the result if the true effect were zero.

inference

Confidence Interval

A range of plausible values for the true coefficient.

inference
Diagnostics

Residual

The difference between actual and predicted values: e = y − ŷ.

diagnostic

Heteroscedasticity

The spread of residuals changes across fitted values.

diagnostic

Outlier

An extreme observation that can distort the model.

diagnostic

Multicollinearity

Predictors are highly correlated, inflating standard errors.

diagnostic
📐

Key Formulas

The essentials you should recognize and use.

Core Regression

Regression Equation

ŷ = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ

The model’s predicted value for Y given the inputs.

Residual

e = y − ŷ

The model’s error for each observation.

Fit & Inference

R-squared

R² = 1 − SSE / SST

The fraction of total variation in Y explained by the model.

t-statistic

t = (β̂ − 0) / SE(β̂)

Used to test whether a coefficient is statistically different from zero.

Confidence Interval

β̂ ± t* · SE(β̂)

A range of plausible values for the true coefficient.

Diagnostics

Homoscedasticity

Var(e | X) = constant

The spread of residuals should be roughly constant across fitted values.