Topic overview

Regression Analysis

The foundation of predictive analytics. Learn to build, interpret, and test regression models.

Learning objectives

•Understand the assumptions behind the linear regression model.
•Understand the difference between correlation and causality.
•Estimate and interpret a linear regression model.
•Interpret goodness-of-fit measures.
•Conduct tests of significance.
•Address common violations of the OLS assumptions.
•Introduce R Markdown.

Start here

Open the lecture experience and follow the guided flow.

Open lecture

🎯

Key Learning Summary

The core ideas to take away from this week.

OLS Assumptions Matter

Linear regression relies on assumptions like linearity, independent errors, and constant variance. Violations can make your inference unreliable.

Correlation Is Not Causation

A strong relationship does not prove cause and effect. You need theory, design, or controls to make causal claims.

Interpret Coefficients in Context

Each coefficient is the expected change in Y for a one-unit change in X, holding other variables constant.

Goodness-of-Fit Is Not Enough

R-squared and adjusted R-squared summarize fit, but you still need significance tests and diagnostics.

Significance Tests Guide Decisions

t-tests, p-values, and confidence intervals tell you which predictors matter and how precise your estimates are.

Document Work with R Markdown

Use R Markdown to combine code, output, and explanation in one reproducible report.

The one sentence to remember: Linear regression is powerful, but only when its assumptions are checked and its results are interpreted responsibly.

📖

Key Vocabulary

Terms you should be able to define and use confidently.

Core Concepts

Statistical Model

A simplified representation of reality built from assumptions.

concept

Dependent Variable (Y)

The outcome we want to explain or predict.

concept

Independent Variable (X)

A predictor used to explain variation in Y.

concept

Correlation vs. Causation

Correlation shows association. Causation requires evidence of a true cause-and-effect link.

concept

Fit & Inference

R-squared

Proportion of variation in Y explained by the model.

fit

Adjusted R-squared

R-squared adjusted for the number of predictors.

fit

t-statistic

Measures how many standard errors a coefficient is from zero.

inference

p-value

Probability of seeing the result if the true effect were zero.

inference

Confidence Interval

A range of plausible values for the true coefficient.

inference

Diagnostics

Residual

The difference between actual and predicted values: e = y − ŷ.

diagnostic

Heteroscedasticity

The spread of residuals changes across fitted values.

diagnostic

Outlier

An extreme observation that can distort the model.

diagnostic

Multicollinearity

Predictors are highly correlated, inflating standard errors.

diagnostic

📐

Key Formulas

The essentials you should recognize and use.

Core Regression

Regression Equation

ŷ = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ

The model’s predicted value for Y given the inputs.

Residual

e = y − ŷ

The model’s error for each observation.

Fit & Inference

R-squared

R² = 1 − SSE / SST

The fraction of total variation in Y explained by the model.

t-statistic

t = (β̂ − 0) / SE(β̂)

Used to test whether a coefficient is statistically different from zero.

Confidence Interval

β̂ ± t* · SE(β̂)

A range of plausible values for the true coefficient.

Diagnostics

Homoscedasticity

Var(e | X) = constant

The spread of residuals should be roughly constant across fitted values.