📍 Week 1 • Live Now

Regression Analysis

The foundation of predictive analytics. Learn to build, interpret, and test regression models.

📖 Jaggia Ch. 7⏱️ ~45 min🎯 5 Key Concepts

Part 01

What is a Model?

Before we dive into regression, let's understand what we're building.

“All models are wrong, some are useful”

— George Box, Statistician

🗺️

Think of a Map

Greenland looks huge on most maps, but it's smaller than Africa. The map is “wrong” but still useful.

🎯

Models Simplify

We can't capture every variable. We focus on the ones that matter most for prediction.

Part 02

Regression to the Mean

Where the name “regression” comes from—a 140-year-old discovery.

👨‍🔬

Francis Galton, 1886

Studied heights of parents and children

Galton found that tall parents tend to have children shorter than themselves, and short parents tend to have taller children.

Part 03

The Regression Equation

The mathematical foundation of linear regression.

Simple Linear Regression

Y = β₀ + β₁X + ε

📊

Y (Dependent)

What we're trying to predict. The outcome variable.

📍

β₀ (Intercept)

The baseline value when X = 0.

📈

β₁ (Slope)

How much Y changes when X increases by 1.

🎲

ε (Error)

Random variation we can't explain.

Multiple Linear Regression

Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε

Part 04

Real Example: College Earnings

Let's predict post-graduate earnings based on college characteristics.

model.R

r

# Build the regression model
Model = lm(Earnings ~ Cost + Grad + Debt + City,
           data = college)

# View the results
summary(Model)

Result

Earnings = 10,004 + 0.43(Cost) + 1.78(Grad) + 141(Debt) + 2,526(City)

Part 05

Interpreting Coefficients

Click each card to see the full interpretation.

💡

The Magic Phrase

Holding all other predictor variables constant, if X increases by 1 unit, on average, predicted Y is expected to increase/decrease by the coefficient.

Part 06

Model Fit: R²

How well does our model explain the data?

R² → 1

Good fit. Explains most variance.

R² → 0

Poor fit. Explains little variance.

Comparing Models

Model 1

Cost only

0.28

Model 2

+ Grad + Debt

0.40

Model 3

+ City

0.45

Part 07

Regression Inference

Testing hypotheses about our coefficients.

🔬

What is Inference?

Testing whether our estimated coefficients are statistically different from zero.

❓

The Key Question

If β₁ = 0, then X₁ has no relationship with Y. Does the data support this?

📊 The t-statistic

t = (bⱼ - βⱼ₀) / se(bⱼ)

bⱼ = estimated coefficient

βⱼ₀ = hypothesized value (usually 0)

se(bⱼ) = standard error

df = n - k - 1

📈 The p-value

The p-value is the largest α at which we would fail to reject H₀.

If α = 0.05 → reject H₀ if p < 0.05

If α = 0.01 → reject H₀ if p < 0.01

Part 08

Statistical Significance

Should we keep or drop a variable?

Is p-value < 0.05?

✅ Significant — Keep it

❌ Not Significant — Consider dropping

Low p + High |t|

Variable matters → Keep

High p-value

Might not matter → Drop

Part 09

Hypothesis Testing Example

Testing if Cost affects Earnings.

Null Hypothesis

H₀: β₁ = 0

Alternative Hypothesis

H₁: β₁ ≠ 0

What This Means

H₀ says: Cost does NOT affect earnings (after controlling for other variables).

If we reject H₀ (p < 0.05), Cost does have a significant effect.

Part 10

Joint Hypothesis Testing (F-test)

Testing multiple coefficients at once.

The Problem

The t-test only tests one coefficient. What if we want to test if all predictors together matter?

Joint Hypothesis

H₀: β₁ = β₂ = ... = βₖ = 0

“None of the predictors matter”

📊 The F-test

Tests whether predictors have a joint statistical influence. R shows the F-statistic and p-value at the bottom of summary output.

Diagnostics • Why Check?

Your Model Ran. But Can You Trust It?

Getting output from R doesn't mean the output is reliable. We need to check our work.

🩺

Think of It Like a Medical Checkup

Just because you feel fine doesn't mean everything is fine. You get bloodwork done to check for hidden problems. Residual plots are the bloodwork for your regression model.

Quick Refresher: What Are Residuals?

ACTUAL VALUE

y

−

PREDICTED VALUE

ŷ

=

RESIDUAL

e

Residuals are the mistakes your model makes. Studying the pattern of mistakes tells us if the model is working properly.

What Could Go Wrong?

📏

Unequal Spread

Errors are bigger for some predictions than others

Heteroscedasticity

🌊

Curved Pattern

Model is systematically off in some ranges

Nonlinearity

⭐

Outliers

A few extreme points distorting everything

Influential Points

❗

The Tool We Use

The Residual vs. Fitted plot catches all three of these problems in one picture. Let's learn how to read it.

Diagnostics • The Plot

Reading the Residual vs. Fitted Plot

This is the single most important diagnostic plot in regression. Here's how to read it.

The Setup

X-AXIS

Fitted Values (ŷ)

What your model predicted

Y-AXIS

Residuals (e = y − ŷ)

How far off each prediction was

✅ What You Want to See


Residuals
    |     *        *
 +  |  *     *  *     *   *
    |    *  *     *  *
 0  |--*----*--*-----*--*-----
    |   *  *    *   *    *
 -  |     *   *    *  *
    |  *        *
    |_________________________ Fitted

✓Random scatter — no pattern

✓Even spread from left to right

✓Centered around zero

"Looks like static on a TV" = good

❌ Fan / Cone Shape


Residuals
    |                        *
 +  |              *    *  *    *
    |        *  *    *
 0  |--*--*--*-----------*--------
    |        *  *    *
 -  |              *    *  *    *
    |                        *
    |_________________________ Fitted

✗Spread gets wider as fitted values increase

✗Errors are small on the left, large on the right

This is heteroscedasticity!

❌ Curved Pattern


Residuals
    |  *  *
 +  |*      *
    |         *  *
 0  |------------*--*---------
    |                 *  *
 -  |                      *
    |_________________________ Fitted

✗U-shape or arch in the residuals

✗Model over-predicts in some ranges, under-predicts in others

This means the relationship isn't linear — you may need a polynomial term or log transform.

In R — One Line

plot(model, which = 1)

This produces the residual vs. fitted plot. R also draws a red smoothed line — it should be flat and close to the dashed zero line.

Quick Check

You run plot(model, which = 1) and see that the points form a tight cluster on the left but spread out dramatically on the right, like a megaphone. What does this indicate?

Diagnostics • Heteroscedasticity

Heteroscedasticity: The Unequal Spread Problem

When your model's errors aren't evenly spread, your conclusions may be wrong.

PRONUNCIATION

hetero (different) + scedasticity (spread)

HET-er-oh-skeh-das-TIS-ih-tee

Literally: "different spread" — the errors have different amounts of scatter

Real-World Example: Predicting House Prices

Imagine predicting home prices based on square footage:

Home Size	Predicted Price	Actual Prices You See	Error Range
500 sq ft (studio)	100K	90K – 110K	± 10K
2,000 sq ft (house)	400K	320K – 480K	± 80K
5,000 sq ft (mansion)	1M	600K – 1.5M	± 400K

The errors get bigger as the home gets bigger. Small homes are easy to price; mansions are wildly unpredictable.

What Breaks When This Happens?

❌Standard Errors Are Wrong

The formulas assume constant error spread. If that's violated, your SEs are too big or too small.

❌P-Values Are Unreliable

Wrong SEs → wrong t-statistics → wrong p-values. You might think a variable is significant when it's not!

❌Confidence Intervals Are Off

Your 95% interval might actually be 80% or 99% — you can't tell without fixing the problem.

✅Coefficients Are Still OK!

Good news: the β estimates themselves are still unbiased. It's the confidence in those estimates that's messed up.

✅ Homoscedasticity

(What we want)


  |     *        *
+  |  *     *  *     *
  |    *  *     *  *
0  |--*----*--*-----*--*--
  |   *  *    *   *    *
-  |     *   *    *  *
  |  *        *
  |______________________

Same spread everywhere → reliable results

❌ Heteroscedasticity

(The problem)


  |                     *
+  |           *    *  *   *
  |     *  *    *
0  |--*--*-----------*-----
  |     *  *    *
-  |           *    *  *   *
  |                     *
  |______________________

Fan shape → SEs and p-values can't be trusted

What Can You Do About It?

1

Transform the Y variable

Taking log(Y) often stabilizes the variance. Very common with dollar amounts.

2

Use robust standard errors

Keep the same model but calculate SEs that don't assume constant variance. Available in R with packages like sandwich.

3

Weighted Least Squares (WLS)

Give less weight to observations with high variance. Advanced but effective.

Diagnose the Pattern

Match each residual plot pattern to its diagnosis.

Points spread out in a fan/cone shape

Points form a random cloud with even spread

Points form a U-shape or arch

A single point sits far away from all others

✅ Model is fine

📏 Heteroscedasticity

🌊 Nonlinearity

⭐ Influential Outlier

💡

The Diagnostic Checklist

Step 1: Run plot(model, which = 1)

Step 2: Check: random cloud? Or fan/curve?

Step 3: Check: red line flat near 0? Or curving?

Step 4: If problems found → consider log(Y) or robust SEs

Summary

Key Takeaways

1. Models simplify reality — they're wrong but useful for prediction.

2. Coefficients show relationships — “holding all else constant, a 1-unit increase in X → β change in Y.”

3. R² measures fit — but high R² doesn't prove causation.

4. p < 0.05 = significant — the variable likely matters; keep it.

5. t-test for one coefficient, F-test for multiple together.