Phase I • Quick Recap

The Linear Probability Model (LPM)

Before we learn logistic regression, let's remember what we already know.

What is LPM?

The Linear Probability Model is simply using regular linear regression (lm()) when your outcome variable is binary (0 or 1).

P(Y = 1) = β₀ + β₁X₁ + β₂X₂ + ε

✅ Why It's Tempting

• Simple to run (just use lm())
• Easy to interpret coefficients
• Familiar from Week 1

❌ Why It Fails

• Can predict probabilities > 100%
• Can predict probabilities < 0%
• Assumes constant effect across all X

❗

The Key Problem

LPM doesn't "know" that probabilities must stay between 0 and 1. We need a smarter approach.

Phase I • The Problem

The Classification Problem

Many business decisions boil down to yes or no.

🤔

The Question We're Asking

"Given what we know about X, what's the probability that Y = 1?"

🏦

Loan Default

Y = 1 if customer defaults

P(Default) = ?

📧

Email Spam

Y = 1 if email is spam

P(Spam) = ?

🛒

Customer Churn

Y = 1 if customer leaves

P(Churn) = ?

What We Need From Our Model

1️⃣

Output a probability

Between 0 and 1

2️⃣

Handle multiple predictors

X₁, X₂, X₃...

3️⃣

Be interpretable

Explain WHY

Phase I • The Concept

Logistic Regression is Everywhere

Before we dive into the math, let's see why this matters.

🏦

Banking

Will this customer default?

📧

Email

Is this message spam?

🏥

Healthcare

Will this patient be readmitted?

🛒

Retail

Will this customer churn?

💳

Fraud

Is this transaction suspicious?

📱

Marketing

Will this user click the ad?

❗

Today's Mission

We'll build a spam filter from scratch. By the end, you'll understand the math, write the R code, and make real business decisions.

Phase I • The Concept

The Failure of the Ruler

Why can't we just use linear regression for classification?

Linear vs. Logistic Regression

LinearLogistic

⚠️ Red dashed lines:Probability boundaries (0% and 100%)

✓ Logistic:Always stays between 0 and 1

❌ Linear Regression Says

"120% chance it's spam"

That's mathematically impossible!

✅ We Need

Walls at 0 and 1

Probabilities must stay bounded.

The Two Fatal Flaws of LPM

❌Unbounded Predictions

Predictions can exceed 1 or go below 0. A probability of 1.037 or -0.2 makes no sense!

❌Constant Effect Assumption

Assumes each unit of X has the same effect. But going from 90% to 100% should be harder than 50% to 60%!

LPM vs. Logistic Regression

1

Bounded outputs: Logistic regression guarantees predictions between 0 and 1

2

S-curve effect: Logistic regression captures diminishing returns near 0 and 1

3

Proper interpretation: Coefficients represent changes in log-odds, which can be converted to odds ratios

Phase I • The Concept

Enter the Sigmoid: The S-Curve That Saves Us

The mathematical wall builder that keeps probabilities in bounds.

The Sigmoid Formula

P = 1 / (1 + e^-z)

where z = β₀ + β₁X₁ + β₂X₂ + ...

The Sigmoid (S-Curve)

Steepness (β₁):1.0

z → -∞

P → 0%

z = 0

P = 50%

z → +∞

P → 100%

💡

The Key Insight

No matter what number you put in, the sigmoid squashes it between 0 and 1.

Phase I • The Concept

Demystifying the Math

You don't need calculus. Just three simple ideas.

e ≈ 2.718

What is e?

It's a special constant — like pi is 3.14, e is 2.718.

Think of it this way:

Put 1 dollar in a bank at 100 percent interest, compounded continuously. After 1 year:

You have 2.718 dollars

(e dollars)

That's it. It's just a number. R knows it already.

ln()

What is natural log?

ln() answers one question: "e to WHAT power gives me this number?"

ln(1)= 0because e⁰ = 1

ln(2.718)= 1because e¹ = 2.718

ln(7.389)= 2because e² = 7.389

ln() and e are opposites

They undo each other — like square and square root.

ln(e^x) = x

e^ln(x) = x

σ(z)

The Sigmoid Function

Takes any number and squashes it to a probability between 0 and 1.

-1,000,000

-5

0

+5

+1,000,000

σ

≈ 0.00

≈ 0.01

= 0.50

≈ 0.99

≈ 1.00

Negative input → probability near 0

Positive input → probability near 1

How These Three Work Together

YOUR REGRESSION

β₀ + β₁X

Linear combination

→

↓

ln() TRANSFORMS

Probability → Log-odds

Makes it unbounded

→

↓

σ SQUASHES BACK

Log-odds → Probability

Bounded 0 to 1 ✓

These three tools work together. You won't calculate any of them by hand — R does it all.

Quick Check

A logistic regression model calculates z = β₀ + β₁X = +15 for a particular customer. After passing through the sigmoid, the predicted probability will be closest to:

💡

The Bottom Line

You don't need to memorize formulas. e is a number (2.718), ln() is its reverse, and the sigmoid squashes everything to a probability. In R, just add type = "response" and it handles all three for you.

Phase II • The Translation

Why Your R Output Looks Weird

You'll run your first logistic regression and see numbers that don't look like probabilities. Here's why.

What You'll See in R

Coefficients:
           Estimate

(Intercept) -3.824

Hyperlinks   0.513

❌Wrong interpretation:

"Each hyperlink increases spam probability by 51.3%"

✅Right interpretation:

"Each hyperlink increases the log-odds of spam by 0.513"

Why Doesn't R Just Show Probabilities?

Probability is trapped between 0 and 1. But regression needs room to work — it needs values that can go as high or low as needed.

So logistic regression works in a different language called log-odds — which can range from -∞ to +∞. Then it converts back to probability when you ask for predictions.

Probability trapped: 0 to 1

→ transform →

Log-Odds free: -∞ to +∞

→ regression →

Probability convert back

Why You Need to Know This

Flip each card to see why log-odds matter for your grade.

Click to flip

❗

The Key Insight

Logistic regression is just linear regression on a transformed scale (log-odds). The coefficients live in that transformed world. To get back to probability, you need to convert — and R can do it for you.

Phase II • The Translation

The Translation Chain

Three steps to go from probability to something linear (and back).

1

Probability → Odds

What are odds?

You already know this from everyday language. When someone says "4 to 1 odds", they mean: for every 1 time it doesn't happen, it happens 4 times.

Odds are just another way to express probability — instead of saying "80% chance", you say "4 to 1 in favor."

Formula

Odds = P / (1-P)

→

Example: P = 0.80

0.80 / 0.20 = 4

What changed: Probability is stuck between 0 and 1. Odds can go from 0 to infinity — the top is now unbounded.

2

Odds → Log-Odds (Logit)

Odds fixed the top (can go to infinity) but the bottom is still stuck at 0. Taking the natural log fixes that too — now values can range from negative infinity to positive infinity.

Formula

Logit = ln(Odds)

→

Example: Odds = 4

ln(4) = 1.39

Now we have a fully unbounded scale. Perfect for linear modeling!

See the Pattern: The Full Chain at Different Probabilities

Probability		Odds		Log-Odds	Intuition
0.10	→	0.11	→	-2.20	Very unlikely → large negative
0.50	→	1.00	→	0.00	Coin flip → log-odds = zero!
0.80	→	4.00	→	1.39	Likely → positive
0.99	→	99.0	→	4.60	Almost certain → large positive

Below 50% → negative log-odds

Exactly 50% → log-odds = 0

Above 50% → positive log-odds

Now that we have an unbounded scale, the model can fit a linear equation to it:

3

The Model Fits a Line to Log-Odds

ln(Odds) = β₀ + β₁X₁ + β₂X₂

This is just like linear regression! The only difference: the coefficients are changes in log-odds, not changes in Y directly.

Quick Check

If a logistic regression model predicts a log-odds of 0 for a customer, what is their predicted probability?

Phase II • The Translation

Converting Back to Probability

R does this for you with type = "response".

MODEL CALCULATES

β₀ + β₁X

Log-odds

→

EXPONENTIATE

e^log-odds

Odds

→

CONVERT

Odds/(1+Odds)

Probability ✓

❌ Without type = "response"

predict(model, newdata)

Returns: 1.39 (log-odds)

Not interpretable!

✅ With type = "response"

predict(model, newdata, type = "response")

Returns: 0.80 (probability)

80% chance ✓

💡

Quick Reference

Probability: What we want (0 to 1)

Odds: P/(1-P) — used for interpretation

Log-Odds: What the model actually predicts

Phase II • The Translation

Interpreting Coefficients

The rule: e^β = odds ratio.

Interpretation Rule

When X increases by 1, the odds are multiplied by e^β.

Phase II • The Translation

Flashcard Checkpoint

Quick drill on odds ratio interpretation.

Phase III • R Lab

R Lab: Five Examples

We'll build models together in RStudio. Here's the roadmap.

🏦

Ex 1

LPM on Mortgage

Why it fails

🏦

Ex 2

Logistic on Mortgage

The fix

📧

Ex 3

Spam Detection

Full workflow

🏋️

Ex 4

Membership

Dummy variables

📊

Ex 5

Spam Holdout

Model comparison

💡

Follow Along

Open RStudio now. We'll write code together — I'll explain each line before we run it.

Phase III • R Lab

Example 1: LPM on Mortgage Data

Let's start with what we already know — linear regression — and see where it breaks.

🏦

Mortgage Loan Approval

Can we predict whether a loan gets approved?

Y = Approval (0 or 1)

X1 = Down Payment %

X2 = Income-to-Loan Ratio %

Step 1: Load & Run LPM

loan <- read_excel( "Data/Mortgage.xlsx")

model <- lm( y ~ x1 + x2, data = loan)

summary(model)

R Output

Coefficients:
           Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.868151   0.281053  -3.089 0.004615 **

x1           0.018840   0.006992   2.694 0.011977 *

x2           0.025831   0.006290   4.107 0.000333 ***

Now Let's Predict: Down Payment = 60%, Income Ratio = 30%

predict(model, data.frame(x1 = 60, x2 = 30))

R Returns

1.037

😱

That's a probability of

103.7%

A probability greater than 100% is impossible. This is exactly why LPM fails.

❗

This Is Why We Need Logistic Regression

The LPM doesn't know that probabilities must stay between 0 and 1. Same data, same question — let's fix this with glm().

Phase III • R Lab

Example 2: Logistic Regression — The Fix

Same data, same question — but now the math keeps probabilities where they belong.

The Key Change: lm() → glm()

# Same data, same formula — different function

model <- glm( y ~ x1 + x2, family = binomial, data = loan)

summary(model)

glm() not lm()

Same formula: y ~ x1 + x2

family = binomial

R Output

Coefficients:
          Estimate Std. Error z value Pr(>|z|)

(Intercept) -9.36709    3.19580  -2.931  0.00338 **

x1           0.13490    0.06401   2.107  0.03508 *

x2           0.17822    0.06463   2.758  0.00582 **

Note: These coefficients are in log-odds, not probability. Remember what we learned!

Now Let's Predict: Two Different Customers

predict(model, data.frame(x1 = c(20, 30), x2 = 30), type = "response")

Customer A: Down Payment = 20%

21.0%

Low down payment → low approval chance

Customer B: Down Payment = 30%

50.7%

Higher down payment → coin-flip chance

LPM predicted (Ex 1)

103.7%

Impossible! ❌

Logistic predicted (Ex 2)

21.0% & 50.7%

Valid probabilities ✅

Phase III • R Lab

Example 3: Spam Detection — Build the Model

Our main example: 500 emails. Is it spam or not?

📧

Spam Detection (Spamdata.xlsx)

Y = Spam (0/1), X1 = Recipients, X2 = Hyperlinks, X3 = Characters

Build the Model

spam <- read_excel( "Data/Spamdata.xlsx")

model <- glm( Spam ~ Recipients + Hyperlinks + Characters,

family = binomial, data = spam)

summary(model)

R Output

Coefficients:
           Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.824254   0.635519  -6.018 1.77e-09 ***

Recipients   0.107522   0.032179   3.341 0.000834 ***

Hyperlinks   0.513269   0.044508  11.532  < 2e-16 ***

Characters  -0.014112   0.004923  -2.866 0.004153 **

How to Read This

1. Check the Stars

*** Recipients — very significant
*** Hyperlinks — very significant
** Characters — significant

All three predictors matter! ✓

2. Check the Direction

+ Recipients (0.108) — more recipients → more spam
+ Hyperlinks (0.513) — more links → more spam
− Characters (-0.014) — longer emails → less spam

❗

Remember

These coefficients are in log-odds! 0.513 does NOT mean "51.3% increase in probability." Next slide: how to translate to business language.

Phase III • R Lab

Example 3: Translating to Business Language

Convert log-odds to something a manager can understand.

The Conversion Formula

# Percentage change in odds for each predictor

(exp( coef(model)) - 1) * 100

Recipients  Hyperlinks  Characters
11.35       67.07       -1.40

Recipients: +11.4%

Log-odds coef: 0.108 → Odds ratio: 1.114

📬

Say this: "Each additional recipient increases the odds of spam by about 11%."

Hyperlinks: +67.1%

Log-odds coef: 0.513 → Odds ratio: 1.670

🔗

Say this: "Each additional hyperlink increases spam odds by 67%. This is our strongest signal!"

Characters: -1.4%

Log-odds coef: -0.014 → Odds ratio: 0.986

📝

Say this: "Longer emails are slightly less likely to be spam. Spammers keep it short!"

💡

The Pattern

Positive % → increases odds of Y=1. Negative % → decreases odds. The bigger the number, the stronger the effect. Hyperlinks (+67%) is the dominant predictor here.

Phase III • R Lab

Example 3: Making Predictions

What's the probability an email with 20 recipients is spam? What about 21?

First: What Are Average Values?

mean(spam$Hyperlinks) # 6.226

mean(spam$Characters) # 58.602

We'll hold Hyperlinks and Characters at their averages, and vary only Recipients.

Predict for 20 vs 21 Recipients

prob <- predict(model,

data.frame(Recipients = c(20, 21),

Hyperlinks = 6.226,

Characters = 58.602),

type = "response")

    1       2
0.667   0.690

20 Recipients

66.7%

chance of being spam

21 Recipients

69.0%

chance of being spam

One extra recipient → probability goes from 66.7% to 69.0% — a 2.3 percentage point increase.

Notice: the effect on probability is NOT constant. At different starting points, the same +1 recipient would produce a different change. That's the S-curve at work!

Phase III • R Lab

Example 4: Customer Loyalty

New twist: one of our predictors is categorical (plan type). We need a dummy variable.

🏋️

Gym Membership Loyalty (Membership.xlsx)

Y = Loyal (0/1), X1 = Age, X2 = Income, X3 = Plan Type (Single vs Family)

Create Dummy Variable & Build Model

membership <- read_excel( "Data/Membership.xlsx")

# Create dummy: Single = 1, Family = 0

membership

Single</span>{" "} <span className="text-slate-400">&lt;-</span>{" "} <span className="text-emerald-300">ifelse</span> <span className="text-slate-300">(</span> <span className="text-white">membership

Plan == "Single", 1, 0)

model <- glm( Loyal ~ Age + Income + Single,

family = binomial, data = membership)

summary(model)

R Output

Coefficients:
           Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.926652   0.863719  -5.704 1.17e-08 ***

Age          0.036994   0.011378   3.251  0.00115 **

Income       0.043722   0.006124   7.139 9.40e-13 ***

Single      -0.859315   0.380345  -2.259  0.02386 *

Business Translation

> ( exp( coef(model)) - 1) * 100

     Age     Income     Single
  3.77      4.47     -57.65

+3.8%

Each year of age increases loyalty odds by 3.8%

+4.5%

Each unit of income increases loyalty odds by 4.5%

-57.7%

Being on a Single plan decreases loyalty odds by 58% vs. Family plan

Predict: 50-year-old, Income = 80, Single Plan

predict(model, data.frame(Age = 50, Income = 80, Single = 1), type = "response")

39.2%probability of being a loyal customer

💡

Dummy Variable Interpretation

The Single coefficient (-57.7%) compares Single plan holders to the reference group (Family plan holders), holding Age and Income constant. This is how we handle categorical predictors.

Phase IV • Evaluation

We've Built Models. But How Good Are They?

So far we've built 4 models and interpreted their coefficients. Now the critical question.

🤔

"If I use this model on new emails tomorrow, will it still work?"

We don't know yet — because we've only tested on data the model already saw.

✅

What We've Done

Built models, read output, interpreted coefficients, made predictions

⚠️

What We Haven't Done

Tested whether our model works on data it hasn't seen

🔜

What's Next

Learn the holdout method, then apply it in Example 5

Phase IV • Evaluation

The Overfitting Trap

Your model looks great — but can you trust the accuracy score?

📖

The Answer-Key Problem

Imagine studying for an exam using the exact answer key. You'd score 95%. But on a new exam? Maybe 60%.

Testing a model on its own training data is the same mistake.

Two Spam Models — Which Is Better?

Model A

Training accuracy:92%

Accuracy on new data:68%

📉 24-point drop — overfitting!

Model B

Training accuracy:81%

Accuracy on new data:79%

📈 Only 2-point drop — reliable!

Quick Check

Your company needs to deploy a spam filter. Based on the numbers above, which model should you choose?

❗

The Rule

Never evaluate a model on the data it was trained on. Always test on unseen data.

Phase IV • Evaluation

The Holdout Method

Split your data once. Train on one part, test on the other.

Your full dataset (e.g., 500 emails)

Training Set

375 emails (75%)

Validation

125 emails (25%)

Model learns from this

Model is graded on this

1

Split

Randomly partition into training & validation

2

Train

Build competing models on training set only

3

Predict

Use trained models on validation set

4

Compare

Pick the model with highest validation accuracy

In R — The Full Recipe

# Step 1: Split
sample_index <- sample(1: nrow(data), 0.75 * nrow(data))
train <- data[sample_index, ]
valid <- data[-sample_index, ]

# Step 2: Train
model <- glm(Spam ~ Recipients + Hyperlinks, family = binomial, data = train)

# Step 3: Predict on validation set
preds <- predict(model, newdata = valid, type = "response")

Practice

Complete the Code

You've trained a model on the training set. Now you need to make predictions on the validation set. Fill in the blanks:

preds <- predict(model, newdata = , type = )

Hints

Which dataset should the model predict on — the one it trained on, or the one it hasn't seen?
What type argument converts log-odds to probability?

💡

Why This Matters for Business

If you skip validation and deploy Model A from the previous slide, your spam filter would miss 32% of spam in production. The holdout method catches this before deployment. It takes 3 lines of R code and could save the project.

Phase V • R Lab: Validation

Let's Apply This in R

Now that we understand why we validate, let's do it with our spam data.

The Plan for Example 5

1

Split

75% train, 25% test

2

Train 2 Models

3 vars vs 2 vars

3

Predict

On validation set

4

Compare

Accuracy, sensitivity, specificity

💡

Open RStudio

We're going back to R. Same spam dataset — but this time we'll split it and prove which model actually works better on new data.

Phase III • R Lab

Example 5: The Holdout Method

Back to spam data. Let's test how well our model works on new data.

Split the 500 Emails

Rows 1–375 (Training)

Rows 376–500 (Validation)

Split, Train, Predict

# Step 1: Split

TData <- spam[1:375, ] # 75% training

VData <- spam[376:500, ] # 25% validation

# Step 2: Train Model 1 (3 predictors)

Model1 <- glm( Spam ~ Recipients + Hyperlinks + Characters,

family = binomial, data = TData)

# Step 3: Predict on validation set

pHat1 <- predict(Model1, VData, type = "response")

# Step 4: Convert to 0/1 using 0.5 cutoff

yHat1 <- ifelse(pHat1 >= 0.5, 1, 0)

# Step 5: Calculate accuracy

100 * mean(VData$Spam == yHat1)

The 0.5 Cutoff

If predicted probability

≥ 0.5

classify as Spam (1)

If predicted probability

< 0.5

classify as Not Spam (0)

This cutoff is a choice! We'll explore how changing it affects results later (threshold tuning).

❗

Key Detail

Notice we train on TData but predict on VData. The model never sees the validation data during training. That's the whole point!

Phase III • R Lab

Example 5: Comparing Two Models

Does adding Characters as a predictor actually help?

Train Model 2 (only 2 predictors)

Model2 <- glm( Spam ~ Recipients + Hyperlinks,

family = binomial, data = TData)

pHat2 <- predict(Model2, VData, type = "response")

yHat2 <- ifelse(pHat2 >= 0.5, 1, 0)

100 * mean(VData$Spam == yHat2)

Model 1

Recipients + Hyperlinks + Characters

Validation Accuracy

68.0%

Model 2

Recipients + Hyperlinks only

Validation Accuracy

65.6%

Model 1 wins by 2.4 percentage points. Adding Characters improves predictions on unseen data.

Quick Check

If Model 2 had a HIGHER training accuracy than Model 1 (e.g., 75% vs 72%), but Model 1 has higher VALIDATION accuracy (68% vs 65.6%), which model should you deploy?

Phase III • R Lab

Example 5: Beyond Accuracy

Accuracy tells us the overall score. But where is the model making mistakes?

Calculate Sensitivity & Specificity

# True Positives: predicted spam AND actually spam

yTP1 <- ifelse(yHat1 == 1 & VData$Spam == 1, 1, 0)

True Negatives: predicted not-spam AND actually not-spam

yTN1 <- ifelse(yHat1 == 0 & VData$Spam == 0, 1, 0)

# Sensitivity: Of actual spam, how many did we catch?

100 * (sum(yTP1) / sum(VData$Spam == 1))

Specificity: Of actual non-spam, how many did we correctly ID?

100 * (sum(yTN1) / sum(VData$Spam == 0))

Model 1 vs Model 2: Full Scorecard

Metric	Model 1 (3 vars)	Model 2 (2 vars)	What It Measures
Accuracy	68.0%	65.6%	Overall correct predictions
Sensitivity	72.1%	67.2%	Of actual spam, how many caught?
Specificity	68.8%	64.1%	Of actual non-spam, how many identified?

Sensitivity = 72.1%

Of all the actual spam emails, Model 1 correctly flagged 72.1% of them.

The remaining 27.9% of spam slipped through. 😬

Specificity = 68.8%

Of all the actual non-spam emails, Model 1 correctly identified 68.8% as safe.

31.2% of real emails were wrongly flagged as spam. 😬

❗

Model 1 Wins Across the Board

Adding Characters as a predictor improved all three metrics. Accuracy +2.4 points, Sensitivity +4.9 points, Specificity +4.7 points. The variable earns its spot in the model.

Phase IV • The Decision

The Confusion Matrix

Accuracy sounds good, but it can hide costly mistakes.

⚠️

Accuracy is a Lie

If 99% of emails are NOT spam, a model that says "Nothing is spam" gets 99% accuracy but catches zero spam.

Interactive Confusion Matrix

Click cells to learn more. Edit values to see metrics change.

Actual →

Positive (1)

Negative (0)

Predicted ↓

METRICS

Accuracy68.8%

(TP + TN) / Total

Precision66.7%

TP / (TP + FP) — "Of predicted positives, how many were correct?"

Recall (Sensitivity)72.1%

TP / (TP + FN) — "Of actual positives, how many did we catch?"

F1 Score69.3%

Harmonic mean of Precision & Recall

Specificity65.6%

TN / (TN + FP) — "Of actual negatives, how many did we correctly identify?"

Phase IV • The Decision

The Threshold "Volume Knob"

Your model outputs probabilities. You decide where to draw the line.

Threshold Control

Move the cutoff and watch precision/recall change.

Threshold: 0.50

Accuracy80.0%

Precision80.0%

Recall80.0%

TP: 8 · FP: 2

TN: 8 · FN: 2

Phase IV • The Decision

When to Prioritize What?

The right choice depends on the cost of each error.

🏥

Prioritize Recall

Missing a cancer case could be fatal. Use a lower threshold.

📧

Prioritize Precision

Blocking a real email is costly. Use a higher threshold.

Phase V • Reflection

Your Turn

Apply what you've learned to a new scenario.

0 / 60 characters

Create a quiz

QuestionModel answer

Phase V • Wrap-Up

Key Takeaways

Checklist0/6