Before we learn logistic regression, let's remember what we already know.
What is LPM?
The Linear Probability Model is simply using
regular linear regression (lm()) when
your outcome variable is binary (0 or 1).
P(Y = 1) = ฮฒโ + ฮฒโXโ + ฮฒโXโ + ฮต
โ Why It's Tempting
โข Simple to run (just use lm())
โข Easy to interpret coefficients
โข Familiar from Week 1
โ Why It Fails
โข Can predict probabilities > 100%
โข Can predict probabilities < 0%
โข Assumes constant effect across all X
โ
The Key Problem
LPM doesn't "know" that probabilities must stay between 0 and 1. We need a smarter approach.
Phase I โข The Problem
The Classification Problem
Many business decisions boil down to yes or no.
๐ค
The Question We're Asking
"Given what we know about X, what's the
probability that Y = 1?"
๐ฆ
Loan Default
Y = 1 if customer defaults
P(Default) = ?
๐ง
Email Spam
Y = 1 if email is spam
P(Spam) = ?
๐
Customer Churn
Y = 1 if customer leaves
P(Churn) = ?
What We Need From Our Model
1๏ธโฃ
Output a probability
Between 0 and 1
2๏ธโฃ
Handle multiple predictors
Xโ, Xโ, Xโ...
3๏ธโฃ
Be interpretable
Explain WHY
Phase I โข The Concept
Logistic Regression is Everywhere
Before we dive into the math, let's see why this matters.
๐ฆ
Banking
Will this customer default?
๐ง
Email
Is this message spam?
๐ฅ
Healthcare
Will this patient be readmitted?
๐
Retail
Will this customer churn?
๐ณ
Fraud
Is this transaction suspicious?
๐ฑ
Marketing
Will this user click the ad?
โ
Today's Mission
We'll build a spam filter from scratch. By the end, you'll understand
the math, write the R code, and make real business decisions.
Phase I โข The Concept
The Failure of the Ruler
Why can't we just use linear regression for classification?
Linear vs. Logistic Regression
โ ๏ธ Red dashed lines:Probability boundaries (0% and 100%)
โ Logistic:Always stays between 0 and 1
โ Linear Regression Says
"120% chance it's spam"
That's mathematically impossible!
โ We Need
Walls at 0 and 1
Probabilities must stay bounded.
The Two Fatal Flaws of LPM
โUnbounded Predictions
Predictions can exceed 1 or go below 0. A probability of 1.037 or -0.2 makes no sense!
โConstant Effect Assumption
Assumes each unit of X has the same effect. But going from 90% to 100% should be harder
than 50% to 60%!
LPM vs. Logistic Regression
1
Bounded outputs: Logistic regression guarantees
predictions between 0 and 1
2
S-curve effect: Logistic regression captures
diminishing returns near 0 and 1
3
Proper interpretation: Coefficients represent
changes in log-odds, which can be converted to odds ratios
Phase I โข The Concept
Enter the Sigmoid: The S-Curve That Saves Us
The mathematical wall builder that keeps probabilities in bounds.
The Sigmoid Formula
P = 1 / (1 + e-z)
where z = ฮฒโ + ฮฒโXโ + ฮฒโXโ + ...
The Sigmoid (S-Curve)
1.0
z โ -โ
P โ 0%
z = 0
P = 50%
z โ +โ
P โ 100%
๐ก
The Key Insight
No matter what number you put in, the sigmoid squashes it between 0 and 1.
Phase I โข The Concept
Demystifying the Math
You don't need calculus. Just three simple ideas.
e โ 2.718
What is e?
It's a special constant โ like pi is 3.14,
e is 2.718.
Think of it this way:
Put 1 dollar in a bank at 100 percent interest, compounded continuously. After 1 year:
You have 2.718 dollars
(e dollars)
That's it. It's just a number. R knows it already.
ln()
What is natural log?
ln() answers one question:
"e to WHAT power gives me this number?"
ln(1)= 0because eโฐ = 1
ln(2.718)= 1because eยน = 2.718
ln(7.389)= 2because eยฒ = 7.389
ln() and e are opposites
They undo each other โ like square and square root.
ln(ex) = x
eln(x) = x
ฯ(z)
The Sigmoid Function
Takes any number and squashes it to a
probability between 0 and 1.
-1,000,000
-5
0
+5
+1,000,000
ฯ
โ 0.00
โ 0.01
= 0.50
โ 0.99
โ 1.00
Negative input โ probability near 0
Positive input โ probability near 1
How These Three Work Together
YOUR REGRESSION
ฮฒโ + ฮฒโX
Linear combination
โ
โ
ln() TRANSFORMS
Probability โ Log-odds
Makes it unbounded
โ
โ
ฯ SQUASHES BACK
Log-odds โ Probability
Bounded 0 to 1 โ
These three tools work together. You won't calculate any of them by hand โ R does it all.
Quick Check
A logistic regression model calculates z = ฮฒโ + ฮฒโX = +15 for a particular customer. After passing through the sigmoid, the predicted probability will be closest to:
๐ก
The Bottom Line
You don't need to memorize formulas. e is a
number (2.718), ln() is its reverse, and the
sigmoid squashes everything to a probability. In
R, just add
type = "response" and it
handles all three for you.
Phase II โข The Translation
Why Your R Output Looks Weird
You'll run your first logistic regression and see numbers that don't look like probabilities.
Here's why.
What You'll See in R
Coefficients:
Estimate
(Intercept) -3.824
Hyperlinks 0.513
โWrong interpretation:
"Each hyperlink increases spam probability by 51.3%"
โ Right interpretation:
"Each hyperlink increases the log-odds of spam by 0.513"
Why Doesn't R Just Show Probabilities?
Probability is trapped between 0 and 1. But regression needs room to work โ it needs values
that can go as high or low as needed.
So logistic regression works in a different language
called log-odds โ which can range from -โ to +โ.
Then it converts back to probability when you ask for predictions.
Probability
trapped: 0 to 1
โ transform โ
Log-Odds
free: -โ to +โ
โ regression โ
Probability
convert back
Why You Need to Know This
Flip each card to see why log-odds matter for your grade.
Click to flip
โ
The Key Insight
Logistic regression is just linear regression on a transformed scale (log-odds).
The coefficients live in that transformed world. To get back to probability, you need to
convert โ and R can do it for you.
Phase II โข The Translation
The Translation Chain
Three steps to go from probability to something linear (and back).
1
Probability โ Odds
What are odds?
You already know this from everyday language. When someone says
"4 to 1 odds", they mean: for every 1 time it
doesn't happen, it happens 4 times.
Odds are just another way to express probability โ instead of saying
"80% chance", you say
"4 to 1 in favor."
Formula
Odds = P / (1-P)
โ
Example: P = 0.80
0.80 / 0.20 = 4
What changed: Probability is stuck between 0
and 1. Odds can go from 0 to infinity โ the top is now unbounded.
2
Odds โ Log-Odds (Logit)
Odds fixed the top (can go to infinity) but the bottom is still stuck at 0. Taking the
natural log fixes that too โ now values can range from
negative infinity to positive infinity.
Formula
Logit = ln(Odds)
โ
Example: Odds = 4
ln(4) = 1.39
Now we have a fully unbounded scale. Perfect for linear modeling!
See the Pattern: The Full Chain at Different Probabilities
Probability
Odds
Log-Odds
Intuition
0.10
โ
0.11
โ
-2.20
Very unlikely โ large negative
0.50
โ
1.00
โ
0.00
Coin flip โ log-odds = zero!
0.80
โ
4.00
โ
1.39
Likely โ positive
0.99
โ
99.0
โ
4.60
Almost certain โ large positive
Below 50% โ negative log-odds
Exactly 50% โ log-odds = 0
Above 50% โ positive log-odds
Now that we have an unbounded scale, the model can fit a linear equation to it:
3
The Model Fits a Line to Log-Odds
ln(Odds) = ฮฒโ + ฮฒโXโ + ฮฒโXโ
This is just like linear regression! The only difference: the coefficients are changes in
log-odds, not changes in Y directly.
Quick Check
If a logistic regression model predicts a log-odds of 0 for a customer, what is their predicted probability?
Phase II โข The Translation
Converting Back to Probability
R does this for you with
type = "response".
MODEL CALCULATES
ฮฒโ + ฮฒโX
Log-odds
โ
EXPONENTIATE
elog-odds
Odds
โ
CONVERT
Odds/(1+Odds)
Probability โ
โ Without type = "response"
predict(model, newdata)
Returns: 1.39 (log-odds)
Not interpretable!
โ With type = "response"
predict(model, newdata, type = "response")
Returns: 0.80 (probability)
80% chance โ
๐ก
Quick Reference
Probability: What we want (0 to 1)
Odds: P/(1-P) โ used for interpretation
Log-Odds: What the model actually predicts
Phase II โข The Translation
Interpreting Coefficients
The rule: e^ฮฒ =
odds ratio.
Interpretation Rule
When X increases by 1, the odds are multiplied by
eฮฒ.
Phase II โข The Translation
Flashcard Checkpoint
Quick drill on odds ratio interpretation.
Phase III โข R Lab
R Lab: Five Examples
We'll build models together in RStudio. Here's the roadmap.
๐ฆ
Ex 1
LPM on Mortgage
Why it fails
๐ฆ
Ex 2
Logistic on Mortgage
The fix
๐ง
Ex 3
Spam Detection
Full workflow
๐๏ธ
Ex 4
Membership
Dummy variables
๐
Ex 5
Spam Holdout
Model comparison
๐ก
Follow Along
Open RStudio now. We'll write code together โ I'll explain each line before we run it.
Phase III โข R Lab
Example 1: LPM on Mortgage Data
Let's start with what we already know โ linear regression โ and see where it breaks.
๐ฆ
Mortgage Loan Approval
Can we predict whether a loan gets approved?
Y = Approval (0 or 1)
X1 = Down Payment %
X2 = Income-to-Loan Ratio %
Step 1: Load & Run LPM
loan<-read_excel(
"Data/Mortgage.xlsx")
model<-lm(
y ~ x1 + x2, data = loan)
summary(model)
R Output
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.868151 0.281053 -3.089 0.004615 **
x1 0.018840 0.006992 2.694 0.011977 *
x2 0.025831 0.006290 4.107 0.000333 ***
Now Let's Predict: Down Payment = 60%, Income Ratio = 30%
predict(model,
data.frame(x1 = 60, x2 = 30))
R Returns
1.037
๐ฑ
That's a probability of
103.7%
A probability greater than 100% is impossible. This is exactly why LPM fails.
โ
This Is Why We Need Logistic Regression
The LPM doesn't know that probabilities must stay between 0 and 1. Same data, same question โ
let's fix this with glm().
Phase III โข R Lab
Example 2: Logistic Regression โ The Fix
Same data, same question โ but now the math keeps probabilities where they belong.
The Key Change: lm() โ glm()
# Same data, same formula โ different function
model<-glm(
y ~ x1 + x2,
family = binomial, data = loan)
summary(model)
glm() not lm()
Same formula: y ~ x1 + x2
family = binomial
R Output
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.36709 3.19580 -2.931 0.00338 **
x1 0.13490 0.06401 2.107 0.03508 *
x2 0.17822 0.06463 2.758 0.00582 **
Note: These coefficients are in log-odds, not
probability. Remember what we learned!
Say this: "Each additional recipient increases the odds of spam by about
11%."
Hyperlinks: +67.1%
Log-odds coef: 0.513 โ Odds ratio: 1.670
๐
Say this: "Each additional hyperlink increases spam odds by 67%. This
is our strongest signal!"
Characters: -1.4%
Log-odds coef: -0.014 โ Odds ratio: 0.986
๐
Say this: "Longer emails are slightly less likely to be spam. Spammers
keep it short!"
๐ก
The Pattern
Positive % โ increases odds of Y=1. Negative % โ decreases odds. The bigger
the number, the stronger the effect. Hyperlinks (+67%) is the dominant predictor here.
Phase III โข R Lab
Example 3: Making Predictions
What's the probability an email with 20 recipients is spam? What about 21?
First: What Are Average Values?
mean(spam$Hyperlinks)
# 6.226
mean(spam$Characters)
# 58.602
We'll hold Hyperlinks and Characters at their averages, and vary only Recipients.
Predict for 20 vs 21 Recipients
prob<-predict(model,
data.frame(Recipients =
c(20, 21),
Hyperlinks = 6.226,
Characters = 58.602),
type = "response")
1 2
0.667 0.690
20 Recipients
66.7%
chance of being spam
21 Recipients
69.0%
chance of being spam
One extra recipient โ probability goes from 66.7% to 69.0% โ a
2.3 percentage point increase.
Notice: the effect on probability is NOT constant. At different starting points, the same +1
recipient would produce a different change. That's the S-curve at work!
Phase III โข R Lab
Example 4: Customer Loyalty
New twist: one of our predictors is categorical
(plan type). We need a dummy variable.
๐๏ธ
Gym Membership Loyalty (Membership.xlsx)
Y = Loyal (0/1), X1 = Age, X2 = Income, X3 = Plan Type (Single vs Family)
Each unit of income increases loyalty odds by 4.5%
-57.7%
Being on a Single plan
decreases loyalty odds by 58% vs. Family plan
Predict: 50-year-old, Income = 80, Single Plan
predict(model,
data.frame(Age = 50, Income = 80, Single = 1),
type = "response")
39.2%probability of being a loyal customer
๐ก
Dummy Variable Interpretation
The Single coefficient (-57.7%) compares Single plan holders
to the reference group (Family plan holders), holding Age and Income constant.
This is how we handle categorical predictors.
Phase IV โข Evaluation
We've Built Models. But How Good Are They?
So far we've built 4 models and interpreted their coefficients. Now the critical question.
๐ค
"If I use this model on new emails tomorrow,
will it still work?"
We don't know yet โ because we've only tested on data the model already saw.
โ
What We've Done
Built models, read output, interpreted coefficients, made predictions
โ ๏ธ
What We Haven't Done
Tested whether our model works on data it hasn't seen
๐
What's Next
Learn the holdout method, then apply it in Example 5
Phase IV โข Evaluation
The Overfitting Trap
Your model looks great โ but can you trust the accuracy score?
๐
The Answer-Key Problem
Imagine studying for an exam using the exact answer key.
You'd score 95%. But on a new exam? Maybe 60%.
Testing a model on its own training data is the same mistake.
Two Spam Models โ Which Is Better?
Model A
Training accuracy:92%
Accuracy on new data:68%
๐ 24-point drop โ overfitting!
Model B
Training accuracy:81%
Accuracy on new data:79%
๐ Only 2-point drop โ reliable!
Quick Check
Your company needs to deploy a spam filter. Based on the numbers above, which model should you choose?
โ
The Rule
Never evaluate a model on the data it was trained on. Always test on
unseen data.
Phase IV โข Evaluation
The Holdout Method
Split your data once. Train on one part, test on the other.
# Step 2: Train model<-glm(Spam ~ Recipients + Hyperlinks,
family = binomial, data =
train)
# Step 3: Predict on validation set preds<-predict(model, newdata =
valid, type =
"response")
Practice
Complete the Code
You've trained a model on the training set. Now you need to make predictions on the validation set. Fill in the blanks:
preds <- predict(model, newdata = , type = )
Hints
Which dataset should the model predict on โ the one it trained on, or the one it hasn't seen?
What type argument converts log-odds to probability?
๐ก
Why This Matters for Business
If you skip validation and deploy Model A from the previous slide, your spam filter would
miss 32% of spam in production. The holdout method
catches this before deployment. It takes 3 lines of R code and could save the project.
Phase V โข R Lab: Validation
Let's Apply This in R
Now that we understand why we validate, let's do it with our spam data.
The Plan for Example 5
1
Split
75% train, 25% test
2
Train 2 Models
3 vars vs 2 vars
3
Predict
On validation set
4
Compare
Accuracy, sensitivity, specificity
๐ก
Open RStudio
We're going back to R. Same spam dataset โ but this time we'll split it and prove which model
actually works better on new data.
Phase III โข R Lab
Example 5: The Holdout Method
Back to spam data. Let's test how well our model works on new data.
This cutoff is a choice! We'll explore how changing it affects results later (threshold
tuning).
โ
Key Detail
Notice we train on TData but predict on
VData. The model never sees the validation
data during training. That's the whole point!
Phase III โข R Lab
Example 5: Comparing Two Models
Does adding Characters as a predictor actually help?
Train Model 2 (only 2 predictors)
Model2<-glm(
Spam ~ Recipients + Hyperlinks,
family = binomial, data = TData)
pHat2<-predict(Model2, VData,
type = "response")
yHat2<-ifelse(pHat2 >= 0.5, 1, 0)
100 * mean(VData$Spam == yHat2)
Model 1
Recipients + Hyperlinks + Characters
Validation Accuracy
68.0%
Model 2
Recipients + Hyperlinks only
Validation Accuracy
65.6%
Model 1 wins by 2.4 percentage points. Adding Characters improves
predictions on unseen data.
Quick Check
If Model 2 had a HIGHER training accuracy than Model 1 (e.g., 75% vs 72%), but Model 1 has higher VALIDATION accuracy (68% vs 65.6%), which model should you deploy?
Phase III โข R Lab
Example 5: Beyond Accuracy
Accuracy tells us the overall score. But where is the model making mistakes?
Calculate Sensitivity & Specificity
# True Positives: predicted spam AND actually spam
yTP1<-ifelse(yHat1 == 1 & VData$Spam == 1, 1, 0)
True Negatives: predicted not-spam AND actually not-spam
yTN1<-ifelse(yHat1 == 0 & VData$Spam == 0, 1, 0)
# Sensitivity: Of actual spam, how many did we catch?
100 * (sum(yTP1) /
sum(VData$Spam == 1))
Specificity: Of actual non-spam, how many did we correctly ID?
100 * (sum(yTN1) /
sum(VData$Spam == 0))
Model 1 vs Model 2: Full Scorecard
Metric
Model 1 (3 vars)
Model 2 (2 vars)
What It Measures
Accuracy
68.0%
65.6%
Overall correct predictions
Sensitivity
72.1%
67.2%
Of actual spam, how many caught?
Specificity
68.8%
64.1%
Of actual non-spam, how many identified?
Sensitivity = 72.1%
Of all the actual spam emails, Model 1 correctly
flagged 72.1% of them.
The remaining 27.9% of spam slipped through. ๐ฌ
Specificity = 68.8%
Of all the actual non-spam emails, Model 1
correctly identified 68.8% as safe.
31.2% of real emails were wrongly flagged as spam. ๐ฌ
โ
Model 1 Wins Across the Board
Adding Characters as a predictor improved all three metrics. Accuracy +2.4
points, Sensitivity +4.9 points, Specificity +4.7 points. The variable earns its spot in the
model.
Phase IV โข The Decision
The Confusion Matrix
Accuracy sounds good, but it can hide costly mistakes.
โ ๏ธ
Accuracy is a Lie
If 99% of emails are NOT spam, a model that says "Nothing is spam" gets 99%
accuracy but catches zero spam.
Interactive Confusion Matrix
Click cells to learn more. Edit values to see metrics change.
Actual โ
Positive (1)
Negative (0)
Predicted โ
METRICS
Accuracy68.8%
(TP + TN) / Total
Precision66.7%
TP / (TP + FP) โ "Of predicted positives, how many were correct?"
Recall (Sensitivity)72.1%
TP / (TP + FN) โ "Of actual positives, how many did we catch?"
F1 Score69.3%
Harmonic mean of Precision & Recall
Specificity65.6%
TN / (TN + FP) โ "Of actual negatives, how many did we correctly identify?"
Phase IV โข The Decision
The Threshold "Volume Knob"
Your model outputs probabilities. You decide where to draw the line.
Threshold Control
Move the cutoff and watch precision/recall change.
Threshold: 0.50
Accuracy80.0%
Precision80.0%
Recall80.0%
TP: 8 ยท FP: 2
TN: 8 ยท FN: 2
Phase IV โข The Decision
When to Prioritize What?
The right choice depends on the cost of each error.
๐ฅ
Prioritize Recall
Missing a cancer case could be fatal. Use a lower threshold.
๐ง
Prioritize Precision
Blocking a real email is costly. Use a higher threshold.