Each week builds on the last. You are not starting over โ you are adding tools.
Linear model
Assumptions
Interpretation
Quadratic
Log models
Elasticity
Odds
Sigmoid
Confusion
Distance
Probabilities
Lift ยท ROC
Every predictive model you build follows this same cycle.
Section 1 ยท The Bridge
So why learn something new?
A dataset contains flowers with petal length, petal width, and sepal length.
The species could be Setosa, Versicolor, or Virginica. Can logistic regression handle this as-is?
Section 2 ยท The Intuition
No formula learned. No training phase. Just similarity.
The Neighbor Analogy
You move to a new city. You don't know anyone. You need to pick a neighborhood.
You look at who already lives there. If the people nearby are similar to you, you figure it's probably a good fit.
kNN does exactly this. A new data point looks at its nearest neighbors and goes with the majority.
Section 3 ยท Why This Matters
Let's practice identifying classification problems โ and catch a trap.
Spotify wants to sort a new song into a genre.
Call out your answers:
Target?
What are we predicting?
Features?
What inputs do we have?
Type?
Binary? Multi-class?
Section 4 ยท Visual Walkthrough
Step through the algorithm one stage at a time.
Past Open House Attendees
Here's the historical data.
Each dot is a past attendee. Blue enrolled, red didn't. The axes represent Hours (x) and Income scaled (y).
Skip this step and kNN silently ignores most of your features.
Raw Data โ Two Gym Attendees
| Feature | Person A | Person B | Difference | % of Total Distance |
|---|---|---|---|---|
| Age | 25 years | 30 years | 5 | 0.0% |
| Income | $45,000 | $80,000 | 35,000 | 100.0% |
| Hours | 3 hrs | 5 hrs | 2 | 0.0% |
Euclidean Distance Calculation
d = โ(5ยฒ + 35,000ยฒ + 2ยฒ)=โ(25 + 1,225,000,000 + 4)โ35,000Income accounts for 99.99% of the distance.
Age and Hours are effectively invisible. kNN is making predictions based on one feature only, even though you gave it three.
The rule: Always scale your features before running kNN. Two lines of R code. No exceptions.
| Model | Output | Assumption | Interpretability |
|---|---|---|---|
| Regression | Number (continuous) | Linear relationship | High (coefficients) |
| Logistic | Probability (0โ1) | Linear log-odds | Medium (odds ratios) |
| kNN | Class label | Similarity by distance | Low (neighbors vote) |
If you donโt standardize, one big feature dominates all distance.
Every new point compares to every old point. It scales poorly.
It doesnโt explain โwhy.โ It just votes.
The naive rule is the bar we have to clear. If kNN canโt beat it, it adds no value.
Before anything else โ look at what you have. Confirm columns, types, and size.
R Code
# Load the data
Gym <- read_excel("Data-1/Gym.xlsx")
# See the first few rows
head(Gym)Scale the features so no single variable dominates. Then split into training and validation sets.
Standardize Features
# Standardize predictor columns (Age, Income, Hours)
Gym1 <- scale(Gym[2:4])
# Reattach target variable
Gym1 <- data.frame(Gym1, Gym$Enroll)
# Rename the target column
colnames(Gym1)[4] <- 'Enroll'
# Convert target to factor (category, not number)
Gym1$Enroll <- as.factor(Gym1$Enroll)Gym[2:4]Columns 2, 3, 4 โ Age, Income, Hours. Excludes Member (col 1) and Enroll (col 5).
as.factor()Tells R that Enroll is a category (0 or 1), not a number.
Common Bug
Make sure you convert Enroll on Gym1, not Gym.
Three decisions: How to evaluate? What k values? Then train.
How do we evaluate the model during training โ without touching the validation set?
myCtrl <- trainControl(method = "cv", number = 10)10-Fold Cross-Validation
โฆ continues for all 10 rounds. Each fold gets a turn as the test set.
The model made predictions. How good are they?
Predict & Evaluate
# Apply model to validation set
KNN_Class <- predict(KNN_fit, newdata = validationSet)
# Convert Enroll to factor for comparison
validationSet$Enroll <- as.factor(validationSet$Enroll)
# Create confusion matrix
confusionMatrix(KNN_Class, validationSet$Enroll, positive = '1')The default 0.5 threshold is a choice, not a law. Different cutoffs produce different tradeoffs.
Get Probabilities Instead of Class Labels
# Predict probabilities, not just 0/1
KNN_Class_prob <- predict(KNN_fit, newdata = validationSet, type = 'prob')
# View the first few rows
head(KNN_Class_prob)Key insight: Instead of a hard 0/1, we now see a probability for each person. The default rule: if column 2 > 0.5, predict โEnroll.โ But 0.5 is arbitrary.
Three visualizations that answer business questions, not just statistical ones.
Business question: โIf I can only contact a fraction of people, which fraction should I pick?โ
Generate Gains Table
# Convert Enroll back to numeric for gains package
validationSet$Enroll <- as.numeric(as.character(validationSet$Enroll))
# Generate gains table
gains_table <- gains(validationSet$Enroll, KNN_Class_prob[, 2])
# Plot cumulative lift chart
plot(
c(0, gains_table$cume.pct.of.total * sum(validationSet$Enroll))
~ c(0, gains_table$cume.obs),
xlab ="# of cases", ylab = "Cumulative",
main ="Cumulative Lift Chart", type = "l")
# Add baseline (random selection)
lines(
c(0, sum(validationSet$Enroll))
~ c(0, dim(validationSet)[1]),
col ="red", lty = 2)The model is trained and evaluated. Now use it on people we've never seen โ the whole point of building a model.
Load the Scoring Data
# New people from a recent open house โ no Enroll column
myScoreData <- read_excel("Data-1/myScoreData.xlsx")
head(myScoreData)The full kNN pipeline, end to end.
Model at a Glance
5
Best k
via 10-fold CV
84.0%
Accuracy
on validation set
0.953
AUC
excellent discrimination
82.0%
Sensitivity
at default 0.5 cutoff
Five things to remember from this session.
kNN asked "who are your neighbors?" Naive Bayes asks "what does the evidence say?"
Same goal: classify an observation. Completely different logic.
This is the fastest way to feel why priors matter.
A medical test is 95% accurate.
You test positive.
What is the probability you actually have the condition?
Naive Bayes is a classification method based on probability.
It predicts which class is most likely after examining the evidence. Given what we know about a person, it calculates the probability of each class and picks the winner.
Outputs actual probabilities, not just labels
Trains almost instantly, even on large datasets
Surprisingly strong for many real problems
Because it makes a simplifying assumption that is almost never true โ and still works anyway.
Naive Bayes assumes that all predictor variables are independent of each other, given the class. It treats each piece of evidence separately.
Features are almost always correlated.
Then multiply all pieces together.
Hint: Think of two variables where knowing one gives you information about the other.
One equation. Four pieces. This is the engine behind the model.
Starting belief before seeing the record. How common is this class overall?
If this class were true, how likely would we see these features?
The updated probability after seeing the evidence. This is the prediction.
A normalizing constant. In practice we just compare class scores.
Think of Bayes as an updating machine. You feed it a starting belief and evidence, and it hands back an updated belief.
Prior: It rains 30% of days in this city. Before looking outside, P(Rain) = 0.30.
Likelihood: You see dark clouds. On rainy days, dark clouds appear 90% of the time.
Posterior: After seeing clouds, P(Rain | Clouds) is now much higher than 30%.
This is the shortcut that makes Naive Bayes practical.
Estimate the joint probability of every possible combination of 5 features. With small data, most combinations have zero observations.
Five separate, small frequency tables. Easy to estimate even with modest data.
The decision rule: pick the class that maximizes
This is where the theorem stops feeling abstract and starts feeling usable.
We are trying to estimate:
Given what we know about a person, what is the probability they are living in poverty?
| Bayes term | In our problem | Where it comes from |
|---|---|---|
| Prior | Percent of training rows in poverty | Count Poverty=1 divided by total rows |
| Likelihoods | P(College | Pov=1), P(Married | Pov=1), etc. | Feature frequencies within each class |
| Posterior | Final probability for a new person | Prior times all likelihoods, then compare classes |
Before code, we slow down and classify one person by hand.
| Person | Education | Marital Status | Poverty |
|---|---|---|---|
| 1 | College | Married | 0 |
| 2 | College | Married | 0 |
| 3 | High School | Single | 1 |
| 4 | High School | Single | 1 |
| 5 | College | Single | 0 |
| 6 | High School | Married | 1 |
New person arrives:
Education = High School | Marital Status = Single
Should we classify them as Poverty = 0 or Poverty = 1?
Hint: Look for repeated patterns in the tiny table.
Half in poverty, half not. Equal starting point.
| Feature | P(feature | Poverty=1) | P(feature | Poverty=0) |
|---|---|---|
| High School | 3/3 = 1.00 | 0/3 = 0.00 โ ๏ธ |
| Single | 2/3 = 0.67 | 1/3 = 0.33 |
P(High School | Poverty = 0) = 0. No one in the non-poverty group had High School in our tiny dataset.
One missing combination wipes out an entire class. We need a fix.
Add 1 to every count. No probability is ever exactly zero.
| Feature | P(f | Pov=1) | P(f | Pov=0) |
|---|---|---|
| High School | (3+1)/(3+2) = 4/5 = 0.80 | (0+1)/(3+2) = 1/5 = 0.20 |
| Single | (2+1)/(3+2) = 3/5 = 0.60 | (1+1)/(3+2) = 2/5 = 0.40 |
Classification: Poverty = 1
We are using census data. The goal: classify whether an individual is living in poverty.
Start by understanding the outcome and the available predictors.
library(readxl)
library(caret)
library(pROC)
library(gains)
Census_Data <- read_excel("Data-1/Census_Data.xlsx")
head(Census_Data)Three preparation steps before we train.
Convert the target, create train/validation sets, and define resampling.
Census_Data$Poverty <- as.factor(Census_Data$Poverty)
set.seed(1)
myIndex <- createDataPartition(Census_Data$Poverty, p = 0.6, list = FALSE)
trainSet <- Census_Data[myIndex, ]
validationSet <- Census_Data[-myIndex, ]
myCtrl <- trainControl(method = "cv", number = 10)Once the idea is clear, the training code is surprisingly short.
Use all predictors and evaluate with cross-validation.
set.seed(1)
nb_fit <- train(Poverty ~ .,
data = trainSet,
method = "nb",
trControl = myCtrl)
nb_fitThis is the first moment where we judge whether the model is useful.
nb_class <- predict(nb_fit, newdata = validationSet)
confusionMatrix(nb_class, validationSet$Poverty, positive = "1")| Metric | Meaning |
|---|---|
| Accuracy | Overall percent correct |
| Sensitivity | How many true poverty cases did we catch? |
| Specificity | How many non-poverty cases did we correctly reject? |
This is the shift from classification to decision support.
nb_class_prob <- predict(nb_fit, newdata = validationSet, type = "prob")
head(nb_class_prob)Barely over the line
Low confidence
Very likely in poverty
High confidence
A cutoff is not a fact of nature. It is a policy choice.
Higher thresholds make the model more conservative.
nb_class_075 <- as.factor(ifelse(nb_class_prob[, 2] > 0.75, "1", "0"))
confusionMatrix(nb_class_075, validationSet$Poverty, positive = "1")Lift matters when you can only act on the highest-priority cases.
Lift matters when you have limited resources and can only intervene on the top cases.
lift_obj <- gains(validationSet$Poverty, nb_class_prob[, 2])
plot(c(0, lift_obj$cume.pct.of.total * 100),
c(0, lift_obj$cume.pct.of.pos * 100),
type = "l",
xlab = "Percent of population targeted",
ylab = "Percent of positives captured")ROC gives us one last view before we leave the lab: overall ranking quality across thresholds.
AUC answers: how well does the model separate the two classes overall?
roc_obj <- roc(validationSet$Poverty, nb_class_prob[, 2])
plot(roc_obj, col = "blue")
auc(roc_obj)The end goal is not just fitting a model. It is making decisions about new records.
Classification models create decision support for unseen records.
new_people <- data.frame(
Age = c(22, 47),
Education = c("High School", "College"),
Sex = c("Male", "Female"),
Ethnicity = c("GroupA", "GroupB"),
MaritalStatus = c("Single", "Married")
)
predict(nb_fit, newdata = new_people, type = "prob")Week 4 lands best when we compare the mindsets, not just the formulas.
| Model | How it thinks | Best use case | Tradeoff |
|---|---|---|---|
| Logistic | Explain probability with coefficients | Interpretability matters | May miss nonlinear local patterns |
| kNN | Compare to similar neighbors | Local similarity drives behavior | Slow prediction, needs scaling |
| Naive Bayes | Combine evidence probabilistically | Categorical and text-like predictors | Assumes independence |
The best model is not just the one with the best score. It is the one that makes the best decisions for the context.
โA model is only as useful as the decision it improves.โ
Hint: Think about ranking limited resources.
If students remember these five ideas, they will remember the method.