BANA Learning Hub

Our Journey So Far

Each week builds on the last. You are not starting over — you are adding tools.

📈

Week 1

Regression

Linear model

Assumptions

Interpretation

“What drives Y?”

🧮

Week 2

Nonlinear

Quadratic

Log models

Elasticity

“Is it curved?”

🔀

Week 3

Logistic

Odds

Sigmoid

Confusion

“Yes or no?”

🎯

Week 4

kNN + Naive Bayes

Distance

Probabilities

Lift · ROC

“Which class?”

The Learning Loop

Every predictive model you build follows this same cycle.

🎯

Definethe ask

▼

🔍

Explorethe data

▼

⚙️

Prepareto model

▼

🧠

Modeland train

▼

📊

Evaluateand decide

🔄 iterate

↑

↓

The steps stay the same — your tools get sharper

Section 1 · The Bridge

You Already Know Logistic Regression.

So why learn something new?

🗳️ Quick Poll

A dataset contains flowers with petal length, petal width, and sepal length.

The species could be Setosa, Versicolor, or Virginica. Can logistic regression handle this as-is?

Section 2 · The Intuition

What if the model just… looked around?

No formula learned. No training phase. Just similarity.

🏙️

The Neighbor Analogy

You move to a new city. You don't know anyone. You need to pick a neighborhood.

You look at who already lives there. If the people nearby are similar to you, you figure it's probably a good fit.

kNN does exactly this. A new data point looks at its nearest neighbors and goes with the majority.

Section 3 · Why This Matters

Where kNN Shows Up in the Real World

Let's practice identifying classification problems — and catch a trap.

⚡ Rapid Fire — Scenario 1 of 4

🎵

Spotify wants to sort a new song into a genre.

Call out your answers:

Target?

What are we predicting?

Features?

What inputs do we have?

Type?

Binary? Multi-class?

Section 4 · Visual Walkthrough

Watch kNN Make a Prediction

Step through the algorithm one stage at a time.

Past Open House Attendees

EnrolledDidn't

Here's the historical data.

Each dot is a past attendee. Blue enrolled, red didn't. The axes represent Hours (x) and Income scaled (y).

Critical

Feature Scaling

Skip this step and kNN silently ignores most of your features.

Raw Data — Two Gym Attendees

Feature	Person A	Person B	Difference	% of Total Distance
Age	25 years	30 years	5	0.0%
Income	$45,000	$80,000	35,000	100.0%
Hours	3 hrs	5 hrs	2	0.0%

Euclidean Distance Calculation

d = √(5² + 35,000² + 2²)=√(25 + 1,225,000,000 + 4)≈35,000

⚠️

Income accounts for 99.99% of the distance.

Age and Hours are effectively invisible. kNN is making predictions based on one feature only, even though you gave it three.

The rule: Always scale your features before running kNN. Two lines of R code. No exceptions.

Section 5 · What Makes kNN Different

kNN vs. Logistic vs. Regression

Three models, three very different mindsets.

Model	Output	Assumption	Interpretability
Regression	Number (continuous)	Linear relationship	High (coefficients)
Logistic	Probability (0–1)	Linear log-odds	Medium (odds ratios)
kNN	Class label	Similarity by distance	Low (neighbors vote)

Section 5 · Key Warnings

What kNN Gets Wrong If You’re Careless

Three things you must remember before you trust the output.

⚖️

Scale First

If you don’t standardize, one big feature dominates all distance.

🐢

Slow Prediction

Every new point compares to every old point. It scales poorly.

❓

No Coefficients

It doesn’t explain “why.” It just votes.

The naive rule is the bar we have to clear. If kNN can’t beat it, it adds no value.

💡

Transition

Now that you know what can go wrong, let’s go deeper into how we choose k.

Phase II • Tuning

Choosing k

Small k overfits. Large k oversmooths. We use cross-validation to pick.

1

Very noisy

5

Balanced

10

Very smooth

🔍Learning Loop → Explore

Step 1: Load & Inspect the Data

Before anything else — look at what you have. Confirm columns, types, and size.

R Code

# Load the data
Gym <- read_excel("Data-1/Gym.xlsx")

# See the first few rows
head(Gym)

⚙️Learning Loop → Prepare

Step 2: Prepare the Data

Scale the features so no single variable dominates. Then split into training and validation sets.

Standardize Features

# Standardize predictor columns (Age, Income, Hours)
Gym1 <- scale(Gym[2:4])

# Reattach target variable
Gym1 <- data.frame(Gym1, Gym$Enroll)

# Rename the target column
colnames(Gym1)[4] <- 'Enroll'

# Convert target to factor (category, not number)
Gym1$Enroll <- as.factor(Gym1$Enroll)

Gym[2:4]

Columns 2, 3, 4 — Age, Income, Hours. Excludes Member (col 1) and Enroll (col 5).

as.factor()

Tells R that Enroll is a category (0 or 1), not a number.

⚠️

Common Bug

Make sure you convert Enroll on Gym1, not Gym.

🧠Learning Loop → Model

Step 3: Train the kNN Model

Three decisions: How to evaluate? What k values? Then train.

How do we evaluate the model during training — without touching the validation set?

myCtrl <- trainControl(method = "cv", number = 10)

10-Fold Cross-Validation

Round 1:

test

train

Round 2:

train

test

train

Round 3:

train

test

train

… continues for all 10 rounds. Each fold gets a turn as the test set.

📊Learning Loop → Evaluate

Step 4: Evaluate — Confusion Matrix

The model made predictions. How good are they?

Predict & Evaluate

# Apply model to validation set
KNN_Class <- predict(KNN_fit, newdata = validationSet)

# Convert Enroll to factor for comparison
validationSet$Enroll <- as.factor(validationSet$Enroll)

# Create confusion matrix
confusionMatrix(KNN_Class, validationSet$Enroll, positive = '1')

📊Learning Loop → Evaluate

Step 5: Cutoff Tuning

The default 0.5 threshold is a choice, not a law. Different cutoffs produce different tradeoffs.

Get Probabilities Instead of Class Labels

# Predict probabilities, not just 0/1
KNN_Class_prob <- predict(KNN_fit, newdata = validationSet, type = 'prob')

# View the first few rows
head(KNN_Class_prob)

Key insight: Instead of a hard 0/1, we now see a probability for each person. The default rule: if column 2 > 0.5, predict “Enroll.” But 0.5 is arbitrary.

📊Learning Loop → Evaluate

Step 6: Performance Charts

Three visualizations that answer business questions, not just statistical ones.

Business question: “If I can only contact a fraction of people, which fraction should I pick?”

Generate Gains Table

# Convert Enroll back to numeric for gains package
validationSet$Enroll <- as.numeric(as.character(validationSet$Enroll))

# Generate gains table
gains_table <- gains(validationSet$Enroll, KNN_Class_prob[, 2])

# Plot cumulative lift chart
plot(
  c(0, gains_table$cume.pct.of.total * sum(validationSet$Enroll))
  ~ c(0, gains_table$cume.obs),
  xlab ="# of cases", ylab = "Cumulative",
  main ="Cumulative Lift Chart", type = "l")

# Add baseline (random selection)
lines(
  c(0, sum(validationSet$Enroll))
  ~ c(0, dim(validationSet)[1]),
  col ="red", lty = 2)

📊Learning Loop → Evaluate

Step 7: Score New Records

The model is trained and evaluated. Now use it on people we've never seen — the whole point of building a model.

Load the Scoring Data

# New people from a recent open house — no Enroll column
myScoreData <- read_excel("Data-1/myScoreData.xlsx")
head(myScoreData)

📊Learning Loop → Evaluate

Summary — What We Built

The full kNN pipeline, end to end.

→

Model at a Glance

5

Best k

via 10-fold CV

84.0%

Accuracy

on validation set

0.953

AUC

excellent discrimination

82.0%

Sensitivity

at default 0.5 cutoff

📊Learning Loop → Evaluate

Key Takeaways

Five things to remember from this session.

Session B · The Bridge

From Distance to Probability

kNN asked "who are your neighbors?" Naive Bayes asks "what does the evidence say?"

📍

kNN (what we just learned)

• Find the k closest data points
• Let them vote on the class
• Distance is the core idea
• No explicit model, just memory

🎲

Naive Bayes (what we learn now)

• Calculate the probability of each class
• Combine several pieces of evidence
• Probability is the core idea
• Explicit probabilistic model

Same goal: classify an observation. Completely different logic.

Session B · The Hook

A Question Before We Start

This is the fastest way to feel why priors matter.

A medical test is 95% accurate.
You test positive.
What is the probability you actually have the condition?

1,000

people tested

1

true positive

~50

false positives

💡

Transition

Priors give us the starting point. Next we turn that intuition into a formal classification rule.

Session B · Definition

Naive Bayes in Plain English

Naive Bayes is a classification method based on probability.

It predicts which class is most likely after examining the evidence. Given what we know about a person, it calculates the probability of each class and picks the winner.

📊

Probabilistic

Outputs actual probabilities, not just labels

⚡

Fast

Trains almost instantly, even on large datasets

🎯

Effective Baseline

Surprisingly strong for many real problems

Quick Check

How does Naive Bayes decide which class to assign?

Session B · The "Naive" Part

Why Is It Called "Naive"?

Because it makes a simplifying assumption that is almost never true — and still works anyway.

The Assumption

Naive Bayes assumes that all predictor variables are independent of each other, given the class. It treats each piece of evidence separately.

🌎 Reality

• Education and age are related
• Marital status and age are related
• Education and income are related

Features are almost always correlated.

🤖 What the Model Assumes

• Education? I'll look at that alone.
• Age? Separate calculation.
• Marital status? Also separate.

Then multiply all pieces together.

Give an example of two predictors in a dataset that are clearly not independent in real life.

Hint: Think of two variables where knowing one gives you information about the other.

0 / 40 characters

Bayes Concepts · The Formula

Bayes' Theorem

One equation. Four pieces. This is the engine behind the model.

P(Class | Evidence) = P(Evidence | Class) × P(Class) / P(Evidence)

P(Class)

Prior

Starting belief before seeing the record. How common is this class overall?

P(Evidence | Class)

Likelihood

If this class were true, how likely would we see these features?

P(Class | Evidence)

Posterior

The updated probability after seeing the evidence. This is the prediction.

P(Evidence)

Evidence

A normalizing constant. In practice we just compare class scores.

Practice

Complete the Bayes vocabulary

Use the core Bayes terms in the right order.

The is our starting belief, the measures how well the evidence fits a class, and the is the updated probability we use to classify.

Hints

Start with the belief before evidence.
Then use the evidence fit.
End with the updated belief.

Bayes Concepts · The Updating Process

Prior → Evidence → Posterior

Think of Bayes as an updating machine. You feed it a starting belief and evidence, and it hands back an updated belief.

Step 1

Prior

Starting belief

→

Step 2

× Likelihood

Multiply by evidence

→

Step 3

Posterior

Updated belief

🌧️ Weather Analogy

Prior: It rains 30% of days in this city. Before looking outside, P(Rain) = 0.30.

Likelihood: You see dark clouds. On rainy days, dark clouds appear 90% of the time.

Posterior: After seeing clouds, P(Rain | Clouds) is now much higher than 30%.

Match each Bayes term to its plain-English meaning

Choose the correct bucket for each term.

Prior

Likelihood

Posterior

Evidence

How common is this class?

How typical is this evidence?

Updated probability

Normalizing constant

Bayes Concepts · The Math of Independence

What Independence Actually Does to the Formula

This is the shortcut that makes Naive Bayes practical.

❌ Without the Assumption

P(Age, Edu, Sex, Eth, Mar | Class)

Estimate the joint probability of every possible combination of 5 features. With small data, most combinations have zero observations.

✅ With the Assumption

Five separate, small frequency tables. Easy to estimate even with modest data.

The decision rule: pick the class that maximizes

P(Class) × P(x₁ | Class) × P(x₂ | Class) × ... × P(xₙ | Class)

Quick Check

Why does Naive Bayes often still work well even though features are not truly independent?

Bayes Concepts · Our Problem

Connecting Bayes to the Census Dataset

This is where the theorem stops feeling abstract and starts feeling usable.

We are trying to estimate:

P(Poverty = 1 | Age, Education, Sex, Ethnicity, Marital Status)

Given what we know about a person, what is the probability they are living in poverty?

Bayes term	In our problem	Where it comes from
Prior	Percent of training rows in poverty	Count Poverty=1 divided by total rows
Likelihoods	P(College \| Pov=1), P(Married \| Pov=1), etc.	Feature frequencies within each class
Posterior	Final probability for a new person	Prior times all likelihoods, then compare classes

Quick Check

Before we build the model, which variable do you think will be the strongest single predictor of poverty?

Hand-Worked Example · The Data

Naive Bayes by Hand

Before code, we slow down and classify one person by hand.

Person	Education	Marital Status	Poverty
1	College	Married	0
2	College	Married	0
3	High School	Single	1
4	High School	Single	1
5	College	Single	0
6	High School	Married	1

New person arrives:

Education = High School | Marital Status = Single

Should we classify them as Poverty = 0 or Poverty = 1?

Before calculating anything, what is your gut prediction? Which class wins, and why?

Hint: Look for repeated patterns in the tiny table.

0 / 40 characters

Hand-Worked Example · The Zero Problem

Step-by-Step Calculation

Step 1 — Priors

P(Poverty = 1) = 3/6 = 0.50

P(Poverty = 0) = 3/6 = 0.50

Half in poverty, half not. Equal starting point.

Step 2 — Likelihoods (raw, no smoothing)

Feature	P(feature \| Poverty=1)	P(feature \| Poverty=0)
High School	3/3 = 1.00	0/3 = 0.00 ⚠️
Single	2/3 = 0.67	1/3 = 0.33

Zero Probability Problem

P(High School | Poverty = 0) = 0. No one in the non-poverty group had High School in our tiny dataset.

0.50 × 0.00 × 0.33 = 0.00

One missing combination wipes out an entire class. We need a fix.

Quick Check

What goes wrong when one conditional probability is zero?

Hand-Worked Example · Laplace Smoothing

The Fix: Add-1 Smoothing

Laplace Smoothing

Add 1 to every count. No probability is ever exactly zero.

P(feature | class) = (count + 1) / (class total + number of categories)

Recalculated Likelihoods

Feature	P(f \| Pov=1)	P(f \| Pov=0)
High School	(3+1)/(3+2) = 4/5 = 0.80	(0+1)/(3+2) = 1/5 = 0.20
Single	(2+1)/(3+2) = 3/5 = 0.60	(1+1)/(3+2) = 2/5 = 0.40

Poverty = 1

0.50 × 0.80 × 0.60 = 0.240

Poverty = 0

0.50 × 0.20 × 0.40 = 0.040

Normalize

P(Pov=1) = 0.240 / (0.240 + 0.040) = 85.7%

P(Pov=0) = 0.040 / (0.240 + 0.040) = 14.3%

Classification: Poverty = 1

❗

Key Insight

Smoothing does not invent evidence. It prevents rare unseen combinations from destroying the arithmetic.

R Lab · Step 1

Load and Inspect the Data

We are using census data. The goal: classify whether an individual is living in poverty.

R Lab Companion

Load libraries and inspect the dataset

Start by understanding the outcome and the available predictors.

R — load libraries and data

r

library(readxl)
library(caret)
library(pROC)
library(gains)

Census_Data <- read_excel("Data-1/Census_Data.xlsx")
head(Census_Data)

Imports the census dataset and previews the first six rows.

Checklist0/3

R Lab · Step 2

Factor, Split, and Set Up Cross-Validation

Three preparation steps before we train.

R Lab Companion

Prepare the data for classification

Convert the target, create train/validation sets, and define resampling.

R — prepare the data

r

Census_Data$Poverty <- as.factor(Census_Data$Poverty)

set.seed(1)
myIndex <- createDataPartition(Census_Data$Poverty, p = 0.6, list = FALSE)
trainSet <- Census_Data[myIndex, ]
validationSet <- Census_Data[-myIndex, ]

myCtrl <- trainControl(method = "cv", number = 10)

Factor the target, protect against overfitting with a validation split, and use cross-validation for a more stable estimate.

Quick Check

Why can't we leave Poverty as a numeric variable?

R Lab · Step 3

Train the Naive Bayes Model

Once the idea is clear, the training code is surprisingly short.

R Lab Companion

Fit the model with caret

Use all predictors and evaluate with cross-validation.

R — fit Naive Bayes via caret

r

set.seed(1)
nb_fit <- train(Poverty ~ .,
              data = trainSet,
              method = "nb",
              trControl = myCtrl)
nb_fit

`Poverty ~ .` means use every remaining variable to predict poverty. The model then learns priors and conditional probability tables for each class.

R Lab · Step 4

Confusion Matrix at Default Cutoff

This is the first moment where we judge whether the model is useful.

R — predict and evaluate

r

nb_class <- predict(nb_fit, newdata = validationSet)
confusionMatrix(nb_class, validationSet$Poverty, positive = "1")

Metric	Meaning
Accuracy	Overall percent correct
Sensitivity	How many true poverty cases did we catch?
Specificity	How many non-poverty cases did we correctly reject?

Quick Check

If policymakers care most about identifying vulnerable populations, which metric matters most?

R Lab · Step 5

From Class Labels to Probabilities

This is the shift from classification to decision support.

R — predict probabilities

r

nb_class_prob <- predict(nb_fit, newdata = validationSet, type = "prob")
head(nb_class_prob)

0.51

Barely over the line

Low confidence

0.93

Very likely in poverty

High confidence

Different positive classifications can carry very different levels of confidence. That is what makes probabilities useful for ranking and triage.

R Lab · Step 6

Changing the Cutoff

A cutoff is not a fact of nature. It is a policy choice.

R Lab Companion

Raise the threshold and compare the tradeoff

Higher thresholds make the model more conservative.

R — apply a higher threshold

r

nb_class_075 <- as.factor(ifelse(nb_class_prob[, 2] > 0.75, "1", "0"))
confusionMatrix(nb_class_075, validationSet$Poverty, positive = "1")

R Lab · Step 7

Lift: Are We Better Than Random Selection?

Lift matters when you can only act on the highest-priority cases.

R Lab Companion

Rank cases by predicted probability

Lift matters when you have limited resources and can only intervene on the top cases.

R — lift chart

r

lift_obj <- gains(validationSet$Poverty, nb_class_prob[, 2])
plot(c(0, lift_obj$cume.pct.of.total * 100),
   c(0, lift_obj$cume.pct.of.pos * 100),
   type = "l",
   xlab = "Percent of population targeted",
   ylab = "Percent of positives captured")

If the top 20% of scored cases contains much more than 20% of actual poverty cases, the model is doing something valuable.

R Lab · Step 8

ROC Curve and AUC

ROC gives us one last view before we leave the lab: overall ranking quality across thresholds.

R Lab Companion

Evaluate ranking quality across thresholds

AUC answers: how well does the model separate the two classes overall?

R — ROC and AUC

r

roc_obj <- roc(validationSet$Poverty, nb_class_prob[, 2])
plot(roc_obj, col = "blue")
auc(roc_obj)

AUC compresses the ROC story into one number: closer to 1 is strong separation, closer to 0.5 is weak separation.

Quick Check

What does an AUC near 0.50 mean?

R Lab · Step 9

Score New People

The end goal is not just fitting a model. It is making decisions about new records.

R Lab Companion

Apply the model to new observations

Classification models create decision support for unseen records.

R — score new records

r

new_people <- data.frame(
Age = c(22, 47),
Education = c("High School", "College"),
Sex = c("Male", "Female"),
Ethnicity = c("GroupA", "GroupB"),
MaritalStatus = c("Single", "Married")
)

predict(nb_fit, newdata = new_people, type = "prob")

This is the business payoff: a ranked or labeled list of people to prioritize for outreach, intervention, or support.

Week 4 · Comparison

Three Classification Mindsets

Week 4 lands best when we compare the mindsets, not just the formulas.

Model	How it thinks	Best use case	Tradeoff
Logistic	Explain probability with coefficients	Interpretability matters	May miss nonlinear local patterns
kNN	Compare to similar neighbors	Local similarity drives behavior	Slow prediction, needs scaling
Naive Bayes	Combine evidence probabilistically	Categorical and text-like predictors	Assumes independence

Week 4 · Reflection

What Matters More: Accuracy or Consequences?

The best model is not just the one with the best score. It is the one that makes the best decisions for the context.

“A model is only as useful as the decision it improves.”

— Week 4, Classification takeaway

Suppose a social program has funds to help only 20% of applicants. Why would predicted probabilities be more useful than simple class labels?

Hint: Think about ranking limited resources.

0 / 60 characters

Week 4 · Takeaways

Naive Bayes in One Page

If students remember these five ideas, they will remember the method.

Flip through the five big ideas

These are the concepts that matter most from Session B.

Click to flip

Glossary

Prior

The starting probability of a class before seeing any features.

Likelihood

How compatible the observed features are with a class.

Posterior

The updated class probability after using the evidence.

Laplace smoothing

Add-1 correction that prevents zero conditional probabilities.

AUC

Area under the ROC curve; summarizes ranking performance across all cutoffs.