BANA Learning Hub

Our Journey So Far

Each week builds on the last. You are not starting over — you are adding tools.

📈

Week 1

Regression

Linear model

Assumptions

Interpretation

“What drives Y?”

🧮

Week 2

Nonlinear

Quadratic

Log models

Elasticity

“Is it curved?”

🔀

Week 3

Logistic

Odds

Sigmoid

Confusion

“Yes or no?”

🎯

Week 4

kNN + Naive Bayes

Distance

Probabilities

Lift · ROC

“Which class?”

🌲

Week 5

Tree Models

Splits

Forests

Ensembles

“Which split?”

The Learning Loop

Every predictive model you build follows this same cycle.

🎯

Definethe ask

▼

🔍

Explorethe data

▼

⚙️

Prepareto model

▼

🧠

Modeland train

▼

📊

Evaluateand decide

🔄 iterate

↑

↓

The steps stay the same — your tools get sharper

Session A · The Bridge

From Probabilities to Rules

Logistic regression explained probabilities with coefficients. Naive Bayes combined evidence with probability. Decision trees ask one question at a time.

📈

Logistic

Weighted equation

🧠

Naive Bayes

Probability update

🌳

Decision Tree

Sequence of if/then splits

Session A · The Hook

A Model That Thinks in Questions

A decision tree behaves like a smart interviewer: ask the most informative question first, then keep narrowing.

Should this customer get a HELOC?

Click 'Next' to follow the trees reasoning one question at a time.

Income < $76,500?

1 / 5

Decision Positive Leaf Negative Leaf

Why is this model appealing to managers?

Because every prediction can be explained as a simple path of decisions. Instead of saying "the coefficient shifted the log-odds," you can say "income was high, so the model routed the case here."

Step through the tree above. That path from root to leaf is the explanation. No other model we have studied gives you that for free.

Why is this model appealing to managers?

Hint: Think about how easy it is to explain each prediction.

0 / 50 characters

Session A · Definition

What Is a Decision Tree?

A decision tree partitions the data into smaller and smaller groups using recursive splitting.

A decision tree is a predictive model that repeatedly splits the data using the variable and cutoff that best separate the outcome.

✂️

Split

Cut the data into groups

🔁

Repeat

Keep splitting child nodes

🍃

Predict

Use the terminal leaf output

Session A · Anatomy

The Parts of a Tree

Every tree is built from the same four pieces.

🌳

Root

The first split

🧩

Node

A place where data is split again

➡️

Branch

A yes/no path

🍃

Leaf

Final prediction

See all four parts in action

The amber node at top is the Root. Each split creates Branches (Yes/No). Internal amber nodes are Nodes. Green and red endpoints are Leaves.

Debt-to-Income < 35.5?

1 / 5

Decision Positive Leaf Negative Leaf

Step through the tree above. Notice how the Root badge marks the first split, the Yes/No labels are the Branches, the second amber box is an internal Node, and the endpoint boxes are the Leaves where predictions live.

Quick Check

Which part of the tree contains the actual prediction?

Tree Concepts · Two Flavors

Classification Trees vs. Regression Trees

The same algorithm adapts to two different types of problems. The difference is in the target variable.

	Classification Tree	Regression Tree
Target	Category (yes / no)	Number (dollars, count)
Splitting criterion	Gini impurity	Sum of Squared Errors (SSE)
Leaf prediction	Majority class vote	Mean of observations
R method	method = "class"	method = "anova"
Today's example	Will this person take a HELOC?	What is their credit card balance?

“Classification trees vote. Regression trees average.”

— Week 5, One-line summary

Quick Check

A bank wants to predict how much money a customer will borrow. Which tree type should they use?

Tree Concepts · Splitting

How a Tree Chooses the Next Question

At each node, the model searches for the split that improves purity the most.

Weak split

Both child nodes still look mixed.

Decision Positive Leaf Negative Leaf

Strong split

The children become much cleaner than the parent.

Decision Positive Leaf Negative Leaf

Decision Rule

The tree is greedy: it chooses the best split available right now, then repeats the process inside each child node.

Look at the Gini values above. The weak split barely moves the needle. The strong split drops impurity dramatically. The tree always picks the split that reduces Gini the most.

Reading the confidence bars

Each leaf has a confidence bar. A nearly full bar means the node is very pure. A half-full bar means the node is still mixed, so the split did not help much.

Compare the weak splits two near-50% bars to the strong splits 88% and 12% outcomes. That visual difference is what Gini impurity is trying to capture.

Quick Check

Why does the tree prefer the Income split over the Color split?

Tree Concepts · Gini Impurity

Gini Measures How Mixed a Node Is

For classification trees, lower Gini means a cleaner node.

Gini = 1 - Σ p(class)^2

0.00

Perfectly pure

0.32

Some mixing

0.50

Maximum mix for two classes

Practice

Complete the idea

Lower impurity is better.

If a node is 100% one class, its Gini impurity is . If it is 50/50 between two classes, the Gini impurity is .

Tree Concepts · Overfitting

A Tree Can Keep Growing Until It Memorizes

If you let the tree split without restraint, it eventually fits noise instead of signal.

Practical tree

Simple, readable, and likely to generalize.

Overfit tree

Huge, unstable, and tuned to peculiar training cases.

⚠️

Core Lesson

Training error can always be pushed downward by making the tree more complex. Validation error tells us when that complexity stops helping.

Tree Concepts · Pruning

Grow Big, Then Cut Back

The usual tree workflow is not “guess the perfect size.” It is “grow large, then prune intelligently.”

Step 1: Grow a large tree.

Step 2: Use the cp table to inspect cross-validated error.

Step 3: Prune back to a simpler tree that generalizes better.

Tree Concepts · Complexity Table

Reading the cp Table

The cp table summarizes the tradeoff between complexity and predictive error.

Column	Meaning
nsplit	How many splits are in the tree
rel error	Training error relative to the root node
xerror	Cross-validated error
xstd	Standard error of xerror

We usually choose the simplest tree whose xerror is within one standard error of the minimum. This is the 1-SE rule.

R Lab · The Business Problem

Case Study: Who Takes a Home Equity Line of Credit?

A bank wants to predict which customers will respond to a HELOC offer using age, sex, and income.

The target

HELOC = 1 if the customer takes the offer, 0 otherwise.

Quick Check

Before building the model, which variable do you expect to split first?

R Lab · Step 1

Load the HELOC Data

We start by loading the customer file and identifying the target.

R Lab Companion

Load the classification dataset

Inspect the target and predictors before training the tree.

R — load HELOC data

r

library(readxl)
library(caret)
library(rpart)
library(rpart.plot)
library(ROCR)
library(gains)

HELOC <- read_excel("data-1/HELOC.xlsx")
head(HELOC)

Checklist0/3

R Lab · Step 2

Factor and Split the Data

As with earlier classification models, the target must be coded as a factor.

R Lab Companion

Prepare for tree modeling

Convert the target and create train/validation sets.

R — prepare HELOC data

r

HELOC$HELOC <- as.factor(HELOC$HELOC)

set.seed(1)
myIndex <- createDataPartition(HELOC$HELOC, p = 0.7, list = FALSE)
trainSet <- HELOC[myIndex, ]
validationSet <- HELOC[-myIndex, ]

The holdout split lets us evaluate whether the tree generalizes or merely memorizes.

R Lab · Step 3

Fit the Default Tree

Start with the default settings so we can see what the baseline tree looks like.

R Lab Companion

Build a first-pass classification tree

This gives us a readable model before we explore overfitting.

R — default classification tree

r

default_tree <- rpart(HELOC ~ ., data = trainSet, method = "class")
summary(default_tree)

The summary reveals split variables, node sizes, and which predictors matter most.

R Lab · Visualization

Read the Tree Visually

Tree plots make the decision logic accessible to a non-technical audience.

R Lab Companion

Visualize the default tree

Read from top to bottom and left to right.

R — visualize the tree

r

prp(default_tree, type = 1, extra = 1, under = TRUE)

The root node is the first global split. Each subsequent branch narrows the customer into a smaller subgroup.

R Lab · Step 4

Build and Inspect the Full Tree

We start with the default tree, then deliberately grow a full tree so the overfitting pattern becomes obvious in the cp table.

R — build default tree

r

default_tree <- rpart(HELOC ~ ., data = trainSet, method = "class")
prp(default_tree, extra = 1, under = TRUE)
printcp(default_tree)

Reading the default tree cp table

CP	nsplit	rel error	xerror	xstd
0.1758	0	1.000	1.000	0.090
0.0220	2	0.648	0.758	0.082
0.0110	6	0.549	0.890	0.087
0.0100	7	0.538	0.890	0.087

What these numbers mean

Row 1 (0 splits): no tree at all. Predict the majority class for everyone. That is why both rel error and xerror start at 1.000.

Row 2 (2 splits): the sweet spot. Cross-validated error falls to 0.758, about a 24% improvement over the baseline.

Rows 3–4: more splits reduce training error a little more, but xerror gets worse. The tree is now fitting noise instead of signal.

Root node error: 91/350 = 0.26 tells us only 26% of training customers took a HELOC. The data are imbalanced, which will matter later when we inspect sensitivity.

R — build full (unpruned) tree

r

full_tree <- rpart(HELOC ~ ., data = trainSet,
                  method = "class",
                  cp = 0,
                  minsplit = 2,
                  minbucket = 1)
printcp(full_tree)

What the full tree reveals

The full tree grows to 78 splits and drives training rel error down to about 0.011. On training data, it looks almost perfect.

But the validation story is the opposite: at 78 splits the xerror rises to about 0.923. That is barely better than having no model at all.

The best xerror in the full tree is about 0.714 at 19 splits. That is only slightly better than the 2-split tree's 0.758, so the extra complexity buys very little.

The overfitting pattern

As the tree gets larger, training error keeps falling but cross-validated error bottoms out and then rises. That divergence is the clearest numerical signature of overfitting in the chapter.

Quick Check

The full tree has training rel error near 0.011 but cross-validated error near 0.923. What does that tell you?

R Lab · Step 5

Prune the Tree

We do not deploy the giant tree. We cut it back to the simplest version that still performs competitively on validation data.

R — prune

r

pruned_tree <- prune(full_tree, cp = 0.023)
fancyRpartPlot(pruned_tree,
             palettes = c("Blues", "Greens"),
             type = 4,
             cex = 0.6,
             tweak = 0.9)

Why cp = 0.023?

The minimum xerror in the full tree was about 0.714 at cp = 0.0069, with xstd about 0.080.

The 1-SE rule says to choose the simplest tree whose xerror is within one standard error of that minimum. That threshold is about 0.794.

The 2-split tree has xerror about 0.758, which is comfortably below 0.794. So we prefer 3 leaves over a much larger tree that adds complexity without much payoff.

The pruned HELOC tree

This is the final 2-split tree students should read and explain.

Decision Positive Leaf Negative Leaf

Plain-English interpretation

1. High income (≥ $76,500) leads to a strong No HELOC prediction with about 94% confidence.

2. Lower income + younger than 45 is the one segment where the tree predicts Yes, with about 67% confidence.

3. Lower income + older flips back to No, with about 75% confidence.

Quick Check

Why pick a 2-split tree instead of the largest tree with the absolute lowest training error?

R Lab · Step 6

Evaluate with a Confusion Matrix

This is where the pruned tree meets reality: does it actually find likely HELOC customers in the validation set?

R — confusion matrix

r

predicted_class <- predict(pruned_tree, validationSet, type = "class")
confusionMatrix(predicted_class, validationSet$HELOC, positive = "1")

Actual output

	Actually No (0)	Actually Yes (1)
Predicted No	97	22
Predicted Yes	14	17

What looks okay

Accuracy = 76%. That sounds decent at first glance.

Specificity = 87.4%. When someone will not take a HELOC, the tree is usually right.

What is concerning

Sensitivity = 43.6%. Out of 39 actual HELOC takers, the model only catches 17.

Positive predictive value = 54.8%. When the model says Yes, it is right only a little more than half the time.

No Information Rate = 74%. The 76% accuracy barely beats predicting No for everyone.

Business interpretation

For a marketing campaign, the weak point is obvious: the model misses 22 out of 39 actual HELOC takers. It is a conservative tree that plays defense well but does a mediocre job identifying the customers we actually want to find.

Quick Check

If the bank mainly wants to find likely HELOC takers, which metric deserves the most attention?

R Lab · Step 7

Predicted Probabilities

The tree can return probabilities, not just labels. That lets us rank customers instead of treating every Yes prediction as equally confident.

R — predicted probabilities

r

predicted_prob <- predict(pruned_tree, validationSet, type = "prob")
head(predicted_prob)

Output

Person	P(HELOC = 0)	P(HELOC = 1)
1	0.750	0.250
2	0.938	0.062
3	0.333	0.667
4	0.750	0.250
5	0.333	0.667
6	0.938	0.062

Notice the quirk

There are only three distinct probability values for P(HELOC = 1): 0.667, 0.250, and 0.062. That is because the pruned tree has exactly three leaves.

Everyone in the same leaf receives the same probability. This is one of the clearest differences between trees and logistic regression: trees are easier to explain, but their probabilities are coarse and bucket-like.

R Lab · Step 8

Gains Table and Lift

Gains tell us whether the tree is useful for prioritization. If we can contact only part of the list, who should go first?

R — gains table

r

validationSet$HELOC <- as.numeric(as.character(validationSet$HELOC))
gains_table <- gains(validationSet$HELOC, predicted_prob[,2])

Warning: fewer distinct predicted values than groups requested.

That warning is expected here. The function wants 10 groups, but our tree only creates 3 distinct probability scores, so R collapses the gains table into 3 groups.

Gains table output

Depth	N	Cume N	Mean Resp	Cume % of Total	Lift	Score
21%	31	31	0.55	43.6%	211	0.67
45%	37	68	0.35	76.9%	135	0.25
100%	82	150	0.11	100.0%	42	0.06

Top 21%

The top 31 customers capture 43.6% of all actual HELOC takers. Lift 211 means this group is 2.11x more likely than average to respond.

Top 45%

Once we add the second group, we reach 68 people and capture 76.9% of the responders.

Bottom group

The lowest-scored group has lift 42. They are less than half as likely as average to take the offer, so they are poor prospects.

Business translation

If you can contact only 31 people, call the top-scored group first. You reach nearly 44% of all likely HELOC takers by targeting only 21% of the file.

R Lab · Step 9

ROC Curve and AUC

The ROC curve summarizes how well the probability scores separate likely HELOC takers from non-takers across all thresholds.

R — ROC and AUC

r

roc_object <- roc(validationSet$HELOC, predicted_prob[,2])
plot.roc(roc_object)
auc(roc_object)

Output

AUC = 0.7395

0.50

Random

0.74

Ours

0.80+

Good

1.00

Perfect

How to explain 0.7395

If we randomly choose one true HELOC taker and one non-taker, the model assigns the higher score to the HELOC taker about 74% of the time.

That is acceptable, but not strong. It also reflects the simplicity of the pruned tree: with only 3 leaves, we only get 3 distinct probability scores.

Why the ROC curve is blocky

Because the tree gives only 3 distinct probabilities, the ROC curve moves in just a few visible steps. A logistic regression would look smoother because it generates many more distinct scores.

Quick Check

If the tree's AUC is around 0.74, what is the best interpretation?

R Lab · Step 10

Score New Customers

The real purpose of predictive analytics: apply the model to people we have never seen before.

R — score new data

r

myScoreData <- read_excel("data-1/HELOC_Score.xlsx")

predicted_class_score <- predict(pruned_tree, myScoreData, type = "class")
predicted_prob_score <- predict(pruned_tree, myScoreData, type = "prob")

myScoreData <- data.frame(myScoreData,
                         Predicted_HELOC = predicted_class_score,
                         Prob_Yes = predicted_prob_score[, 2])
myScoreData

R Lab Companion

Turn the model into an operational tool

Attach class labels and probabilities, then rank the list.

Predicted_HELOC is the yes/no flag. Prob_Yes is the more useful business variable for ranking and targeting.

Always remember: the model reflects historical patterns. It is a decision aid, not an oracle.

Session B · Regression Trees

Same Structure, Different Target

Everything you just learned about classification trees applies here. The only difference: the target is a number, not a category.

🏷️

Classification

• Target is a class
• Splits minimize Gini
• Leaves vote

💰

Regression

• Target is numeric
• Splits minimize SSE
• Leaves average

Regression Trees · Evaluation

MAE and MSE Replace the Confusion Matrix

When the target is continuous, we ask how far off the predictions are, not whether the class was right or wrong.

MAE

MAE = Σ |actual - predicted| / n

Average absolute miss in the original units.

MSE

MSE = Σ (actual - predicted)^2 / n

Penalizes large errors much more strongly.

R Lab · Regression Step 1

Load the Balance Data

Predicting credit card balance from age, sex, and income.

R Lab Companion

Load and split the regression dataset

No factor conversion this time, because Balance stays numeric.

R — load and split

r

myData <- read_excel("data-1/Balance.xlsx")
head(myData)

set.seed(1)
myIndex <- createDataPartition(myData$Balance, p = 0.7, list = FALSE)
trainSet <- myData[myIndex, ]
validationSet <- myData[-myIndex, ]

Checklist0/4

R Lab · Regression Step 2

Build, Inspect, and Prune

Same pattern: default tree, full tree, cp table, then prune.

R — default tree + full tree + prune

r

default_tree <- rpart(Balance ~ ., data = trainSet, method = "anova")
summary(default_tree)

full_tree <- rpart(Balance ~ ., data = trainSet,
                 method = "anova",
                 cp = 0, minsplit = 2, minbucket = 1)
printcp(full_tree)

pruned_tree <- prune(full_tree, cp = 0.033)
prp(pruned_tree, type = 1, extra = 1, under = TRUE)

R Lab Companion

method = "anova" is the switch

That tells `rpart()` we are optimizing a continuous outcome rather than a class label.

Variable importance often reveals which numeric predictor drives the first major split.

Income commonly dominates because it creates the largest reductions in squared error.

R Lab · Regression Step 3

Evaluate the Regression Tree

For regression trees, we judge the model by how far off its numeric predictions are, not by a confusion matrix.

R — MAE and MSE

r

predicted <- predict(pruned_tree, validationSet, type = "vector")
actual <- validationSet$Balance

## Mean Absolute Error
MAE <- sum(abs(predicted - actual)) / length(actual)
MAE

## Mean Squared Error
mean((predicted - validationSet$Balance)^2)

Mean Absolute Error

$3,786

Average miss in original dollar units

Root Mean Squared Error

$5,120

Derived from MSE = 26,219,132

How to interpret those errors

MAE = $3,786</strong> means the model misses the true balance by almost four thousand dollars on average.</p> <p><strong className="text-white">RMSE =$ 5,120 is larger because big misses get squared before averaging. A few customers with unusually high or low balances are hurting the model more than MAE alone reveals.

This is also the cost of simplicity: the pruned tree only makes 2 splits, so it can only predict a small number of balance buckets rather than a smooth range of values.

Variable importance from the regression tree

Income76%

Age24%

Income dominates the tree's splitting power. Sex does not appear, which tells students the tree found it unhelpful for predicting balance.

Quick Check

Why is RMSE noticeably larger than MAE in this example?

R Lab · Regression Step 4

Score New Customers

This is the deployment moment: we send in unseen customers and the tree returns predicted balances.

R — score new data

r

myScoreData <- read_excel("data-1/Balance_Score.xlsx")
predicted_value_score <- predict(pruned_tree, myScoreData)
myScoreData <- data.frame(myScoreData, predicted_value_score)
colnames(myScoreData)[4] <- "Balance"
myScoreData

What the scored output teaches us

#	Age	Sex	Income	Predicted Balance
1	35	Female	65,000	3,106
2	56	Male	160,000	14,725
3	43	Male	32,000	3,106
8	23	Male	28,000	3,106
16	42	Male	113,000	7,978

The same-prediction problem

Many different customers receive the exact same predicted balance because the regression tree predicts the leaf average.

In the professor's output, 15 out of 20 people landed in the same bucket and received essentially the same prediction: about $3,106.

That is the defining limitation of regression trees: they produce step functions, not smooth curves. Interpretability goes up, but nuance goes down.

Business takeaway

A regression tree is useful when stakeholders want understandable dollar buckets and simple rules. But if the goal is highly individualized predictions, a smoother model such as linear regression will usually produce more nuanced scores.

R Lab Companion

Why rename the output column

The `colnames()` step is a tiny but important deployment detail.

Business users should see a clean field like `Balance`, not an awkward modeling artifact like `predicted_value_score`.

Clean naming makes scored files easier to read, share, and operationalize.

Looking Ahead · What Comes Next

Trees Have a Weakness. Ensembles Fix It.

One tree is interpretable, but unstable. Ensemble methods improve stability and predictive power.

🎒

Bagging

Build many trees on bootstrap samples and average them.

🌲🌲🌲

Random Forest

Like bagging, but also randomize which predictors each tree can consider.

🔁

Boosting

Build trees sequentially, each one focusing on prior mistakes.

Preview the ensemble family

Three ways to make trees stronger.

Click to flip

Reflection · The Full Picture So Far

Five Weeks, Five Model Mindsets

Each model solves predictive problems differently. Choosing well means understanding the tradeoffs.

Model	Core idea	Best when	Watch out for	Interpretability
Linear Regression	Fit a line	Continuous target, roughly linear patterns	Nonlinearity, outliers	High
Logistic Regression	Model log-odds	Binary outcomes, coefficients matter	Rigid boundaries	High
kNN	Nearest neighbors vote	Local similarity matters	Needs scaling, slow prediction	Low
Naive Bayes	Multiply conditional evidence	Fast baseline, categorical features	Independence assumption	Medium
Decision Tree	Ask the best next question	Need rules and visual logic	Overfitting, instability	Very High

Match each model to its core mechanism

Choose the correct bucket for each model.

Linear Regression

Logistic Regression

kNN

Naive Bayes

Decision Tree

Fits a line

Models log-odds

Uses nearby cases

Combines probabilities

Recursively splits data

Reflection · Critical Thinking

When Do Trees Win? When Do They Lose?

Decision trees are great at some jobs and weak at others. The real skill is knowing when to use them.

Trees win when...

• Stakeholders need an explainable rule path
• Relationships are nonlinear or interaction-heavy
• Predictors mix numeric and categorical types
• You want a visual representation of decisions

Trees lose when...

• Fine-grained probabilities are needed
• Stability matters across small data shifts
• A smooth boundary would fit better than boxy splits
• You forget to prune and validation collapses

Quick Check

What is wrong with the request, 'Give me the model with the highest training accuracy'?

Week 5 · Final Takeaways

Decision Trees in One Page

“A perfect training model is often a bad model. The best model is the simplest one that generalizes.”

— Week 5, Pruning principle

Concept	Key idea
What it is	A flowchart of recursive if/then splits
Classification vs. Regression	Classification trees vote. Regression trees average.
Gini	Measures class mixing inside a node
Overfitting	Large trees memorize training noise
Pruning	Cut back using cp and validation error
Probabilities	One probability per leaf, not per person
Regression metrics	MAE and MSE summarize numeric miss
Biggest strength	Interpretability

Checklist0/6

This week's R functions

rpart()

prp() / fancyRpartPlot()

printcp()

prune()

predict(type = "class")

predict(type = "prob")

predict(type = "vector")

confusionMatrix()

gains()

roc() / auc()

Coming next

One tree is fragile. A forest of trees is robust.

Next: bagging, random forests, and boosting.

Same building block, dramatically better performance.