Each week builds on the last. You are not starting over โ you are adding tools.
Linear model
Assumptions
Interpretation
Quadratic
Log models
Elasticity
Odds
Sigmoid
Confusion
Distance
Probabilities
Lift ยท ROC
Splits
Forests
Ensembles
Every predictive model you build follows this same cycle.
Logistic regression explained probabilities with coefficients. Naive Bayes combined evidence with probability. Decision trees ask one question at a time.
Weighted equation
Probability update
Sequence of if/then splits
A decision tree behaves like a smart interviewer: ask the most informative question first, then keep narrowing.
Click 'Next' to follow the trees reasoning one question at a time.
Because every prediction can be explained as a simple path of decisions. Instead of saying "the coefficient shifted the log-odds," you can say "income was high, so the model routed the case here."
Step through the tree above. That path from root to leaf is the explanation. No other model we have studied gives you that for free.
Hint: Think about how easy it is to explain each prediction.
A decision tree partitions the data into smaller and smaller groups using recursive splitting.
A decision tree is a predictive model that repeatedly splits the data using the variable and cutoff that best separate the outcome.
Cut the data into groups
Keep splitting child nodes
Use the terminal leaf output
Every tree is built from the same four pieces.
The first split
A place where data is split again
A yes/no path
Final prediction
The amber node at top is the Root. Each split creates Branches (Yes/No). Internal amber nodes are Nodes. Green and red endpoints are Leaves.
Step through the tree above. Notice how the Root badge marks the first split, the Yes/No labels are the Branches, the second amber box is an internal Node, and the endpoint boxes are the Leaves where predictions live.
The same algorithm adapts to two different types of problems. The difference is in the target variable.
| Classification Tree | Regression Tree | |
|---|---|---|
| Target | Category (yes / no) | Number (dollars, count) |
| Splitting criterion | Gini impurity | Sum of Squared Errors (SSE) |
| Leaf prediction | Majority class vote | Mean of observations |
| R method | method = "class" | method = "anova" |
| Today's example | Will this person take a HELOC? | What is their credit card balance? |
โClassification trees vote. Regression trees average.โ
At each node, the model searches for the split that improves purity the most.
Both child nodes still look mixed.
The children become much cleaner than the parent.
The tree is greedy: it chooses the best split available right now, then repeats the process inside each child node.
Look at the Gini values above. The weak split barely moves the needle. The strong split drops impurity dramatically. The tree always picks the split that reduces Gini the most.
Each leaf has a confidence bar. A nearly full bar means the node is very pure. A half-full bar means the node is still mixed, so the split did not help much.
Compare the weak splits two near-50% bars to the strong splits 88% and 12% outcomes. That visual difference is what Gini impurity is trying to capture.
For classification trees, lower Gini means a cleaner node.
Perfectly pure
Some mixing
Maximum mix for two classes
If you let the tree split without restraint, it eventually fits noise instead of signal.
Simple, readable, and likely to generalize.
Huge, unstable, and tuned to peculiar training cases.
The usual tree workflow is not โguess the perfect size.โ It is โgrow large, then prune intelligently.โ
Step 1: Grow a large tree.
Step 2: Use the cp table to inspect cross-validated error.
Step 3: Prune back to a simpler tree that generalizes better.
The cp table summarizes the tradeoff between complexity and predictive error.
| Column | Meaning |
|---|---|
| nsplit | How many splits are in the tree |
| rel error | Training error relative to the root node |
| xerror | Cross-validated error |
| xstd | Standard error of xerror |
A bank wants to predict which customers will respond to a HELOC offer using age, sex, and income.
HELOC = 1 if the customer takes the offer, 0 otherwise.
We start by loading the customer file and identifying the target.
Inspect the target and predictors before training the tree.
library(readxl)
library(caret)
library(rpart)
library(rpart.plot)
library(ROCR)
library(gains)
HELOC <- read_excel("data-1/HELOC.xlsx")
head(HELOC)As with earlier classification models, the target must be coded as a factor.
Convert the target and create train/validation sets.
HELOC$HELOC <- as.factor(HELOC$HELOC)
set.seed(1)
myIndex <- createDataPartition(HELOC$HELOC, p = 0.7, list = FALSE)
trainSet <- HELOC[myIndex, ]
validationSet <- HELOC[-myIndex, ]Start with the default settings so we can see what the baseline tree looks like.
This gives us a readable model before we explore overfitting.
default_tree <- rpart(HELOC ~ ., data = trainSet, method = "class")
summary(default_tree)Tree plots make the decision logic accessible to a non-technical audience.
Read from top to bottom and left to right.
prp(default_tree, type = 1, extra = 1, under = TRUE)We start with the default tree, then deliberately grow a full tree so the overfitting pattern becomes obvious in the cp table.
default_tree <- rpart(HELOC ~ ., data = trainSet, method = "class")
prp(default_tree, extra = 1, under = TRUE)
printcp(default_tree)| CP | nsplit | rel error | xerror | xstd |
|---|---|---|---|---|
| 0.1758 | 0 | 1.000 | 1.000 | 0.090 |
| 0.0220 | 2 | 0.648 | 0.758 | 0.082 |
| 0.0110 | 6 | 0.549 | 0.890 | 0.087 |
| 0.0100 | 7 | 0.538 | 0.890 | 0.087 |
Row 1 (0 splits): no tree at all. Predict the majority class for everyone. That is why both rel error and xerror start at 1.000.
Row 2 (2 splits): the sweet spot. Cross-validated error falls to 0.758, about a 24% improvement over the baseline.
Rows 3โ4: more splits reduce training error a little more, but xerror gets worse. The tree is now fitting noise instead of signal.
Root node error: 91/350 = 0.26 tells us only 26% of training customers took a HELOC. The data are imbalanced, which will matter later when we inspect sensitivity.
full_tree <- rpart(HELOC ~ ., data = trainSet,
method = "class",
cp = 0,
minsplit = 2,
minbucket = 1)
printcp(full_tree)The full tree grows to 78 splits and drives training rel error down to about 0.011. On training data, it looks almost perfect.
But the validation story is the opposite: at 78 splits the xerror rises to about 0.923. That is barely better than having no model at all.
The best xerror in the full tree is about 0.714 at 19 splits. That is only slightly better than the 2-split tree's 0.758, so the extra complexity buys very little.
As the tree gets larger, training error keeps falling but cross-validated error bottoms out and then rises. That divergence is the clearest numerical signature of overfitting in the chapter.
We do not deploy the giant tree. We cut it back to the simplest version that still performs competitively on validation data.
pruned_tree <- prune(full_tree, cp = 0.023)
fancyRpartPlot(pruned_tree,
palettes = c("Blues", "Greens"),
type = 4,
cex = 0.6,
tweak = 0.9)The minimum xerror in the full tree was about 0.714 at cp = 0.0069, with xstd about 0.080.
The 1-SE rule says to choose the simplest tree whose xerror is within one standard error of that minimum. That threshold is about 0.794.
The 2-split tree has xerror about 0.758, which is comfortably below 0.794. So we prefer 3 leaves over a much larger tree that adds complexity without much payoff.
This is the final 2-split tree students should read and explain.
1. High income (โฅ $76,500) leads to a strong No HELOC prediction with about 94% confidence.
2. Lower income + younger than 45 is the one segment where the tree predicts Yes, with about 67% confidence.
3. Lower income + older flips back to No, with about 75% confidence.
This is where the pruned tree meets reality: does it actually find likely HELOC customers in the validation set?
predicted_class <- predict(pruned_tree, validationSet, type = "class")
confusionMatrix(predicted_class, validationSet$HELOC, positive = "1")| Actually No (0) | Actually Yes (1) | |
|---|---|---|
| Predicted No | 97 | 22 |
| Predicted Yes | 14 | 17 |
Accuracy = 76%. That sounds decent at first glance.
Specificity = 87.4%. When someone will not take a HELOC, the tree is usually right.
Sensitivity = 43.6%. Out of 39 actual HELOC takers, the model only catches 17.
Positive predictive value = 54.8%. When the model says Yes, it is right only a little more than half the time.
No Information Rate = 74%. The 76% accuracy barely beats predicting No for everyone.
For a marketing campaign, the weak point is obvious: the model misses 22 out of 39 actual HELOC takers. It is a conservative tree that plays defense well but does a mediocre job identifying the customers we actually want to find.
The tree can return probabilities, not just labels. That lets us rank customers instead of treating every Yes prediction as equally confident.
predicted_prob <- predict(pruned_tree, validationSet, type = "prob")
head(predicted_prob)| Person | P(HELOC = 0) | P(HELOC = 1) |
|---|---|---|
| 1 | 0.750 | 0.250 |
| 2 | 0.938 | 0.062 |
| 3 | 0.333 | 0.667 |
| 4 | 0.750 | 0.250 |
| 5 | 0.333 | 0.667 |
| 6 | 0.938 | 0.062 |
There are only three distinct probability values for P(HELOC = 1): 0.667, 0.250, and 0.062. That is because the pruned tree has exactly three leaves.
Everyone in the same leaf receives the same probability. This is one of the clearest differences between trees and logistic regression: trees are easier to explain, but their probabilities are coarse and bucket-like.
Gains tell us whether the tree is useful for prioritization. If we can contact only part of the list, who should go first?
validationSet$HELOC <- as.numeric(as.character(validationSet$HELOC))
gains_table <- gains(validationSet$HELOC, predicted_prob[,2])That warning is expected here. The function wants 10 groups, but our tree only creates 3 distinct probability scores, so R collapses the gains table into 3 groups.
| Depth | N | Cume N | Mean Resp | Cume % of Total | Lift | Score |
|---|---|---|---|---|---|---|
| 21% | 31 | 31 | 0.55 | 43.6% | 211 | 0.67 |
| 45% | 37 | 68 | 0.35 | 76.9% | 135 | 0.25 |
| 100% | 82 | 150 | 0.11 | 100.0% | 42 | 0.06 |
The top 31 customers capture 43.6% of all actual HELOC takers. Lift 211 means this group is 2.11x more likely than average to respond.
Once we add the second group, we reach 68 people and capture 76.9% of the responders.
The lowest-scored group has lift 42. They are less than half as likely as average to take the offer, so they are poor prospects.
If you can contact only 31 people, call the top-scored group first. You reach nearly 44% of all likely HELOC takers by targeting only 21% of the file.
The ROC curve summarizes how well the probability scores separate likely HELOC takers from non-takers across all thresholds.
roc_object <- roc(validationSet$HELOC, predicted_prob[,2])
plot.roc(roc_object)
auc(roc_object)Output
AUC = 0.7395
0.50
Random
0.74
Ours
0.80+
Good
1.00
Perfect
If we randomly choose one true HELOC taker and one non-taker, the model assigns the higher score to the HELOC taker about 74% of the time.
That is acceptable, but not strong. It also reflects the simplicity of the pruned tree: with only 3 leaves, we only get 3 distinct probability scores.
Because the tree gives only 3 distinct probabilities, the ROC curve moves in just a few visible steps. A logistic regression would look smoother because it generates many more distinct scores.
The real purpose of predictive analytics: apply the model to people we have never seen before.
myScoreData <- read_excel("data-1/HELOC_Score.xlsx")
predicted_class_score <- predict(pruned_tree, myScoreData, type = "class")
predicted_prob_score <- predict(pruned_tree, myScoreData, type = "prob")
myScoreData <- data.frame(myScoreData,
Predicted_HELOC = predicted_class_score,
Prob_Yes = predicted_prob_score[, 2])
myScoreDataAttach class labels and probabilities, then rank the list.
Everything you just learned about classification trees applies here. The only difference: the target is a number, not a category.
When the target is continuous, we ask how far off the predictions are, not whether the class was right or wrong.
Average absolute miss in the original units.
Penalizes large errors much more strongly.
Predicting credit card balance from age, sex, and income.
No factor conversion this time, because Balance stays numeric.
myData <- read_excel("data-1/Balance.xlsx")
head(myData)
set.seed(1)
myIndex <- createDataPartition(myData$Balance, p = 0.7, list = FALSE)
trainSet <- myData[myIndex, ]
validationSet <- myData[-myIndex, ]Same pattern: default tree, full tree, cp table, then prune.
default_tree <- rpart(Balance ~ ., data = trainSet, method = "anova")
summary(default_tree)
full_tree <- rpart(Balance ~ ., data = trainSet,
method = "anova",
cp = 0, minsplit = 2, minbucket = 1)
printcp(full_tree)
pruned_tree <- prune(full_tree, cp = 0.033)
prp(pruned_tree, type = 1, extra = 1, under = TRUE)That tells `rpart()` we are optimizing a continuous outcome rather than a class label.
For regression trees, we judge the model by how far off its numeric predictions are, not by a confusion matrix.
predicted <- predict(pruned_tree, validationSet, type = "vector")
actual <- validationSet$Balance
## Mean Absolute Error
MAE <- sum(abs(predicted - actual)) / length(actual)
MAE
## Mean Squared Error
mean((predicted - validationSet$Balance)^2)Mean Absolute Error
$3,786
Average miss in original dollar units
Root Mean Squared Error
$5,120
Derived from MSE = 26,219,132
MAE = 3,786</strong>meansthemodelmissesthetruebalancebyalmostfourthousanddollarsonaverage.</p><p><strongclassName="textโwhite">RMSE=5,120 is larger because big misses get squared before averaging. A few customers with unusually high or low balances are hurting the model more than MAE alone reveals.
This is also the cost of simplicity: the pruned tree only makes 2 splits, so it can only predict a small number of balance buckets rather than a smooth range of values.
Income dominates the tree's splitting power. Sex does not appear, which tells students the tree found it unhelpful for predicting balance.
This is the deployment moment: we send in unseen customers and the tree returns predicted balances.
myScoreData <- read_excel("data-1/Balance_Score.xlsx")
predicted_value_score <- predict(pruned_tree, myScoreData)
myScoreData <- data.frame(myScoreData, predicted_value_score)
colnames(myScoreData)[4] <- "Balance"
myScoreData| # | Age | Sex | Income | Predicted Balance |
|---|---|---|---|---|
| 1 | 35 | Female | 65,000 | 3,106 |
| 2 | 56 | Male | 160,000 | 14,725 |
| 3 | 43 | Male | 32,000 | 3,106 |
| 8 | 23 | Male | 28,000 | 3,106 |
| 16 | 42 | Male | 113,000 | 7,978 |
Many different customers receive the exact same predicted balance because the regression tree predicts the leaf average.
In the professor's output, 15 out of 20 people landed in the same bucket and received essentially the same prediction: about $3,106.
That is the defining limitation of regression trees: they produce step functions, not smooth curves. Interpretability goes up, but nuance goes down.
A regression tree is useful when stakeholders want understandable dollar buckets and simple rules. But if the goal is highly individualized predictions, a smoother model such as linear regression will usually produce more nuanced scores.
The `colnames()` step is a tiny but important deployment detail.
One tree is interpretable, but unstable. Ensemble methods improve stability and predictive power.
Build many trees on bootstrap samples and average them.
Like bagging, but also randomize which predictors each tree can consider.
Build trees sequentially, each one focusing on prior mistakes.
Each model solves predictive problems differently. Choosing well means understanding the tradeoffs.
| Model | Core idea | Best when | Watch out for | Interpretability |
|---|---|---|---|---|
| Linear Regression | Fit a line | Continuous target, roughly linear patterns | Nonlinearity, outliers | High |
| Logistic Regression | Model log-odds | Binary outcomes, coefficients matter | Rigid boundaries | High |
| kNN | Nearest neighbors vote | Local similarity matters | Needs scaling, slow prediction | Low |
| Naive Bayes | Multiply conditional evidence | Fast baseline, categorical features | Independence assumption | Medium |
| Decision Tree | Ask the best next question | Need rules and visual logic | Overfitting, instability | Very High |
Decision trees are great at some jobs and weak at others. The real skill is knowing when to use them.
โA perfect training model is often a bad model. The best model is the simplest one that generalizes.โ
| Concept | Key idea |
|---|---|
| What it is | A flowchart of recursive if/then splits |
| Classification vs. Regression | Classification trees vote. Regression trees average. |
| Gini | Measures class mixing inside a node |
| Overfitting | Large trees memorize training noise |
| Pruning | Cut back using cp and validation error |
| Probabilities | One probability per leaf, not per person |
| Regression metrics | MAE and MSE summarize numeric miss |
| Biggest strength | Interpretability |
One tree is fragile. A forest of trees is robust.
Next: bagging, random forests, and boosting.
Same building block, dramatically better performance.