7 Classification Trees

We first use classification trees to analyze the Carseats data set. In these data, Sales is a continuous variable, and so we begin by recoding it as a binary variable:

We create a new variable High to denote if Sales <= 8, then the Sales predictor is removed as it is a perfect predictor of High.

carseats <- ISLR::Carseats %>% 
  as_tibble() %>% 
  mutate(High = factor(ifelse(Sales <= 8, "No", "Yes")))

7.1 Summary Statistics

Table 7.1: Carseats Dataset
Sales	CompPrice	Income	Advertising	Population	Price	ShelveLoc	Age	Education	Urban	US	High
9.50	138	73	11	276	120	Bad	42	17	Yes	Yes	Yes
11.22	111	48	16	260	83	Good	65	10	Yes	Yes	Yes
10.06	113	35	10	269	80	Medium	59	12	Yes	Yes	Yes
7.40	117	100	4	466	97	Medium	55	14	Yes	Yes	No
4.15	141	64	3	340	128	Bad	38	13	Yes	No	No
10.81	124	113	13	501	72	Bad	78	16	No	Yes	Yes

carseats %>% 
  tbl_summary(statistic = list(
    all_continuous() ~ "{mean} ({sd})"),
    digits = all_continuous() ~ 3)

Table 7.2:
Summary Statistics
Characteristic	N = 400¹
Sales	7.496 (2.824)
CompPrice	124.975 (15.335)
Income	68.658 (27.986)
Advertising	6.635 (6.650)
Population	264.840 (147.376)
Price	115.795 (23.677)
ShelveLoc
Bad	96 (24%)
Good	85 (21%)
Medium	219 (55%)
Age	53.323 (16.200)
Education
10	48 (12%)
11	48 (12%)
12	49 (12%)
13	43 (11%)
14	40 (10%)
15	36 (9.0%)
16	47 (12%)
17	49 (12%)
18	40 (10%)
Urban	282 (71%)
US	258 (65%)
High	164 (41%)
¹ Mean (SD); n (%)

7.2 Classification Trees

Now we try to predict High using all variables but Sales.

tree_carseats <- tree(High ~ . - Sales, data = carseats)
summary(tree_carseats)


Classification tree:
tree(formula = High ~ . - Sales, data = carseats)
Variables actually used in tree construction:
[1] "ShelveLoc"   "Price"       "Income"      "CompPrice"   "Population" 
[6] "Advertising" "Age"         "US"         
Number of terminal nodes:  27 
Residual mean deviance:  0.4575 = 170.7 / 373 
Misclassification error rate: 0.09 = 36 / 400

The summary() function lists the variables that are used as internal nodes in the tree, the number of terminal nodes, and the (training) error rate. We see that the training error rate is 9%.

But we will actually use the rpart package for trees.

The rpart function has some default parameters that prevented our tree from growing. Namely minsplit and minbucket.

minsplit is "the minimum number of observations that must exist in a node in order for a split to be attempted";
minbucket is "the minimum number of observations in any terminal node".

tree_carseats <- rpart(High ~ . - Sales, 
                       data = carseats, 
                       method = "class")

The rpart.plot package provides functions to let us easily visualize the decision tree. As the name implies, it only works with rpart trees.

fancyRpartPlot(tree_carseats, caption = NULL)

We can see that the most important variable to predict high sales appears to be shelving location (ShelveLoc) as it forms the first node. But if we actualy report the Variable Importance, we will see that Price is the most important variable here:

Code

data.frame(vi=tree_carseats$variable.importance) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 7.3: Variable Importance
	vi
Price	39.3458304
ShelveLoc	28.9918954
Age	13.0761815
Advertising	12.7110483
CompPrice	10.2253806
Income	6.2611750
Population	3.1669718
Education	0.9670985
US	0.1411614

7.2.1 Performance Assessment

In order to properly evaluate the performance of a classification tree on these data, we must estimate the test error rather than simply computing the training error.

set.seed(345)
index <- sample(1:nrow(carseats), 0.7*nrow(carseats)) 

train <- carseats[index,] # Create the training data 
test <- carseats[-index,] # Create the test data

dim(train)

[1] 280  12

dim(test)

[1] 120  12

We can build a confusion matrix to assess prediction accuracy on our test data:

tree_carseats <- rpart(High ~ . - Sales, 
                       data = train, 
                       method = "class")

tree.pred <- predict (tree_carseats, newdata = test, type = "class")
cm <- confusionMatrix(factor(tree.pred), test$High)
cm

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  51  13
       Yes 22  34
                                          
               Accuracy : 0.7083          
                 95% CI : (0.6184, 0.7877)
    No Information Rate : 0.6083          
    P-Value [Acc > NIR] : 0.01454         
                                          
                  Kappa : 0.4081          
                                          
 Mcnemar's Test P-Value : 0.17630         
                                          
            Sensitivity : 0.6986          
            Specificity : 0.7234          
         Pos Pred Value : 0.7969          
         Neg Pred Value : 0.6071          
             Prevalence : 0.6083          
         Detection Rate : 0.4250          
   Detection Prevalence : 0.5333          
      Balanced Accuracy : 0.7110          
                                          
       'Positive' Class : No

We have an Accuracy of 70.83%.

A more visual representation of the confusion matrix can be done by generating a ggplot2 chart:

plt <- as.data.frame(cm$table)

plotTable <- plt %>%
  mutate(Correct = ifelse(plt$Prediction == plt$Reference, "TRUE", "FALSE")) %>%
  group_by(Reference) %>%
  mutate(prop = Freq/sum(Freq))

Code

ggplot(plotTable, aes(x = Reference, y = Prediction, fill = Correct, alpha = prop)) +
  geom_tile() +
  geom_text(aes(label = Freq), vjust = .5, fontface  = "bold", alpha = 1) +
  scale_fill_manual(values = c("TRUE" = "#AED8CC", "FALSE" = "#CD6688")) +
  theme_bw() +
  xlim(rev(levels(plt$Reference)))

Figure 7.2: Confusion Matrix: Decision Tree

7.2.2 Pruning the Tree

Next, we consider whether pruning the tree might lead to improved results.

The printcp and plotcp functions provide the cross-validation error for each nsplit and can be used to prune the tree.
The one with least cross-validated error (xerror) is the optimal value of CP given by the printcp() function.
By default, rpart is performing some automated tuning.

set.seed(345)
tree_carseats <- rpart(High ~ . - Sales, 
                       data = train, 
                       method = "class",
                       cp = 0)
plotcp(tree_carseats)

Code

data.frame(tree_carseats$cptable) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 7.4: Cost-Complexity
CP	nsplit	rel.error	xerror	xstd
0.3418803	0	1.0000000	1.0000000	0.0705378
0.0726496	1	0.6581197	0.6581197	0.0638599
0.0598291	3	0.5128205	0.6324786	0.0630644
0.0341880	4	0.4529915	0.5641026	0.0607035
0.0170940	5	0.4188034	0.5811966	0.0613278
0.0000000	6	0.4017094	0.5811966	0.0613278

The model with the lowest xerror:

ptree <- prune(tree_carseats,
              cp = tree_carseats$cptable[which.min(tree_carseats$cptable[,"xerror"]),"CP"])

prp(ptree, yesno = 2, extra = 6, box.palette=c("#CD6688", "#AED8CC")) # or extra=4

tree.pred <- predict (ptree, newdata = test, type = "class")
cm <- confusionMatrix(factor(tree.pred), test$High)
cm$overall[1]

Accuracy 
     0.7

Note that we use prp function to customize our tree visualization.

8 Regression Trees

Here we fit a regression tree to the Boston data set. First, we create a training set, and fit the tree to the training data:

set.seed(345)

index <- sample(1:nrow(Boston), 0.7*nrow(Boston)) 

boston_train <- Boston[index,] # Create the training data 
boston_test <- Boston[-index,] # Create the test data

Fitting the model to the training data set:

set.seed(345)
tree_boston <- rpart(medv ~., 
                       data = boston_train, 
                       method = "anova") ## for regression trees

prp(tree_boston,yesno = 2, box.palette=c("#CD6688", "#AED8CC"))

Figure 8.1: Regression Tree of Home Value

8.0.1 Performance Assessment

To evaluate predicting performance in regression trees, we will use the MSE/RMSE metrics:

eval_results <- function(true, predicted, df) {
  SSE <- sum((predicted - true)^2)
  MSE <- SSE/nrow(df)                            # Mean-Squared Error
  RMSE = sqrt(SSE/nrow(df))                      # Root Mean-Squared Error

  # Model performance metrics
data.frame(
  MSE = MSE,
  RMSE = RMSE
)
  
}

tree.pred <- predict(tree_boston, newdata = boston_test) ## no "class" type

eval_results(boston_test$medv, tree.pred, boston_test) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 8.1: Model Performance: Regression Tree
MSE	RMSE
33.37421	5.777041

8.0.2 Pruning the Tree

Let’s consider the full tree:

set.seed(345)
tree_boston <- rpart(medv ~., 
                       data = boston_train, 
                       method = "anova", cp = 0.00) 
plotcp(tree_boston)

tree.pred <- predict(tree_boston, newdata = boston_test) ## no "class" type

eval_results(boston_test$medv, tree.pred, boston_test) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 8.2: Model Performance: Full Regression Tree
MSE	RMSE
27.48546	5.242658

The model with the lowest xerror:

ptree <- prune(tree_boston,
              cp = tree_boston$cptable[which.min(tree_boston$cptable[,"xerror"]),"CP"])

prp(ptree, yesno = 2, box.palette=c("#CD6688", "#AED8CC")) # or extra=4

Let’s get our MSE/RMSE metrics:

tree.pred <- predict (ptree, boston_test) ## no "class" type

eval_results(boston_test$medv, tree.pred, boston_test) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 8.3: Model Performance: Pruned Regression Tree
MSE	RMSE
27.55519	5.249303

9 Bagging & Random Forests

Here we apply bagging and random forests to the Boston data set. We will be using the randomForest package as the engine.

A bagging model is a special case of a random forest with $m=p$ where mtry is equal to the number of predictors.
We can specify the mtry to be ncol()-1 which means that the number of columns in the predictor matrix is used (removing the dependent variable).

9.1 Bagging

Let’s start with our bagging model. In bagging, we are using all the predictors available for each split.

set.seed(345)
bag_boston <- randomForest(medv ~ ., data = boston_train,
                           mtry = ncol(boston_train)-1, importance = TRUE)
bag_boston


Call:
 randomForest(formula = medv ~ ., data = boston_train, mtry = ncol(boston_train) -      1, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 13

          Mean of squared residuals: 12.39109
                    % Var explained: 83.76

Code

vip(bag_boston, aesthetics = list(fill = "#CD6688")) + theme_bw()

bag.pred <- predict(bag_boston, newdata = boston_test)

eval_results(boston_test$medv, bag.pred, boston_test) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 9.1: Model Performance: Bagging
MSE	RMSE
18.11275	4.255907

The test MSE is much lower using bagging than using a single decision tree. Let's see if the test error further decreases when using random forest instead.

9.2 Random Forest

Next, let us take a look at a Random Forest.

By default, randomForest() uses p/3 variables when building a random forest for regression trees, and sqrt(p) variables when building a random forest for classification trees.
Here we use mtry = 6.

set.seed(345)
rf_boston <- randomForest(medv ~ ., data = boston_train,
                           mtry = 6, importance = TRUE)
rf_boston


Call:
 randomForest(formula = medv ~ ., data = boston_train, mtry = 6,      importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 6

          Mean of squared residuals: 10.91157
                    % Var explained: 85.7

Code

vip(rf_boston, aesthetics = list(fill = "#CD6688")) + theme_bw()

rf.pred <- predict(rf_boston, newdata = boston_test)

eval_results(boston_test$medv, rf.pred, boston_test) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 9.2: Model Performance: Random Forest
MSE	RMSE
16.69036	4.085384

10 Boosting

We will now fit a boosted tree model.

To fit boosted regression trees we need to specify distribution = "gaussian".
To use classification trees instead, we use distribution = "bernoulli

We set n.tree = 5000 to grow 5000 trees with a maximal depth of 4:

set.seed(345)
boost_boston <- gbm(medv ~ .,
                    data = boston_train,
                    distribution = "gaussian",
                    n.trees = 5000,
                    interaction.depth = 4)

summary(boost_boston)

            var     rel.inf
rm           rm 42.88339199
lstat     lstat 24.71055094
dis         dis  6.80641563
crim       crim  5.75903125
ptratio ptratio  4.33128086
black     black  4.09087730
nox         nox  4.06061420
age         age  3.83636955
tax         tax  1.53100288
rad         rad  0.95943008
indus     indus  0.70584492
zn           zn  0.22531868
chas       chas  0.09987172

pred_boost <- predict(boost_boston, new_data = boston_test)

eval_results(boston_test$medv, pred_boost, boston_test) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 10.1: Model Performance: Boosting
MSE	RMSE
411.5822	20.28749