6  Tree-Based Methods

Author

Daniel Redel

7 Classification Trees

We first use classification trees to analyze the Carseats data set. In these data, Sales is a continuous variable, and so we begin by recoding it as a binary variable:

  • We create a new variable High to denote if Sales <= 8, then the Sales predictor is removed as it is a perfect predictor of High.
carseats <- ISLR::Carseats %>% 
  as_tibble() %>% 
  mutate(High = factor(ifelse(Sales <= 8, "No", "Yes")))

7.1 Summary Statistics

Table 7.1: Carseats Dataset
Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US High
9.50 138 73 11 276 120 Bad 42 17 Yes Yes Yes
11.22 111 48 16 260 83 Good 65 10 Yes Yes Yes
10.06 113 35 10 269 80 Medium 59 12 Yes Yes Yes
7.40 117 100 4 466 97 Medium 55 14 Yes Yes No
4.15 141 64 3 340 128 Bad 38 13 Yes No No
10.81 124 113 13 501 72 Bad 78 16 No Yes Yes
carseats %>% 
  tbl_summary(statistic = list(
    all_continuous() ~ "{mean} ({sd})"),
    digits = all_continuous() ~ 3)
Table 7.2:

Summary Statistics

Characteristic N = 4001
Sales 7.496 (2.824)
CompPrice 124.975 (15.335)
Income 68.658 (27.986)
Advertising 6.635 (6.650)
Population 264.840 (147.376)
Price 115.795 (23.677)
ShelveLoc
    Bad 96 (24%)
    Good 85 (21%)
    Medium 219 (55%)
Age 53.323 (16.200)
Education
    10 48 (12%)
    11 48 (12%)
    12 49 (12%)
    13 43 (11%)
    14 40 (10%)
    15 36 (9.0%)
    16 47 (12%)
    17 49 (12%)
    18 40 (10%)
Urban 282 (71%)
US 258 (65%)
High 164 (41%)
1 Mean (SD); n (%)

7.2 Classification Trees

Now we try to predict High using all variables but Sales.

tree_carseats <- tree(High ~ . - Sales, data = carseats)
summary(tree_carseats)

Classification tree:
tree(formula = High ~ . - Sales, data = carseats)
Variables actually used in tree construction:
[1] "ShelveLoc"   "Price"       "Income"      "CompPrice"   "Population" 
[6] "Advertising" "Age"         "US"         
Number of terminal nodes:  27 
Residual mean deviance:  0.4575 = 170.7 / 373 
Misclassification error rate: 0.09 = 36 / 400 

The summary() function lists the variables that are used as internal nodes in the tree, the number of terminal nodes, and the (training) error rate. We see that the training error rate is 9%.

But we will actually use the rpart package for trees.

The rpart function has some default parameters that prevented our tree from growing. Namely minsplit and minbucket

  • minsplit is "the minimum number of observations that must exist in a node in order for a split to be attempted"

  • minbucket is "the minimum number of observations in any terminal node"

tree_carseats <- rpart(High ~ . - Sales, 
                       data = carseats, 
                       method = "class")

The rpart.plot package provides functions to let us easily visualize the decision tree. As the name implies, it only works with rpart trees.

fancyRpartPlot(tree_carseats, caption = NULL)

Figure 7.1: Decision Tree of Sales

We can see that the most important variable to predict high sales appears to be shelving location (ShelveLoc) as it forms the first node. But if we actualy report the Variable Importance, we will see that Price is the most important variable here:

Code
data.frame(vi=tree_carseats$variable.importance) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 7.3: Variable Importance
vi
Price 39.3458304
ShelveLoc 28.9918954
Age 13.0761815
Advertising 12.7110483
CompPrice 10.2253806
Income 6.2611750
Population 3.1669718
Education 0.9670985
US 0.1411614

7.2.1 Performance Assessment

In order to properly evaluate the performance of a classification tree on these data, we must estimate the test error rather than simply computing the training error.

set.seed(345)
index <- sample(1:nrow(carseats), 0.7*nrow(carseats)) 

train <- carseats[index,] # Create the training data 
test <- carseats[-index,] # Create the test data

dim(train)
[1] 280  12
dim(test)
[1] 120  12

We can build a confusion matrix to assess prediction accuracy on our test data:

tree_carseats <- rpart(High ~ . - Sales, 
                       data = train, 
                       method = "class")

tree.pred <- predict (tree_carseats, newdata = test, type = "class")
cm <- confusionMatrix(factor(tree.pred), test$High)
cm
Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  51  13
       Yes 22  34
                                          
               Accuracy : 0.7083          
                 95% CI : (0.6184, 0.7877)
    No Information Rate : 0.6083          
    P-Value [Acc > NIR] : 0.01454         
                                          
                  Kappa : 0.4081          
                                          
 Mcnemar's Test P-Value : 0.17630         
                                          
            Sensitivity : 0.6986          
            Specificity : 0.7234          
         Pos Pred Value : 0.7969          
         Neg Pred Value : 0.6071          
             Prevalence : 0.6083          
         Detection Rate : 0.4250          
   Detection Prevalence : 0.5333          
      Balanced Accuracy : 0.7110          
                                          
       'Positive' Class : No              
                                          

We have an Accuracy of 70.83%.

A more visual representation of the confusion matrix can be done by generating a ggplot2 chart:

plt <- as.data.frame(cm$table)

plotTable <- plt %>%
  mutate(Correct = ifelse(plt$Prediction == plt$Reference, "TRUE", "FALSE")) %>%
  group_by(Reference) %>%
  mutate(prop = Freq/sum(Freq))
Code
ggplot(plotTable, aes(x = Reference, y = Prediction, fill = Correct, alpha = prop)) +
  geom_tile() +
  geom_text(aes(label = Freq), vjust = .5, fontface  = "bold", alpha = 1) +
  scale_fill_manual(values = c("TRUE" = "#AED8CC", "FALSE" = "#CD6688")) +
  theme_bw() +
  xlim(rev(levels(plt$Reference)))

Figure 7.2: Confusion Matrix: Decision Tree

7.2.2 Pruning the Tree

Next, we consider whether pruning the tree might lead to improved results.

  • The printcp and plotcp functions provide the cross-validation error for each nsplit and can be used to prune the tree.

  • The one with least cross-validated error (xerror) is the optimal value of CP given by the printcp() function.

  • By default, rpart is performing some automated tuning.

set.seed(345)
tree_carseats <- rpart(High ~ . - Sales, 
                       data = train, 
                       method = "class",
                       cp = 0)
plotcp(tree_carseats)

Code
data.frame(tree_carseats$cptable) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 7.4: Cost-Complexity
CP nsplit rel.error xerror xstd
0.3418803 0 1.0000000 1.0000000 0.0705378
0.0726496 1 0.6581197 0.6581197 0.0638599
0.0598291 3 0.5128205 0.6324786 0.0630644
0.0341880 4 0.4529915 0.5641026 0.0607035
0.0170940 5 0.4188034 0.5811966 0.0613278
0.0000000 6 0.4017094 0.5811966 0.0613278

The model with the lowest xerror:

ptree <- prune(tree_carseats,
              cp = tree_carseats$cptable[which.min(tree_carseats$cptable[,"xerror"]),"CP"])

prp(ptree, yesno = 2, extra = 6, box.palette=c("#CD6688", "#AED8CC")) # or extra=4

Figure 7.3: Prunned Classification Tree
tree.pred <- predict (ptree, newdata = test, type = "class")
cm <- confusionMatrix(factor(tree.pred), test$High)
cm$overall[1]
Accuracy 
     0.7 
  • Note that we use prp function to customize our tree visualization.

8 Regression Trees

Here we fit a regression tree to the Boston data set. First, we create a training set, and fit the tree to the training data:

set.seed(345)

index <- sample(1:nrow(Boston), 0.7*nrow(Boston)) 

boston_train <- Boston[index,] # Create the training data 
boston_test <- Boston[-index,] # Create the test data

Fitting the model to the training data set:

set.seed(345)
tree_boston <- rpart(medv ~., 
                       data = boston_train, 
                       method = "anova") ## for regression trees
prp(tree_boston,yesno = 2, box.palette=c("#CD6688", "#AED8CC"))

Figure 8.1: Regression Tree of Home Value

8.0.1 Performance Assessment

To evaluate predicting performance in regression trees, we will use the MSE/RMSE metrics:

eval_results <- function(true, predicted, df) {
  SSE <- sum((predicted - true)^2)
  MSE <- SSE/nrow(df)                            # Mean-Squared Error
  RMSE = sqrt(SSE/nrow(df))                      # Root Mean-Squared Error

  # Model performance metrics
data.frame(
  MSE = MSE,
  RMSE = RMSE
)
  
}
tree.pred <- predict(tree_boston, newdata = boston_test) ## no "class" type

eval_results(boston_test$medv, tree.pred, boston_test) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 8.1: Model Performance: Regression Tree
MSE RMSE
33.37421 5.777041

8.0.2 Pruning the Tree

Let’s consider the full tree:

set.seed(345)
tree_boston <- rpart(medv ~., 
                       data = boston_train, 
                       method = "anova", cp = 0.00) 
plotcp(tree_boston)

Figure 8.2: Full Regression Tree
tree.pred <- predict(tree_boston, newdata = boston_test) ## no "class" type

eval_results(boston_test$medv, tree.pred, boston_test) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 8.2: Model Performance: Full Regression Tree
MSE RMSE
27.48546 5.242658

The model with the lowest xerror:

ptree <- prune(tree_boston,
              cp = tree_boston$cptable[which.min(tree_boston$cptable[,"xerror"]),"CP"])

prp(ptree, yesno = 2, box.palette=c("#CD6688", "#AED8CC")) # or extra=4

Figure 8.3: Prunned Regression Tree

Let’s get our MSE/RMSE metrics:

tree.pred <- predict (ptree, boston_test) ## no "class" type

eval_results(boston_test$medv, tree.pred, boston_test) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 8.3: Model Performance: Pruned Regression Tree
MSE RMSE
27.55519 5.249303

9 Bagging & Random Forests

Here we apply bagging and random forests to the Boston data set. We will be using the randomForest package as the engine.

  • A bagging model is a special case of a random forest with m=pm=p where mtry is equal to the number of predictors.

  • We can specify the mtry to be ncol()-1 which means that the number of columns in the predictor matrix is used (removing the dependent variable).

9.1 Bagging

Let’s start with our bagging model. In bagging, we are using all the predictors available for each split.

set.seed(345)
bag_boston <- randomForest(medv ~ ., data = boston_train,
                           mtry = ncol(boston_train)-1, importance = TRUE)
bag_boston

Call:
 randomForest(formula = medv ~ ., data = boston_train, mtry = ncol(boston_train) -      1, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 13

          Mean of squared residuals: 12.39109
                    % Var explained: 83.76
Code
vip(bag_boston, aesthetics = list(fill = "#CD6688")) + theme_bw()

Figure 9.1: Variable Importance
bag.pred <- predict(bag_boston, newdata = boston_test)

eval_results(boston_test$medv, bag.pred, boston_test) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 9.1: Model Performance: Bagging
MSE RMSE
18.11275 4.255907

The test MSE is much lower using bagging than using a single decision tree. Let's see if the test error further decreases when using random forest instead.

9.2 Random Forest

Next, let us take a look at a Random Forest.

  • By default, randomForest() uses p/3 variables when building a random forest for regression trees, and sqrt(p) variables when building a random forest for classification trees.

  • Here we use mtry = 6.

set.seed(345)
rf_boston <- randomForest(medv ~ ., data = boston_train,
                           mtry = 6, importance = TRUE)
rf_boston

Call:
 randomForest(formula = medv ~ ., data = boston_train, mtry = 6,      importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 6

          Mean of squared residuals: 10.91157
                    % Var explained: 85.7
Code
vip(rf_boston, aesthetics = list(fill = "#CD6688")) + theme_bw()

Figure 9.2: Variable Importance
rf.pred <- predict(rf_boston, newdata = boston_test)

eval_results(boston_test$medv, rf.pred, boston_test) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 9.2: Model Performance: Random Forest
MSE RMSE
16.69036 4.085384

10 Boosting

We will now fit a boosted tree model.

  • To fit boosted regression trees we need to specify distribution = "gaussian".

  • To use classification trees instead, we use distribution = "bernoulli

We set n.tree = 5000 to grow 5000 trees with a maximal depth of 4:

set.seed(345)
boost_boston <- gbm(medv ~ .,
                    data = boston_train,
                    distribution = "gaussian",
                    n.trees = 5000,
                    interaction.depth = 4)

summary(boost_boston)

            var     rel.inf
rm           rm 42.88339199
lstat     lstat 24.71055094
dis         dis  6.80641563
crim       crim  5.75903125
ptratio ptratio  4.33128086
black     black  4.09087730
nox         nox  4.06061420
age         age  3.83636955
tax         tax  1.53100288
rad         rad  0.95943008
indus     indus  0.70584492
zn           zn  0.22531868
chas       chas  0.09987172
pred_boost <- predict(boost_boston, new_data = boston_test)

eval_results(boston_test$medv, pred_boost, boston_test) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 10.1: Model Performance: Boosting
MSE RMSE
411.5822 20.28749