carseats <- ISLR::Carseats %>%
as_tibble() %>%
mutate(High = factor(ifelse(Sales <= 8, "No", "Yes")))6 Tree-Based Methods
7 Classification Trees
We first use classification trees to analyze the Carseats data set. In these data, Sales is a continuous variable, and so we begin by recoding it as a binary variable:
- We create a new variable
Highto denote ifSales <= 8, then theSalespredictor is removed as it is a perfect predictor ofHigh.
7.1 Summary Statistics
| Sales | CompPrice | Income | Advertising | Population | Price | ShelveLoc | Age | Education | Urban | US | High |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9.50 | 138 | 73 | 11 | 276 | 120 | Bad | 42 | 17 | Yes | Yes | Yes |
| 11.22 | 111 | 48 | 16 | 260 | 83 | Good | 65 | 10 | Yes | Yes | Yes |
| 10.06 | 113 | 35 | 10 | 269 | 80 | Medium | 59 | 12 | Yes | Yes | Yes |
| 7.40 | 117 | 100 | 4 | 466 | 97 | Medium | 55 | 14 | Yes | Yes | No |
| 4.15 | 141 | 64 | 3 | 340 | 128 | Bad | 38 | 13 | Yes | No | No |
| 10.81 | 124 | 113 | 13 | 501 | 72 | Bad | 78 | 16 | No | Yes | Yes |
carseats %>%
tbl_summary(statistic = list(
all_continuous() ~ "{mean} ({sd})"),
digits = all_continuous() ~ 3)| Characteristic | N = 4001 |
|---|---|
| Sales | 7.496 (2.824) |
| CompPrice | 124.975 (15.335) |
| Income | 68.658 (27.986) |
| Advertising | 6.635 (6.650) |
| Population | 264.840 (147.376) |
| Price | 115.795 (23.677) |
| ShelveLoc | |
| Bad | 96 (24%) |
| Good | 85 (21%) |
| Medium | 219 (55%) |
| Age | 53.323 (16.200) |
| Education | |
| 10 | 48 (12%) |
| 11 | 48 (12%) |
| 12 | 49 (12%) |
| 13 | 43 (11%) |
| 14 | 40 (10%) |
| 15 | 36 (9.0%) |
| 16 | 47 (12%) |
| 17 | 49 (12%) |
| 18 | 40 (10%) |
| Urban | 282 (71%) |
| US | 258 (65%) |
| High | 164 (41%) |
| 1 Mean (SD); n (%) | |
7.2 Classification Trees
Now we try to predict High using all variables but Sales.
tree_carseats <- tree(High ~ . - Sales, data = carseats)
summary(tree_carseats)
Classification tree:
tree(formula = High ~ . - Sales, data = carseats)
Variables actually used in tree construction:
[1] "ShelveLoc" "Price" "Income" "CompPrice" "Population"
[6] "Advertising" "Age" "US"
Number of terminal nodes: 27
Residual mean deviance: 0.4575 = 170.7 / 373
Misclassification error rate: 0.09 = 36 / 400
The summary() function lists the variables that are used as internal nodes in the tree, the number of terminal nodes, and the (training) error rate. We see that the training error rate is 9%.
But we will actually use the rpart package for trees.
The rpart function has some default parameters that prevented our tree from growing. Namely minsplit and minbucket.
minsplitis "the minimum number of observations that must exist in a node in order for a split to be attempted";minbucketis "the minimum number of observations in any terminal node".
tree_carseats <- rpart(High ~ . - Sales,
data = carseats,
method = "class")The rpart.plot package provides functions to let us easily visualize the decision tree. As the name implies, it only works with rpart trees.
We can see that the most important variable to predict high sales appears to be shelving location (ShelveLoc) as it forms the first node. But if we actualy report the Variable Importance, we will see that Price is the most important variable here:
Code
data.frame(vi=tree_carseats$variable.importance) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))7.2.1 Performance Assessment
In order to properly evaluate the performance of a classification tree on these data, we must estimate the test error rather than simply computing the training error.
set.seed(345)
index <- sample(1:nrow(carseats), 0.7*nrow(carseats))
train <- carseats[index,] # Create the training data
test <- carseats[-index,] # Create the test data
dim(train)[1] 280 12
dim(test)[1] 120 12
We can build a confusion matrix to assess prediction accuracy on our test data:
tree_carseats <- rpart(High ~ . - Sales,
data = train,
method = "class")
tree.pred <- predict (tree_carseats, newdata = test, type = "class")
cm <- confusionMatrix(factor(tree.pred), test$High)
cmConfusion Matrix and Statistics
Reference
Prediction No Yes
No 51 13
Yes 22 34
Accuracy : 0.7083
95% CI : (0.6184, 0.7877)
No Information Rate : 0.6083
P-Value [Acc > NIR] : 0.01454
Kappa : 0.4081
Mcnemar's Test P-Value : 0.17630
Sensitivity : 0.6986
Specificity : 0.7234
Pos Pred Value : 0.7969
Neg Pred Value : 0.6071
Prevalence : 0.6083
Detection Rate : 0.4250
Detection Prevalence : 0.5333
Balanced Accuracy : 0.7110
'Positive' Class : No
We have an Accuracy of 70.83%.
A more visual representation of the confusion matrix can be done by generating a ggplot2 chart:
plt <- as.data.frame(cm$table)
plotTable <- plt %>%
mutate(Correct = ifelse(plt$Prediction == plt$Reference, "TRUE", "FALSE")) %>%
group_by(Reference) %>%
mutate(prop = Freq/sum(Freq))Code
ggplot(plotTable, aes(x = Reference, y = Prediction, fill = Correct, alpha = prop)) +
geom_tile() +
geom_text(aes(label = Freq), vjust = .5, fontface = "bold", alpha = 1) +
scale_fill_manual(values = c("TRUE" = "#AED8CC", "FALSE" = "#CD6688")) +
theme_bw() +
xlim(rev(levels(plt$Reference)))7.2.2 Pruning the Tree
Next, we consider whether pruning the tree might lead to improved results.
The
printcpandplotcpfunctions provide the cross-validation error for eachnsplitand can be used to prune the tree.The one with least cross-validated error (
xerror) is the optimal value of CP given by theprintcp()function.By default,
rpartis performing some automated tuning.
set.seed(345)
tree_carseats <- rpart(High ~ . - Sales,
data = train,
method = "class",
cp = 0)
plotcp(tree_carseats)Code
data.frame(tree_carseats$cptable) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| CP | nsplit | rel.error | xerror | xstd |
|---|---|---|---|---|
| 0.3418803 | 0 | 1.0000000 | 1.0000000 | 0.0705378 |
| 0.0726496 | 1 | 0.6581197 | 0.6581197 | 0.0638599 |
| 0.0598291 | 3 | 0.5128205 | 0.6324786 | 0.0630644 |
| 0.0341880 | 4 | 0.4529915 | 0.5641026 | 0.0607035 |
| 0.0170940 | 5 | 0.4188034 | 0.5811966 | 0.0613278 |
| 0.0000000 | 6 | 0.4017094 | 0.5811966 | 0.0613278 |
The model with the lowest xerror:
ptree <- prune(tree_carseats,
cp = tree_carseats$cptable[which.min(tree_carseats$cptable[,"xerror"]),"CP"])
prp(ptree, yesno = 2, extra = 6, box.palette=c("#CD6688", "#AED8CC")) # or extra=4tree.pred <- predict (ptree, newdata = test, type = "class")
cm <- confusionMatrix(factor(tree.pred), test$High)
cm$overall[1]Accuracy
0.7
- Note that we use
prpfunction to customize our tree visualization.
8 Regression Trees
Here we fit a regression tree to the Boston data set. First, we create a training set, and fit the tree to the training data:
set.seed(345)
index <- sample(1:nrow(Boston), 0.7*nrow(Boston))
boston_train <- Boston[index,] # Create the training data
boston_test <- Boston[-index,] # Create the test dataFitting the model to the training data set:
set.seed(345)
tree_boston <- rpart(medv ~.,
data = boston_train,
method = "anova") ## for regression treesprp(tree_boston,yesno = 2, box.palette=c("#CD6688", "#AED8CC"))8.0.1 Performance Assessment
To evaluate predicting performance in regression trees, we will use the MSE/RMSE metrics:
eval_results <- function(true, predicted, df) {
SSE <- sum((predicted - true)^2)
MSE <- SSE/nrow(df) # Mean-Squared Error
RMSE = sqrt(SSE/nrow(df)) # Root Mean-Squared Error
# Model performance metrics
data.frame(
MSE = MSE,
RMSE = RMSE
)
}8.0.2 Pruning the Tree
Let’s consider the full tree:
set.seed(345)
tree_boston <- rpart(medv ~.,
data = boston_train,
method = "anova", cp = 0.00)
plotcp(tree_boston)tree.pred <- predict(tree_boston, newdata = boston_test) ## no "class" type
eval_results(boston_test$medv, tree.pred, boston_test) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))The model with the lowest xerror:
ptree <- prune(tree_boston,
cp = tree_boston$cptable[which.min(tree_boston$cptable[,"xerror"]),"CP"])
prp(ptree, yesno = 2, box.palette=c("#CD6688", "#AED8CC")) # or extra=4Let’s get our MSE/RMSE metrics:
9 Bagging & Random Forests
Here we apply bagging and random forests to the Boston data set. We will be using the randomForest package as the engine.
A bagging model is a special case of a random forest with where
mtryis equal to the number of predictors.We can specify the
mtryto bencol()-1which means that the number of columns in the predictor matrix is used (removing the dependent variable).
9.1 Bagging
Let’s start with our bagging model. In bagging, we are using all the predictors available for each split.
set.seed(345)
bag_boston <- randomForest(medv ~ ., data = boston_train,
mtry = ncol(boston_train)-1, importance = TRUE)
bag_boston
Call:
randomForest(formula = medv ~ ., data = boston_train, mtry = ncol(boston_train) - 1, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 13
Mean of squared residuals: 12.39109
% Var explained: 83.76
Code
vip(bag_boston, aesthetics = list(fill = "#CD6688")) + theme_bw()bag.pred <- predict(bag_boston, newdata = boston_test)
eval_results(boston_test$medv, bag.pred, boston_test) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))The test MSE is much lower using bagging than using a single decision tree. Let's see if the test error further decreases when using random forest instead.
9.2 Random Forest
Next, let us take a look at a Random Forest.
By default,
randomForest()usesp/3variables when building a random forest for regression trees, andsqrt(p)variables when building a random forest for classification trees.Here we use
mtry = 6.
set.seed(345)
rf_boston <- randomForest(medv ~ ., data = boston_train,
mtry = 6, importance = TRUE)
rf_boston
Call:
randomForest(formula = medv ~ ., data = boston_train, mtry = 6, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 6
Mean of squared residuals: 10.91157
% Var explained: 85.7
Code
vip(rf_boston, aesthetics = list(fill = "#CD6688")) + theme_bw()10 Boosting
We will now fit a boosted tree model.
To fit boosted regression trees we need to specify
distribution = "gaussian".To use classification trees instead, we use
distribution = "bernoulli
We set n.tree = 5000 to grow 5000 trees with a maximal depth of 4:
set.seed(345)
boost_boston <- gbm(medv ~ .,
data = boston_train,
distribution = "gaussian",
n.trees = 5000,
interaction.depth = 4)
summary(boost_boston) var rel.inf
rm rm 42.88339199
lstat lstat 24.71055094
dis dis 6.80641563
crim crim 5.75903125
ptratio ptratio 4.33128086
black black 4.09087730
nox nox 4.06061420
age age 3.83636955
tax tax 1.53100288
rad rad 0.95943008
indus indus 0.70584492
zn zn 0.22531868
chas chas 0.09987172