library(dplyr)
hitters <- ISLR::Hitters %>% na.omit()5 Model Selection & Regularization
6 ISLR: Baseball Players Salary
Here we apply regularization methods to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.
First, let’s clean the missing values.
| AtBat | Hits | HmRun | Runs | RBI | Walks | Years | CAtBat | CHits | CHmRun | CRuns | CRBI | CWalks | League | Division | PutOuts | Assists | Errors | Salary | NewLeague | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -Alan Ashby | 315 | 81 | 7 | 24 | 38 | 39 | 14 | 3449 | 835 | 69 | 321 | 414 | 375 | N | W | 632 | 43 | 10 | 475.0 | N |
| -Alvin Davis | 479 | 130 | 18 | 66 | 72 | 76 | 3 | 1624 | 457 | 63 | 224 | 266 | 263 | A | W | 880 | 82 | 14 | 480.0 | A |
| -Andre Dawson | 496 | 141 | 20 | 65 | 78 | 37 | 11 | 5628 | 1575 | 225 | 828 | 838 | 354 | N | E | 200 | 11 | 3 | 500.0 | N |
| -Andres Galarraga | 321 | 87 | 10 | 39 | 42 | 30 | 2 | 396 | 101 | 12 | 48 | 46 | 33 | N | E | 805 | 40 | 4 | 91.5 | N |
| -Alfredo Griffin | 594 | 169 | 4 | 74 | 51 | 35 | 11 | 4408 | 1133 | 19 | 501 | 336 | 194 | A | W | 282 | 421 | 25 | 750.0 | A |
| -Al Newman | 185 | 37 | 1 | 23 | 8 | 21 | 2 | 214 | 42 | 1 | 30 | 9 | 24 | N | E | 76 | 127 | 7 | 70.0 | A |
6.1 Summary Statistics
We can show some statistics
hitters %>%
dplyr::select(Salary, Years, Hits, CHits, HmRun, Runs, AtBat, CAtBat, Walks) %>%
tbl_summary(statistic = list(
all_continuous() ~ "{mean} ({sd})"),
digits = all_continuous() ~ 3)| Characteristic | N = 2631 |
|---|---|
| Salary | 535.926 (451.119) |
| Years | 7.312 (4.794) |
| Hits | 107.829 (45.125) |
| CHits | 722.186 (648.200) |
| HmRun | 11.620 (8.757) |
| Runs | 54.745 (25.540) |
| AtBat | 403.643 (147.307) |
| CAtBat | 2,657.544 (2,286.583) |
| Walks | 41.114 (21.718) |
| 1 Mean (SD) | |
Let’s take a closer look at the Salary distribution:
Code
hitters %>%
ggplot(aes(x = Salary, fill = Division)) +
geom_density(alpha = 0.7) +
scale_fill_manual(values = mycolors) + theme_bw()6.2 Ridge Regression
We will use the glmnet package to perform ridge regression. The main function in this package is glmnet(), which can be used to fit ridge regression models, lasso models, and more.
This function has slightly different syntax from other model-fitting functions that we have encountered thus far. In particular, we must pass in an x matrix as well as a y vector, and we do not use the y ∼ x syntax.
x <- model.matrix(Salary~., data = hitters)[,-1]
y <- hitters$SalaryThe glmnet() function has an alpha argument that determines what type of model is fit. If alpha=0 then a ridge regression model is fit, and if alpha=1 then a lasso model is fit. We first fit a ridge regression model.
We choose to compute the ridge regression using a range of lambda values that goes from
10^10(very close to the null model, including only the intercept) to10^(-2)(very close to the full OLS model).Note that by default, the
glmnet()function standardizes the variables so that they are on the same scale. To turn off this default setting, use the argumentstandardize = FALSE.
grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)Ridge gives us a path of possible models:
par(mai=c(.9,.8,.8,.8))
par(mfrow=c(1,1))
plot(ridge.mod, xvar="lambda", label = TRUE, )Let’s quickly check how the coefficients changes at different values for \lambda:
First with \lambda=11,498:
Code
ridge.mod[["lambda"]][50][1] 11497.57
Code
coefficients <- coef(ridge.mod, )[, 50]
as.data.frame(coefficients) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| coefficients | |
|---|---|
| (Intercept) | 407.3560502 |
| AtBat | 0.0369572 |
| Hits | 0.1381803 |
| HmRun | 0.5246300 |
| Runs | 0.2307015 |
| RBI | 0.2398415 |
| Walks | 0.2896187 |
| Years | 1.1077029 |
| CAtBat | 0.0031318 |
| CHits | 0.0116536 |
| CHmRun | 0.0875457 |
| CRuns | 0.0233799 |
| CRBI | 0.0241383 |
| CWalks | 0.0250154 |
| LeagueN | 0.0850281 |
| DivisionW | -6.2154410 |
| PutOuts | 0.0164826 |
| Assists | 0.0026130 |
| Errors | -0.0205027 |
| NewLeagueN | 0.3014335 |
Now with \lambda=705:
Code
ridge.mod[["lambda"]][60][1] 705.4802
Code
coefficients <- coef(ridge.mod, )[, 60]
as.data.frame(coefficients) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| coefficients | |
|---|---|
| (Intercept) | 54.3251995 |
| AtBat | 0.1121111 |
| Hits | 0.6562241 |
| HmRun | 1.1798091 |
| Runs | 0.9376971 |
| RBI | 0.8471855 |
| Walks | 1.3198795 |
| Years | 2.5964042 |
| CAtBat | 0.0108341 |
| CHits | 0.0467456 |
| CHmRun | 0.3377732 |
| CRuns | 0.0935553 |
| CRBI | 0.0978040 |
| CWalks | 0.0718961 |
| LeagueN | 13.6837019 |
| DivisionW | -54.6587775 |
| PutOuts | 0.1185229 |
| Assists | 0.0160604 |
| Errors | -0.7035865 |
| NewLeagueN | 8.6118121 |
Main Idea: The coefficients tend to be larger with a lower value of \lambda (although some of them can increase their value).
6.2.1 Validation Approach (Train and Test Data)
We now split the samples into a training set (70%) and a test set (30%) in order to estimate the test error of ridge regression and the lasso.
set.seed(123456)
index <- sample(1:nrow(hitters), 0.7*nrow(hitters))
train_hitters <- hitters[index,] # Create the training data
test_hitters <- hitters[-index,] # Create the test data
dim(train_hitters)[1] 184 20
dim(test_hitters)[1] 79 20
We convert it into a matrix:
# Train
x_train <- model.matrix(Salary~., data = train_hitters)[,-1]
y_train <- train_hitters$Salary
# Test
x_test <- model.matrix(Salary~., data = test_hitters)[,-1]
y_test <- test_hitters$SalaryNow, to evaluate it’s predicting performance, we will use the MSE/RMSE metrics:
eval_results <- function(true, predicted, df) {
SSE <- sum((predicted - true)^2)
MSE <- SSE/nrow(df) # Mean-Squared Error
RMSE = sqrt(SSE/nrow(df)) # Root Mean-Squared Error
# Model performance metrics
data.frame(
MSE = MSE,
RMSE = RMSE
)
}Let’s run the ridge regression in our training dataset and evaluate using \lambda=4:
ridge.mod1 <- glmnet(x_train, y_train, alpha = 0, lambda = grid)
# Test Performance (MSE)
predict_ridge <- predict(ridge.mod1, s = 4, newx = x_test)
ridge_MSE <- eval_results(y_test, predict_ridge, test_hitters)| MSE | RMSE |
|---|---|
| 123038.6 | 350.7686 |
What if we use a very large \lambda ?
ridge.mod2 <- glmnet(x_train, y_train, alpha = 0, lambda = grid)
# Test Performance (MSE)
predict_ridge <- predict(ridge.mod2, s = 1e10, newx = x_test)
ridge_MSE <- eval_results(y_test, predict_ridge, test_hitters)| MSE | RMSE |
|---|---|
| 203250.7 | 450.8334 |
So fitting a ridge regression model with λ = 4 leads to a much lower test MSE than fitting a model with just an intercept.
6.2.2 Choosing \lambda with Cross-Validation
In general, instead of arbitrarily choosing λ, it would be better to use cross-validation to choose the tuning parameter λ. We can do this using the built-in cross-validation function, cv.glmnet().
By default, the function
cv.glmnet()performs ten-fold cross-validation, though this can be changed using the argumentnfolds. Let’s use 5 for our exampleNote that we set a random seed first so our results will be reproducible, since the choice of the cross-validation folds is random.
set.seed(12345)
mod.ridge.cv <- cv.glmnet(x_train, y_train, type.measure = "mse", nfolds = 5)
plot(mod.ridge.cv)As you can see from the plot, there are two types of optimal \lambda’s that we can use. We will consider both:
# Save Optimal Lambdas
lambda_min.ridge <- mod.ridge.cv$lambda.min
print(lambda_min.ridge)[1] 1.189541
coeff_min <- coef(mod.ridge.cv, s = "lambda.min")[,1]
lambda_1se.ridge <- mod.ridge.cv$lambda.1se
print(lambda_1se.ridge)[1] 113.5474
coeff_1se <- coef(mod.ridge.cv, s = "lambda.1se")[,1]Code
data.frame(cbind(coeff_min, coeff_1se)) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| coeff_min | coeff_1se | |
|---|---|---|
| (Intercept) | 242.3167138 | 293.7115872 |
| AtBat | -2.3746030 | 0.0000000 |
| Hits | 7.0088455 | 0.0000000 |
| HmRun | -4.2660149 | 0.0000000 |
| Runs | -3.0257019 | 0.0000000 |
| RBI | 3.1703769 | 0.0000000 |
| Walks | 7.9740935 | 2.5880568 |
| Years | -20.3546933 | 0.0000000 |
| CAtBat | -0.0017199 | 0.0000000 |
| CHits | 0.0000000 | 0.0000000 |
| CHmRun | 1.5844970 | 0.0000000 |
| CRuns | 1.0091501 | 0.2339336 |
| CRBI | 0.0000000 | 0.2171932 |
| CWalks | -0.5131176 | 0.0000000 |
| LeagueN | 43.1992322 | 0.0000000 |
| DivisionW | -134.4769998 | 0.0000000 |
| PutOuts | 0.3003368 | 0.0000000 |
| Assists | 0.3766270 | 0.0000000 |
| Errors | -3.8907645 | 0.0000000 |
| NewLeagueN | 0.0000000 | 0.0000000 |
Now let’s use that value to predict y_test and check the MSE:
ridge.mod3 <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_min.ridge)
# Test Performance (MSE)
predict_ridge <- predict(ridge.mod3, newx = x_test)
ridge_MSE <- eval_results(y_test, predict_ridge, test_hitters)| MSE | RMSE |
|---|---|
| 124994.8 | 353.5461 |
This is a lower MSE than when we used \lambda=4.
6.3 LASSO Regression
We now ask whether the LASSO can yield either a more accurate or a more interpretable model than ridge regression. In order to fit a LASSO model, we once again use the glmnet() function; however, this time we use the argument alpha=1.
lasso.mod1 <- glmnet(x_train, y_train, alpha = 1, lambda = grid)
plot(lasso.mod1)We can see from the coefficient plot that depending on the choice of tuning parameter, some of the coefficients will be exactly equal to zero.
6.3.1 Choosing \lambda with Cross-Validation
Now we perform cross validation to find the best value of \lambda:
set.seed(12345)
mod.lasso.cv <- cv.glmnet(x_train, y_train, type.measure = "mse", nfolds = 5)
plot(mod.lasso.cv)# Save Optimal Lambdas
lambda_min.lasso <- mod.lasso.cv$lambda.min
print(lambda_min.lasso)[1] 1.189541
coeff_min <- coef(mod.lasso.cv, s = "lambda.min")[,1]
lambda_1se.lasso <- mod.lasso.cv$lambda.1se
print(lambda_1se.lasso)[1] 113.5474
coeff_1se <- coef(mod.lasso.cv, s = "lambda.1se")[,1]Code
data.frame(cbind(coeff_min, coeff_1se)) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| coeff_min | coeff_1se | |
|---|---|---|
| (Intercept) | 242.3167138 | 293.7115872 |
| AtBat | -2.3746030 | 0.0000000 |
| Hits | 7.0088455 | 0.0000000 |
| HmRun | -4.2660149 | 0.0000000 |
| Runs | -3.0257019 | 0.0000000 |
| RBI | 3.1703769 | 0.0000000 |
| Walks | 7.9740935 | 2.5880568 |
| Years | -20.3546933 | 0.0000000 |
| CAtBat | -0.0017199 | 0.0000000 |
| CHits | 0.0000000 | 0.0000000 |
| CHmRun | 1.5844970 | 0.0000000 |
| CRuns | 1.0091501 | 0.2339336 |
| CRBI | 0.0000000 | 0.2171932 |
| CWalks | -0.5131176 | 0.0000000 |
| LeagueN | 43.1992322 | 0.0000000 |
| DivisionW | -134.4769998 | 0.0000000 |
| PutOuts | 0.3003368 | 0.0000000 |
| Assists | 0.3766270 | 0.0000000 |
| Errors | -3.8907645 | 0.0000000 |
| NewLeagueN | 0.0000000 | 0.0000000 |
lasso.mod2 <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_min.lasso)
# Test Performance (MSE)
predict_lasso <- predict(lasso.mod2, s = 4, newx = x_test)
lasso_MSE <- eval_results(y_test, predict_ridge, test_hitters)| MSE | RMSE |
|---|---|
| 124994.8 | 353.5461 |
6.4 k-fold CV Model Comparison
Just a fancy way of combining regularization with cross-validation to compare model performance.
6.4.1 Lasso CV
set.seed(12345)
n = nrow(hitters)
K = 10 # # folds
foldid = rep(1:K, each=ceiling(n/K))[sample(1:n)]
OOS.lasso.min=data.frame(MSE=rep(NA,K), RMSE=rep(NA, K))
for(k in 1:K){
train = which(foldid!=k)
# train.data conversion
k.train <- hitters[train,]
x_train <- model.matrix(Salary~., data = k.train)[,-1]
y_train <- k.train$Salary
# test.data conversion
k.test <- hitters[-train,]
x_test <- model.matrix(Salary~., data = k.test)[,-1]
y_test <- k.test$Salary
#Choosing Tuning Parameter
mod.lasso.cv <- cv.glmnet(x_train, y_train, type.measure = "mse", nfolds = 5)
lambda_min <- mod.lasso.cv$lambda.min
lambda_1se <- mod.lasso.cv$lambda.1se
#fit regression on train
mod.lasso <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_min)
#predict on test
predictions_test <- predict(mod.lasso, s = lambda_min, newx = x_test)
#MSE
OOS.lasso.min$MSE[k] <- eval_results(y_test, predictions_test, k.test)[,1]
OOS.lasso.min$RMSE[k] <- eval_results(y_test, predictions_test, k.test)[,2]
# print progress
cat(k, " ")
}1 2 3 4 5 6 7 8 9 10
mean(OOS.lasso.min$RMSE)[1] 329.0406
set.seed(12345)
n = nrow(hitters)
K = 10 # # folds
foldid = rep(1:K, each=ceiling(n/K))[sample(1:n)]
OOS.lasso.1se=data.frame(MSE=rep(NA,K), RMSE=rep(NA, K))
for(k in 1:K){
train = which(foldid!=k)
# train.data conversion
k.train <- hitters[train,]
x_train <- model.matrix(Salary~., data = k.train)[,-1]
y_train <- k.train$Salary
# test.data conversion
k.test <- hitters[-train,]
x_test <- model.matrix(Salary~., data = k.test)[,-1]
y_test <- k.test$Salary
#Choosing Tuning Parameter
mod.lasso.cv <- cv.glmnet(x_train, y_train, type.measure = "mse", nfolds = 5)
lambda_min <- mod.lasso.cv$lambda.min
lambda_1se <- mod.lasso.cv$lambda.1se
#fit regression on train
mod.lasso <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_1se)
#predict on test
predictions_test <- predict(mod.lasso, s = lambda_1se, newx = x_test)
#MSE
OOS.lasso.1se$MSE[k] <- eval_results(y_test, predictions_test, k.test)[,1]
OOS.lasso.1se$RMSE[k] <- eval_results(y_test, predictions_test, k.test)[,2]
# print progress
cat(k, " ")
}1 2 3 4 5 6 7 8 9 10
mean(OOS.lasso.1se$RMSE)[1] 358.0974
6.4.2 Ridge CV
set.seed(12345)
n = nrow(hitters)
K = 10 # # folds
foldid = rep(1:K, each=ceiling(n/K))[sample(1:n)]
OOS.ridge.min=data.frame(MSE=rep(NA,K), RMSE=rep(NA, K))
for(k in 1:K){
train = which(foldid!=k)
# train.data conversion
k.train <- hitters[train,]
x_train <- model.matrix(Salary~., data = k.train)[,-1]
y_train <- k.train$Salary
# test.data conversion
k.test <- hitters[-train,]
x_test <- model.matrix(Salary~., data = k.test)[,-1]
y_test <- k.test$Salary
#Choosing Tuning Parameter
mod.ridge.cv <- cv.glmnet(x_train, y_train, alpha=0, type.measure = "mse", nfolds = 5)
lambda_min <- mod.ridge.cv$lambda.min
lambda_1se <- mod.ridge.cv$lambda.1se
#fit regression on train
mod.ridge <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_min)
#predict on test
predictions_test <- predict(mod.ridge, s = lambda_min, newx = x_test)
#MSE
OOS.ridge.min$MSE[k] <- eval_results(y_test, predictions_test, k.test)[,1]
OOS.ridge.min$RMSE[k] <- eval_results(y_test, predictions_test, k.test)[,2]
# print progress
cat(k, " ")
}1 2 3 4 5 6 7 8 9 10
mean(OOS.ridge.min$RMSE)[1] 331.3975
set.seed(12345)
n = nrow(hitters)
K = 10 # # folds
foldid = rep(1:K, each=ceiling(n/K))[sample(1:n)]
OOS.ridge.1se=data.frame(MSE=rep(NA,K), RMSE=rep(NA, K))
for(k in 1:K){
train = which(foldid!=k)
# train.data conversion
k.train <- hitters[train,]
x_train <- model.matrix(Salary~., data = k.train)[,-1]
y_train <- k.train$Salary
# test.data conversion
k.test <- hitters[-train,]
x_test <- model.matrix(Salary~., data = k.test)[,-1]
y_test <- k.test$Salary
#Choosing Tuning Parameter
mod.ridge.cv <- cv.glmnet(x_train, y_train, alpha=0, type.measure = "mse", nfolds = 5)
lambda_min <- mod.ridge.cv$lambda.min
lambda_1se <- mod.ridge.cv$lambda.1se
#fit regression on train
mod.ridge <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_1se)
#predict on test
predictions_test <- predict(mod.ridge, s = lambda_1se, newx = x_test)
#MSE
OOS.ridge.1se$MSE[k] <- eval_results(y_test, predictions_test, k.test)[,1]
OOS.ridge.1se$RMSE[k] <- eval_results(y_test, predictions_test, k.test)[,2]
# print progress
cat(k, " ")
}1 2 3 4 5 6 7 8 9 10
mean(OOS.ridge.1se$RMSE)[1] 359.3406
Finally, we build the plot that compares all the four models:
OOS.lasso.min <- OOS.lasso.min %>% mutate(Model="Lasso Min")
OOS.lasso.1se <- OOS.lasso.1se %>% mutate(Model="Lasso 1SE")
OOS.ridge.min <- OOS.ridge.min %>% mutate(Model="Ridge Min")
OOS.ridge.1se <- OOS.ridge.1se %>% mutate(Model="Ridge 1SE")
### MERGE
OOS <- rbind(OOS.lasso.min, OOS.lasso.1se, OOS.ridge.min, OOS.ridge.1se)6.4.2.1 Model Comparison Plot
Code
OOS %>%
#filter(!Model=="OLS") %>%
ggplot( aes(x=reorder(Model,RMSE), y=RMSE, fill=Model)) +
geom_boxplot() +
geom_jitter(shape=16, position=position_jitter(0.2)) +
theme_bw() + theme(legend.position="none") +
xlab("") +
scale_y_continuous(labels = scales::comma) +
scale_fill_manual(values = mycolors)