5  Model Selection & Regularization

Author

Daniel Redel

6 ISLR: Baseball Players Salary

Here we apply regularization methods to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.

First, let’s clean the missing values.

library(dplyr)
hitters <- ISLR::Hitters %>% na.omit()
Table 6.1: Baseball Dataset
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
-Alan Ashby 315 81 7 24 38 39 14 3449 835 69 321 414 375 N W 632 43 10 475.0 N
-Alvin Davis 479 130 18 66 72 76 3 1624 457 63 224 266 263 A W 880 82 14 480.0 A
-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225 828 838 354 N E 200 11 3 500.0 N
-Andres Galarraga 321 87 10 39 42 30 2 396 101 12 48 46 33 N E 805 40 4 91.5 N
-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19 501 336 194 A W 282 421 25 750.0 A
-Al Newman 185 37 1 23 8 21 2 214 42 1 30 9 24 N E 76 127 7 70.0 A

6.1 Summary Statistics

We can show some statistics

hitters %>% 
  dplyr::select(Salary, Years, Hits, CHits, HmRun, Runs, AtBat, CAtBat, Walks) %>% 
  tbl_summary(statistic = list(
    all_continuous() ~ "{mean} ({sd})"),
    digits = all_continuous() ~ 3)
Table 6.2:

Summary Statistics

Characteristic N = 2631
Salary 535.926 (451.119)
Years 7.312 (4.794)
Hits 107.829 (45.125)
CHits 722.186 (648.200)
HmRun 11.620 (8.757)
Runs 54.745 (25.540)
AtBat 403.643 (147.307)
CAtBat 2,657.544 (2,286.583)
Walks 41.114 (21.718)
1 Mean (SD)

Let’s take a closer look at the Salary distribution:

Code
hitters %>% 
  ggplot(aes(x = Salary, fill = Division)) +
  geom_density(alpha = 0.7) +  
  scale_fill_manual(values = mycolors) + theme_bw()

Figure 6.1: Salary Distribution of Baseball Players

6.2 Ridge Regression

We will use the glmnet package to perform ridge regression. The main function in this package is glmnet(), which can be used to fit ridge regression models, lasso models, and more.

This function has slightly different syntax from other model-fitting functions that we have encountered thus far. In particular, we must pass in an x matrix as well as a y vector, and we do not use the y ∼ x syntax.

x <- model.matrix(Salary~., data = hitters)[,-1]
y <- hitters$Salary

The glmnet() function has an alpha argument that determines what type of model is fit. If alpha=0 then a ridge regression model is fit, and if alpha=1 then a lasso model is fit. We first fit a ridge regression model.

  • We choose to compute the ridge regression using a range of lambda values that goes from 10^10 (very close to the null model, including only the intercept) to 10^(-2) (very close to the full OLS model).

  • Note that by default, the glmnet() function standardizes the variables so that they are on the same scale. To turn off this default setting, use the argument standardize = FALSE.

grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)

Ridge gives us a path of possible models:

par(mai=c(.9,.8,.8,.8))
par(mfrow=c(1,1))
plot(ridge.mod, xvar="lambda", label = TRUE, )

Let’s quickly check how the coefficients changes at different values for \lambda:

First with \lambda=11,498:

Code
ridge.mod[["lambda"]][50]
[1] 11497.57
Code
coefficients <- coef(ridge.mod, )[, 50]

as.data.frame(coefficients) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 6.3: Ridge Coefficients (Lambda=11,498)
coefficients
(Intercept) 407.3560502
AtBat 0.0369572
Hits 0.1381803
HmRun 0.5246300
Runs 0.2307015
RBI 0.2398415
Walks 0.2896187
Years 1.1077029
CAtBat 0.0031318
CHits 0.0116536
CHmRun 0.0875457
CRuns 0.0233799
CRBI 0.0241383
CWalks 0.0250154
LeagueN 0.0850281
DivisionW -6.2154410
PutOuts 0.0164826
Assists 0.0026130
Errors -0.0205027
NewLeagueN 0.3014335

Now with \lambda=705:

Code
ridge.mod[["lambda"]][60]
[1] 705.4802
Code
coefficients <- coef(ridge.mod, )[, 60]

as.data.frame(coefficients) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 6.4: Ridge Coefficients (Lambda=705)
coefficients
(Intercept) 54.3251995
AtBat 0.1121111
Hits 0.6562241
HmRun 1.1798091
Runs 0.9376971
RBI 0.8471855
Walks 1.3198795
Years 2.5964042
CAtBat 0.0108341
CHits 0.0467456
CHmRun 0.3377732
CRuns 0.0935553
CRBI 0.0978040
CWalks 0.0718961
LeagueN 13.6837019
DivisionW -54.6587775
PutOuts 0.1185229
Assists 0.0160604
Errors -0.7035865
NewLeagueN 8.6118121

Main Idea: The coefficients tend to be larger with a lower value of \lambda (although some of them can increase their value).

6.2.1 Validation Approach (Train and Test Data)

We now split the samples into a training set (70%) and a test set (30%) in order to estimate the test error of ridge regression and the lasso.

set.seed(123456) 
index <- sample(1:nrow(hitters), 0.7*nrow(hitters)) 

train_hitters <- hitters[index,] # Create the training data 
test_hitters <- hitters[-index,] # Create the test data

dim(train_hitters)
[1] 184  20
dim(test_hitters)
[1] 79 20

We convert it into a matrix:

# Train
x_train <- model.matrix(Salary~., data = train_hitters)[,-1]
y_train <- train_hitters$Salary

# Test
x_test <- model.matrix(Salary~., data = test_hitters)[,-1]
y_test <- test_hitters$Salary

Now, to evaluate it’s predicting performance, we will use the MSE/RMSE metrics:

eval_results <- function(true, predicted, df) {
  SSE <- sum((predicted - true)^2)
  MSE <- SSE/nrow(df)                            # Mean-Squared Error
  RMSE = sqrt(SSE/nrow(df))                      # Root Mean-Squared Error

  # Model performance metrics
data.frame(
  MSE = MSE,
  RMSE = RMSE
)
  
}

Let’s run the ridge regression in our training dataset and evaluate using \lambda=4:

ridge.mod1 <- glmnet(x_train, y_train, alpha = 0, lambda = grid)

# Test Performance (MSE)
predict_ridge <- predict(ridge.mod1, s = 4, newx = x_test)
ridge_MSE <- eval_results(y_test, predict_ridge, test_hitters)
Table 6.5: Ridge Performance on Test
MSE RMSE
123038.6 350.7686

What if we use a very large \lambda ?

ridge.mod2 <- glmnet(x_train, y_train, alpha = 0, lambda = grid)

# Test Performance (MSE)
predict_ridge <- predict(ridge.mod2, s = 1e10, newx = x_test)
ridge_MSE <- eval_results(y_test, predict_ridge, test_hitters)
Table 6.6: Ridge Performance on Test
MSE RMSE
203250.7 450.8334

So fitting a ridge regression model with λ = 4 leads to a much lower test MSE than fitting a model with just an intercept.

6.2.2 Choosing \lambda with Cross-Validation

In general, instead of arbitrarily choosing λ, it would be better to use cross-validation to choose the tuning parameter λ. We can do this using the built-in cross-validation function, cv.glmnet().

  • By default, the function cv.glmnet() performs ten-fold cross-validation, though this can be changed using the argument nfolds. Let’s use 5 for our example

  • Note that we set a random seed first so our results will be reproducible, since the choice of the cross-validation folds is random.

set.seed(12345) 
mod.ridge.cv <- cv.glmnet(x_train, y_train, type.measure = "mse", nfolds = 5)
plot(mod.ridge.cv)

As you can see from the plot, there are two types of optimal \lambda’s that we can use. We will consider both:

# Save Optimal Lambdas
lambda_min.ridge <- mod.ridge.cv$lambda.min
print(lambda_min.ridge)
[1] 1.189541
coeff_min <- coef(mod.ridge.cv, s = "lambda.min")[,1]

lambda_1se.ridge <- mod.ridge.cv$lambda.1se
print(lambda_1se.ridge)
[1] 113.5474
coeff_1se <- coef(mod.ridge.cv, s = "lambda.1se")[,1]
Code
data.frame(cbind(coeff_min, coeff_1se)) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 6.7: Ridge Coefficients - Optimal Lambda
coeff_min coeff_1se
(Intercept) 242.3167138 293.7115872
AtBat -2.3746030 0.0000000
Hits 7.0088455 0.0000000
HmRun -4.2660149 0.0000000
Runs -3.0257019 0.0000000
RBI 3.1703769 0.0000000
Walks 7.9740935 2.5880568
Years -20.3546933 0.0000000
CAtBat -0.0017199 0.0000000
CHits 0.0000000 0.0000000
CHmRun 1.5844970 0.0000000
CRuns 1.0091501 0.2339336
CRBI 0.0000000 0.2171932
CWalks -0.5131176 0.0000000
LeagueN 43.1992322 0.0000000
DivisionW -134.4769998 0.0000000
PutOuts 0.3003368 0.0000000
Assists 0.3766270 0.0000000
Errors -3.8907645 0.0000000
NewLeagueN 0.0000000 0.0000000

Now let’s use that value to predict y_test and check the MSE:

ridge.mod3 <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_min.ridge)
# Test Performance (MSE)
predict_ridge <- predict(ridge.mod3, newx = x_test)
ridge_MSE <- eval_results(y_test, predict_ridge, test_hitters)
Table 6.8: Ridge Performance on Test
MSE RMSE
124994.8 353.5461

This is a lower MSE than when we used \lambda=4.

6.3 LASSO Regression

We now ask whether the LASSO can yield either a more accurate or a more interpretable model than ridge regression. In order to fit a LASSO model, we once again use the glmnet() function; however, this time we use the argument alpha=1.

lasso.mod1 <- glmnet(x_train, y_train, alpha = 1, lambda = grid)
plot(lasso.mod1)

We can see from the coefficient plot that depending on the choice of tuning parameter, some of the coefficients will be exactly equal to zero.

6.3.1 Choosing \lambda with Cross-Validation

Now we perform cross validation to find the best value of \lambda:

set.seed(12345) 
mod.lasso.cv <- cv.glmnet(x_train, y_train, type.measure = "mse", nfolds = 5)
plot(mod.lasso.cv)

# Save Optimal Lambdas
lambda_min.lasso <- mod.lasso.cv$lambda.min
print(lambda_min.lasso)
[1] 1.189541
coeff_min <- coef(mod.lasso.cv, s = "lambda.min")[,1]

lambda_1se.lasso <- mod.lasso.cv$lambda.1se
print(lambda_1se.lasso)
[1] 113.5474
coeff_1se <- coef(mod.lasso.cv, s = "lambda.1se")[,1]
Code
data.frame(cbind(coeff_min, coeff_1se)) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Table 6.9: LASSO Coefficients - Optimal Lambda
coeff_min coeff_1se
(Intercept) 242.3167138 293.7115872
AtBat -2.3746030 0.0000000
Hits 7.0088455 0.0000000
HmRun -4.2660149 0.0000000
Runs -3.0257019 0.0000000
RBI 3.1703769 0.0000000
Walks 7.9740935 2.5880568
Years -20.3546933 0.0000000
CAtBat -0.0017199 0.0000000
CHits 0.0000000 0.0000000
CHmRun 1.5844970 0.0000000
CRuns 1.0091501 0.2339336
CRBI 0.0000000 0.2171932
CWalks -0.5131176 0.0000000
LeagueN 43.1992322 0.0000000
DivisionW -134.4769998 0.0000000
PutOuts 0.3003368 0.0000000
Assists 0.3766270 0.0000000
Errors -3.8907645 0.0000000
NewLeagueN 0.0000000 0.0000000
lasso.mod2 <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_min.lasso)
# Test Performance (MSE)
predict_lasso <- predict(lasso.mod2, s = 4, newx = x_test)
lasso_MSE <- eval_results(y_test, predict_ridge, test_hitters)
Table 6.10: LASSO Performance on Test
MSE RMSE
124994.8 353.5461

6.4 k-fold CV Model Comparison

Just a fancy way of combining regularization with cross-validation to compare model performance.

6.4.1 Lasso CV

set.seed(12345)  
n = nrow(hitters)
K = 10 # # folds
foldid = rep(1:K, each=ceiling(n/K))[sample(1:n)]

OOS.lasso.min=data.frame(MSE=rep(NA,K), RMSE=rep(NA, K))

for(k in 1:K){
  
  train = which(foldid!=k) 
  
  # train.data conversion
  k.train <- hitters[train,]
  x_train <- model.matrix(Salary~., data = k.train)[,-1]
  y_train <- k.train$Salary
   
  # test.data conversion
  k.test <- hitters[-train,]
  x_test <- model.matrix(Salary~., data = k.test)[,-1]
  y_test <- k.test$Salary 
  
  #Choosing Tuning Parameter
  mod.lasso.cv <- cv.glmnet(x_train, y_train, type.measure = "mse", nfolds = 5)
  lambda_min <- mod.lasso.cv$lambda.min
  lambda_1se <- mod.lasso.cv$lambda.1se
  
  #fit regression on train
  mod.lasso <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_min)
  
  #predict on test
  predictions_test <- predict(mod.lasso, s = lambda_min, newx = x_test)
  
  #MSE
  OOS.lasso.min$MSE[k] <- eval_results(y_test, predictions_test, k.test)[,1]
  OOS.lasso.min$RMSE[k] <- eval_results(y_test, predictions_test, k.test)[,2]

    # print progress
  cat(k, "  ")
  
}
1   2   3   4   5   6   7   8   9   10   
mean(OOS.lasso.min$RMSE)
[1] 329.0406
set.seed(12345)  
n = nrow(hitters)
K = 10 # # folds
foldid = rep(1:K, each=ceiling(n/K))[sample(1:n)]

OOS.lasso.1se=data.frame(MSE=rep(NA,K), RMSE=rep(NA, K))


for(k in 1:K){
  
  train = which(foldid!=k) 
  
  # train.data conversion
  k.train <- hitters[train,]
  x_train <- model.matrix(Salary~., data = k.train)[,-1]
  y_train <- k.train$Salary
   
  # test.data conversion
  k.test <- hitters[-train,]
  x_test <- model.matrix(Salary~., data = k.test)[,-1]
  y_test <- k.test$Salary 
  
  #Choosing Tuning Parameter
  mod.lasso.cv <- cv.glmnet(x_train, y_train, type.measure = "mse", nfolds = 5)
  lambda_min <- mod.lasso.cv$lambda.min
  lambda_1se <- mod.lasso.cv$lambda.1se
  
  #fit regression on train
  mod.lasso <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_1se)
  
  #predict on test
  predictions_test <- predict(mod.lasso, s = lambda_1se, newx = x_test)
  
  #MSE
  OOS.lasso.1se$MSE[k] <- eval_results(y_test, predictions_test, k.test)[,1]
  OOS.lasso.1se$RMSE[k] <- eval_results(y_test, predictions_test, k.test)[,2]

    # print progress
  cat(k, "  ")
  
}
1   2   3   4   5   6   7   8   9   10   
mean(OOS.lasso.1se$RMSE)
[1] 358.0974

6.4.2 Ridge CV

set.seed(12345)  
n = nrow(hitters)
K = 10 # # folds
foldid = rep(1:K, each=ceiling(n/K))[sample(1:n)]

OOS.ridge.min=data.frame(MSE=rep(NA,K), RMSE=rep(NA, K))


for(k in 1:K){
  
  train = which(foldid!=k) 
  
  # train.data conversion
  k.train <- hitters[train,]
  x_train <- model.matrix(Salary~., data = k.train)[,-1]
  y_train <- k.train$Salary
   
  # test.data conversion
  k.test <- hitters[-train,]
  x_test <- model.matrix(Salary~., data = k.test)[,-1]
  y_test <- k.test$Salary
  
  #Choosing Tuning Parameter
  mod.ridge.cv <- cv.glmnet(x_train, y_train, alpha=0, type.measure = "mse", nfolds = 5)
  lambda_min <- mod.ridge.cv$lambda.min
  lambda_1se <- mod.ridge.cv$lambda.1se
  
  #fit regression on train
  mod.ridge <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_min)
  
  #predict on test
  predictions_test <- predict(mod.ridge, s = lambda_min, newx = x_test)
  
  #MSE
  OOS.ridge.min$MSE[k] <- eval_results(y_test, predictions_test, k.test)[,1]
  OOS.ridge.min$RMSE[k] <- eval_results(y_test, predictions_test, k.test)[,2]

    # print progress
  cat(k, "  ")
  
}
1   2   3   4   5   6   7   8   9   10   
mean(OOS.ridge.min$RMSE)
[1] 331.3975
set.seed(12345)  
n = nrow(hitters)
K = 10 # # folds
foldid = rep(1:K, each=ceiling(n/K))[sample(1:n)]

OOS.ridge.1se=data.frame(MSE=rep(NA,K), RMSE=rep(NA, K))


for(k in 1:K){
  
  train = which(foldid!=k) 
  
  # train.data conversion
  k.train <- hitters[train,]
  x_train <- model.matrix(Salary~., data = k.train)[,-1]
  y_train <- k.train$Salary
   
  # test.data conversion
  k.test <- hitters[-train,]
  x_test <- model.matrix(Salary~., data = k.test)[,-1]
  y_test <- k.test$Salary 
  
  #Choosing Tuning Parameter
  mod.ridge.cv <- cv.glmnet(x_train, y_train, alpha=0, type.measure = "mse", nfolds = 5)
  lambda_min <- mod.ridge.cv$lambda.min
  lambda_1se <- mod.ridge.cv$lambda.1se
  
  #fit regression on train
  mod.ridge <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_1se)
  
  #predict on test
  predictions_test <- predict(mod.ridge, s = lambda_1se, newx = x_test)
  
  #MSE
  OOS.ridge.1se$MSE[k] <- eval_results(y_test, predictions_test, k.test)[,1]
  OOS.ridge.1se$RMSE[k] <- eval_results(y_test, predictions_test, k.test)[,2]

    # print progress
  cat(k, "  ")
  
}
1   2   3   4   5   6   7   8   9   10   
mean(OOS.ridge.1se$RMSE)
[1] 359.3406

Finally, we build the plot that compares all the four models:

OOS.lasso.min <- OOS.lasso.min %>% mutate(Model="Lasso Min")
OOS.lasso.1se <- OOS.lasso.1se %>% mutate(Model="Lasso 1SE")

OOS.ridge.min <- OOS.ridge.min %>% mutate(Model="Ridge Min")
OOS.ridge.1se <- OOS.ridge.1se %>% mutate(Model="Ridge 1SE")

### MERGE
OOS <- rbind(OOS.lasso.min, OOS.lasso.1se, OOS.ridge.min, OOS.ridge.1se)

6.4.2.1 Model Comparison Plot

Code
OOS %>%
  #filter(!Model=="OLS") %>% 
  ggplot( aes(x=reorder(Model,RMSE), y=RMSE, fill=Model)) + 
  geom_boxplot() +
  geom_jitter(shape=16, position=position_jitter(0.2)) +
  theme_bw() + theme(legend.position="none") +
  xlab("") +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_manual(values = mycolors)

Figure 6.2: Model Comparison: k-fold Out-of-Sample RMSE