5 Model Selection & Regularization

Author

Daniel Redel

6 ISLR: Baseball Players Salary

Here we apply regularization methods to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.

First, let’s clean the missing values.

library(dplyr)
hitters <- ISLR::Hitters %>% na.omit()

Table 6.1: Baseball Dataset
	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	League	Division	PutOuts	Assists	Errors	Salary	NewLeague
-Alan Ashby	315	81	7	24	38	39	14	3449	835	69	321	414	375	N	W	632	43	10	475.0	N
-Alvin Davis	479	130	18	66	72	76	3	1624	457	63	224	266	263	A	W	880	82	14	480.0	A
-Andre Dawson	496	141	20	65	78	37	11	5628	1575	225	828	838	354	N	E	200	11	3	500.0	N
-Andres Galarraga	321	87	10	39	42	30	2	396	101	12	48	46	33	N	E	805	40	4	91.5	N
-Alfredo Griffin	594	169	4	74	51	35	11	4408	1133	19	501	336	194	A	W	282	421	25	750.0	A
-Al Newman	185	37	1	23	8	21	2	214	42	1	30	9	24	N	E	76	127	7	70.0	A

6.1 Summary Statistics

We can show some statistics

hitters %>% 
  dplyr::select(Salary, Years, Hits, CHits, HmRun, Runs, AtBat, CAtBat, Walks) %>% 
  tbl_summary(statistic = list(
    all_continuous() ~ "{mean} ({sd})"),
    digits = all_continuous() ~ 3)

Table 6.2:
Summary Statistics
Characteristic	N = 263¹
Salary	535.926 (451.119)
Years	7.312 (4.794)
Hits	107.829 (45.125)
CHits	722.186 (648.200)
HmRun	11.620 (8.757)
Runs	54.745 (25.540)
AtBat	403.643 (147.307)
CAtBat	2,657.544 (2,286.583)
Walks	41.114 (21.718)
¹ Mean (SD)

Let’s take a closer look at the Salary distribution:

Code

hitters %>% 
  ggplot(aes(x = Salary, fill = Division)) +
  geom_density(alpha = 0.7) +  
  scale_fill_manual(values = mycolors) + theme_bw()

Figure 6.1: Salary Distribution of Baseball Players

6.2 Ridge Regression

We will use the glmnet package to perform ridge regression. The main function in this package is glmnet(), which can be used to fit ridge regression models, lasso models, and more.

This function has slightly different syntax from other model-fitting functions that we have encountered thus far. In particular, we must pass in an x matrix as well as a y vector, and we do not use the y ∼ x syntax.

x <- model.matrix(Salary~., data = hitters)[,-1]
y <- hitters$Salary

The glmnet() function has an alpha argument that determines what type of model is fit. If alpha=0 then a ridge regression model is fit, and if alpha=1 then a lasso model is fit. We first fit a ridge regression model.

We choose to compute the ridge regression using a range of lambda values that goes from 10^10 (very close to the null model, including only the intercept) to 10^(-2) (very close to the full OLS model).
Note that by default, the glmnet() function standardizes the variables so that they are on the same scale. To turn off this default setting, use the argument standardize = FALSE.

grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)

Ridge gives us a path of possible models:

par(mai=c(.9,.8,.8,.8))
par(mfrow=c(1,1))
plot(ridge.mod, xvar="lambda", label = TRUE, )

Let’s quickly check how the coefficients changes at different values for \lambda:

First with \lambda=11,498:

Code

ridge.mod[["lambda"]][50]

[1] 11497.57

Code

coefficients <- coef(ridge.mod, )[, 50]

as.data.frame(coefficients) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 6.3: Ridge Coefficients (Lambda=11,498)
	coefficients
(Intercept)	407.3560502
AtBat	0.0369572
Hits	0.1381803
HmRun	0.5246300
Runs	0.2307015
RBI	0.2398415
Walks	0.2896187
Years	1.1077029
CAtBat	0.0031318
CHits	0.0116536
CHmRun	0.0875457
CRuns	0.0233799
CRBI	0.0241383
CWalks	0.0250154
LeagueN	0.0850281
DivisionW	-6.2154410
PutOuts	0.0164826
Assists	0.0026130
Errors	-0.0205027
NewLeagueN	0.3014335

Now with \lambda=705:

Code

ridge.mod[["lambda"]][60]

[1] 705.4802

Code

coefficients <- coef(ridge.mod, )[, 60]

as.data.frame(coefficients) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 6.4: Ridge Coefficients (Lambda=705)
	coefficients
(Intercept)	54.3251995
AtBat	0.1121111
Hits	0.6562241
HmRun	1.1798091
Runs	0.9376971
RBI	0.8471855
Walks	1.3198795
Years	2.5964042
CAtBat	0.0108341
CHits	0.0467456
CHmRun	0.3377732
CRuns	0.0935553
CRBI	0.0978040
CWalks	0.0718961
LeagueN	13.6837019
DivisionW	-54.6587775
PutOuts	0.1185229
Assists	0.0160604
Errors	-0.7035865
NewLeagueN	8.6118121

Main Idea: The coefficients tend to be larger with a lower value of \lambda (although some of them can increase their value).

6.2.1 Validation Approach (Train and Test Data)

We now split the samples into a training set (70%) and a test set (30%) in order to estimate the test error of ridge regression and the lasso.

set.seed(123456) 
index <- sample(1:nrow(hitters), 0.7*nrow(hitters)) 

train_hitters <- hitters[index,] # Create the training data 
test_hitters <- hitters[-index,] # Create the test data

dim(train_hitters)

[1] 184  20

dim(test_hitters)

[1] 79 20

We convert it into a matrix:

# Train
x_train <- model.matrix(Salary~., data = train_hitters)[,-1]
y_train <- train_hitters$Salary

# Test
x_test <- model.matrix(Salary~., data = test_hitters)[,-1]
y_test <- test_hitters$Salary

Now, to evaluate it’s predicting performance, we will use the MSE/RMSE metrics:

eval_results <- function(true, predicted, df) {
  SSE <- sum((predicted - true)^2)
  MSE <- SSE/nrow(df)                            # Mean-Squared Error
  RMSE = sqrt(SSE/nrow(df))                      # Root Mean-Squared Error

  # Model performance metrics
data.frame(
  MSE = MSE,
  RMSE = RMSE
)
  
}

Let’s run the ridge regression in our training dataset and evaluate using \lambda=4:

ridge.mod1 <- glmnet(x_train, y_train, alpha = 0, lambda = grid)

# Test Performance (MSE)
predict_ridge <- predict(ridge.mod1, s = 4, newx = x_test)
ridge_MSE <- eval_results(y_test, predict_ridge, test_hitters)

Table 6.5: Ridge Performance on Test
MSE	RMSE
123038.6	350.7686

What if we use a very large \lambda ?

ridge.mod2 <- glmnet(x_train, y_train, alpha = 0, lambda = grid)

# Test Performance (MSE)
predict_ridge <- predict(ridge.mod2, s = 1e10, newx = x_test)
ridge_MSE <- eval_results(y_test, predict_ridge, test_hitters)

Table 6.6: Ridge Performance on Test
MSE	RMSE
203250.7	450.8334

So fitting a ridge regression model with λ = 4 leads to a much lower test MSE than fitting a model with just an intercept.

6.2.2 Choosing \lambda with Cross-Validation

In general, instead of arbitrarily choosing λ, it would be better to use cross-validation to choose the tuning parameter λ. We can do this using the built-in cross-validation function, cv.glmnet().

By default, the function cv.glmnet() performs ten-fold cross-validation, though this can be changed using the argument nfolds. Let’s use 5 for our example
Note that we set a random seed first so our results will be reproducible, since the choice of the cross-validation folds is random.

set.seed(12345) 
mod.ridge.cv <- cv.glmnet(x_train, y_train, type.measure = "mse", nfolds = 5)
plot(mod.ridge.cv)

As you can see from the plot, there are two types of optimal \lambda’s that we can use. We will consider both:

# Save Optimal Lambdas
lambda_min.ridge <- mod.ridge.cv$lambda.min
print(lambda_min.ridge)

[1] 1.189541

coeff_min <- coef(mod.ridge.cv, s = "lambda.min")[,1]

lambda_1se.ridge <- mod.ridge.cv$lambda.1se
print(lambda_1se.ridge)

[1] 113.5474

coeff_1se <- coef(mod.ridge.cv, s = "lambda.1se")[,1]

Code

data.frame(cbind(coeff_min, coeff_1se)) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 6.7: Ridge Coefficients - Optimal Lambda
	coeff_min	coeff_1se
(Intercept)	242.3167138	293.7115872
AtBat	-2.3746030	0.0000000
Hits	7.0088455	0.0000000
HmRun	-4.2660149	0.0000000
Runs	-3.0257019	0.0000000
RBI	3.1703769	0.0000000
Walks	7.9740935	2.5880568
Years	-20.3546933	0.0000000
CAtBat	-0.0017199	0.0000000
CHits	0.0000000	0.0000000
CHmRun	1.5844970	0.0000000
CRuns	1.0091501	0.2339336
CRBI	0.0000000	0.2171932
CWalks	-0.5131176	0.0000000
LeagueN	43.1992322	0.0000000
DivisionW	-134.4769998	0.0000000
PutOuts	0.3003368	0.0000000
Assists	0.3766270	0.0000000
Errors	-3.8907645	0.0000000
NewLeagueN	0.0000000	0.0000000

Now let’s use that value to predict y_test and check the MSE:

ridge.mod3 <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_min.ridge)
# Test Performance (MSE)
predict_ridge <- predict(ridge.mod3, newx = x_test)
ridge_MSE <- eval_results(y_test, predict_ridge, test_hitters)

Table 6.8: Ridge Performance on Test
MSE	RMSE
124994.8	353.5461

This is a lower MSE than when we used \lambda=4.

6.3 LASSO Regression

We now ask whether the LASSO can yield either a more accurate or a more interpretable model than ridge regression. In order to fit a LASSO model, we once again use the glmnet() function; however, this time we use the argument alpha=1.

lasso.mod1 <- glmnet(x_train, y_train, alpha = 1, lambda = grid)
plot(lasso.mod1)

We can see from the coefficient plot that depending on the choice of tuning parameter, some of the coefficients will be exactly equal to zero.

6.3.1 Choosing \lambda with Cross-Validation

Now we perform cross validation to find the best value of \lambda:

set.seed(12345) 
mod.lasso.cv <- cv.glmnet(x_train, y_train, type.measure = "mse", nfolds = 5)
plot(mod.lasso.cv)

# Save Optimal Lambdas
lambda_min.lasso <- mod.lasso.cv$lambda.min
print(lambda_min.lasso)

[1] 1.189541

coeff_min <- coef(mod.lasso.cv, s = "lambda.min")[,1]

lambda_1se.lasso <- mod.lasso.cv$lambda.1se
print(lambda_1se.lasso)

[1] 113.5474

coeff_1se <- coef(mod.lasso.cv, s = "lambda.1se")[,1]

Code

data.frame(cbind(coeff_min, coeff_1se)) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 6.9: LASSO Coefficients - Optimal Lambda
	coeff_min	coeff_1se
(Intercept)	242.3167138	293.7115872
AtBat	-2.3746030	0.0000000
Hits	7.0088455	0.0000000
HmRun	-4.2660149	0.0000000
Runs	-3.0257019	0.0000000
RBI	3.1703769	0.0000000
Walks	7.9740935	2.5880568
Years	-20.3546933	0.0000000
CAtBat	-0.0017199	0.0000000
CHits	0.0000000	0.0000000
CHmRun	1.5844970	0.0000000
CRuns	1.0091501	0.2339336
CRBI	0.0000000	0.2171932
CWalks	-0.5131176	0.0000000
LeagueN	43.1992322	0.0000000
DivisionW	-134.4769998	0.0000000
PutOuts	0.3003368	0.0000000
Assists	0.3766270	0.0000000
Errors	-3.8907645	0.0000000
NewLeagueN	0.0000000	0.0000000

lasso.mod2 <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_min.lasso)
# Test Performance (MSE)
predict_lasso <- predict(lasso.mod2, s = 4, newx = x_test)
lasso_MSE <- eval_results(y_test, predict_ridge, test_hitters)

Table 6.10: LASSO Performance on Test
MSE	RMSE
124994.8	353.5461

6.4 k-fold CV Model Comparison

Just a fancy way of combining regularization with cross-validation to compare model performance.

6.4.1 Lasso CV

set.seed(12345)  
n = nrow(hitters)
K = 10 # # folds
foldid = rep(1:K, each=ceiling(n/K))[sample(1:n)]

OOS.lasso.min=data.frame(MSE=rep(NA,K), RMSE=rep(NA, K))

for(k in 1:K){
  
  train = which(foldid!=k) 
  
  # train.data conversion
  k.train <- hitters[train,]
  x_train <- model.matrix(Salary~., data = k.train)[,-1]
  y_train <- k.train$Salary
   
  # test.data conversion
  k.test <- hitters[-train,]
  x_test <- model.matrix(Salary~., data = k.test)[,-1]
  y_test <- k.test$Salary 
  
  #Choosing Tuning Parameter
  mod.lasso.cv <- cv.glmnet(x_train, y_train, type.measure = "mse", nfolds = 5)
  lambda_min <- mod.lasso.cv$lambda.min
  lambda_1se <- mod.lasso.cv$lambda.1se
  
  #fit regression on train
  mod.lasso <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_min)
  
  #predict on test
  predictions_test <- predict(mod.lasso, s = lambda_min, newx = x_test)
  
  #MSE
  OOS.lasso.min$MSE[k] <- eval_results(y_test, predictions_test, k.test)[,1]
  OOS.lasso.min$RMSE[k] <- eval_results(y_test, predictions_test, k.test)[,2]

    # print progress
  cat(k, "  ")
  
}

1   2   3   4   5   6   7   8   9   10

mean(OOS.lasso.min$RMSE)

[1] 329.0406

set.seed(12345)  
n = nrow(hitters)
K = 10 # # folds
foldid = rep(1:K, each=ceiling(n/K))[sample(1:n)]

OOS.lasso.1se=data.frame(MSE=rep(NA,K), RMSE=rep(NA, K))


for(k in 1:K){
  
  train = which(foldid!=k) 
  
  # train.data conversion
  k.train <- hitters[train,]
  x_train <- model.matrix(Salary~., data = k.train)[,-1]
  y_train <- k.train$Salary
   
  # test.data conversion
  k.test <- hitters[-train,]
  x_test <- model.matrix(Salary~., data = k.test)[,-1]
  y_test <- k.test$Salary 
  
  #Choosing Tuning Parameter
  mod.lasso.cv <- cv.glmnet(x_train, y_train, type.measure = "mse", nfolds = 5)
  lambda_min <- mod.lasso.cv$lambda.min
  lambda_1se <- mod.lasso.cv$lambda.1se
  
  #fit regression on train
  mod.lasso <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_1se)
  
  #predict on test
  predictions_test <- predict(mod.lasso, s = lambda_1se, newx = x_test)
  
  #MSE
  OOS.lasso.1se$MSE[k] <- eval_results(y_test, predictions_test, k.test)[,1]
  OOS.lasso.1se$RMSE[k] <- eval_results(y_test, predictions_test, k.test)[,2]

    # print progress
  cat(k, "  ")
  
}

1   2   3   4   5   6   7   8   9   10

mean(OOS.lasso.1se$RMSE)

[1] 358.0974

6.4.2 Ridge CV

set.seed(12345)  
n = nrow(hitters)
K = 10 # # folds
foldid = rep(1:K, each=ceiling(n/K))[sample(1:n)]

OOS.ridge.min=data.frame(MSE=rep(NA,K), RMSE=rep(NA, K))


for(k in 1:K){
  
  train = which(foldid!=k) 
  
  # train.data conversion
  k.train <- hitters[train,]
  x_train <- model.matrix(Salary~., data = k.train)[,-1]
  y_train <- k.train$Salary
   
  # test.data conversion
  k.test <- hitters[-train,]
  x_test <- model.matrix(Salary~., data = k.test)[,-1]
  y_test <- k.test$Salary
  
  #Choosing Tuning Parameter
  mod.ridge.cv <- cv.glmnet(x_train, y_train, alpha=0, type.measure = "mse", nfolds = 5)
  lambda_min <- mod.ridge.cv$lambda.min
  lambda_1se <- mod.ridge.cv$lambda.1se
  
  #fit regression on train
  mod.ridge <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_min)
  
  #predict on test
  predictions_test <- predict(mod.ridge, s = lambda_min, newx = x_test)
  
  #MSE
  OOS.ridge.min$MSE[k] <- eval_results(y_test, predictions_test, k.test)[,1]
  OOS.ridge.min$RMSE[k] <- eval_results(y_test, predictions_test, k.test)[,2]

    # print progress
  cat(k, "  ")
  
}

1   2   3   4   5   6   7   8   9   10

mean(OOS.ridge.min$RMSE)

[1] 331.3975

set.seed(12345)  
n = nrow(hitters)
K = 10 # # folds
foldid = rep(1:K, each=ceiling(n/K))[sample(1:n)]

OOS.ridge.1se=data.frame(MSE=rep(NA,K), RMSE=rep(NA, K))


for(k in 1:K){
  
  train = which(foldid!=k) 
  
  # train.data conversion
  k.train <- hitters[train,]
  x_train <- model.matrix(Salary~., data = k.train)[,-1]
  y_train <- k.train$Salary
   
  # test.data conversion
  k.test <- hitters[-train,]
  x_test <- model.matrix(Salary~., data = k.test)[,-1]
  y_test <- k.test$Salary 
  
  #Choosing Tuning Parameter
  mod.ridge.cv <- cv.glmnet(x_train, y_train, alpha=0, type.measure = "mse", nfolds = 5)
  lambda_min <- mod.ridge.cv$lambda.min
  lambda_1se <- mod.ridge.cv$lambda.1se
  
  #fit regression on train
  mod.ridge <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_1se)
  
  #predict on test
  predictions_test <- predict(mod.ridge, s = lambda_1se, newx = x_test)
  
  #MSE
  OOS.ridge.1se$MSE[k] <- eval_results(y_test, predictions_test, k.test)[,1]
  OOS.ridge.1se$RMSE[k] <- eval_results(y_test, predictions_test, k.test)[,2]

    # print progress
  cat(k, "  ")
  
}

1   2   3   4   5   6   7   8   9   10

mean(OOS.ridge.1se$RMSE)

[1] 359.3406

Finally, we build the plot that compares all the four models:

OOS.lasso.min <- OOS.lasso.min %>% mutate(Model="Lasso Min")
OOS.lasso.1se <- OOS.lasso.1se %>% mutate(Model="Lasso 1SE")

OOS.ridge.min <- OOS.ridge.min %>% mutate(Model="Ridge Min")
OOS.ridge.1se <- OOS.ridge.1se %>% mutate(Model="Ridge 1SE")

### MERGE
OOS <- rbind(OOS.lasso.min, OOS.lasso.1se, OOS.ridge.min, OOS.ridge.1se)

6.4.2.1 Model Comparison Plot

Code

OOS %>%
  #filter(!Model=="OLS") %>% 
  ggplot( aes(x=reorder(Model,RMSE), y=RMSE, fill=Model)) + 
  geom_boxplot() +
  geom_jitter(shape=16, position=position_jitter(0.2)) +
  theme_bw() + theme(legend.position="none") +
  xlab("") +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_manual(values = mycolors)

Figure 6.2: Model Comparison: k-fold Out-of-Sample RMSE