set.seed(1)
train_auto <- Auto %>%
sample_n(size = 196)
test_auto <- Auto %>%
anti_join(train_auto)
# train <- sample(392, 196)4 Resampling Methods
5 ISLR: Cross-Validation and The Bootstrap
In this lab, we explore the resampling techniques covered in this chapter. Some of the commands in this lab may take a while to run on your computer.
When fitting a model it is often desired to be able to calculate a performance metric to quantify how well the model fits the data.
If a model is evaluated on the data it was fit on, you are quite likely to get over-optimistic results. It is therefore we split our data into testing and training. This way we can fit the model to data and evaluate it on some other that that is similar.
5.1 The Validation Approach
We begin by using the sample() function to split the set of observations into two halves, by selecting a random subset of 196 observations out of the original 392 observations. We refer to these observations as the training set.
Linear Regression Fit:
lm_auto <- lm(mpg ~ horsepower, data = train_auto)
# lm_auto <- lm(mpg ~ horsepower, data = Auto, subset = train)MSE:
mse <- mean((test_auto$mpg - predict(lm_auto, test_auto))^2)MSE: 23.26601
Polynomial Regression Fit:
lm_auto2 <- lm(mpg ~ poly(horsepower, 2), data = train_auto)
mse2 <- mean((test_auto$mpg - predict(lm_auto2, test_auto))^2)MSE: 18.71646
Quadratic regression performs better than a linear model. Let’s also try a cubic model:
lm_auto3 <- lm(mpg ~ poly(horsepower, 3), data = train_auto)
mse3 <- mean((test_auto$mpg - predict(lm_auto3, test_auto))^2)MSE: 18.79401
5.2 Leave-One-Out Cross-Validation (LOOCV)
Estimating the test error with LOOCV in linear regression:
glm_auto <- glm(mpg ~ horsepower, data = Auto)(Intercept) horsepower
39.9358610 -0.1578447
The glm() function can be used with cv.glm() to estimate k-fold cross-validation prediction error. To do this, we re-insert the fitted glm_auto into the cv.glm()function:
cv_err <- cv.glm(Auto, glm_auto)[1] 24.23 24.23
We can repeat this process in a for() loop to compare the cross-validation error of higher-order polynomials. The following example estimates the polynomial fit of the order 1 through 5 and stores the result in a cv.error vector.
loocv_error_poly <- function(n){
glm_auto <- glm(mpg ~ poly(horsepower, n), data = Auto)
cv_err <- cv.glm(Auto, glm_auto)
cv_err[["delta"]][[1]]
}
map_dbl(1:5, loocv_error_poly)[1] 24.23151 19.24821 19.33498 19.42443 19.03321
5.3 -Fold Cross-Validation
In addition to LOOCV, cv.glm() can also be used to run $k$-fold cross-validation. In the following example, we estimate the cross-validation error of polynomials of the order 1 through 10 using -fold cross-validation.
set.seed(17)
k10_error_poly <- function(n){
glm_auto <- glm(mpg ~ poly(horsepower, n), data = Auto)
cv_err_10 <- cv.glm(Auto, glm_auto, K = 10)
cv_err_10[["delta"]][[1]]
}
map_dbl(1:10, k10_error_poly) [1] 24.27207 19.26909 19.34805 19.29496 19.03198 18.89781 19.12061 19.14666
[9] 18.87013 20.95520
In both LOOCV and k-fold cross-validation, we get lower test errors with quadratic models than linear models, but cubic and higher-order polynomials don’t offer any significant improvement.
5.4 The Bootstrap
This section illustrates the use of the bootstrap in the simple Section 5.2 of ISLR, as well as on an example involving estimating the accuracy of the linear regression model on the Auto data set.
5.4.1 Accuracy of a Statistic of Internet
First we create a function to compute the alpha statistic:
alpha_fn <- function (data, index){
X <- data$X[index]
Y <- data$Y[index]
(var(Y)-cov(X,Y))/(var(X)+var(Y) -2*cov(X,Y))
}set.seed (7)
alpha_fn(Portfolio , sample (100, 100, replace = T))[1] 0.5385326
Instead of manually repeating this procedure with different samples from our dataset, we can automate this process with the boot() function as shown below.
boot(Portfolio, alpha_fn, R = 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Portfolio, statistic = alpha_fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 0.5758321 0.0007959475 0.08969074
5.4.2 Accuracy of an OLS
We can apply the same bootstrap approach to the Auto dataset by creating a bootstrap function that fits a linear model to our dataset.
coefs_boot <- function(data, index)
return(coef(lm(mpg ~ horsepower, data = data, subset = index)))
coefs_boot(Auto, 1:392)(Intercept) horsepower
39.9358610 -0.1578447
We can run this manually on different samples from the dataset.
coefs_boot(Auto, sample(1:392, 392, replace = TRUE))(Intercept) horsepower
38.5659756 -0.1447879
Standard Errors
Finally, we can also automate this by fitting the model on 1000 replicates from our dataset:
boot(Auto, coefs_boot, R = 10000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Auto, statistic = coefs_boot, R = 10000)
Bootstrap Statistics :
original bias std. error
t1* 39.9358610 0.0356735447 0.864233803
t2* -0.1578447 -0.0004414161 0.007467949
The summary() function be used to compute standard errors for the regression coefficients.
summary(lm(mpg ~ horsepower, data = Auto))$coef Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187
horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81
6 DSM: Cross-Validation and Bootstrap
In this exercise, we will work with a dataset about home loan eligibility. A housing finance company provides home loans for the houses which were present across all urban, semi-urban and rural areas for their customers. The company validates the eligibility of loan after customer applies for the loan. However, it consumes lot of time for the manual validation of eligibility process.
Our aim is to create a predictive model that will give us possible outcome of a loan application and fasten the procedure for the finance company.
Since acceptance is a binary decision, we will use logistic regression in our predictions. After running the logit model, we will calculate how many misclassifications we have. Then we will use cross validation to calculate test error rate.
In the last part of the exercise we will apply bootstrapping to calculate standard errors of coefficients. We have an analytical formula for standard errors and glm() function gives them automatically. We will compare standard errors from theory with the standard errors from bootstrapping.
6.1 Question 1
Load the dataset homeloan.csv and summarize it. Eliminate rows with missing values. Convert type of categorical variables to factor.
6.2 Question 2
Estimate a logit regression using LoanAmount, Self Employed, Education, Married, Gender, ApplicantIncome, Credit History and Property Area as independent variable and Loan Status as dependent variable. Summarize estimation results.
6.3 Question 3
Based on the results of this estimation make predictions using 0.5 as threshold.
6.4 Question 4
Write a for loop to implement LOOCV using following steps:
Create and empty array of size number of observations by 1.
Say for loop has an index of i, estimate model without th observation.
Predict outcome of th observation using estimation results.
Store predicted outcome in the array you created.
Calculate error rate with these new predictions.
6.5 Question 5
Compare error rates in Q3 and Q4.
6.6 Question 6
Write a for loop to implement bootsrapping using following steps:
Create an empty array of size 1000 and number of regressors +1 to store estimation results.
For each iteration in the for loop, draw a random sample from your data with replacement at the same size with your dataset.
Run a logit regression on this new dataset and store its coefficients in the empty array you created.
Find standard deviation of coefficients stored in the array.
6.7 Question 7
Compare the standard errors from theory with the standard errors calculated by bootstrapping