For part c

BetaHat3 = function(units,xy){
if(length(units)==nrow(xy)){ # in the bootstrap part: length(units)==n
ystar = xy[, 2] + xy[units, 3]
m = lm(ystar ~ xy[, 1])
}
else{ # in the jack-knife part: length(units)==n-1
# With the jack-knife (done internally by bcanon),
# we need to use the fitted values and residuals with one element left out as well.
# Otherwise we are trying to add vectors of different length.
ystar = xy[units, 2] + xy[units, 3]
m = lm(ystar ~ xy[units, 1])
}
return(m\$coefficients[2])
}

res_bca3 = bcanon(1:n, BetaHat3, nboot=N, cbind(diabetes\$age, fitted, resid))

PRESS statistic

To ﬁnd the cross-validated residuals and hence the PRESS statistic, we can repeatedly ﬁt the model, leaving out one row at a time, and each time predicting the outcome Y for the data-point that was left out. In the following code, Hospitals[-i,] gives the dataset with the ith row left out. Then predict(modeli, Hospitals[i,]) returns the prediction using the covariates in the ith row. So pred1 is ﬁlled in with the cross-validated predictions ˆ y[i].

n = nrow(Hospitals)
pred1 = vector(length=n)
for(i in 1:n){
modeli = lm(Y ~ X5, Hospitals[-i,])
pred1[i] = predict(modeli, Hospitals[i,])
}
The cross-validated residuals can then be found as
cv res1 = Hospitals\$Y – pred1
Plot both the ordinary residuals and the cross-validated residuals against X5. (If we make a plot with plot(…), we can add another scatter-plot using points(…)). Which is further from 0?
Now calculate the value of the PRESS statistic:,
PRESS1 = sum(cv_res1^2)

Repeat this for the other two models. By looking at the PRESS values which of the three models is best for predicting values of the response variable? Does adding more variables always decrease the value of the PRESS statistic?

