Reference date of book: 2019-12-06
Chapter 4: Linear Regression
4.2.2 Inference Notes
(Ctrl-f) "Regresion" & "Remdial"
[4.7 Partial least squares]
set.seed(123)
cv_model_pls <- train(
Sale_Price ~ .,
data = ames_train,
method = "pls",
trControl = trainControl(method = "cv", number = 10),
preProcess = c("zv", "center", "scale"),
tuneLength = 20
)
#model with lowest RMSE
cv_model_pls$bestTune
I'm not able to replicate n=3 with cv_model_pls$bestTune.
I've tried it on two different computers, and I’m getting closer to m=19 or 20.
I experimented with tuneLength = 40 and cv_model_pls$bestTune was between 19-21.
Given the big discrepancy between m=3 and m=19, I thought I'd flag it out.
After reading this line "Using PLS with m=3 principal components corresponded with the lowest cross-validated RMSE of $29,970", I was wondering how would I go about verifying the RMSE other than looking at the ggplot graph itself.
Suggestion: Consider including the following code to aid the reader in extracting the lowest RMSE for themselves:
library(tidyverse)
#assuming $bestTune gives n = 19
cv_model_pls$results %>%
dplyr::filter(ncomp == 19)
Fig 4.10
There's a typo in the caption: The 10-fold cross "valdation" RMSE
Online supplementary material
(https://koalaverse.github.io/homlr/notebooks/04-linear-regression.nb.html), there's a section with repetitive words:
(Ctrl-f)
“Prediction from a rank-deficient fit…”
Chapter 5: Logistic Regression
5.5 Assessing model accuracy
"There are 16 numeric features in our data set so the following code performs a 10-fold cross-validated PLS model while tuning the number of principal components to use from 1–16. "
Suggestion - Consider including the following code to allow reader to extract number of numeric features for themselves:
length(attrition[sapply(attrition, is.numeric)])
Suggestion - Consider including the following code to allow reader to extract lowest RMSE for themselves:
cv_model_pls$results %>%
dplyr::filterfilter(ncomp == 14)
Question - Could you elaborate on what’s the intuition behind limiting tuneLength to number of numeric features? Why can't we set tuneLength to the number of all features?
Chapter 6: Regularized Regression
6.2 Why regularize?
(Ctrl-f) "classicial"
(Ctrl-f) bet on sparsity principal - should be "principle"
6.3 Implementation
(Ctrl-f) Here we just peak - should be "peek"
6.4 Tuning
Suggestion - Consider including the following code to allow reader to extract Lasso coefficient for the lowest MSE:
lasso$nzero[lasso$lambda == lasso$lambda.min] # No. of coef | Min MSE
lasso$nzero[lasso$lambda == lasso$lambda.1se] # No. of coef | 1-SE MSE
Chapter 7: Multivariate Adaptive Regression Splines
7.5 Feature Interpretation
With the latest version of vip (0.2.1), the following code below gives an warning/error
# variable importance plots
> p1 <- vip(cv_mars, num_features = 40, bar = FALSE, value = "gcv") + ggtitle("GCV")
Warning message:
In vip.default(cv_mars, num_features = 40, bar = FALSE, value = "gcv") :
The `bar` argument has been deprecated in favor of the new `geom` argument. It will be removed in version 0.3.0.
> p2 <- vip(cv_mars, num_features = 40, bar = FALSE, value = "rss") + ggtitle("RSS")
Warning message:
In vip.default(cv_mars, num_features = 40, bar = FALSE, value = "rss") :
The `bar` argument has been deprecated in favor of the new `geom` argument. It will be removed in version 0.3.0.
Suggestion: Code tweaked below.
p1 <- vip(cv_mars, num_features = 40, geom = "point",value = "gcv") + ggtitle("GCV")
p2 <- vip(cv_mars, num_features = 40, geom = "point", value = "rss") + ggtitle("RSS")
gridExtra::grid.arrange(p1, p2, ncol = 2)
Thank you!