Coder Social home page Coder Social logo

koalaverse / homlr Goto Github PK

View Code? Open in Web Editor NEW
220.0 25.0 89.0 315.13 MB

Supplementary material for Hands-On Machine Learning with R, an applied book covering the fundamentals of machine learning with R.

Home Page: https://koalaverse.github.io/homlr

License: Creative Commons Attribution Share Alike 4.0 International

CSS 100.00%
r machine-learning data-science supervised-learning unsupervised-learning

homlr's People

Contributors

bgreenwell avatar bradleyboehmke avatar brandongreenwell-8451 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

homlr's Issues

Convert chapters to RStudio Cloud notebooks for interactive use

I'm adding notebooks for each chapter to a homlr RStudio Cloud project. This will allow people to run the code chunks (and reproduce figures/tables) for each chapter.

Question: Do we only put the code blocks in the notebook or should we dump the entire chapter into each notebook? Reference notebooks 02-modeling-process for example of just code chunks and 03-feature-engineering for example of the entire chapter.

If we put in the entire chapter I think we need to add a copyright issue at the bottom. Also, we won't be able to get everything to render properly unless we can get notebooks to leverage the additional css files. But even then, chapter cross links won't persist.

Thoughts @bgreenwell?

Code for Chapter 8 not working

I attempted to use the read_mnist() function from dslabs and it returned this error:

Error in readBin(conn, "integer", n = prod(dim), size = 1, signed = FALSE) : 
  cannot read from connection
In addition: Warning message:
In readBin(conn, "integer", n = prod(dim), size = 1, signed = FALSE) :
  URL 'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz': Timeout of 60 seconds was reached

It looks like http://yann.lecun.com/exdb/mnist/ is no longer live but with a little help from the Brave browser I found an old image of the site using the wayback machine and I downloaded the files.

I modified the function to read the data out of local copies:

read_mnist_local <- function () {
    mnist <- list(train = list(images = c(), labels = c()), test = list(images = c(), 
        labels = c()))
    for (ttt in c("train", "t10k")) {
        fn <- paste0(ttt, "-images-idx3-ubyte.gz")
        # url <- url(paste0("", fn), "rb")
        # conn <- gzcon(url)
        conn <- gzcon(file(fn, "rb"))
        magic <- readBin(conn, "integer", n = 1, size = 4, endian = "big")
        typ <- bitwAnd(bitwShiftR(magic, 8), 255)
        ndm <- bitwAnd(magic, 255)
        dim <- readBin(conn, "integer", n = ndm, size = 4, endian = "big")
        data <- readBin(conn, "integer", n = prod(dim), size = 1, 
            signed = FALSE)
        tt <- ttt
        if (tt == "t10k") 
            tt <- "test"
        mmm <- matrix(data, nrow = dim[1], byrow = TRUE)
        mnist[[tt]][["images"]] <- mmm
        close(conn)
        fn <- paste0(ttt, "-labels-idx1-ubyte.gz")
        # url <- url(paste0("", fn), "rb")
        # conn <- gzcon(url)
        conn <- gzcon(file(fn, "rb"))
        magic <- readBin(conn, "integer", n = 1, size = 4, endian = "big")
        nlb <- readBin(conn, "integer", n = 1, size = 4, endian = "big")
        data <- readBin(conn, "integer", n = nlb, size = 1, signed = FALSE)
        mnist[[tt]][["labels"]] <- data
        close(conn)
    }
    mnist
}
 
# import MNIST training data
#mnist <- dslabs::read_mnist()
mnist <- read_mnist_local()

… and all is good. I don’t know the proper solution (other than hosting the files) but I figured I should share this in the hope it helps others.

chapter 5, latex error

Book version: 2019-05-25

In section 5.2 you have some Latex code out in regular text, "-1.144649210^{-4} units".

Typo in 13.9

Hi - I found a small typo in the second paragraph in the "Final Thoughts" section,. 13.9. Currently reads:

..."M=# hidden nodes, and L=# epchos."

Should read: "epochs"

Found in book v. 2020-02-01. Thanks for a great book! Very useful.

Figure 20.8 not working

The following code:

set.seed(123)

fviz_nbclust(
  ames_1hot_scaled, 
  kmeans, 
  method = "wss", 
  k.max = 25, 
  verbose = FALSE
)

Returns:

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

My environment is:
R version 4.0.5 (2021-03-31)
factoextra 1.0.7
AmesHousing 0.0.4
caret 6.0.86
dplyr 1.0.5

Chapter 1 section 1.2 Unsupervised learning

Book version 2019-06-01

The goal of clustering is to segment observations into similar groups based on the observed variables; for example, to divide consumers into different homogeneous groups, a process known as market segmentation. In dimension reduction, we are often concerned with reducing the number of variables in a data set. For example, classical linear regression models break down in the presence of highly correlated features. Some dimension reduction techniques can be used reduce the feature set to a potentially smaller set of uncorrelated variables.

There is a missing "to" in the last sentence of the paragraph above. Highlighted in bold below.

For example, classical linear regression models break down in the presence of highly correlated features. Some dimension reduction techniques can be used to reduce the feature set to a potentially smaller set of uncorrelated variables.

Edit
Same section as above

Such a reduced feature set is often used as input to downstream supervised learning models (e.g., principle component regression).

The highlighted has to read as principal component regression

Proposed Ch. 1 exercises

A few proposed questions for chapter 1:

  1. Identify four real-life applications of supervised and unsupervised problems.

    • Explain what makes these problems supervised versus unsupervised.
    • For each problem identify the target variable (if applicable) and potential features.
  2. Identify and contrast a regression problem with a classification problem.

    • What is the target variable in each problem and why would being able to accurately predict this target be beneficial to society?
    • What are potential features and where could you collect this information?
    • What is determining if the problem is a regression or a classification problem?
  3. Identify three open source datasets suitable for machine learning (e.g., https://bit.ly/35wKu5c).

    • Explain the type of machine learning models that could be constructed from the data (e.g., supervised versus unsupervised and regression versus classification).
    • What are the dimensions of the data?
    • Is there a code book that explains who collected the data, why it was originally collected, and what each variable represents?
    • If the dataset is suitable for supervised learning, which variable(s) could be considered as a useful target? Which variable(s) could be considered as features?
  4. Identify examples of misuse of machine learning in society. What was the ethical concern?

Proposed Ch. 2 exercises

  1. Load the Boston housing data set from the pdp package. These data come from a classic paper that analyzed the relationship between several characteristics (e.g., crime rate, average rooms per dwelling, property tax value) and the median value of homes within a census tract (cmedv). See ?pdp::boston for details and further references.

    • What are the dimensions of this data set?
    • Perform some exploratory data analysis on this data set (be sure to assess the distribution of the target variable cmedv).
  2. Split the Boston housing data into a training set and test set using a 70-30% split.

    • How many observations are in the training set and test set?
    • Compare the distribution of cmedv between the training set and test set.
  3. Load the spam data set from the kernlab package.

    • What is the distribution of the target variable (type) across the entire data set?
    • Create a 70/30 training/test split stratified by the target variable.
    • Compare the distribution of the target variable between the training set and test set.
  4. Using the Boston housing training data created in 2), fit a linear regression model that use all available features to predict cmedv.

    • Create a model with lm(), glm(), and caret::train().
    • How do the coefficients compare across these models?
    • How does the MSE/RMSE compare across these models?
    • Which method is caret::train() using to fit a linear regression model?
  5. Using the Boston housing training data created in exercise 2), perform a 10-fold cross-validated linear regression model, repeated 5 times, that uses all available features to predict cmedv.

    • What is the average RMSE across all 50 model iterations?
    • Plot the distribution of the RMSE across all 50 model iterations.
    • Describe the results.
    • Repeat this exercise for the spam data from exercise 3); since the target (type) is binary, be sure to use a more appropriate metric (e.g., AUC or misclassification error).
  6. Repeat exercise 5) on the Boston housing data; however, instead of a linear regression model, use a k-nearest neighbor model that executes a hyperparameter grid search where k ranges from 2--20. How does this model's results compare to the linear regression results?

Error in 2.1: attrition data is in a different package

At the end of section 2.1 is a sample code block. This line does not work:

churn <- rsample::attrition %>% 
  mutate_if(is.ordered, .funs = factor, ordered = FALSE)

It returns this error: Error: 'attrition' is not an exported object from namespace:rsample`

It's failing because the attrition data was moved to modeldata. See the info for version 0.0.7 here: https://cloud.r-project.org/web/packages/rsample/news/news.html.

To fix:

  1. Add library(modeldata) to first code block in 2.1.
  2. Change rsample::attrition to attrition.

chapter 5, suggestion: reasons to avoid linear regression

Book version: 2019-05-25

Section 5.2: Consider mentioning other reasons that linear regression is unsuitable for classification, such as sensitivity to imbalances and outliers. The section might give the impression that it's not so terrible to use what we used to call the "linear probability model" in econometrics. (Bleah.)

ICE plots - Error in guess(varying)

I'm following your tutorial on binary classification with random forests. When attempting to plot ICE plots I get an error:

Error in guess(varying) : 
  failed to guess time-varying variables from their names

Updating pdp to dev did not help. Is this an issue with pdp and reshape() or am I doing something wrong?
The code I'm running is an exact copy of

p1 <- m3_ranger_prob %>%
  partial(pred.var = "OverTime", ice = TRUE, center = TRUE, pred.fun = custom_pred, train = attrit_train) %>%
  autoplot(rug = TRUE, train = attrit_train, alpha = 0.2)

just with different variables.

Figure 4.4 no longer works

The code for Figure 4.4 no longer works because broom::augment does not add a .resid variable. If .resid is replaced with .std.resid the figure is produced.

Question RE: Chapter 13.4.2.2 Implementation

Running:

  • R version 3.6.2 (2019-12-12)
  • RStudio, Version 1.2.5033
  • Keras, 2.2.5.0
  • tensorflow, 2.0.0
  • tfestimators, version 1.9.1

IN Chapter 13.4.2.2 Implementation

  1. After loading the libraries in section 13.1 Prerequisites and
  2. Importing MNIST training data
  3. Ran 13.4.1.3 Implementation: ...keras_model_sequential() %>% {128, 64, 10}...
  4. After running code block in 13.4.2.2 Implementation
model <- keras_model_sequential() %>%
layer_dense(units = 128, activation = "relu", input_shape = p) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 10, activation = "softmax")

I get the error:

Error in py_call_impl(callable, dots$args, dots$keywords) : ValueError: Error converting shape to a TensorShape: invalid literal for int() with base 10: 'tfestimators'.

I'm currently seeking info on keras and tensorflow github sites...

Is this a new issue?

Error description in Chapter 4, Section 4.5 Model concerns

Version: 2020-02-01

In chapter 4, section 4.5 model concerns, regarded to the 5th assumption of no or little multicollinearity:

Looking at our full model where both of these variables are included, we see that Garage_Cars is found to be statistically significant but Garage_Area is not:

However, the following result shows that Garage_Area is found to be statistically significant but Garage_Cars is not. And the following content also needs modification.

Minor typos in several chapters

Hi thanks for this book. While reading I come accross many minor typos in the text. I'll try to reference them here.

The typos and my propositions will be in bold; redundant words are strikeouted

In Chapter 4.8 : The last paragraph of the section :

gradually decreases for lessor important variables.
Correct would be :
gradually decreases for lesser important variables.

In Chapter 6.2 : the paragraph right after the note to the reader.

Many real-life data sets, like those cmmon to text mining
Correct would be :
Many real-life data sets, like those common to text mining

In Chapter 9.5 . The last sentence of the second paragraph

"Basically, this is telling us that Overall_Qual is an important predictor os sales price"
Correct would be :
Basically, this is telling us that Overall_Qual is an important predictor of sales price.

In Chapter 10.2. The last sentence of the "note to the reader"

you’ll often find that the averaged guesses tends to be a lot closer to the true numnber.
Correct would be
you’ll often find that the averaged guesses tends to be a lot closer to the true number

In 10.3. The fourth (4th) line of the first paragraph

we’re keeping bias low and **avriance ** high
Correct would be:
we’re keeping bias low and variance high

In 10.4 last sentence of the third paragraph

how the OOB error closely closely approximates the test error.

In Chapter 11, second line

collection of de-correlated trees to to further improve

In 11.4.3, the fifth line

if computation time is a concern than you can
Correct would be:
if computation time is a concern then you can

In 11.7, second line

(witht he exception of surrogate splits)
Correct would be :
(with the exception of surrogate splits)

I'm still reading the book, so I will keep posting the issues.

Typos in Chapter 6 Regularized Regression

Section 6.5 Feature interpretation:

However, not that one of the top 20 most influential variables is Overall_QualPoor.

"not" should be "note".

Section 6.7 Final thoughts:

It provides a great option for handling the n > p problem, helps minimize the impact of multicollinearity, and can perform automated feature selection.

"n > p" should be "n < p".

lambda needs to be defined (or estimated) in 3.2 section

y <- forecast::BoxCox(10, lambda)
returns an error (since lambda has not been defined (or estimated) earlier).

Adding a line like
lambda <- -0.03616899
would fix this issue.
(Note that one can estimate the lambda by using the BoxCox.lambda function).

Figure 3.1 not reproducible under the new version of purrr or broom package

Hi there. It seems that Figure 3.1 is not reproducible because of the new version of purrr or broom:

library(purrr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
ggplot2::theme_set(ggplot2::theme_light())

ames <- AmesHousing::make_ames()
#> Warning: `funs()` is deprecated as of dplyr 0.8.0.
#> Please use a list of either functions or lambdas: 
#> 
#>   # Simple named list: 
#>   list(mean = mean, median = median)
#> 
#>   # Auto named with `tibble::lst()`: 
#>   tibble::lst(mean, median)
#> 
#>   # Using lambdas
#>   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.

# Load and split the Ames housing data using stratified sampling
set.seed(123)  # for reproducibility
split <- rsample::initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train <- rsample::training(split)
ames_test <- rsample::testing(split)

models <- c("Non-log transformed model residuals", 
            "Log transformed model residuals")
list(
  m1 = lm(Sale_Price ~ Year_Built, data = ames_train),
  m2 = lm(log(Sale_Price) ~ Year_Built, data = ames_train)
) %>%
  map2_dfr(models, ~ broom::augment(.x) %>% mutate(model = .y)) %>%
  ggplot(aes(.resid)) +
    geom_histogram(bins = 75) +
    facet_wrap(~ model, scales = "free_x") +
    ylab(NULL) +
    xlab("Residuals")
#> Warning: Removed 2053 rows containing non-finite values (stat_bin).

Created on 2020-07-18 by the reprex package (v0.3.0)

The use of map2_dfr is so succinct that I spend hours to understand what's going on here. However, the result of map2_dfr now has some problems. The .resid column is NA in the Log transformed model. Could you please have a look at how to handle it? Many thanks for your guidance!

My sessioninfo is as follows:

sessionInfo(c("purrr", "broom"))
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Catalina 10.15.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#> 
#> attached base packages:
#> character(0)
#> 
#> other attached packages:
#> [1] purrr_0.3.4 broom_0.7.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.2  magrittr_1.5    graphics_3.6.2  tools_3.6.2    
#>  [5] htmltools_0.5.0 utils_3.6.2     yaml_2.2.1      grDevices_3.6.2
#>  [9] stats_3.6.2     datasets_3.6.2  stringi_1.4.6   rmarkdown_2.3.2
#> [13] highr_0.8       knitr_1.29      methods_3.6.2   stringr_1.4.0  
#> [17] xfun_0.15       digest_0.6.25   rlang_0.4.7     base_3.6.2     
#> [21] evaluate_0.14

Created on 2020-07-18 by the reprex package (v0.3.0)

Is Potential Problem with code or my machine?

I am working through sections of HOML/autoencoders.html and found an error message pop up.

In section: 19.2.3 Visualizing the reconstruction, I found an error associated with the line:

# Predict reconstructed pixel values  
best_model_id <- grid_perf@model_ids[[1]]

after this line I get:

Error: object 'grid_perf' not found

Up to this point, I have followed the code from the Autoencoder section, should I look at my setup or is this a change in H2O.ai and code?
HTH

various minor typos

version 2019-10-19

3.3.1

Final paragraph reads

Perhaps the values were never recoded

should it be "recorded"?

4.2.1

discrepancy in Residual standard error/RMSE of model1 reported. 55750 vs 55753.45

summary(model1)
## Residual standard error: 55750 on 2052 degrees of freedom

sigma(model1) # RMSE
## [1] 55753.45

5.3

2nd para reads

In the background glm() , uses ML estimation

seems to be a comma (and space?) not required

6.3

2nd code chunk comment says:

# Apply ridge regression to attrition data

I believe this should be the Ames data

Small type in 3.2.2 - "engineering steps yo take" should be "engineering steps you take"

Thank you for giving us feedback on the book! To help us as much as possible please follow these guidelines depending on the type of issue you are submitting and be sure to state the version of the book by referencing the date on the book's first page (under the title and authors).

  • Comments or questions regarding a specific chapter:
    • If this is a regarding a specific passage please try to be as specific as possible regarding where the issue is (e.g. chapter, section, figure number, etc.) and, whenever possible, copy/paste the text in question using > in the issue.
    • If you feel information needs to be added please provide exemplar verbiage you feel should be included, why it should be included, any references required, and the specific location it should be located.
  • Illustrations: Please state the Figure or Table number under concern. If you are suggesting an alternative way to illustrate the idea please supply supporting code or content.
  • Not reproducible: If you are having problems reproducing example code, complete the following checklist:
    • make sure you have all packages listed in the chapter prereq section installed
    • verify if your package versions differ from those used in the book (check software information section in Preface)
    • please list where your error is occurring and what the error is
  • Typo: Be as specific as possible regarding where the typo is (e.g. chapter, section, figure number, etc.) and, whenever possible, copy/paste the text in question using > in the issue.
  • References:
    • Incorrect reference: please provide specific location of the current reference and the suggested edits required to correct the problem.
    • Missing references: please provide Google Scholar Bibtex format for the suggested reference and why & where it should be included.
  • General comments: Although general comments are welcome please be as specific as possible
  • Design: Although design comments are welcome, please understand that we may or may not have control over the specific design issue you are identifying. Please be as specific as possible to the problem you've identified and an example of a remedy solution.

chapter 5, suggestion: consider avoiding discussion about log odds

Book version: 2019-05-25

I tend to find discussion about log odds confusing and not easily interpretable, and in the chapter the interpretation is log odds isn't really used for anything. Unless you're going to be talking more generally about GLMs, consider just keeping the discussion in terms of good ol' probability.

Figure 3.1 code does not work

The code for figure 3.1 does not render properly.

image

There is a known bug in boom::augment.lm() that causes this:
tidymodels/broom#937

This shows the guts of the problem:

library(broom)
aModel <- augment(lm(log(Sale_Price) ~ Year_Built, data = ames_train))
names(aModel)
aModel <- augment(lm(Sale_Price ~ Year_Built, data = ames_train))
names(aModel)

This code works:

library(tidyverse)
ames <- AmesHousing::make_ames() %>% 
  mutate(log_Sale_Price = log(Sale_Price))

# Load and split the Ames housing data using stratified sampling
set.seed(123)  # for reproducibility
split  <- rsample::initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train  <- rsample::training(split)
ames_test   <- rsample::testing(split)

models <- c("Non-log transformed model residuals", 
            "Log transformed model residuals")
list(
  m1 = lm(Sale_Price ~ Year_Built, data = ames_train) ,
  m2 = lm(log_Sale_Price ~ Year_Built, data = ames_train)
) %>%
  map2_dfr(models, ~ broom::augment(.x) %>% mutate(model = .y)) %>%
  ggplot(aes(.resid)) +
  geom_histogram(bins = 75) +
  facet_wrap(~ model, scales = "free") +
  ylab(NULL) +
  xlab("Residuals")

5.1 code no longer works

This line of code: df <- attrition %>% mutate_if(is.ordered, factor, ordered = FALSE) no longer works because the attrition dataset has been moved to the modeldata package. Add data("attrition") before that line to resolve the problem.

Chapter 6, Figure 6.1 code chunks 18 and 19 have an error

I am running the codes of chapter 6.
I am using R version 4.02

while running the code on my local machine I got an error

"Error: x must be a vector, not a data.frame/partial object."

I believe there is an issue in pdp package.

I can run the code and produce the charts flawlessly on RStudio cloud.

Regards,
Mohamed.

Possible typo

I'm relatively new to GBM but I'm wondering if this is a typo?

[12.2.1] Sequential training with respect to errors

Should Step 5 be instead read as:
5. Add this new tree to our algorithm:

F3(x) = F2(x) + h2(x)

?

Other slight typos:
[3.3.2.1] Estimated Statistic
The data output for ames_recipe %>% step_medianimpute(Gr_Liv_Area) shouldn't read

Box-Cox transformation on all_outcomes()

Instead, it should read

Log transformation on all_outcomes()

as ames_recipe was last defined as

ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_log(all_outcomes())

[3.4] Feature Filtering
Library needed for rownames_to_column() (tibble::) was not loaded in prior code. I had to use library(tidyverse) to run rownames_to_column()

Chapter 2, Section 2.5.3 Hyperparameter Tuning typo

-Typo Section 2.5.3 Hyperparameter Tuning

In this sentence:

Figure 2.10 illustrates this point. Smaller k values (e.g., 2, 5, or 10) lead to high variance (but lower variance) and larger values (e.g., 150) lead to high bias (but lower variance).

Did you mean to say:

Figure 2.10 illustrates this point. Smaller k values (e.g., 2, 5, or 10) lead to high variance (but lower variancebias) and larger values (e.g., 150) lead to high bias (but lower variance).

Also, great book, I'm in a group that's going through ISLR, and your book is a nice supplement to that.

Thanks

FYI: Minor Spelling

  • In Section: 19.4 Denoising autoencoders
  • Just under figure: Figure 19.8: Original digit sampled from the MNIST test set (left), corrupted data with on/off imputation (middle), and corrupted data with Gaussian imputation (right).
  • On line #4: "have been corrupted with Gaussian noise (inputs_currupted_gaussian) and supply the original input"
    • "inputs_currupted_gaussian" => corrupted
  • In Code section: "# Train a denoise autoencoder"
    • "training_frame = inputs_currupted_gaussian," => corrupted

Unless, this is your intention.

Minor spelling

Three times in Chapter 12 you use the word "minimas". That ought to be either minimums or minima, I think, minima already being the latin plural.

chapter 5, discussion: include a little more detail on model metrics

Book version: 2019-05-25

It's been a number of chapters since the intro, and some readers are likely to need a brief refresher on ROC curves and sensitivity / specificity & friends. The PLS example doesn't really seem to come to anything for the attrition dataset, and could probably be excised to make more room.

Undefined variables/arguments

Version: Online ebook (2020-02-01)

I.2.4.1 k-fold cross validation

The code below fails as x and y were not defined before calling the function

# Example using h2o
h2o.cv <- h2o.glm(
  x = x, 
  y = y, 
  training_frame = ames.h2o,
  nfolds = 10  # perform 10-fold CV
) 

This code snippet called earlier includes x and y provided within a function

model_fn(
  x = c("Year_Sold", "Longitude", "Latitude"),
  y = "Sale_Price",
  data = ames.h2o
)

Solutions:

  1. Define both x and y externally and pass them to these functions.
x <- c("Year_Sold", "Longitude", "Latitude")
y <- "Sale_Price"

OR

  1. Create new assignment only for y. features is the same as x defined above. Use features instead of x.

rsample::attrition does not work - Section 2.1

Version: Online ebook (2020-02-01)

When I run the following lines shown at the end of Section 2.1, I get an error: 'attrition' is not an exported object from 'namespace:rsample'.

# Job attrition data
churn <- rsample::attrition %>% 
  mutate_if(is.ordered, .funs = factor, ordered = FALSE)

Instead, the following worked in my environment.

library(modeldata)
data(attrition)
churn <- attrition %>% 
  mutate_if(is.ordered, .funs = factor, ordered = FALSE)

My environment is:

R version 4.0.2 (2020-06-22)

modeldata 0.0.2

rsample 0.0.7

chapter 6, my results for cv_glmnet differ from yours

Starting in section 6.4, there are a number of places that my results don't match your code examples. This seems to stem from this code:

# for reproducibility
set.seed(123)

# grid search across 
cv_glmnet <- train(
  x = X,
  y = Y,
  method = "glmnet",
  preProc = c("zv", "center", "scale"),
  trControl = trainControl(method = "cv", number = 10),
  tuneLength = 10
)
  1. bestTune in that code snippet differs from yours (mine is 0.04688353)
  2. My MSE in the comparison to previous Ames model at the end of 6.4 differs from yours:
    > RMSE(exp(pred), exp(Y))
    [1] 23503.05
  3. My VIP in 6.5, figure 6.10, looks somewhat different

Possibly wrong number in caption of Figure 7.7

First off, thank you so much for creating this book and making it freely available! It has been a really great resource so far for me.

I think there's a small inaccuracy in the caption of Figure 7.7:

Figure 7.7: Cross-validated accuracy rate for the 30 different hyperparameter combinations in our grid search. The optimal model retains 45 terms and includes no interaction effects.

From looking at the graph and the output in the code chunk above Figure 7.7, I think "45 terms" should be "12 terms" (seems to be a remnant from Figure 7.4).

Typos/Suggestions/Questions for Chapter 4/5/6/7

Reference date of book: 2019-12-06

Chapter 4: Linear Regression

4.2.2 Inference Notes

(Ctrl-f) "Regresion" & "Remdial"

[4.7 Partial least squares]

set.seed(123)

cv_model_pls <- train(
  Sale_Price ~ ., 
  data = ames_train, 
  method = "pls",
  trControl = trainControl(method = "cv", number = 10),
  preProcess = c("zv", "center", "scale"),
  tuneLength = 20
)

#model with lowest RMSE
cv_model_pls$bestTune

I'm not able to replicate n=3 with cv_model_pls$bestTune.
I've tried it on two different computers, and I’m getting closer to m=19 or 20.
I experimented with tuneLength = 40 and cv_model_pls$bestTune was between 19-21.
Given the big discrepancy between m=3 and m=19, I thought I'd flag it out.

After reading this line "Using PLS with m=3 principal components corresponded with the lowest cross-validated RMSE of $29,970", I was wondering how would I go about verifying the RMSE other than looking at the ggplot graph itself.

Suggestion: Consider including the following code to aid the reader in extracting the lowest RMSE for themselves:

library(tidyverse)
#assuming $bestTune gives n = 19
cv_model_pls$results %>%
  dplyr::filter(ncomp == 19)

Fig 4.10

There's a typo in the caption: The 10-fold cross "valdation" RMSE

Online supplementary material

(https://koalaverse.github.io/homlr/notebooks/04-linear-regression.nb.html), there's a section with repetitive words:
(Ctrl-f)
“Prediction from a rank-deficient fit…”

Chapter 5: Logistic Regression

5.5 Assessing model accuracy

"There are 16 numeric features in our data set so the following code performs a 10-fold cross-validated PLS model while tuning the number of principal components to use from 1–16. "

Suggestion - Consider including the following code to allow reader to extract number of numeric features for themselves:

length(attrition[sapply(attrition, is.numeric)])

Suggestion - Consider including the following code to allow reader to extract lowest RMSE for themselves:

cv_model_pls$results %>%
  dplyr::filterfilter(ncomp == 14)

Question - Could you elaborate on what’s the intuition behind limiting tuneLength to number of numeric features? Why can't we set tuneLength to the number of all features?

Chapter 6: Regularized Regression

6.2 Why regularize?

(Ctrl-f) "classicial"
(Ctrl-f) bet on sparsity principal - should be "principle"

6.3 Implementation

(Ctrl-f) Here we just peak - should be "peek"

6.4 Tuning

Suggestion - Consider including the following code to allow reader to extract Lasso coefficient for the lowest MSE:

lasso$nzero[lasso$lambda == lasso$lambda.min] # No. of coef | Min MSE
lasso$nzero[lasso$lambda == lasso$lambda.1se] # No. of coef | 1-SE MSE

Chapter 7: Multivariate Adaptive Regression Splines

7.5 Feature Interpretation

With the latest version of vip (0.2.1), the following code below gives an warning/error

# variable importance plots
> p1 <- vip(cv_mars, num_features = 40, bar = FALSE, value = "gcv") + ggtitle("GCV")

Warning message:
In vip.default(cv_mars, num_features = 40, bar = FALSE, value = "gcv") :
  The `bar` argument has been deprecated in favor of the new `geom` argument. It will be removed in version 0.3.0.
> p2 <- vip(cv_mars, num_features = 40, bar = FALSE, value = "rss") + ggtitle("RSS")

Warning message:
In vip.default(cv_mars, num_features = 40, bar = FALSE, value = "rss") :
  The `bar` argument has been deprecated in favor of the new `geom` argument. It will be removed in version 0.3.0.

Suggestion: Code tweaked below.

p1 <- vip(cv_mars, num_features = 40, geom = "point",value = "gcv") + ggtitle("GCV")

p2 <- vip(cv_mars, num_features = 40, geom = "point", value = "rss") + ggtitle("RSS")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Thank you!

section 9.7 sqrt rendering issue

In the fourth sentence of the first paragraph in section 9.7, log and exp are rendering correctly, but sqrt is showing as \sqrt with a box around it. The date on the Preface page is 5/29.

Minor typo

Change ?dslabs::read_mnist() to ?dslabs::read_mnist on page 10. Note, however, that the code still launches the help page.

Spelling Error - Chapter 12

Chapter 12

"xgboost will except three different kinds of matrices for the features: ordinary R matrix, sparse matrices from the Matrix package, or xgboost’s internal xgb.DMatrix objects. See ?xgboost::xgboost for details."

The word 'except' should be spelt 'accept' instead.

Minor typo in Chapter 6

Online version: 2020-02-01

Chapter 6, Section 6.5 Feature interpretation

However, not that one of the top 20 most influential variables is Overall_QualPoor.

I believe "not" should be "note."

Questions regrading chapter 3 Feature & Target Engineering

Hello @bradleyboehmke and @bgreenwell. I'm reading through this book and it is fantastic!

I have a little question in Chapter 3, where the book introduces a feature engineering workflow using the recipes package and emphasis that we should create preprocessing blueprint but apply it later and within each resample. This workflow can be perfectly embedded in caret as mentioned in section 3.8.3:

Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.

My question is whether this principle is also implemented by other machine learning packages such as h2o. Because in Chapter 15 Stacked Models, 15.1 Prerequisites, the training and test set were prepared before runningh2o training process, not in each resample:

# Make sure we have consistent categorical levels
blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_other(all_nominal(), threshold = 0.005)

# Create training & test sets for h2o
train_h2o <- prep(blueprint, training = ames_train, retain = TRUE) %>%
  juice() %>%
  as.h2o()
test_h2o <- prep(blueprint, training = ames_train) %>%
  bake(new_data = ames_test) %>%
  as.h2o()

I was wondering, does this violate the principle that we should do the preprocessing within each resample? Do other packages seldom implement this principle except for caret? Besides, Chapter 3 introduces so many steps of feature engineering. Does h2o's AutoML handle these steps automatically?

Your kind guidance would be much appreciated!

4.8 Feature interpretation Once we’ve found the model that minimizes the predictive accuracy,

Thank you for giving us feedback on the book! To help us as much as possible please follow these guidelines depending on the type of issue you are submitting and be sure to state the version of the book by referencing the date on the book's first page (under the title and authors).

  • Comments or questions regarding a specific chapter:
    • If this is a regarding a specific passage please try to be as specific as possible regarding where the issue is (e.g. chapter, section, figure number, etc.) and, whenever possible, copy/paste the text in question using > in the issue.
    • If you feel information needs to be added please provide exemplar verbiage you feel should be included, why it should be included, any references required, and the specific location it should be located.
  • Illustrations: Please state the Figure or Table number under concern. If you are suggesting an alternative way to illustrate the idea please supply supporting code or content.
  • Not reproducible: If you are having problems reproducing example code, complete the following checklist:
    • make sure you have all packages listed in the chapter prereq section installed
    • verify if your package versions differ from those used in the book (check software information section in Preface)
    • please list where your error is occurring and what the error is
  • Typo: Be as specific as possible regarding where the typo is (e.g. chapter, section, figure number, etc.) and, whenever possible, copy/paste the text in question using > in the issue.
  • References:
    • Incorrect reference: please provide specific location of the current reference and the suggested edits required to correct the problem.
    • Missing references: please provide Google Scholar Bibtex format for the suggested reference and why & where it should be included.
  • General comments: Although general comments are welcome please be as specific as possible
  • Design: Although design comments are welcome, please understand that we may or may not have control over the specific design issue you are identifying. Please be as specific as possible to the problem you've identified and an example of a remedy solution.

Two typos

Date on the book's first page: 2019-06-25

In 11 Random Forests, 11.4 Hyperparameters, 11.4.2. mtry, last sentence of first paragraph:

When there are many relevant predictors, a lower mtry might perfrom better.

In 12 Gradient Boosting, 12.2 How boosting works, 12.2.1 A sequential ensemble approach, second sentence of fourth paragraph:

The idea behind boosting is that each model in the sequence slightly improves upon the perfroamnce of the previous one (essentially, by focusing on the rows of the training data where the previous tree had the largest errors or residuals).

chapter 5, possible issue in odds / marginal effect discussion

Book version: 2019-05-25

Section 5.2 "Thus, for every one dollar increase in MonthlyIncome, the odds of an employee attriting decreases slightly, represented by a slightly less than 50% probability."

I'm not sure this or the code following it works. Since a logit model is nonlinear, the change in probability will depend on where in the support of MonthlyIncome we're looking. The 0.499 figure is a transformation of the coefficient for MonthlyIncome, but I'm not sure this is meaningful because it doesn't include the constant. It could be that my stubborn refusal to view binary classification in terms of odds is leading me astray here, though.

This is related to an issue in section 5.4: ""however, working OverTime tends to nearly double the probability of attrition."

I may not understand. But from a monthly income of 1000, model3 would indicate that working raises the probability of attrition by about 2.75 times.

pred_prob <- function(ot) {
  predict(model3, tibble(MonthlyIncome = 1000, OverTime = ot), type = "response")
}
pred_prob("Yes") / pred_prob("No")  # => 2.75488

chapter 6, typos

  • Section 6.2: "The easiest way to understand regularized regression is to explain how and why [it] is applied to ordinary least squares (OLS)."

  • Section 6.2: "The objective in OLS regression is to find the hyperplane (e.g., a
    straight line in two dimensions)" Should be "multiple dimensions", since it's not clear you only mean the specific example

  • Section 6.2: "Having a large number of feature[s] invites"

  • Section 6.2.1: "This is indicitive of ..." should be "indicative"

  • Section 6.5: "Variable importance for regularized models provide[s]"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.