3.4.7 What Should Be Included Inside of Resampling? Version: 2018-09-08 <p dir

Thanks. We address that in <a href="https://bookdown.org/max/FES/enc

Elaboration On Pre-Processing about fes HOT 3 CLOSED

LoweCoryr commented on August 23, 2024

Elaboration On Pre-Processing

from fes.

Comments (3)

topepo commented on August 23, 2024

Outside of resampling, all preprocessing quantities are computed using the training set. So you could get your mode or means on the training set and impute all samples using these. You test set or unknown samples probably have a different distribution (based on what was sampled), but the model is expecting something consistent with the training set.

As a pathological counter-example, suppose you are predicting one new sample. You couldn't impute a missing predictor value if you used the current sample set.

This specific issue isn't the point though; it's one of consistency. Non-training set samples should be preprocessed in a manner that is consistent with what was used to build the model.

For resampling, using the terminology here, the preprocessing is recomputed for each analysis set and these same values (e.g. the mean or mode in your imputation example) are used on the corresponding assessment set.

For 10-fold CV, this results in 10 different imputation values but this is the point; we want to measure the variation in the preprocessing steps and let that variation show its impact on the the assessment set performance measures.

apply the same values from the training set to the test set, which may lead to over-fitting.

It wouldn't lead to overfitting since the test set is being imputed with values contained in other samples. Quite the opposite; we measure overfitting using this strategy.

from fes.

LoweCoryr commented on August 23, 2024

Great answer. Thank you so, so much. I am looking forward to reading the rest of your book, and your other content too. Maybe you can add this content into the book, as it's a great explanation.

I was researching this question, and I ran across this resource (the first answer in the link below.) It was discussing how when you split up your training and test sets, it's possible that one runs into issues where there is a mismatch between all the possible categorical values in the training set and the test set, due to the random split. Therefore, this needs to be avoided when encoding. If this isn't in your book, you may want to mention this.

https://www.kaggle.com/c/avito-demand-prediction/discussion/56550

Good luck!

from fes.

topepo commented on August 23, 2024

Thanks.

We address that in chapter 5.

from fes.

Recommend Projects

Elaboration On Pre-Processing about fes HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent