Coder Social home page Coder Social logo

Elaboration On Pre-Processing about fes HOT 3 CLOSED

LoweCoryr avatar LoweCoryr commented on August 23, 2024
Elaboration On Pre-Processing

from fes.

Comments (3)

topepo avatar topepo commented on August 23, 2024

Outside of resampling, all preprocessing quantities are computed using the training set. So you could get your mode or means on the training set and impute all samples using these. You test set or unknown samples probably have a different distribution (based on what was sampled), but the model is expecting something consistent with the training set.

As a pathological counter-example, suppose you are predicting one new sample. You couldn't impute a missing predictor value if you used the current sample set.

This specific issue isn't the point though; it's one of consistency. Non-training set samples should be preprocessed in a manner that is consistent with what was used to build the model.

For resampling, using the terminology here, the preprocessing is recomputed for each analysis set and these same values (e.g. the mean or mode in your imputation example) are used on the corresponding assessment set.

For 10-fold CV, this results in 10 different imputation values but this is the point; we want to measure the variation in the preprocessing steps and let that variation show its impact on the the assessment set performance measures.

apply the same values from the training set to the test set, which may lead to over-fitting.

It wouldn't lead to overfitting since the test set is being imputed with values contained in other samples. Quite the opposite; we measure overfitting using this strategy.

from fes.

LoweCoryr avatar LoweCoryr commented on August 23, 2024

Great answer. Thank you so, so much. I am looking forward to reading the rest of your book, and your other content too. Maybe you can add this content into the book, as it's a great explanation.

I was researching this question, and I ran across this resource (the first answer in the link below.) It was discussing how when you split up your training and test sets, it's possible that one runs into issues where there is a mismatch between all the possible categorical values in the training set and the test set, due to the random split. Therefore, this needs to be avoided when encoding. If this isn't in your book, you may want to mention this.

https://www.kaggle.com/c/avito-demand-prediction/discussion/56550

Good luck!

from fes.

topepo avatar topepo commented on August 23, 2024

Thanks.

We address that in chapter 5.

from fes.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.