Coder Social home page Coder Social logo

Comments (5)

MBrouns avatar MBrouns commented on September 24, 2024 1

So I got asked a similar type of thing in todays training where someone wanted to drop rows with too many missing values from train but not from test so I was toying around to see if I could find something that would work.

I might have figured out a way but I'm not sure I like it all that much:

import pandas as pd
import hashlib

class TrainOnlyMixin(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y):
        self.df_hash_ = self.hash_df(X)
        return self
    
    
    @staticmethod
    def hash_df(df):
        return hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
    
    def transform(self, X, y=None):
    
        if self.hash_df(X) == self.df_hash_:
            return self.transform_train(X)
        
        else:
            return self.transform_test(X)

I basically store a hash of the train dataframe and compare X with it in transform and then call transform_train or transform_test. I think this can be made quite generic and I can't think of a case where it wouldn't work. What do you think?

from scikit-lego.

koaning avatar koaning commented on September 24, 2024

I think we agree this can't be done. Close issue?

from scikit-lego.

koaning avatar koaning commented on September 24, 2024

My first thought: WE NEED TO BE CAREFUL there's methods that might do a partial fit and then this would totally break.

My second thought: but yeah ... this could totally work

Third thought: do we have other usecases for it?

Fourth thought: will it also work on numpy arrays?

Thought five: I'm not sure if I want to implement transform_test on each object. cant we just return X?

from scikit-lego.

MBrouns avatar MBrouns commented on September 24, 2024
  1. Pipelines don't implement partial_fit so that shouldn't be a problem i think

  2. 😄

  3. So far I've come up with the following use cases:
    a. dropping rows with many missing values from train but not from test
    b. adding noise to train but not to test
    c. removing outliers from training data

  4. As is no, but it is possible to calculate a hash on a numpy array afaik

  5. Agreed, probably have this as a mixin where transform_test has a default implementation to return X

from scikit-lego.

MBrouns avatar MBrouns commented on September 24, 2024

#81

from scikit-lego.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.