Coder Social home page Coder Social logo

Comments (11)

rhiever avatar rhiever commented on July 17, 2024

Looks very useful! Have you considered merging your functions (that don't overlap) into sklearn.preprocessing? I think sklearn would benefit from more preprocessors.

wrt merging into datacleaner: I have some concerns, for example, what would it do with continuous variables? Would it ignore them entirely?

I also have concerns with whether feature preprocessing is within the scope of datacleaner. I see data cleaning as a step before feature preprocessing, as shown below. Data cleaning entails encoding the data properly (usually: numerically), removing or imputing missing data, removing quirks from the data, and so on. Feature preprocessing is definitely an important step, but I see it more as part of the modeling step.

image

I've been developing TPOT to automate the parts that follow the data cleaning step, and would love to add more feature preprocessing operators if they add value beyond the ones already implemented in TPOT/sklearn.preprocessing. I've already "seen the light" wrt the power of using the right feature preprocessor.

BTW: Do you know about my sklearn benchmark project? I've evaluated about 30 million sklearn models so far, and would like to look into evaluating feature preprocessors on my ~180 data set benchmark as well.

from datacleaner.

wdm0006 avatar wdm0006 commented on July 17, 2024

I have considered a PR into sklearn.preprocessing, and probably will try one out in the coming weeks. I really need to improve the documentation and write automated tests before that though I think, for now I am dogfooding it.

For continuous variables, they would be ignored. The behavior is identical to your usage of label encoding, so any column with type object would be encoded as categorical, floats or ints would be assumed continuous (for good or for ill), which I don't think makes any new assumptions (again, for good or for ill).

As far as applicability, it is definitely a grey area, here are a few points as to why I think it makes sense:

  • Encoders like hashing can greatly reduce dataset size, which lets datacleaner produce a clean and (more) portable version of the dataset.
  • Encoding the string columns as numeric naively as the library currently does makes an encoding decision already, so I think it is more prudent to offer options for that step, or not do it at all, otherwise I think the tool is making decisions it shouldn't be making (how to encode possibly categorical data).

I'll look at TPOT, I have seen it, but not used it for anything yet, looks like an interesting project (sklearn benchmark as well)

from datacleaner.

rhiever avatar rhiever commented on July 17, 2024

I'm liking the sound of this a little more. The goal of datacleaner is to automatically put the data into a good state for analysis, but not necessarily make major feature encoding decisions that would make it difficult for the practitioner to encode the data in their own way in the future. Hence why I don't mind doing a direct string label --> numerical label encoding (since numerical encodings are necessary for sklearn etc.), but I would avoid transforming the data into a OneHot feature representation.

Encoders like hashing can greatly reduce dataset size, which lets datacleaner produce a clean and (more) portable version of the dataset.

I'm very intrigued by this -- do you have a demo? Are there other things like this that we could do without affecting the basic feature representation?

from datacleaner.

wdm0006 avatar wdm0006 commented on July 17, 2024

Sure, check out the tables on this post: beyond one-hot, they are basically trying to find high scoring encoders with low dimensionality (fewer columns). In these cases, every column was categorical (strings), so all had to be encoded as numbers somehow.

The hashing encoder is not in those tables. It, unlike the others, encodes multiple columns at once and allows for a configurable output dimensionality. So if you have 128 categorical input columns, you could encode that as 3 (or 10 or 20 or whatever) columns with the hashing trick. It might not be perfect, but it's smaller. Here you can see the degrading performance with really low dimension outputs (hashing_2 and 4 vs 16+).

I agree that the default for datacleaner should be bare minimum encoding (so just ordinal), and that one-hot is risky for very high dimension data (could end up with huge numbers of columns), but I think the scikit-learn philosophy of good options with sensible defaults (so multiple options but default to ordinal) would make sense here. If not that, then don't do encoding at all at this stage.

from datacleaner.

rhiever avatar rhiever commented on July 17, 2024

Hmmm... I think you've convinced me, at the very least, that the encoder should be configurable.

So the lines such as this one can be replaced with an arbitrary encoder (LabelEncoder, OneHotEncoder, etc. -- anything that follows the sklearn interface), with LabelEncoder as the default. Won't even be necessary to import any other libraries into datacleaner itself because those encoders would be passed via the function, and imported by the user.

It might be nice to even allow a list of encoders to be passed, but that may complicate things too much and step too far into the feature preproceessing stage.

from datacleaner.

wdm0006 avatar wdm0006 commented on July 17, 2024

Sounds good!

I'll put together a pull request. One implementation detail that I'm not sure of is how to pass the number of output columns to the hashing encoder (if at all). All of the other encoders need no input parameters, but the hashing encoder takes that one.

Maybe just have one flag for encoder, with a hyphen and number for hashing, so:

for default:

datacleaner my_data.csv -o my_clean.data.csv -is , -os ,

for binary encoding:

datacleaner my_data.csv -o my_clean.data.csv -is , -os , -en BinaryEncoder

for hashing encoder with 32 output dims:

datacleaner my_data.csv -o my_clean.data.csv -is , -os , -en HashingEncoder-32

for hashing encoder with default params:

datacleaner my_data.csv -o my_clean.data.csv -is , -os , -en HashingEncoder

That seem alright?

from datacleaner.

rhiever avatar rhiever commented on July 17, 2024

This might have to be a feature that's limited to the script version, because if we want to add CLI support, we'd have to parse out every possible encoder from the CLI. That will be way too complicated, add several dependencies, and bloat the code in the long run.

I was thinking the function would look something like, e.g.,

def autoclean(input_dataframe, drop_nans=False, copy=False, encoder=LabelEncoder):
    """Performs a series of automated data cleaning transformations on the provided data set
<snip>
        # Encode all strings with numerical equivalents
        if str(input_dataframe[column].values.dtype) == 'object':
            input_dataframe[column] = encoder().fit_transform(input_dataframe[column].values)
<snip>

That of course limits us to encoders that take no input parameters, but I think I'm okay with that.

from datacleaner.

rhiever avatar rhiever commented on July 17, 2024

A workaround that the user could implement to pass an encoder with parameters could be to write a wrapper function for the encoder, e.g.,

def HashingEncoder_32(data):
    return HashingEncoder(data, 32)

We could document that for advanced users.

from datacleaner.

wdm0006 avatar wdm0006 commented on July 17, 2024

That may be. With the exception of the parameter for hashing, this actually will implement all of the encoders in just a few lines.

https://github.com/wdm0006/datacleaner/blob/master/datacleaner/datacleaner.py#L87

I could parse out the parameter ahead of time without too too much hassle too I think.

from datacleaner.

rhiever avatar rhiever commented on July 17, 2024

The issue I have with that implementation is that it adds another dependency. I really want to minimize dependencies wherever possible.

from datacleaner.

rhiever avatar rhiever commented on July 17, 2024

Alrighty, it's merged! Thank you for coding that up -- I think it will add some useful flexibility to datacleaner.

Please ping me if you have any thoughts on how to support that functionality on the command line.

from datacleaner.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.