Coder Social home page Coder Social logo

rdt's People

Contributors

amontanez24 avatar csala avatar fealho avatar frances-h avatar gsheni avatar jdtheripperpc avatar katxiao avatar kveerama avatar lajohn4747 avatar manuelalvarezc avatar npatki avatar pvk-developer avatar r-palazzo avatar rwedge avatar sarahmish avatar sbrugman avatar sdv-team avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rdt's Issues

KeyError: nan problem with CatTransformer get_val function

https://github.com/HDI-Project/RDT/blob/7bc38da2c9cd737fdd2ff3207d3cd3482701bb6d/rdt/transformers/CatTransformer.py#L86

This dictionary will represent all nans as type <class 'float'>. However, sometimes the original value in the column will be of type <class 'numpy.float64'>. If this is the case, then when looking up the key of type <class 'numpy.float64'>, the dictionary created on line 86 will give KeyError: nan.

To recreate:

s = pd.Series([np.nan, 1, 5])
s2 = s.fillna(-1)
d = s2.groupby(s2).count().rename({-1: np.nan}).to_dict()
d[s[0]]

Read anonymize configuration from metadata

Currently, CatTransformer accepts anonymize and category as class-level arguments. it will be cleaner to read those parameters from the metada. This will allow also to have the funcionality available at HyperTransformer level.

Add support to Python 3.7

Currently, we only support python 3.5 and 3.6. We need to add the newest version of python. To do so, we need to check:

  1. All dependencies of the package are compatible with 3.7
  2. The project builds after adding environment 3.7 on TravisCI
  3. The supported versions are correctly listed in setup.py

Add option to anonimize data

Sometimes transfomers are used with sensitive data we don't want the reverse_transform to take the values from nor keep them after extracting its distribution.

In order to do so, we should anonimize data before creating CatTransformer.probability_map, so its mapped to new values taken from faker but preserving its original distribution.

To set this option, there would be two flags on rdt.transformers.CatTransformer.fit:

  • anonimize: bool
  • category: str

If the flag pii is set, then the values should be mapped with values generated from faker, before generating the probability_map. The supported values of category should be one of the attributes of faker object, that list includes, but is not limited to:

  • first_name
  • last_name
  • name
  • ssn
  • phone_number
  • email

Split HyperTransfomer in two

Right now, rdt.hyper_transformer.HyperTransformer has the logic to handle both tables and datasets.
We could split in between a transfomer that works at table level and another that does the same with datasets.

Fix extra columns

  • categorical should return data with extra column to help reverse transformation
  • missing columns should be returned with output as well

Enforce python standards.

Currently, there a some minor detailis that broke the python standards such as:

  • Redundant names, or using uppercase.
  • Docstrings not following PEP257.
  • Requirements out of setup.py
  • No module contents declaration on rdt.__init__

Add documentation

Proper documentation for the project is missing, along the things that should be done are:

  • Complete API Reference
  • Data requirements and meta.json reference
  • Fix warnings on doc generation.

Add a numerical transformer for positive numbers

We should implement a transformer that takes values from (0, +inf) and puts them into the real numbers, and is able to reverse this transformation. It will be helpful in order to model and sample positive values.

To implement it, we can use the log function to move from R+ --> R, and exp to to reverse the transformation

new transformer for missing values

  • Add a new transformer class that replaces missing values.
  • Should create new column for missing values stating whether or not the value in the column was missing in the original data.

Fix CircleCI build

CircleCI build seems to be failing. Make required changes to make it work.

Add col_meta and missing as optional constructor arguments in Transfomers

Currently, for each call to transfomer we need to specify the arguments col_meta and missing it would be a nice idea to add them as optional arguments on __init__ and in case they where set, don't require them on the calls to transform or reverse_transform.

This way code written in the old format will keep working, but we give the chance to make it easier. We could also add a deprecation warning and deprecate this arguments on the method calls for 0.2.0 release.

Make DatetimeTransformer timezone-aware

The current DatetimeTransformer is converting the datetime values to integer and back without considering the timezones.

The timezones could be added as a new column, either categorical, using the CategorialTransformer, in a fashion similar to how the NullTransformer is being used, or numerical.

Hyper transformer needs access to transformer classes

In order to do hyperTransformer.fit_transform or hyperTransformer.reverse_transform, the hyper transformer needs access to the individual transformers. This used to be accomplished by importing everything from the transformers package, but this is no longer being imported. When trying to run fit_transform, the following error occurs:

File "/Users/andrew/Documents/MEng_Thesis/dataprep/rdt/hyper_transformer.py", line 52, in fit_transform
table, table_meta, transformer_dict, transformer_list, missing)
File "/Users/andrew/Documents/MEng_Thesis/dataprep/rdt/hyper_transformer.py", line 112, in fit_transform_table
transformer = self.get_class(transformer_name)
File "/Users/andrew/Documents/MEng_Thesis/dataprep/rdt/hyper_transformer.py", line 24, in get_class
return getattr(globals()[class_name], class_name)
KeyError: 'NumberTransformer'

Handle missing data appropriately

Currently, missing data is replaced with np.nan. This wont work as an input to Copula. Each transformer should handle missing data appropriately in its class.

Refactor unittests

Right now some of the tests are not as well designed as they should be. We should refactor them in order to:

  • Have no test that requires external data.
  • Check all skiped tests because of issues are up-to-date.

Add CLA

Configure a service to allow contributors to sign a CLA before submiting their contributions.

Drop the usage of meta

RDT shouldn't load data from disk nor need a metadata.json file to operate. It's behavior should be the following:

  • Both HyperTransformer and Transformers should have as input data which is already in the required format. That means that HyperTransformer shouldn't load data from disk, nor the individual transformers prepare it.

  • The filling or not of missing values should be taken out from individual transformers and be handled at the HyperTransformer, as specified in this issue.

Remove unused methods

The methods HyperTransformer.get_types and HyperTransformer.impute_table are not used anywhere in the project, and their only finality is not longer needed in the current implementation.

Minor coding issues

  • The first iteration of CatTransformer.get_probability_map can be changed to self.probability_map = col.groupby(col).count().to_dict()

  • Transform method should be implemented on BaseTransformer and inherited on the other transformers.

  • HyperTransformer should crash at init if no meta_file is provided. It will raise an Attribute later on, so it’s better doing it early.

  • HyperTransformer methods shouldn’t be prefixed with hyper.

  • rdt.transformers.__init__ should be replace with a proper module declaration

DTTransformer Out of bounds nanosecond timestamp

https://github.com/HDI-Project/RDT/blob/7bc38da2c9cd737fdd2ff3207d3cd3482701bb6d/rdt/transformers/DTTransformer.py#L28

Pandas represents timestamps in nanoseconds, so there is in upper limit to the timespans that can be represented. If a timestamp is out of this range, the code breaks with a pandas.tslib.OutOfBoundsDatetime error. Full discussion can be seen here: https://stackoverflow.com/questions/32888124/pandas-out-of-bounds-nanosecond-timestamp-after-offset-rollforward-plus-adding-a

NullTransformer unreacheable operation

NullTransformers behavior is that it first try to fill null values using the average of values if its not null, and if that's not possible uses a fillna(0).

However the way the average is computed in the code will always be np.nan if there is a single null value ( and thats the case where we need the mean to be not null to replace null values).

Enforce int dtype on integer columns

Altough the implementation of rdt.transformer.NumberTransformer try to keep integer values as such, when doing reverse transform, some times the returned pandas.Series has a dtype of float64, that is, even if the values are valid integers, the column is treated as a float by pandas.

Could we find a way to check that integer columns are returned with dtype int64?

Add attributes NullTransformer and col_meta.

Transformers create NullTransformer when called with missing param.

The missing param should be made an attribute, and at init time,create an instance of NullTransformer in another attribute if proceeds.

Also, col_meta should be passed at init time and not on transform or fit calls.

Exclude logic for missing values in Transformers.

Currently, each Transformer accepts an argument missing on its transform and reverse_transform, and case its True, handles the missing values.

This behavior could be moved to HyperTransfomer and maybe anotating the corresponding columns in the meta.json for a more granular control.

Make CatTransformer.probability_map deterministic

When transforming data with CatTransformer, the different values with their mean and standard deviation are preserved in the attribute CatTransformer.probability_map.

However, that process isn't fully deterministic, which make the class harder to test and debug. Until everybody uses python 3.7, where dict order is assured, we should find a way to ensure that for the same input values the same CatTransfrmer.probability_map is generated.

Make the output numeric only

  • Current output has transformed columns as well as not transformed columns (eg. id columns). This should be changed to only have the numeric columns.

DTTransformer check data format before transform.

Currently DTTransformer expect a format key in it's meta with the output date in strpformat, that will be used to format the output of DTTransformer.reverse_transform.

However, no validation of the input format is made before the transformation which leave the door open to silent errors in case of a mistake in the meta.json.

To avoid this I propose to check the data is on the specified format before transforming.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.