rdt,sdv-dev

KeyError: nan problem with CatTransformer get_val function

https://github.com/HDI-Project/RDT/blob/7bc38da2c9cd737fdd2ff3207d3cd3482701bb6d/rdt/transformers/CatTransformer.py#L86

This dictionary will represent all nans as type <class 'float'>. However, sometimes the original value in the column will be of type <class 'numpy.float64'>. If this is the case, then when looking up the key of type <class 'numpy.float64'>, the dictionary created on line 86 will give KeyError: nan.

To recreate:

s = pd.Series([np.nan, 1, 5])
s2 = s.fillna(-1)
d = s2.groupby(s2).count().rename({-1: np.nan}).to_dict()
d[s[0]]

Ensure backwards compatibility and raise deprecation warnings

We should add DeprecationWarning on method calls that are going to loose backwards compatibility.

Read anonymize configuration from metadata

Currently, CatTransformer accepts anonymize and category as class-level arguments. it will be cleaner to read those parameters from the metada. This will allow also to have the funcionality available at HyperTransformer level.

Add support to Python 3.7

Currently, we only support python 3.5 and 3.6. We need to add the newest version of python. To do so, we need to check:

All dependencies of the package are compatible with 3.7
The project builds after adding environment 3.7 on TravisCI
The supported versions are correctly listed in setup.py

Add option to anonimize data

Sometimes transfomers are used with sensitive data we don't want the reverse_transform to take the values from nor keep them after extracting its distribution.

In order to do so, we should anonimize data before creating CatTransformer.probability_map, so its mapped to new values taken from faker but preserving its original distribution.

To set this option, there would be two flags on rdt.transformers.CatTransformer.fit:

anonimize: bool
category: str

If the flag pii is set, then the values should be mapped with values generated from faker, before generating the probability_map. The supported values of category should be one of the attributes of faker object, that list includes, but is not limited to:

first_name
last_name
name
ssn
phone_number
email

Finish number transformer

Split HyperTransfomer in two

Right now, rdt.hyper_transformer.HyperTransformer has the logic to handle both tables and datasets.
We could split in between a transfomer that works at table level and another that does the same with datasets.

Code linting

Make the project flake + isort compatible

Fix extra columns

categorical should return data with extra column to help reverse transformation
missing columns should be returned with output as well

Enforce python standards.

Currently, there a some minor detailis that broke the python standards such as:

Redundant names, or using uppercase.
Docstrings not following PEP257.
Requirements out of setup.py
No module contents declaration on rdt.__init__

Create way to pass HyperTransformer table dict

https://github.com/HDI-Project/RDT/blob/6a088df4f73aad4fb0d52b8d8336d3758e86aa67/rdt/hyper_transformer.py#L25

RDT always sets the table dict to be created by loading data from the meta.json. There should be a way to pass the HyperTransformer a list or dictionary of tables, and have RDT create the table_dict from that.

NumberTransformer get default_val in a safer way.

NumberTransformer.get_default_val will fail is if given an empty column.

Add documentation

Proper documentation for the project is missing, along the things that should be done are:

Complete API Reference
Data requirements and meta.json reference
Fix warnings on doc generation.

Create Categorical Transformer

Review each transformer and make them time efficient

Add a numerical transformer for positive numbers

We should implement a transformer that takes values from (0, +inf) and puts them into the real numbers, and is able to reverse this transformation. It will be helpful in order to model and sample positive values.

To implement it, we can use the log function to move from R+ --> R, and exp to to reverse the transformation

Make everything fit pep8 standards

new transformer for missing values

Add a new transformer class that replaces missing values.
Should create new column for missing values stating whether or not the value in the column was missing in the original data.

Fix CircleCI build

CircleCI build seems to be failing. Make required changes to make it work.

NumberTransformer reverse transform can't handle nan

On line 104 of NumberTransformer.py, if the value for x is nan then the code will raise an error.

Move functions from rdt.utils to HyperTransformer class

The contents of rdt.utils is only used by HyperTransfomer, or not used at all. Moving them will keep all the logic related to hypertransformer in the same place.

Add col_meta and missing as optional constructor arguments in Transfomers

Currently, for each call to transfomer we need to specify the arguments col_meta and missing it would be a nice idea to add them as optional arguments on __init__ and in case they where set, don't require them on the calls to transform or reverse_transform.

This way code written in the old format will keep working, but we give the chance to make it easier. We could also add a deprecation warning and deprecate this arguments on the method calls for 0.2.0 release.

Make DatetimeTransformer timezone-aware

The current DatetimeTransformer is converting the datetime values to integer and back without considering the timezones.

The timezones could be added as a new column, either categorical, using the CategorialTransformer, in a fashion similar to how the NullTransformer is being used, or numerical.

Hyper transformer needs access to transformer classes

In order to do hyperTransformer.fit_transform or hyperTransformer.reverse_transform, the hyper transformer needs access to the individual transformers. This used to be accomplished by importing everything from the transformers package, but this is no longer being imported. When trying to run fit_transform, the following error occurs:

File "/Users/andrew/Documents/MEng_Thesis/dataprep/rdt/hyper_transformer.py", line 52, in fit_transform
table, table_meta, transformer_dict, transformer_list, missing)
File "/Users/andrew/Documents/MEng_Thesis/dataprep/rdt/hyper_transformer.py", line 112, in fit_transform_table
transformer = self.get_class(transformer_name)
File "/Users/andrew/Documents/MEng_Thesis/dataprep/rdt/hyper_transformer.py", line 24, in get_class
return getattr(globals()[class_name], class_name)
KeyError: 'NumberTransformer'

Handle missing data appropriately

Currently, missing data is replaced with np.nan. This wont work as an input to Copula. Each transformer should handle missing data appropriately in its class.

Add Tests for individual transformers

Refactor unittests

Right now some of the tests are not as well designed as they should be. We should refactor them in order to:

Have no test that requires external data.
Check all skiped tests because of issues are up-to-date.

Add cookie cutter

Add CLA

Configure a service to allow contributors to sign a CLA before submiting their contributions.

Infinity not handled in reverse transform

Sometimes the value infinity is received as input to a reverse transform. They should be able to handle this appropriately as opposed to giving an error.

Add demo file downloader

Drop the usage of meta

RDT shouldn't load data from disk nor need a metadata.json file to operate. It's behavior should be the following:

Both HyperTransformer and Transformers should have as input data which is already in the required format. That means that HyperTransformer shouldn't load data from disk, nor the individual transformers prepare it.
The filling or not of missing values should be taken out from individual transformers and be handled at the HyperTransformer, as specified in this issue.

Remove unused methods

The methods HyperTransformer.get_types and HyperTransformer.impute_table are not used anywhere in the project, and their only finality is not longer needed in the current implementation.

Fix issues with cookiecutter package

Set author name and email as DAI LAB MIT.
Check if cli is really needed
Move code out of the root of the repository.

DTTransformer does not handle nan in reverse_transform

https://github.com/HDI-Project/RDT/blob/7bc38da2c9cd737fdd2ff3207d3cd3482701bb6d/rdt/transformers/DTTransformer.py#L88

If a nan value is passed into the safe_data method on line 88, then an error will be raised.

Minor coding issues

The first iteration of CatTransformer.get_probability_map can be changed to self.probability_map = col.groupby(col).count().to_dict()
Transform method should be implemented on BaseTransformer and inherited on the other transformers.
HyperTransformer should crash at init if no meta_file is provided. It will raise an Attribute later on, so it’s better doing it early.
HyperTransformer methods shouldn’t be prefixed with hyper.
rdt.transformers.__init__ should be replace with a proper module declaration

DTTransformer Out of bounds nanosecond timestamp

https://github.com/HDI-Project/RDT/blob/7bc38da2c9cd737fdd2ff3207d3cd3482701bb6d/rdt/transformers/DTTransformer.py#L28

Pandas represents timestamps in nanoseconds, so there is in upper limit to the timespans that can be represented. If a timestamp is out of this range, the code breaks with a pandas.tslib.OutOfBoundsDatetime error. Full discussion can be seen here: https://stackoverflow.com/questions/32888124/pandas-out-of-bounds-nanosecond-timestamp-after-offset-rollforward-plus-adding-a

NullTransformer unreacheable operation

NullTransformers behavior is that it first try to fill null values using the average of values if its not null, and if that's not possible uses a fillna(0).

However the way the average is computed in the code will always be np.nan if there is a single null value ( and thats the case where we need the mean to be not null to replace null values).

Enforce int dtype on integer columns

Altough the implementation of rdt.transformer.NumberTransformer try to keep integer values as such, when doing reverse transform, some times the returned pandas.Series has a dtype of float64, that is, even if the values are valid integers, the column is treated as a float by pandas.

Could we find a way to check that integer columns are returned with dtype int64?

Document specification for meta.json

Currently, there is nowhere in the documentation a complete specification of the meta.json the transformers needs in order to work.

Pass col_meta at class-level instead of method-level

col_meta and missing argument should only be passed when creating the class, not when using any of its methods.

Add attributes NullTransformer and col_meta.

Transformers create NullTransformer when called with missing param.

The missing param should be made an attribute, and at init time,create an instance of NullTransformer in another attribute if proceeds.

Also, col_meta should be passed at init time and not on transform or fit calls.

Exclude logic for missing values in Transformers.

Currently, each Transformer accepts an argument missing on its transform and reverse_transform, and case its True, handles the missing values.

This behavior could be moved to HyperTransfomer and maybe anotating the corresponding columns in the meta.json for a more granular control.

Add circleci

Make CatTransformer.probability_map deterministic

When transforming data with CatTransformer, the different values with their mean and standard deviation are preserved in the attribute CatTransformer.probability_map.

However, that process isn't fully deterministic, which make the class harder to test and debug. Until everybody uses python 3.7, where dict order is assured, we should find a way to ensure that for the same input values the same CatTransfrmer.probability_map is generated.

Improve anonymization through integration with other libraries

Currently we are using faker library to anonymize data on the categorical transformer. We should investigate if there are better options available in terms of both performance and supported data types, such as mimesis.

Update README to include Categorical Transformer

The hyper transformer snippet in the README does not include the catTransformer. This should be fixed.

Add unittests for hyper transformer

Make the output numeric only

Current output has transformed columns as well as not transformed columns (eg. id columns). This should be changed to only have the numeric columns.

DTTransformer check data format before transform.

Currently DTTransformer expect a format key in it's meta with the output date in strpformat, that will be used to format the output of DTTransformer.reverse_transform.

However, no validation of the input format is made before the transformation which leave the door open to silent errors in case of a mistake in the meta.json.

To avoid this I propose to check the data is on the specified format before transforming.

sdv-dev / rdt Goto Github PK

rdt's People

Contributors

Stargazers

Watchers

Forkers

rdt's Issues

Recommend Projects

Recommend Topics

Recommend Org