Coder Social home page Coder Social logo

sdv-dev / sdv Goto Github PK

View Code? Open in Web Editor NEW
2.1K 40.0 285.0 28.8 MB

Synthetic data generation for tabular data

Home Page: https://docs.sdv.dev/sdv

License: Other

Makefile 0.41% Python 99.59%
synthetic-data machine-learning relational-datasets multi-table time-series synthetic-data-generation sdv data-generation generative-adversarial-network gan

sdv's People

Contributors

amontanez24 avatar arashakhgari avatar aylr avatar csala avatar deathn0t avatar dyuliu avatar fealho avatar frances-h avatar github-actions[bot] avatar gsheni avatar jdtheripperpc avatar katxiao avatar kveerama avatar lajohn4747 avatar ludovicc avatar manuelalvarezc avatar npatki avatar pvk-developer avatar r-palazzo avatar rollervan avatar rwedge avatar sarahmish avatar sdv-team avatar tssbas avatar xamm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sdv's Issues

Add Evaluation Metrics

Find a way to evaluate the output of SDV.

  • Time
  • Accuracy for numeric columns
  • Accuracy for categorical columns

Update SDV to work with Copula updates

  • Copulas changed the name of the classes that get referenced by SDV
  • Currently get following error from running SDV:
    ModuleNotFoundError: No module named 'copulas.multivariate.GaussianCopula'

Add copula models to modeler.py

  • SDV version:
  • Python version:
  • Operating System:

Description

The modeler should now store copula models as it runs RCPA. RCPA should now add the flattened models to tables.

Extended parameters not being passed up

Description

If there is a table that has a parent and a child, it currently isn't passing its added parameters up to its parent during modeling.

Fix should be on line 147

Add support for Vine Copulas as a modeler.

It could be useful to add support for different models. To archieve that we should:

  1. Wait for this issue of Copulas is done and released.

  2. Update our requirements to work with the latests version of copulas.

  3. Add a new method sdv.Modeler.flatten_dict, that gets a nested dictionary and returns it flattened:

     >>> nested_dict
     {
         'one_attribute': 0
         'nested_attribute': {
             'foo': 'bar
         }
     }
     
     >>> sdv.Modeler.flatten_dict(nested_dict)
    
     {
        'one_attribute': 0
        'nested_attribute__foo': 'bar'
     }
  4. Add a new method sdv.Sampler.unflatten_dict that does the exact opposite, that is:

     >>> assert nested_dict == sdv.Sampler.unflatten_dict(sdv.Modeler.flatten_dict(nested_dict))
     >>> assert flattened_dict = sdv.Modeler.flatten_dict(sdv.Sampler.unflatten_dict(flattened_dict))
  5. Change the behavior of sdv.Modeler.flatten_model in order for it recieve a modelas input and return a pandas.Series with the flattened model dict.

  6. Rename the distribution keyword on sdv.Modeler.__init__ to model_kwargs that defaults to None, but when its present its passed to model when instances are created.

  7. Change the behavior of sdv.Sampler._make_model_from_params to make that after the parameters have been retrieved from the parent_row are transform into a dictionary, passed to sdv.Sampler.unflatten_dict and the result to model.from_dict

Enforce data constraints

After this issue is solved, we should be ready to enforce data constraints on sampled data.

In order to implement them, they should be checked after data is sampled and reverse_transformed but before is returned. It should be on sdv.Sampler.sample_rows as it's the common access to the process of sampling for the three public methods. The roadmap should be as follows:

1-. Create a method sdv.Sampler.check_constraints that gets a dataframe sampled and reverse transformed and return an array of indices corresponding to rows that fulfill constraints.

2-. Modify the method sdv.Sampler.sample_rows that handles the process of sampling, but before returning the result, checks the data fullfill the constraints, discard the rows that fail and samples again until it gets to the desired number of rows.

Enforce coding standards.

Fix python standards violations in the project such as:

  • Invalid file names.
  • Docstrings improperly formatted.

Also:

  • Remove unused files.
  • Delete unused variables
  • Refactor repeated chunks of code
  • Make sure make test-all pass without issues

Sample Parents Using Copulas

  • SDV version:
  • Python version:
  • Operating System:

Description

Using the models generated by the modeler, we want to sample rows for parents. Every time a new row is sampled, the primary key and row should be stored, so that children can generate models for the primary keys.

Remove hyper transformer dependency from Modeler

Description

  • In Modeler.py, hyper_transformer is used to clean tables before modeling (remove added Nans). This dependency should be removed, and the imputing should be done within Modeler.py itself.

Add get_dataframe and get_metadata functions to DataNavigator

Currently, it is unclear for how a user can access the dataframes or meta-data for a specific table. Functions should be added to data_navigator to make this easier.

-def get_dataframe(table_name): returns the dataframe for the specified table
-def get_meta_data(table_name): returns the meta information for the specified table

synthesize rows given some restrictions

  • SDV version: 0.1.0
  • Python version: 3.6
  • Operating System: Fedora release 28 (Twenty Eight)

Description

In the master thesis and old documentation, it is stated that users are able to sample from tables according to arbitrary conditions on certain features. In the current version, I can't find anything like this in the documentation.

What I Did

I looked into the documentation and the source code.

Notes

I might be wrong but, probably, SDV team is waiting for a PR on this: sdv-dev/Copulas#47

README fixes

After running the README step by step I found some issues that need to be fixed:

  • Formatting python snippets as such instead of bash

  • On install instructions, change conda instructions for vainilla venv, if it's really needed ( we can just put the normal install from sources instructions)

  • On the code examples, change import * with the concrete modules to import.

  • When showing the values of a dataframe:
    · Avoid using print as is redundant.
    · Don't print the whole dataframe, doing a transposed head (df.head(3).T), will be more readable.

  • When users_meta ( a nested dict) is obtained, is displayed using print, that flattens it, making it harder to understand it's structure. will be better to call it without print or use pprint instead.

  • On save_model, create folder models if it doesn't exist.

Even if it doesn't crash, at some points in the execution, warnings arise, solving the, will be a plus:

>>> modeler.model_database()
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/numpy/lib/function_base.py:3082: RuntimeWarning: invalid value encountered in subtract
  X -= avg[:, None]
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/pandas/core/frame.py:5550: RuntimeWarning: Degrees of freedom <= 0 for slice
  baseCov = np.cov(mat.T)
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/numpy/lib/function_base.py:3088: RuntimeWarning: divide by zero encountered in double_scalars
  c *= 1. / np.float64(fact)
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/numpy/lib/function_base.py:3088: RuntimeWarning: invalid value encountered in multiply
  c *= 1. / np.float64(fact)
/home/xino/Pythia/MIT/SDV/sdv/Modeler.py:83: RuntimeWarning: '>' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects
  extended_table = extended_table.append(row, ignore_index=True)
>>> sampler.sample_all()
/home/xino/.virtualenvs/sdv_mit/src/copulas/copulas/multivariate/GaussianCopula.py:88: RuntimeWarning: covariance is not positive-semidefinite.
  samples = np.random.multivariate_normal(clean_mean, clean_cov, size=s)
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1907: RuntimeWarning: invalid value encountered in add
  lower_bound = self.a * scale + loc

Add support for modelling multi-parent tables

Currently SDV is not able to model nor sample a table with multiple foreign_keys, whether or not they are from different tables or the same one repeated.

We should find a way to model and sample such tables.

SDV Modeler Index issue

  • Lines 54, 63 an 64 of Modeler should be changed to use iloc, because sometimes the index of the dataframe doesn't start at 0 and increase normally.

Fix NaN problem that happens when modelling

  • SDV version:
  • Python version:
  • Operating System:

Description

Covariance matrices are being filled with NaNs. This is likely because column of foreign key is being modeled (all values are the same which causes that column to become all NaNs when creating a copula from it).

What I Did

Running RCPA causes many of the copula models to get covariance matrices filled with NaNs

Rename Variables

  • SDV version:
  • Python version:
  • Operating System:

Description

In DataNavigator:
transformed -> transformed_data
_parse_data -> _parse_meta_data

In Modeler:
model_type -> tuple of overall model type name, and list of parameters ie. ('GaussianCopula', ['GaussianUnivariate']}
sets -> conditional_data

Sample Children Using Copulas

  • SDV version:
  • Python version:
  • Operating System:

Description

Users should be able to generate rows for child tables. These tables should have foreign keys that refer to primary keys actually generated by parents.

Fix hyper_fit_transform call

RDT changed the method in hyper transformer from hyper_fit_transform to fit_transform. SDV still calls the old method on line 108 of DataNavigator.py. This should be changed to call the new method.

Create sampling logic w/ dummy values

  • SDV version:
  • Python version:
  • Operating System:

Description

add ability to sample table recursively, but using random values instead of having the model generate the values

Add documentation/reference for meta.json

Currently there is no documentation about what a meta.json file should contain for a given dataset. A minimal documentation should contain:

  • Reference for datasets
  • Reference for fields
  • Examples for simple cases
  • Links to external sources (RDT and Dataset Manager)

Ignore foreign key when modelling

  • SDV version:
  • Python version:
  • Operating System:

Description]

When creating a copula model to get the conditional data, the column of the foreign key should be ignored. This is because all values will be the same and the copula model will be messed up by this

Generate multiple rows

  • SDV version:
  • Python version:
  • Operating System:

Description

Generate more that one row at a time.

Add testing datasets with more complex relationships

The current dataset we are using for unittesting is quite simple, and I'm afraid some issues may arise when working with datasets with more complex relations.

My proposal is to add datasets with:

  • A single parent and multiple childs
  • A child with multiple parents
  • Multiple multi-level relations (a child whose parents are also parents of some of the parents of the child,...)

This would help us to catch edge cases we may have not considered.

Improve copula parameters sampling

During the modeling of the database in sdv.Modeler, extensions are created for each row of the parent tables containing the parameters to model the children tables.

On sampling time, this extensions are sampled too and later the parameters extracted and used to create the models to sample the children rows.

When creating new models from the sampled parameters, sometimes the models are created with inconsistent values. So far the following have been found:

  1. The sampled covariance matrix may not be positive-semidefinite, which is a requirement for copulas.multivaritate.GaussianMultivariate copula, which raises this warning:

    sdv_mit/lib/python3.6/site-packages/copulas/multivariate/gaussian.py:199: RuntimeWarning: covariance is not positive-semidefinite.
       samples = np.random.multivariate_normal(means, clean_cov, size=size)
    
  2. If by any chance the sampled value for the std of the copulas.univariate.GaussianUnivariate distribution is negative or zero the value of the generated sampled will be np.nan

NaNs in Covariance matrix for parent models

  • SDV version:
  • Python version:
  • Operating System:

Description

If you model the database, the parent models receive data with NaNs, and then end up with covariance matrices that have nans. This makes sampling impossible.

Two possible causes:

  1. Some parent primary keys are never referenced, so the extension is null.
  2. The covariance matrix for different conditional data may have different sizes. This is likely a bug in copulas where numpy.cov is taking the rows as variables instead of the columns.

Minor issues after code review

  • Repeated string values, should be defined as module-level constants. (ie, ‘GENERATED_PRIMARY_KEY’ on sdv.Modeler)

  • On sdv.DataNavigator.DataNavigator : delete getter methods, access the attributes directly.

  • Change dict lookups for dict.get calls where possible.

  • Delete sdv.DataNavigator.DataNavigator.__init__: It simply calls super, so it does nothing by itself, and the call to super is done by inheritance.

  • Delete repeated methods sdv.Modeler.get_model and sdv.Modeler._get_model

  • On if statements, change comparison against empty set to comparison against object itself, like: if self.attribute instead of if self.attribute == set().

Prepare 0.1.0 release

This issue includes all the task that need to be done before the release of the 0.1.0 version:

  1. Installation works on a clean environment using make dist and installing the resulting tarball.
  2. Build passes with make test-all.
  3. Documentation includes necessary steps for installation and usage, minimal api reference and contributing guide.
  4. README examples works perfectly and reflexes the latest changes made to the project.

Add TravisCI

Add TravisCI to run builds after each commit, merge and PR.

Ensure unicity on primary keys on different calls

Currently, primary keys are generated using exrex module and the regex from the meta.json file.The way it's implemented, if we sample a single time, we are guaranteed that the primary keys will be unique, however, if we sample more than once, it's possible to obtain again keys that have been returned in the previous call.

Should we ensure uniqueness in this scenario?
Note that if we do this, we will only be able to sample as many rows as different matches the regex allows, afterwards we'll need a way to reset the database before sampling anything else.

For example, if we had a dataset consisting of a single table, with a single column, which is the primary key with regex [1-5]{1}, then the following could happen:

>>>...
>>>first_samples = sampler.sample_all(num_rows=3)
>>>first_samples.T
   primary_key
0            1
1            2
2            3

# Then it's no guaranteed that if we sample a single row more, it's primary key will be neither 4 or 5
>>> second_sample = sampler.sample_all(num_rows=1)
>>> second_sample
   primary_key
0            3

Set Copulas as dependency

  • SDV version:
  • Python version:
  • Operating System:

Description

Copulas library needs to be dependency. Should be able to use Copulas in SDV

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Change the modeler save/load API.

Right now, the functionality to save/load a model, is invoked like this:

modeler.save_model('demo_model')
modeler = sdv.utils.load_model('sdv/models/demo_model.pkl')

Here we are saving and loading the same model.
A few problems arise:

1-. The input value on both function should be the same, to avoid confussions.

2-. The saving is done building a path relative to the file modeler, while the loading is done using the path as it comes. This can cause unexpected behavior for the end-user. Could we take this value from a configuration file?

3-. It makes little sense to have a function that loads a class instance as a standalone function in a separate model, when it could be a classmethod on the Modeler class

Remove Primary Key Requirement

  • SDV version:
  • Python version:
  • Operating System:

Description

SDV currently requires that a primary key be defined for every table in the meta. This is not actually necessary and the dependency should be removed.

Create Flatten Model Function

  • SDV version:
  • Python version:
  • Operating System:

Description

Given a Copula model, there should be a function that converts its parameters into an array (flattens the model).

Separate Data Loading into another class

Description

  • Create another class responsible for loading data and returning a DataNavigator object.
  • Remove all loading logic from DataNavigator and move to this new class

Use pypi versions of Copulas and RDT

This week two of the dependencies of the project have released new versions. We should check that everything works fine with the new versions and change the project dependencies to the newer versions.

Modeler parameter not being used (?)

SDV version: 0.1.0
Python version: 3.6
Operating System: Fedora release 28 (Twenty Eight)

Description

I was trying to use the univariate KDE. To do that, I tried to set the distribution parameter in sdv.Modeler constructor to sdv.univariate.KDEUnivariate. The fitted modeler still uses sdv.univariate.GaussianUnivariate.

What I Did

I ran the following code:

import pandas as pd
import numpy as np

from copulas.univariate import KDEUnivariate
from copulas.univariate import GaussianUnivariate
from copulas.multivariate import VineCopula
from copulas.multivariate import GaussianMultivariate
from copulas.multivariate.tree import TreeTypes

from sdv import Sampler
from sdv import Modeler
from sdv import CSVDataLoader
from functools import partial


data_loader = CSVDataLoader('boston.json')
dn = data_loader.load_data()
dn.transform_data()
modeler = Modeler(dn, distribution=KDEUnivariate)
modeler.model_database()
sampler = Sampler(dn, modeler)

I checked the distribution for TAX feature and it follows, in the synthetic data, a gaussian distribution, while in the original data it wasn't gaussian. To check that, I looked into both the modeler and the following KDE plots:

image

image

If you want to run the code, you can use the annexed CSV and JSON files.

boston-data.zip

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.