sdv-dev / sdv Goto Github PK
View Code? Open in Web Editor NEWSynthetic data generation for tabular data
Home Page: https://docs.sdv.dev/sdv
License: Other
Synthetic data generation for tabular data
Home Page: https://docs.sdv.dev/sdv
License: Other
SDV needs to specify commits for copulas and rdt that work, until the tagged versions get released
Find a way to evaluate the output of SDV.
The modeler should now store copula models as it runs RCPA. RCPA should now add the flattened models to tables.
Add dataprep to SDV.
If there is a table that has a parent and a child, it currently isn't passing its added parameters up to its parent during modeling.
Fix should be on line 147
It could be useful to add support for different models. To archieve that we should:
Wait for this issue of Copulas is done and released.
Update our requirements to work with the latests version of copulas.
Add a new method sdv.Modeler.flatten_dict
, that gets a nested dictionary and returns it flattened:
>>> nested_dict
{
'one_attribute': 0
'nested_attribute': {
'foo': 'bar
}
}
>>> sdv.Modeler.flatten_dict(nested_dict)
{
'one_attribute': 0
'nested_attribute__foo': 'bar'
}
Add a new method sdv.Sampler.unflatten_dict
that does the exact opposite, that is:
>>> assert nested_dict == sdv.Sampler.unflatten_dict(sdv.Modeler.flatten_dict(nested_dict))
>>> assert flattened_dict = sdv.Modeler.flatten_dict(sdv.Sampler.unflatten_dict(flattened_dict))
Change the behavior of sdv.Modeler.flatten_model
in order for it recieve a model
as input and return a pandas.Series
with the flattened model dict.
Rename the distribution
keyword on sdv.Modeler.__init__
to model_kwargs
that defaults to None
, but when its present its passed to model when instances are created.
Change the behavior of sdv.Sampler._make_model_from_params
to make that after the parameters
have been retrieved from the parent_row
are transform into a dictionary, passed to sdv.Sampler.unflatten_dict
and the result to model.from_dict
Primary and foreign keys should be generated using regex, not the copula model.
After this issue is solved, we should be ready to enforce data constraints on sampled data.
In order to implement them, they should be checked after data is sampled and reverse_transformed but before is returned. It should be on sdv.Sampler.sample_rows
as it's the common access to the process of sampling for the three public methods. The roadmap should be as follows:
1-. Create a method sdv.Sampler.check_constraints
that gets a dataframe sampled and reverse transformed and return an array of indices corresponding to rows that fulfill constraints.
2-. Modify the method sdv.Sampler.sample_rows
that handles the process of sampling, but before returning the result, checks the data fullfill the constraints, discard the rows that fail and samples again until it gets to the desired number of rows.
Fix python standards violations in the project such as:
Also:
make test-all
pass without issuesUsing the models generated by the modeler, we want to sample rows for parents. Every time a new row is sampled, the primary key and row should be stored, so that children can generate models for the primary keys.
Currently, it is unclear for how a user can access the dataframes or meta-data for a specific table. Functions should be added to data_navigator to make this easier.
-def get_dataframe(table_name): returns the dataframe for the specified table
-def get_meta_data(table_name): returns the meta information for the specified table
Investigate whether the Git Large File Storage infrastructure is suitable to store the demo data.
In the master thesis and old documentation, it is stated that users are able to sample from tables according to arbitrary conditions on certain features. In the current version, I can't find anything like this in the documentation.
I looked into the documentation and the source code.
I might be wrong but, probably, SDV team is waiting for a PR on this: sdv-dev/Copulas#47
After running the README step by step I found some issues that need to be fixed:
Formatting python snippets as such instead of bash
On install instructions, change conda
instructions for vainilla venv
, if it's really needed ( we can just put the normal install from sources instructions)
On the code examples, change import *
with the concrete modules to import.
When showing the values of a dataframe:
· Avoid using print
as is redundant.
· Don't print the whole dataframe, doing a transposed head (df.head(3).T
), will be more readable.
When users_meta ( a nested dict) is obtained, is displayed using print
, that flattens it, making it harder to understand it's structure. will be better to call it without print
or use pprint
instead.
On save_model
, create folder models if it doesn't exist.
Even if it doesn't crash, at some points in the execution, warnings arise, solving the, will be a plus:
>>> modeler.model_database()
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/numpy/lib/function_base.py:3082: RuntimeWarning: invalid value encountered in subtract
X -= avg[:, None]
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/pandas/core/frame.py:5550: RuntimeWarning: Degrees of freedom <= 0 for slice
baseCov = np.cov(mat.T)
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/numpy/lib/function_base.py:3088: RuntimeWarning: divide by zero encountered in double_scalars
c *= 1. / np.float64(fact)
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/numpy/lib/function_base.py:3088: RuntimeWarning: invalid value encountered in multiply
c *= 1. / np.float64(fact)
/home/xino/Pythia/MIT/SDV/sdv/Modeler.py:83: RuntimeWarning: '>' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects
extended_table = extended_table.append(row, ignore_index=True)
>>> sampler.sample_all()
/home/xino/.virtualenvs/sdv_mit/src/copulas/copulas/multivariate/GaussianCopula.py:88: RuntimeWarning: covariance is not positive-semidefinite.
samples = np.random.multivariate_normal(clean_mean, clean_cov, size=s)
/home/xino/.virtualenvs/sdv_mit/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1907: RuntimeWarning: invalid value encountered in add
lower_bound = self.a * scale + loc
Currently SDV is not able to model nor sample a table with multiple foreign_keys
, whether or not they are from different tables or the same one repeated.
We should find a way to model and sample such tables.
Covariance matrices are being filled with NaNs. This is likely because column of foreign key is being modeled (all values are the same which causes that column to become all NaNs when creating a copula from it).
Running RCPA causes many of the copula models to get covariance matrices filled with NaNs
On sdv.sampler.sample_rows
:
https://github.com/HDI-Project/SDV/blob/687d30a090bd2424abf675e66349bb516e4d6a5b/sdv/sampler.py#L52-L71
and https://github.com/HDI-Project/SDV/blob/687d30a090bd2424abf675e66349bb516e4d6a5b/sdv/sampler.py#L101-L117
are basically identical. It could be a good idea to move that code to a separate method sdv.sampler.transform_sampled_rows
to reduce duplicated code.
Once the latest version of copulas is out, we should update our requirements in order to proceed with #71
In DataNavigator:
transformed -> transformed_data
_parse_data -> _parse_meta_data
In Modeler:
model_type -> tuple of overall model type name, and list of parameters ie. ('GaussianCopula', ['GaussianUnivariate']}
sets -> conditional_data
Integrate with faker library for certain data types
Users should be able to generate rows for child tables. These tables should have foreign keys that refer to primary keys actually generated by parents.
RDT changed the method in hyper transformer from hyper_fit_transform to fit_transform. SDV still calls the old method on line 108 of DataNavigator.py. This should be changed to call the new method.
add ability to sample table recursively, but using random values instead of having the model generate the values
Currently there is no documentation about what a meta.json
file should contain for a given dataset. A minimal documentation should contain:
When creating a copula model to get the conditional data, the column of the foreign key should be ignored. This is because all values will be the same and the copula model will be messed up by this
Generate more that one row at a time.
The current dataset we are using for unittesting is quite simple, and I'm afraid some issues may arise when working with datasets with more complex relations.
My proposal is to add datasets with:
This would help us to catch edge cases we may have not considered.
https://github.com/HDI-Project/SDV/blob/a0da522a9e597b00be0a3a948cb962820528a82b/sdv/modeler.py#L181.
This line contains a check to ensure an extension is not None
. However this is will allow an empty extension to be processed, which in turn may cause problems later.
As stated here, some dataset structures are not yet supported by SDV.
It would be useful if we raise an exception at fit
explaining the reasons.
During the modeling of the database in sdv.Modeler
, extensions are created for each row of the parent tables containing the parameters to model the children tables.
On sampling time, this extensions are sampled too and later the parameters extracted and used to create the models to sample the children rows.
When creating new models from the sampled parameters, sometimes the models are created with inconsistent values. So far the following have been found:
The sampled covariance matrix may not be positive-semidefinite, which is a requirement for copulas.multivaritate.GaussianMultivariate
copula, which raises this warning:
sdv_mit/lib/python3.6/site-packages/copulas/multivariate/gaussian.py:199: RuntimeWarning: covariance is not positive-semidefinite.
samples = np.random.multivariate_normal(means, clean_cov, size=size)
If by any chance the sampled value for the std
of the copulas.univariate.GaussianUnivariate
distribution is negative or zero the value of the generated sampled will be np.nan
If you model the database, the parent models receive data with NaNs, and then end up with covariance matrices that have nans. This makes sampling impossible.
Two possible causes:
Repeated string values, should be defined as module-level constants. (ie, ‘GENERATED_PRIMARY_KEY’ on sdv.Modeler
)
On sdv.DataNavigator.DataNavigator
: delete getter methods, access the attributes directly.
Change dict lookups for dict.get calls where possible.
Delete sdv.DataNavigator.DataNavigator.__init__
: It simply calls super, so it does nothing by itself, and the call to super is done by inheritance.
Delete repeated methods sdv.Modeler.get_model
and sdv.Modeler._get_model
On if statements, change comparison against empty set to comparison against object itself, like: if self.attribute
instead of if self.attribute == set()
.
This issue includes all the task that need to be done before the release of the 0.1.0
version:
make dist
and installing the resulting tarball.make test-all
.Add TravisCI to run builds after each commit, merge and PR.
Currently, primary keys are generated using exrex
module and the regex from the meta.json
file.The way it's implemented, if we sample a single time, we are guaranteed that the primary keys will be unique, however, if we sample more than once, it's possible to obtain again keys that have been returned in the previous call.
Should we ensure uniqueness in this scenario?
Note that if we do this, we will only be able to sample as many rows as different matches the regex allows, afterwards we'll need a way to reset the database before sampling anything else.
For example, if we had a dataset consisting of a single table, with a single column, which is the primary key with regex [1-5]{1}
, then the following could happen:
>>>...
>>>first_samples = sampler.sample_all(num_rows=3)
>>>first_samples.T
primary_key
0 1
1 2
2 3
# Then it's no guaranteed that if we sample a single row more, it's primary key will be neither 4 or 5
>>> second_sample = sampler.sample_all(num_rows=1)
>>> second_sample
primary_key
0 3
Copulas library needs to be dependency. Should be able to use Copulas in SDV
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
Right now, the functionality to save/load a model, is invoked like this:
modeler.save_model('demo_model')
modeler = sdv.utils.load_model('sdv/models/demo_model.pkl')
Here we are saving and loading the same model.
A few problems arise:
1-. The input value on both function should be the same, to avoid confussions.
2-. The saving is done building a path relative to the file modeler, while the loading is done using the path as it comes. This can cause unexpected behavior for the end-user. Could we take this value from a configuration file?
3-. It makes little sense to have a function that loads a class instance as a standalone function in a separate model, when it could be a classmethod
on the Modeler class
SDV currently requires that a primary key be defined for every table in the meta. This is not actually necessary and the dependency should be removed.
There are two loops in CPA that might not be optimized. This should be further investigated by timing the loops and optimized if possible.
sklearn is not in the requirements.txt, but is a required dependency.
Given a Copula model, there should be a function that converts its parameters into an array (flattens the model).
This week two of the dependencies of the project have released new versions. We should check that everything works fine with the new versions and change the project dependencies to the newer versions.
SDV version: 0.1.0
Python version: 3.6
Operating System: Fedora release 28 (Twenty Eight)
I was trying to use the univariate KDE. To do that, I tried to set the distribution
parameter in sdv.Modeler
constructor to sdv.univariate.KDEUnivariate
. The fitted modeler still uses sdv.univariate.GaussianUnivariate
.
I ran the following code:
import pandas as pd
import numpy as np
from copulas.univariate import KDEUnivariate
from copulas.univariate import GaussianUnivariate
from copulas.multivariate import VineCopula
from copulas.multivariate import GaussianMultivariate
from copulas.multivariate.tree import TreeTypes
from sdv import Sampler
from sdv import Modeler
from sdv import CSVDataLoader
from functools import partial
data_loader = CSVDataLoader('boston.json')
dn = data_loader.load_data()
dn.transform_data()
modeler = Modeler(dn, distribution=KDEUnivariate)
modeler.model_database()
sampler = Sampler(dn, modeler)
I checked the distribution for TAX
feature and it follows, in the synthetic data, a gaussian distribution, while in the original data it wasn't gaussian. To check that, I looked into both the modeler and the following KDE plots:
If you want to run the code, you can use the annexed CSV and JSON files.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.