sdv-dev / sdgym Goto Github PK
View Code? Open in Web Editor NEWBenchmarking synthetic data generation methods.
License: Other
Benchmarking synthetic data generation methods.
License: Other
Thank you very much for your prompt response for intrusion dataset. I was wondering if you have any script to generate the JSON structure file and npy file for any given dataset. Were the JSON files of the datasets used for benchmarking hand-annotated? It would be great if you can share the script if you used one, it will help in benchmarking various datasets. Thank you.
the demo PrivBNSynthesizer
is looking for a file privbayes/privBayes.bin
on https://github.com/DAI-Lab/SDGym/blob/aa2b82b2a68e9d0391ea67704b4b058cad867512/sdgym/synthesizers/privbn.py#L21, but in my local install (via pip install sdgym
in conda env, macos), this file does not exist. Is this supposed to be generated during the install? Or do I need to do an extra steps to get this file?
SDGym's synthesizers all inherit from the Baseline class (or BaseSynthesizer class in previous versions). Users can provide custom synthesizer functions. The convenience inheritance is demonstrated throughout SDGym's code base and has all sort of other benefits. My suggestion would be to make the following changes:
fit
and sample
methodThese changes provide consistency between SDGym's native and user-provided synthesizers and clear distinction between fit and sample logic, at nearly no cost:
def synthesizer_function(real_data: dict[str, pandas.DataFrame],
metadata: sdv.Metadata) -> real_data: dict[str, pandas.DataFrame]:
...
# do all necessary steps to learn from the real data
# and produce new synthetic data that resembles it
...
return synthetic_data
will become
from sdgym.synthesizers.base import Baseline
class MySynthesizer(Baseline):
def fit(self, real_data: dict[str, pandas.DataFrame], metadata: sdv.Metadata) -> None:
# ...
# do all necessary steps to learn from the real data
# ...
def sample(self, n_samples: int) -> dict[str, pandas.DataFrame]:
# and produce new synthetic data that resembles it
return synthetic_data
More interestingly, this structure allows for capturing valuable metrics that are currently out of reach related to fit/sampling time and complexity (time measurements or maybe even this package). SDGym would this way be able to benchmark this aspect of a synthesizer as well, which can be an important decision criterion for which synthesizer is best for a given use case: if the user expects to sample large quantities of data then a longer fitting time would be acceptable at a lower sampling complexity.
The code that needs to be changed for this is minimal, however I wanted to make sure you see value in this point before drafting a PR.
I don't understand why this is happening, since i followed the instructions on readme file.
Any suggestions?
https://colab.research.google.com/drive/1sBFKvFy5D_ssGtIVMmm6QiKl8H7NXEBm
#evaluating performance of build in synthesizer
synthesizer = IndependentSynthesizer()
benchmark(synthesizer.fit_sample)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
'precision', 'predicted', average, warn_for)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
'precision', 'predicted', average, warn_for)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
Hi I am new to Pytorch. Some layers have different behavior during train/and evaluation (like BatchNorm, Dropout), but I can't see model's train() mode and eval() mode in the code of synthesizer's fit mode or sample mode. Why don't use train mode or eval mode?
Thanks very much for open-sourcing the code.
While running the script provided in the benchmarking documentation as follows:
from sdgym import benchmark
from sdgym.synthesizers import CTGANSynthesizer
leaderboard = benchmark(synthesizers=CTGANSynthesizer)
I encountered the following ConvergenceWaring:
ConvergenceWarning: Initialization 1 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data.% (init + 1), ConvergenceWarning)
Could you please tell me whether is it OK to ignore the ConvergenceWaring above ? Or give me some other suggestions?
Thank you in advance!
On a testing dataset of mine the loss_g keeps increasing.
Also loss_mean and loss_std do not seem to converge even after hundreds of epochs
I am going to reproduce all results reported in the CTGAN paper.
However, I cannot fully reproduce the reported results:
For reproducing, I follow the demo:
import sdgym
from sdv.tabular import GaussianCopula, CTGAN
from sdgym.synthesizers import (
CLBN, CopulaGAN, CTGAN, Identity, Independent,
MedGAN)
scores = sdgym.run(synthesizers=CTGAN, datasets=['asia'])
scores = sdgym.run(synthesizers=Identity, datasets=['credit'])
When generating data, the class LegacySingleTableBaseline transforms labels into numbers. However, after transforming, the authors do not rearrange the columns. This makes the model apply the one-hot encoding scheme on continuous columns and Gaussian Mixtures models for categorical columns during training.
I think you could fix the bugs at the line 131 at synthesizers/base.py from: model_data = ht.transform(real_data) to model_data = ht.transform(real_data)[columns].
The bug will be solved.
Thanks
The leaderboard could be extended to keep track of execution times of the synthesizers and report the variation of the measurements (by default the three iterations).
To some users, the execution time is an informative measure for evaluating the feasibility of using that synthesizer due to resource limitations (and also interesting in general). Reporting the variation for the measurements is necessary to be able compare synthesizers.
Tried to import SDgym into a jupyter notebook for the first time after install.
Error message: "AttributeError: module 'sdmetrics' has no attribute 'single_table'"
sdmetrics is fully installed with latest version
Hello DAI-Lab, recently I am researching on data synthesizing and I found this amazing project.
I got an error when running sdgym.benchmark()
function.
After tracking some codes. I think the variable synthesized
should be a synthesized data produced by synthesizer.sample()
.
So we can solve this problem by replacing line 37:
synthesized = synthesizer(train, categoricals, ordinals)
into
synthesizer.fit(train, categoricals, ordinals)
synthesized = synthesizer.sample(train.shape[0])
There are multiple synthesizers implemented based on some articles that are difficult to find. Is it possible to add a section with these references?
Traceback (most recent call last):
File "", line 1, in
File "/home/zhengxr/anaconda3/lib/python3.7/site-packages/sdgym/benchmark.py", line 38, in benchmark
scores = evaluate(train, test, synthesized, meta)
File "/home/zhengxr/anaconda3/lib/python3.7/site-packages/sdgym/evaluate.py", line 366, in evaluate
performance = evaluator(synthesized_data, test, metadata)
File "/home/zhengxr/anaconda3/lib/python3.7/site-packages/sdgym/evaluate.py", line 297, in _evaluate_bayesian_likelihood
structure_json = json.dumps(metadata['structure'])
KeyError: 'structure'
Add normalized score to benchmark results, in addition to the raw metric score
Currently our code is being validated only by flake8
'vanilla' and just a few plugins. We would like to increase the code style checks by adding more add-on's that follow our code style and our standards.
Also we would like to ensure that our docstrings are properly written and follow the rest of our format.
We have performed this task already on RDT , more precisely on the following issue:
sdv-dev/RDT#248 (comment)
We need to add pydocstyle
plugin with the following lines on our setup.cfg
file as we are following the google convention.
[pydocstyle]
convention = google
add-ignore = D107, D407, D417
Flake8 comes with a lot of different addons that we can use to adapt it to our codestyle and checking, here is a list of plugins that I found to be interesting for us:
flake8-builtins
- Check for python builtins being used as variables or parameters.flake8-comprehensions
- Helps you write better list/set/dict comprehensions.flake8-debugger
- Debug statement checker.flake8-variables-names
- Extension that helps to make more readable variables names.Dlint
- Tool for encouraging best coding practices and helping ensure Python code is secure.flake8-mock
- Provides checking mock non-existent methods.flake8-fixme
- Check for FIXME, TODO and other temporary developer notes.flake8-eradicate
- Plugin to find commented out or dead code.flake8-mutable
- Extension for mutable default arguments.flake8-print
- Check for print statements in python files.flake8-pytest-style
- Plugin checking common style issues or inconsistencies.flake8-quotes
- Extension for checking quotes in python.flake8-multiline-containers
- Plugin to ensure a consistent format for multiline containers.pandas-vet
- Plugin that provides opinionated linting for pandas code.pep8-naming
- Check the PEP-8 naming conventions.flake8-expression-complexity
- Plugin to validate expressions complexity.flake8-sfs
- String formatting.hello, I see sdgym/synthesizers
.
the functions are written by PyTorch.
I usually use TensorFlow.
Is this package also applicable to the TensorFlow code?
and how can I save the learned model?
Thanks
There is no documentation on how to add new datasets in SDGym. Please add documentation for the method to add new datasets in SDGym.
Add a function and CLI command to list the available synthesizer names.
Thanks a lot for sharing the code, I tried to run the benchmark for VEEGAN with credit dataset, but I got the following bugs, I didn't really find where it exists the inplace operation, any ideas on that?
Error computing scores for VEEGANSynthesizer on dataset credit - iteration 0
Traceback (most recent call last):
File "<stdin>", line 8, in compute_benchmark
File "/Users/zhaozilong/Documents/SDGym/sdgym/synthesizers/base.py", line 17, in fit_sample
self.fit(data, categorical_columns, ordinal_columns)
File "/Users/zhaozilong/Documents/SDGym/sdgym/synthesizers/veegan.py", line 148, in fit
loss_g.backward(retain_graph=True)
File "/anaconda3/envs/pytorch0.3/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/anaconda3/envs/pytorch0.3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [128, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
An error shows up Tablegan is benchmarked on the intrusion
dataset.
This is the shown traceback.
2020-05-11 22:09:26,240 - INFO - base - Sampling TableganSynthesizer
2020-05-11 22:09:28,757 - ERROR - benchmark - Error computing scores for TableganSynthesizer on dataset intrusion - iteration 0
Traceback (most recent call last):
File "/home/xals/Projects/SDGym/sdgym/benchmark.py", line 71, in compute_benchmark
scores = compute_scores(train, test, synthesized, meta)
File "/home/xals/Projects/SDGym/sdgym/evaluate.py", line 358, in compute_scores
scores = evaluator(synthesized_data, test, metadata)
File "/home/xals/Projects/SDGym/sdgym/evaluate.py", line 162, in _evaluate_multi_classification
x_train, y_train, x_test, y_test, classifiers = _prepare_ml_problem(train, test, metadata)
File "/home/xals/Projects/SDGym/sdgym/evaluate.py", line 143, in _prepare_ml_problem
x_train, y_train = fm.make_features(train)
File "/home/xals/Projects/SDGym/sdgym/evaluate.py", line 132, in make_features
feature = encoder.fit_transform(col)
File "/home/xals/.virtualenvs/SDGym/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 631, in fit_transform
return self.fit(X).transform(X)
File "/home/xals/.virtualenvs/SDGym/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 493, in fit
self._fit(X, handle_unknown=self.handle_unknown)
File "/home/xals/.virtualenvs/SDGym/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 80, in _fit
X_list, n_samples, n_features = self._check_X(X)
File "/home/xals/.virtualenvs/SDGym/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 49, in _check_X
X_temp = check_array(X, dtype=None)
File "/home/xals/.virtualenvs/SDGym/lib/python3.6/site-packages/sklearn/utils/validation.py", line 542, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "/home/xals/.virtualenvs/SDGym/lib/python3.6/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
IntependentSynthesizer
raises a ValueError
with the intrusion
dataset
What I did:
In [1]: from sdgym import benchmark
In [2]: from sdgym.synthesizers import IndependentSynthesizer, MedganSynthesizer, VEEGANSynthesizer
In [3]: independent = IndependentSynthesizer()
In [4]: benchmark(independent.fit_sample, datasets=['intrusion'])
INFO - Evaluating dataset intrusion
INFO - Fitting IndependentSynthesizer
INFO - Sampling IndependentSynthesizer
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-00ab287a3ba6> in <module>
----> 1 benchmark(independent.fit_sample, datasets=['intrusion'])
~/Projects/SDGym/sdgym/benchmark.py in benchmark(synthesizer, datasets, repeat)
35
36 for iteration in range(repeat):
---> 37 synthesized = synthesizer(train, categoricals, ordinals)
38 scores = evaluate(train, test, synthesized, meta)
39 scores['dataset'] = name
~/Projects/SDGym/sdgym/synthesizers/base.py in fit_sample(self, data, categorical_columns, ordinal_columns)
18
19 LOGGER.info("Sampling %s", self.__class__.__name__)
---> 20 return self.sample(data.shape[0])
~/Projects/SDGym/sdgym/synthesizers/independent.py in sample(self, samples)
38 data[:, i] = data[:, i].clip(info['min'], info['max'])
39 else:
---> 40 data[:, i] = np.random.choice(np.arange(info['size']), samples, p=self.models[i])
41
42 return data
mtrand.pyx in numpy.random.mtrand.RandomState.choice()
ValueError: 'a' and 'p' must have same size
The problem is that is trying to access with the size
thats inside info
instead of the actual size of the models
.
There are some typos in docstring of the sdgym.run
function
This is the exact url where there is typo -> Link
show_progress
defaults to False
but mentioned that it defaults to True
.iterations
defaults to 1 but mentioned that it defaults to 3.Hi, I'm having troubles runing the code because of the function gm.fit(data[:, id_].reshape([-1, 1]))
in the fit
method of the BGMTransformer
class, line 310.
It seems that the reshape made by the function data[:, id_].reshape([-1, 1])
leads to an error '(slice(None, None, None), 2)' is an invalid key
. I tried to reshape by another methods like using loc
or iloc
, which works from terminal but I'm geting the same error when I change the function in the scrip. I don't know what to do now.
I loaded my own dataset and tried to run but I can't because of the error. That's what I am doing:
from sdgym.synthesizers.tvae import TVAESynthesizer
tvae = TVAESynthesizer()
tvae.fit(my_data, my_discrete_columns, my_ordinal_columns)
Add the project scaffolding from DAI-Lab cookiecutter
Similarly to what is described on #80, it should be possible to store the cache contents into an S3 bucket.
The behavior would be similar to the results path, where one can specify the s3://
prefix in the cache_dir
path specification to trigger the S3 storage.
Python
sdgym.run(..., cache_dir='s3://my-bucket/path/to/my/cache/dir')
CLI
sdgym run ... -c s3://my-bucket/path/to/my/cache/dir
The collect
command introduced in the PR #78 should also be adapted to read the cache contents from S3 and store the resulting CSV file to S3.
I ran my synthesizer within the benchmark and all datasets ran fine except for: census, covtype, credit, intrusion and mnist12 where I get 'segmentation fault'
This is my wrapper function:
def ReplicasSynthesizer(real_data, categorical_columns, ordinal_columns):
print("categorical columns:")
print(categorical_columns)
print("ordinal columns:")
print(ordinal_columns)
print(real_data.shape)
print(real_data[0])
df = pd.DataFrame(real_data)
print("the columns are:")
print(df.columns)
df.dropna(axis=0, how='any', inplace=True)
syn_df = synthesis_lib.synthesize(df)
syn_np = df.to_numpy()
return syn_np
leaderboard = sdgym.run(synthesizers=ReplicasSynthesizer, datasets=['mnist28'])
print(leaderboard)
Running on mnist28 dataset:
sh-4.2$ python replicas_wrapper.py
categorical columns:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784]
ordinal columns:
[]
(60000, 785)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 3]
the columns are:
RangeIndex(start=0, stop=785, step=1)
Segmentation fault
In issue #79 we added a way to collect results from a collection of intermediate cached results to a single scores CSV file.
It should also be possible to read the intermediate cached results from an S3 bucket, and to store the results csv to a S3 bucket.
Python:
sdgym.collect.collect_results(input_path='s3://my-bucket/path/to/results', output_file='s3://my-bucket/path/to/my/results.csv')
CLI
sdgym collect ... -i s3://my-bucket/path/to/results -o s3://my-bucket/path/to/my/results.csv
If the bucket is private, the AWS key and secret introduce in PR #74 should be used.
MedganSynthesizer
and VEEGANSynthesizer
raise IndexError
with the intrusion
dataset.
What I did:
In [1]: from sdgym import benchmark
In [2]: from sdgym.synthesizers import MedganSynthesizer, VEEGANSynthesizer
In [3]: medgan = MedganSynthesizer()
In [4]: benchmark(medgan.fit_sample, datasets=['intrusion'])
INFO - Evaluating dataset intrusion
INFO - Fitting MedganSynthesizer
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-4-fe34e4587459> in <module>
----> 1 benchmark(medgan.fit_sample, datasets=['intrusion'])
~/Projects/SDGym/sdgym/benchmark.py in benchmark(synthesizer, datasets, repeat)
35
36 for iteration in range(repeat):
---> 37 synthesized = synthesizer(train, categoricals, ordinals)
38 scores = evaluate(train, test, synthesized, meta)
39 scores['dataset'] = name
~/Projects/SDGym/sdgym/synthesizers/base.py in fit_sample(self, data, categorical_columns, ordinal_columns)
15 def fit_sample(self, data, categorical_columns=tuple(), ordinal_columns=tuple()):
16 LOGGER.info("Fitting %s", self.__class__.__name__)
---> 17 self.fit(data, categorical_columns, ordinal_columns)
18
19 LOGGER.info("Sampling %s", self.__class__.__name__)
~/Projects/SDGym/sdgym/synthesizers/medgan.py in fit(self, data, categorical_columns, ordinal_columns)
155 self.transformer = GeneralTransformer()
156 self.transformer.fit(data, categorical_columns, ordinal_columns)
--> 157 data = self.transformer.transform(data)
158 dataset = TensorDataset(torch.from_numpy(data.astype('float32')).to(self.device))
159 loader = DataLoader(dataset, batch_size=self.batch_size, shuffle=True, drop_last=True)
~/Projects/SDGym/sdgym/synthesizers/utils.py in transform(self, data)
151 else:
152 col_t = np.zeros([len(data), info['size']])
--> 153 col_t[np.arange(len(data)), col.astype('int32')] = 1
154 data_t.append(col_t)
155 self.output_info.append((info['size'], 'softmax'))
IndexError: index 65 is out of bounds for axis 1 with size 64
In [5]: veegan = VEEGANSynthesizer()
In [6]: benchmark(veegan.fit_sample, datasets=['intrusion'])
INFO - Evaluating dataset intrusion
INFO - Fitting VEEGANSynthesizer
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-6-2c2f4cbda073> in <module>
----> 1 benchmark(veegan.fit_sample, datasets=['intrusion'])
~/Projects/SDGym/sdgym/benchmark.py in benchmark(synthesizer, datasets, repeat)
35
36 for iteration in range(repeat):
---> 37 synthesized = synthesizer(train, categoricals, ordinals)
38 scores = evaluate(train, test, synthesized, meta)
39 scores['dataset'] = name
~/Projects/SDGym/sdgym/synthesizers/base.py in fit_sample(self, data, categorical_columns, ordinal_columns)
15 def fit_sample(self, data, categorical_columns=tuple(), ordinal_columns=tuple()):
16 LOGGER.info("Fitting %s", self.__class__.__name__)
---> 17 self.fit(data, categorical_columns, ordinal_columns)
18
19 LOGGER.info("Sampling %s", self.__class__.__name__)
~/Projects/SDGym/sdgym/synthesizers/veegan.py in fit(self, train_data, categorical_columns, ordinal_columns)
106 self.transformer = GeneralTransformer(act='tanh')
107 self.transformer.fit(train_data, categorical_columns, ordinal_columns)
--> 108 train_data = self.transformer.transform(train_data)
109 dataset = TensorDataset(torch.from_numpy(train_data.astype('float32')).to(self.device))
110 loader = DataLoader(dataset, batch_size=self.batch_size, shuffle=True, drop_last=True)
~/Projects/SDGym/sdgym/synthesizers/utils.py in transform(self, data)
151 else:
152 col_t = np.zeros([len(data), info['size']])
--> 153 col_t[np.arange(len(data)), col.astype('int32')] = 1
154 data_t.append(col_t)
155 self.output_info.append((info['size'], 'softmax'))
IndexError: index 65 is out of bounds for axis 1 with size 64
col_t
is being accessed by the values of the column instead of the index
benchmark function requires a my_synthesizer_function which takes input real data, categorical, ordinal features and make output of synthesized data. Though the documentation provided is not sufficient for a novice like me and hence facing issue in implementing and moreover in benchmark function it's showing up that it is taking data from predefined defult_datasets which has its own metdata file stored in server in json format, hence not allowing me to benchmark on my data as I don't have metadata ready for my data sets, there are quite a few and they are large.
so any detailed documentation on how to use this benchmark function more efficiently will be helpful.
Thanks a lot for such a beautiful package.
I am new to this domain
Docs build got broken after Sphinx 3.0.0 was released.
The main dependencies are all already capped to the next major (minor if 0.x) to improve robustness against API changes, but development and test versions are not.
Let's cap them as well.
Hi, It is not working CTGANSynthesizer.
https://github.com/DAI-Lab/SDGym/blob/master/sdgym/synthesizers/ctgan.py
https://github.com/DAI-Lab/SDGym/blob/master/sdgym/synthesizers/utils.py
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-36-afe47608990b> in <module>
3 synthesizer.fit(train_data = data2.values ,
4 categorical_columns = obj_col2,
----> 5 ordinal_columns = ord_col2 , )
6 print(synthesizer.sample(100).shape)
~/anaconda3/envs/py36/lib/python3.6/site-packages/sdgym/synthesizers/ctgan.py in fit(self, train_data, categorical_columns, ordinal_columns)
291 self.transformer = BGMTransformer()
292 self.transformer.fit(train_data, categorical_columns, ordinal_columns)
--> 293 train_data = self.transformer.transform(train_data)
294
295 data_sampler = Sampler(train_data, self.transformer.output_info)
~/anaconda3/envs/py36/lib/python3.6/site-packages/sdgym/synthesizers/utils.py in transform(self, data)
351 else:
352 col_t = np.zeros([len(data), info['size']])
--> 353 col_t[np.arange(len(data)), current.astype('int32')] = 1
354 values.append(col_t)
355
IndexError: index 31 is out of bounds for axis 1 with size 31
So I Check this.
## id 2 / info size 31
col_t = np.zeros([len( data2.values), 31])
print(col_t.shape)
current = data2.values[:, 2]
print(current.astype("int32").shape)
col_t[np.arange(len(data2.values)), current.astype('int32')] = 1
(10269, 31)
(10269,)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-54-e6b983069c50> in <module>
3 current = data2.values[:, 2]
4 print(current.astype("int32").shape)
----> 5 col_t[np.arange(len(data2.values)), current.astype('int32')] = 1
IndexError: index 31 is out of bounds for axis 1 with size 31
current = data[:, id_]
so current
is 1-d array. it is not data2[:,id] unique values.
therefore I got an Index error.
how should I solve it?
The latest versions of the compress-pickle
and humanfriendly
are not supported by SDGym
:
Library | Upper bound (unsupported) | Latest release |
---|---|---|
pandas |
1.1.5 | 1.3.1 |
pomegranate |
0.14.2 | 0.14.5 |
compress-pickle |
2 | 2.01 |
humanfriendly |
9 | 9.2 |
We should investigate why and update the code if necessary to support them.
I've been using your nicely implemented TableGAN and TGAN for some training, but after 10-15 my loss becomes Nan.
12 step 1200
tensor(0.0802, device='cuda:0', grad_fn=<SubBackward0>)
tensor(6.9029, device='cuda:0', grad_fn=<NegBackward>) None
epoch
12 step 1250
tensor(0.1213, device='cuda:0', grad_fn=<SubBackward0>)
tensor(5.7503, device='cuda:0', grad_fn=<NegBackward>) None
epoch
12 step 1300
tensor(0.2494, device='cuda:0', grad_fn=<SubBackward0>)
tensor(3.1711, device='cuda:0', grad_fn=<NegBackward>) None
epoch
12 step 1350
tensor(0.1131, device='cuda:0', grad_fn=<SubBackward0>)
tensor(8.1661, device='cuda:0', grad_fn=<NegBackward>) None
epoch
12 step 1400
tensor(0.1474, device='cuda:0', grad_fn=<SubBackward0>)
tensor(6.2397, device='cuda:0', grad_fn=<NegBackward>) None
epoch 12
step
1450
tensor(nan, device='cuda:0', grad_fn=<SubBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NegBackward>) None
epoch 12
step 1500
tensor(nan, device='cuda:0', grad_fn=<SubBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NegBackward>) None
It seems to be possibly related with the data definition given as input, where some of the target domains were not covered by the data, for example when taking a subset. This seems to be inducing the effect earlier. But at the moment I am not using that and getting the Nans after about half an hour of training, which seems to be wierd. It's not visible in the text, but up to the Nan point, the convergence seems to be a normal GAN curve, first dipping and then steadily rising.
I'm testing with your toy datasets now to see if the issue persists.
Is this something you have encountered in your testing? Any thoughts?
Current PrivBNSynthesizer implementation has two fixed arguments: the theta
value passed to the underlying privbn
binary and the maximum number of samples being used from the training data.
These two values should be passed as optional parameters.
I think I found an error in the inverse transformation of tableGAN.
The transformation is done with
In the following lines the inverse transformation is defined
https://github.com/DAI-Lab/SDGym/blob/c327ed6edf3f65270d5b339b3090e6745c3d21b8/sdgym/synthesizers/utils.py#L426-L429
The formula in this line is
However, it should be
This is the error message that we get while using the evaluation function to get the scores. The command that we use is: scores = evaluate(train, test, sampled, meta)
Synthesizer being used: UniformSynthesizer
This error is for IndependentSynthesizer
I'm facing some issues when I use a custom dataset. Can you help me with this?
I am trying to execute this code :
import sdgym
from sdgym import benchmark
from sdgym.synthesizers import (
CTGANSynthesizer, TVAESynthesizer)
all_synthesizers = [
CTGANSynthesizer,
TVAESynthesizer
]
scores = benchmark(synthesizers=all_synthesizers)
Is this normal that it takes a lot of time ? Do you have tips to make it faster ?
And how can I add another dataset, imagine I have data.csv in my directory, how can I add it in benchmark function ? or how can I add only real datasets (from your examples)
How can I add another synthesizer in benchmark function when the synthesizer is a python package (ex: DataSynthesizer)
Thank you for you help !
Hi,
Thank you for making this repository available.
Can you please add a way to cite this repository both code and paper, as the repo has moved a bit from Conditional Table GAN paper.
When benchmarking real datasets, sdgym now differentiates between continuous, categorical and ordinal data. The continuous data can be real or integer (e.g. Age). Currently, the synthesizers are allowed to produce real-valued features for integer-valued columns that are used as features to train the models. It would make sense to distinguish between these types and restrict the range that synthesized features for integer columns can take. @csala Would you agree that this change is an improvement to the benchmark? (If so, I can contribute a PR)
Current SDGym implementation allows to produce a results table as either a DataFrame (when run from python) or as a CSV file stored in the local HDD.
It should also be possible to store the results in an S3 bucket, which would be triggered by passing an output_path
that contains the S3
prefix:
Python:
sdgym.run(..., output_path='s3://my-bucket/path/to/my/results.csv')
CLI
sdgym run ... -o s3://my-bucket/path/to/my/results.csv
If the bucket is private, the AWS key and secret introduce in PR #74 should be used.
Hi guys, sorry for bothering you again!
I noticed the TGAN model was changed, removing the recurrent structure as well as the attention layer. I was wondering why these were made, since the original performed very well. Was it due to space/time limitations in this new setup? Or did you also see improved results with these changes?
THe Travis CI tests are failing because of the following:
The job exceeded the maximum log length, and has been terminated.
Example: this job
Trying to just get started running benchmarks
import sdgym
from sdgym.synthesizers import (
CLBN, CopulaGAN, CTGAN, Identity, Independent,
MedGAN, PrivBN, TableGAN,
Uniform, VEEGAN)
all_synthesizers = [
CLBN,
CTGAN,
CopulaGAN,
Identity,
Independent,
MedGAN,
PrivBN,
TableGAN,
Uniform,
VEEGAN,
]
scores = sdgym.run(synthesizers=all_synthesizers)
Exception has occurred: ClientError (note: full exception trace is shown but execution is paused at: )
An error occurred (ExpiredToken) when calling the ListObjects operation: The provided token has expired.
File "C:\Users\crm0376\Projects\SDGym\sdgym\datasets.py", line 118, in get_available_datasets
response = s3.list_objects(Bucket=bucket or BUCKET)
File "C:\Users\crm0376\Projects\SDGym\sdgym\datasets.py", line 161, in get_dataset_paths
datasets = get_available_datasets()['name'].tolist()
File "C:\Users\crm0376\Projects\SDGym\sdgym\benchmark.py", line 321, in run
datasets = get_dataset_paths(datasets, datasets_path, bucket, aws_key, aws_secret)
File "C:\Users\crm0376\Projects\SDGym\testing_load.py", line 21, in (Current frame)
scores = sdgym.run(synthesizers=all_synthesizers)
When passing a json
file as configuration for a multi-table
synthesizer with more than one dataset, this ends up producing errors after evaluating the first dataset.
Having a json
configuration file, named HMA1.json
with the following content:
{
"name": "HMA1('gaussian', 'categorical_fuzzy')",
"modalities": "multi-table",
"synthesizer": "sdv.relational.HMA1",
"init_kwargs": {
"model_kwargs": {
"default_distribution": "gaussian",
"categorical_transformer": "categorical"
},
"metadata": "$metadata"
},
"fit_kwargs": {
"tables": "$real_data"
}
}
I run SDGym
on multiple multi-table datasets:
sdgym run -s HMA1.json -d world_v1 trains_v1 -v
To which the following error is being produced:
Traceback (most recent call last):
File "/home/work/Projects/SDV/SDGym/sdgym/benchmark.py", line 79, in _compute_scores
score = metric.compute(*metric_args)
File "/home/work/.virtualenvs/SDGym/lib/python3.8/site-packages/sdmetrics/multi_table/multi_single_table.py", line 102, in compute
return cls._compute(cls, real_data, synthetic_data, metadata, **kwargs)
File "/home/work/.virtualenvs/SDGym/lib/python3.8/site-packages/sdmetrics/multi_table/multi_single_table.py", line 62, in _compute
raise ValueError('`real_data` and `synthetic_data` must have the same tables')
The PrivBNSynthesizer requires an executable that needs to be compiled from C++ code.
The current code of PrivBNSynthesizer only checks for the existence of the binary file and raises an
AssertionError
if it is not found, which is hard to understand by the users.
The code should be changed to provide a user-friendly message indicating that privbayes needs to be compiled before it can be used and pointing at the corresponding documentation.
SDGym can automatically determine how many workers to use, based on the available GPUs or CPUs on the current machine. If the machine has GPUs, use the number of GPUs as the workers value. Otherwise use the number of CPUs. The user can request to automatically detect the number of workers by passing in workers=-1
.
Apart from producing a single dataframe or CSV file with the scores obtained by all the Synthesizers, SDGym has the option to store intermediate results, scores and error logs as the different tasks are run, which are kept inside the cache_dir
. However, if the sdgym process is cut for some reason, there is no way to find all the intermediate results and put them together as a single CSV file again.
It would be interesting to have a collect_results
function and an sdgym collect
command that would do this job and allow producing a single scores CSV file from a collection of intermediate cached results.
Hi @csala, thank you very much for open-sourcing the package. The baseline-models are working on all the datasets except intrusion where the following error is raised-
IndexError: index 65 is out of bounds for axis 1 with size 64.
Can you please help me out with this error. Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.