Comments (4)
Hi @ThomasK1018, thanks for the detailed description. We were able to replicate your error. For future reference, I'm attaching the stack trace below.
Root Cause
We will investigate more but in all likelihood, this is probably related to #1691. The known bug is that HMA sometimes creates null values even when there aren't any in the real data. Usually this happens for float64 or string types in the child table. In this particular case, you have int64 columns (that are categorical). The int64 storage type does not support null values, hence the crash.
As a next step, the team can investigate more to confirm that the root cause is, indeed, #1691. If so, we will dedupe the issue. We can also do a better job at preventing the crash and instead just returning the null value (even though this is not quite correct). I've filed RDT issue 747 to keep track of this.
Workarounds
Changing the default distribution for each table to 'norm'
seems to solve this issue because it makes the error less likely. Unfortunately, this may impact the data quality (but at least it won't crash!)
synthesizer = HMASynthesizer(metadata)
for table_name in ['unique_id', 'ads_dataset', 'feeds_dataset']:
synthesizer.set_table_parameters(
table_name=table_name,
table_parameters={
'enforce_min_max_values': True,
'default_distribution': 'norm'
}
)
synthesizer.fit(datasets)
synthesizer.sample()
Do note that the HMA is only meant for smaller datasets though. So even with this workaround in place, you'll still see the performance alert:
PerformanceAlert: Using the HMASynthesizer on this metadata schema is not recommended. To model this data, HMA will generate a large number of columns. (1189 columns)
Table Name # Columns in Metadata Est # Columns
0 unique_id 1 1129
1 feeds_dataset 26 26
2 ads_dataset 34 34
We recommend simplifying your metadata schema by dropping columns that are not necessary. If this is not possible, contact us at [email protected] for enterprise solutions.
Ultimately, I'd recommend using the HSASynthesizer as it's designed to handle larger datasets more robustly. I've confirmed that it works without issue on this dataset (fitting in a few seconds).
from sdv.
Hi Neha,
Thanks for your reply. That change on default distribution has certainly helped. May I also ask what are the available choices to be filled in for this section? I have also tried inputting truncnorm and the output is different from leaving in blank. So I would like to know if there are more choices for the distribution. Thanks.
Best Regards,
Thomas
from sdv.
Hi @ThomasK1018, no problem. For more info, I'd recommend checking our docs here. Note that the default_distributions
sets the shape for all columns but you can override individual columns by using the numerical_distributions
parameter too.
from sdv.
Hi @ThomasK1018, I'm closing off this issue as a duplicate since we do have now have more evidence that it is caused by #1691. We are actively looking into this root cause and hope to have fix up in an upcoming release.
Please feel free to reply directly to #1691 if there is anything more to discuss and we can always continue the conversation there.
from sdv.
Related Issues (20)
- Metadata isn't validating sdtypes in a column relationship (public SDV only)
- Cannot Convert float NaN to Int HOT 1
- Mixed data type in synthetic data HOT 2
- Audit demo datasets & add new ones (multi-table) HOT 1
- Unable to load a saved model HOT 3
- Country / CareOf columns automatically using Faker? HOT 1
- Add a utility to drop unknown references (and enforce referential integrity)
- Cleanup `utils` module: Make internal functions private
- FastML adding extra decimal places from a monetary column HOT 2
- CTGAN add custom constraints error [single table, multi columns] HOT 3
- Allow the `detect_from_csv(s)` functions to accept additional read parameters HOT 1
- Transition from using setup.py to pyroject.toml to specify project metadata
- Remove bumpversion and use bump-my-version HOT 1
- Switch to using ruff for Python linting and code formatting
- Add `update_columns` and `update_columns_metadata` methods to metadata
- Add `get_column_names` method to metadata
- FastML not properly learning from float values with 2 decimal places
- Contextual Anonymization transformers shouldn't be used for primary keys
- Unexpected NaN Values in sequence_index After Updating to SDV 1.10.0 HOT 5
- Can't install on azure cloud with python:3.11-slim-buster HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sdv.