Comments (2)
Hi @josalhor, thanks for filing this issue with all the details. Our investigation showed that this issue is probably not related to TVAE, as it is possible to replicate this same error with different synthesizer such as Gaussian Copula.
Root Cause
The BinaryDecisionTreeClassifier
metric cannot be run on certain combinations of real/synthetic data.
The metric is designed to take the following steps:
- Train the ML model using the synthetic data
- Test the ML model using the real data
The problem is that the synthetic data may not have full coverage of all the possible categories. For example, assume only 0.1% of the real data had a particular category value such as 'supdup'
. It's possible (due to random chance) that none of the the synthetic data has this value. In this case, the Binary Classification algorithm messes up because the value is seen for the first time during testing.
For more info about the metric, see the API docs.
Next Steps
I'm updating the title of this issue to reflect the findings.
I've also started a new feature request in the underlying SDMetrics library: sdv-dev/SDMetrics#515. We can continue our discussion there.
In the meantime, I wonder if any other metric will be suitable for your purposes? (The Binary Classification metrics are listed as "in Beta" by the SDMetrics docs.)
from sdgym.
Your description of the problem makes a lot of sense and matches my findings.
In the meantime, I wonder if any other metric will be suitable for your purposes? (The Binary Classification metrics are listed as "in Beta" by the SDMetrics docs.)
Actually, I was trying my best to replicate the CTGAN paper results, so I will take a look at the error and try to patch if possible.
I've also started a new feature request in the underlying SDMetrics library: sdv-dev/SDMetrics#515. We can continue our discussion there.
I'll write further comments in that issue.
from sdgym.
Related Issues (20)
- Transition from using setup.py to pyproject.toml to specify project metadata
- Remove bumpversion and use bump-my-version
- Switch to using ruff for Python linting and code formatting
- Add 'pytest-runner>=2.11.1' dependency
- Add dependency checker
- Fix minimum version workflow when pointing to github branch
- Add bandit workflow
- Cleanup automated PR workflows
- Add support for Python 3.12
- Remove FastML Synthesizer
- Only run unit and integration tests on oldest and latest python versions for macos
- Bump verions SDV, SDMetrics and RDT
- Docs for AWS integration are incorrect HOT 1
- Passing synthesizer as string fails if run_on_ec2 is enabled
- The returned `Evaluate_Time` does not include results from all metrics
- Allow the ability to compute diagnostic score in a benchmarking run
- Cap numpy to less than 2.0.0 until SDGym supports
- Add support for numpy 2.0.0
- Rename `IndependentSynthesizer` to `ColumnSynthesizer`
- Store the intermediary results of the custom synthesizers, but the generated result file cannot be opened
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sdgym.