Comments (5)
@csala Any thoughts on this proposal?
from sdgym.
@leix28 @katxiao would a PR be welcome?
from sdgym.
Hi @sbrugman I think that for now the exact change that you are proposing is not within the current SDGym roadmap, but some variation is:
My suggestion would be to make the following changes:
- All synthesizers should inherit from a synthesizer base class (Baseline)
- All synthesizers should implement a separate
fit
andsample
method
To add some more context to it, the reason for which the required input is a function
instead of a class
is that wrapping a class-based synthesizer that follows the fit/sample
abstraction within a single function that receives real data, runs fit/sample
internally and returns a synthetic clone is far easier than the opposite approach of trying to adapt a functional based synthesizer into this fit/sample
abstraction. Also, the current implementation of SDGym
already supports class-based synthesizers that inherit from the Baseline
class, so making this a hard requirement does not really expand the support, it only narrows it.
On the other hand, it is true that this support for class-based synthesizer is not really documented, so adding it to the docs would be interesting!
More interestingly, this structure allows for capturing valuable metrics that are currently out of reach related to fit/sampling time and complexity (time measurements or maybe even this package). SDGym would this way be able to benchmark this aspect of a synthesizer as well, which can be an important decision criterion for which synthesizer is best for a given use case: if the user expects to sample large quantities of data then a longer fitting time would be acceptable at a lower sampling complexity.
This is another story, and it could actually be interesting too! An option that would be acceptable without sacrificing the functional input, would be to modify the code to capture such metrics only when the provided synthesizer is a Baseline
subclass. We could make it so that model_time
stays the same and is always reported for all synthesizer, but for Baseline
subclasses two new columns, fit_time
and sample_time
, are added to the output table.
from sdgym.
Hi Carles, thanks for getting back at this. The clarification on why you would not like to impose the fit/sample
abstraction on users is helpful. The backwards-compatibility argument also makes sense. Good to know that we can move forward on by documenting the class-based synthesizers and with the conditional extension of the benchmark with metrics on whether the implementation is Baseline
-based or otherwise.
from sdgym.
Hello, I'm jumping in here a few years after this initial conversation. Since 2021, we have made significant updates to the usage/API of SDGym as well as its functionality. And I believe that some key features that were discussed here have now been incorporated into the library.
- You can now supply a custom synthesizer by supplying 2 separate functions for
fit
andsample
. For more information, see the custom synthesizer user guide. We will automatically use these functions to create a class for you. - The benchmarking script now reports more granular results, such as time (fit time, sample time, evaluation time) and memory usage. See interpreting the results.
So I'm going to mark this issue as (more-or-less) resolved.
Apologies if I've overlooked any of the finer points in the discussion. If there is more to add, I'd recommend filing a new issue and we can start a new discussion based on the vision and capabilities of the newest SDGym library. Thanks!
from sdgym.
Related Issues (20)
- Passing None as synthesizers runs all of them HOT 2
- timeout parameter causes sdgym to crash
- Can't download datasets if `.aws` config is present
- ImportError: cannot import name 'Metadata' from 'sdv' HOT 2
- (Known issue, workaround provided) Problems with importing SDGym HOT 1
- The `UniformSynthesizer` should follow the sdtypes in metadata (not the data's dtypes)
- The `IndependentSynthesizer` should follow the sdtypes in the metadata (not the data's dtypes)
- Add support for Python 3.11
- Remove anyio usage
- load_dataset fails for HOT 4
- Drop support for Python 3.7
- Switch default branch from master to main HOT 1
- Binary Classification metric fails with unknown category (`ValueError`) HOT 2
- Add ability to load and inspect individual datasets HOT 1
- Dockerfile error HOT 1
- Fix typos in the docs HOT 1
- Add run_on_ec2 flag to benchmark_single_table
- Transition from using setup.py to pyproject.toml to specify project metadata
- Remove bumpversion and use bump-my-version
- Switch to using ruff for Python linting and code formatting
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sdgym.