Comments (2)
Hi there @ankurpuri1981
The SDV does a best guess effort during automatic metadata detection for types and table relationships and then provides convenience methods for updating the metadata to help you tweak and customize it. We've found this approach the best way to balance reducing friction (with best guess automatic metadata detection) with giving users this transparency and control over their metadata, ensuring higher quality synthetic data.
The sdtype is set to Unknown
when SDV can't cleanly assign a better sdtype and these fields are treated as PII fields (or personal identifiable information).
- If the field contains freeform text (like a vehicle's description), then there isn't a clear matching sdtype in that case because the SDV doesn't have an sdtype dedicated to arbitrary text.
- In other cases, the text field might represent domain-specific concepts like social security numbers, IP addresses, street addresses, or license plate numbers. In which case, I'd recommend updating the field to the relevant sdtype. You can see a list of supported PII sdtypes here.
- In other cases, the field may actually be a numerical field but maybe has some extraneous characters and can't cleanly be cast to a numerical field. I'd recommend cleaning your data and then manually setting this column's sdtype!
It looks like you've already found the metadata updating methods, but I'm also linking here as well so you have them handy: https://docs.sdv.dev/sdv/multi-table-data/data-preparation/multi-table-metadata-api#update-api
Out of curiosity, where does your source data live that you're trying to feed into the SDV? A database? An API end point? Flat files in a file store?
from sdv.
Hi there @ankurpuri1981 I hope my response was useful! I haven't heard from you in 2 weeks so I'm going to move forward with closing this issue out.
Feel free to open a new issue if you have more questions!
from sdv.
Related Issues (20)
- Primary key and sequential key cannot be the same
- Context column cannot be a sequence key: Need better error message for this case
- Separate primary key detection functionality
- SDV 1.14: PAR Synthesizer can't fit if metadata has a `sequence_index` set HOT 2
- Create unified Metadata class
- Scalability of theSDV tool and GPU support for Multi Table Data HOT 2
- Transformers for context columns are not allowed to be updated HOT 5
- Synthetic time series data generated from PARSynthesizer is not ordered in time HOT 4
- How to evaluate the quality of synthetic time series data generated from PARSynthesizer HOT 3
- InvalidDataError when fitting datetime columns as context columns in PARSynthesizer HOT 3
- Primary keys may not be unique for variable length regexes HOT 4
- Add rename_column function to metadata API
- Add `utils` to the Top Level Package.
- When loading synthesizer from pkl, then trying to synthesizer.sample, I get: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters() HOT 1
- Cap boto and botocore
- Metadata.add_column can be slow
- Enable single table synthesizers to use new Metadata
- Enable multi table synthesizers to use new Metadata
- Enable evaluation methods to work with new metadata
- Update demos to use new metadata
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sdv.