Comments (7)
Hi @raaachli thanks for providing the details for this issue. We can look into it. To replicate, it would be helpful if you can provide the metadata of this dataset for us. Additionally, if this is a public dataset (or you don't mind sharing), would you also mind providing the data?
For context: The HMASynthesizer is designed to learn the full distribution for the foreign key IDs and replicate them to the best of its ability. At minimum, it is meant to guarantee the cardinality (# of children) falls at least within min/max of the original range. At best, it will create the full distribution correctly.
Some next steps to try:
- Can you run a diagnostic report to make sure that the basic min/max requirements are being met? Please refer to this documentation and let us know if the score is <1.0
- Can you create and share a relationship visualization for the connection between the joint table and the book table? You can refer to this documentation. The code should look something like this:
from sdv.evaluation.multi_table import get_cardinality_plot
fig = get_cardinality_plot(
real_data=real_data,
synthetic_data=synthetic_data,
child_table_name='joint',
parent_table_name='book',
child_foreign_key='book.id',
metadata=metadata)
fig.show()
from sdv.
Hello! Thank you for you reply.
The book and author relation is actually an example I create, my data is actually private and I cannot share them.
However, I just checked the diagnostic score, and the relationship validity is not 100%. the cardinality boundary adherence of one parent table to the joint table is 0.63, while other metrics are 100%.
Is there anything or link I can refer to resolve this?
from sdv.
Hi @raaachli thanks for running the diagnostic and getting back to us. All SDV synthesizers are supposed to guarantee a score of 100% so if it is below that, we will consider it a bug. I suggest we dig into this first, as it may help us identify the root cause.
Quick question: You mention the cardinality boundary adherence for one of the relationships (parent table to joint table) is 0.63. I wonder if it is the same parent table that the uniform foreign key is referring to? Eg. You have a uniform distribution for book.id
in your example -- I'm curious if the 0.63 score is referring to that same relationship between tables joint
and book
?
Replication
Would you be able to share any data example -- even some fake data -- that may help us replicate this? Or at the very least, if you are able to share your metadata file (that contains only the column names and tables names), that would be great.
from sdv.
Hi @raaachli one more thing -- could you please specify which version of SDV you are using? You can use the code below to print it out
import sdv
print(sdv.version.public)
from sdv.
Thank you!
Yes 0.63 is referring to that same relationship between tables joint and book.
However, it seems like I cannot replicate this result when I make some fake data, the score is 100% when I tried the same process on the fake data.
I wonder what does this 0.63 means? Does it mean the primary ids in the synthetic book table partially do not exist in the synthetic joint table?
my meta data is:
{
"tables": {
"A": {
"columns": {
"AuthorID": {
"sdtype": "id"
}
},
"primary_key": "AuthorID"
},
"B": {
"columns": {
"BookID": {
"sdtype": "id"
}
},
"primary_key": "BookID"
},
"Joint": {
"columns": {
"JointID": {
"sdtype": "id"
},
"BookID": {
"sdtype": "id"
},
"AuthorID": {
"sdtype": "id"
}
},
"primary_key": "JointID"
}
},
"relationships": [
{
"parent_table_name": "A",
"child_table_name": "Joint",
"parent_primary_key": "AuthorID",
"child_foreign_key": "AuthorID"
},
{
"parent_table_name": "B",
"child_table_name": "Joint",
"parent_primary_key": "BookID",
"child_foreign_key": "BookID"
}
],
"METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}
And my sdv version is 1.12.0
from sdv.
Hi @raaachli thanks for the details.
You can read more about the cardinality boundary adherence metric in these docs. Score of 0.63 here means that 63% of synthetic parent rows have the correct # of children -- the remaining 37% may have too few/too many children (outside min/max range), as compared to the real data.
We used to have a known issue with this, but it was fixed in SVD 1.12.0 -- see #1834.
We can try to replicate further using the metadata you provided. For more insight, it would be great if you could render the cardinality visualization for your real dataset (the one that yielded the score of 0.63). It should create a bar chart. Copy-pasting the instructions below.
Instructions for cardinality visualization
Can you create and share a relationship visualization for the connection between the joint table and the book table? You can refer to this documentation. The code should look something like this:
from sdv.evaluation.multi_table import get_cardinality_plot
fig = get_cardinality_plot(
real_data=real_data,
synthetic_data=synthetic_data,
child_table_name='joint',
parent_table_name='book',
child_foreign_key='book.id',
metadata=metadata)
fig.show()
from sdv.
Hi @raaachli are you still running into this problem? I'm closing out the issue since it has been inactive for a few weeks and I'm unable to replicate it. But please feel free to reply with more details and I can always reopen to continue the investigation.
from sdv.
Related Issues (20)
- Add reproducibility when fitting a synthesizer
- Getting ValueError (sdv-pii-25szo) while sampling synthesizer on SDV==1.13.0 HOT 3
- Is it possible to specify a distribution that one or more columns need to follow? HOT 2
- Getting KeyError while generation of data (synthesizer.sample()) - sdv==1.12.1 HOT 2
- unable to run this code from sdv.demo import load_tabular_demo HOT 1
- unable to run this code from sdv.demo import load_tabular_demo its showing error and stating No module named 'sdv.demo' HOT 3
- Sampling should not create a file called β.sample.csv.tempβ by default HOT 1
- SDV support for Ray? HOT 1
- PARSynthesizer: Duplicate sequence index values when `sequence_length` is higher than real data HOT 2
- How to improve the performance of synthesizers? HOT 1
- Adjustable Target Feature Distribution HOT 1
- TVAESynthesizer.__init__() got an unexpected keyword argument 'verbose' HOT 1
- Use of incorrect parameter name in example HOT 2
- If no filepath is provided, do not create a file during `sample`
- HMA Synthesizer's `scale` parameter doesn't work for small values
- Add header to log.csv file
- Enable the ability to run multi table synthesizers on disjointed table schemas
- Order Metadata Columns Alphabetically in Visualization
- Certain attributes are mapped as Unknown SDType and we have to change the dtype using custom script HOT 1
- Add workflow to generate release notes
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sdv.