Coder Social home page Coder Social logo

Comments (7)

npatki avatar npatki commented on June 9, 2024

Hi @raaachli thanks for providing the details for this issue. We can look into it. To replicate, it would be helpful if you can provide the metadata of this dataset for us. Additionally, if this is a public dataset (or you don't mind sharing), would you also mind providing the data?

For context: The HMASynthesizer is designed to learn the full distribution for the foreign key IDs and replicate them to the best of its ability. At minimum, it is meant to guarantee the cardinality (# of children) falls at least within min/max of the original range. At best, it will create the full distribution correctly.

Some next steps to try:

  1. Can you run a diagnostic report to make sure that the basic min/max requirements are being met? Please refer to this documentation and let us know if the score is <1.0
  2. Can you create and share a relationship visualization for the connection between the joint table and the book table? You can refer to this documentation. The code should look something like this:
from sdv.evaluation.multi_table import get_cardinality_plot

fig = get_cardinality_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    child_table_name='joint',
    parent_table_name='book',
    child_foreign_key='book.id',
    metadata=metadata)
    
fig.show()

from sdv.

raaachli avatar raaachli commented on June 9, 2024

Hello! Thank you for you reply.

The book and author relation is actually an example I create, my data is actually private and I cannot share them.

However, I just checked the diagnostic score, and the relationship validity is not 100%. the cardinality boundary adherence of one parent table to the joint table is 0.63, while other metrics are 100%.

Is there anything or link I can refer to resolve this?

from sdv.

npatki avatar npatki commented on June 9, 2024

Hi @raaachli thanks for running the diagnostic and getting back to us. All SDV synthesizers are supposed to guarantee a score of 100% so if it is below that, we will consider it a bug. I suggest we dig into this first, as it may help us identify the root cause.

Quick question: You mention the cardinality boundary adherence for one of the relationships (parent table to joint table) is 0.63. I wonder if it is the same parent table that the uniform foreign key is referring to? Eg. You have a uniform distribution for book.id in your example -- I'm curious if the 0.63 score is referring to that same relationship between tables joint and book?

Replication

Would you be able to share any data example -- even some fake data -- that may help us replicate this? Or at the very least, if you are able to share your metadata file (that contains only the column names and tables names), that would be great.

from sdv.

npatki avatar npatki commented on June 9, 2024

Hi @raaachli one more thing -- could you please specify which version of SDV you are using? You can use the code below to print it out

import sdv

print(sdv.version.public)

from sdv.

raaachli avatar raaachli commented on June 9, 2024

Thank you!

Yes 0.63 is referring to that same relationship between tables joint and book.

However, it seems like I cannot replicate this result when I make some fake data, the score is 100% when I tried the same process on the fake data.

I wonder what does this 0.63 means? Does it mean the primary ids in the synthetic book table partially do not exist in the synthetic joint table?

my meta data is:
{
"tables": {
"A": {
"columns": {
"AuthorID": {
"sdtype": "id"
}
},
"primary_key": "AuthorID"
},
"B": {
"columns": {
"BookID": {
"sdtype": "id"
}
},
"primary_key": "BookID"
},
"Joint": {
"columns": {
"JointID": {
"sdtype": "id"
},
"BookID": {
"sdtype": "id"
},
"AuthorID": {
"sdtype": "id"
}
},
"primary_key": "JointID"
}
},
"relationships": [
{
"parent_table_name": "A",
"child_table_name": "Joint",
"parent_primary_key": "AuthorID",
"child_foreign_key": "AuthorID"
},
{
"parent_table_name": "B",
"child_table_name": "Joint",
"parent_primary_key": "BookID",
"child_foreign_key": "BookID"
}
],
"METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}

And my sdv version is 1.12.0

from sdv.

npatki avatar npatki commented on June 9, 2024

Hi @raaachli thanks for the details.

You can read more about the cardinality boundary adherence metric in these docs. Score of 0.63 here means that 63% of synthetic parent rows have the correct # of children -- the remaining 37% may have too few/too many children (outside min/max range), as compared to the real data.

We used to have a known issue with this, but it was fixed in SVD 1.12.0 -- see #1834.

We can try to replicate further using the metadata you provided. For more insight, it would be great if you could render the cardinality visualization for your real dataset (the one that yielded the score of 0.63). It should create a bar chart. Copy-pasting the instructions below.

Instructions for cardinality visualization

Can you create and share a relationship visualization for the connection between the joint table and the book table? You can refer to this documentation. The code should look something like this:

from sdv.evaluation.multi_table import get_cardinality_plot

fig = get_cardinality_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    child_table_name='joint',
    parent_table_name='book',
    child_foreign_key='book.id',
    metadata=metadata)
    
fig.show()

from sdv.

npatki avatar npatki commented on June 9, 2024

Hi @raaachli are you still running into this problem? I'm closing out the issue since it has been inactive for a few weeks and I'm unable to replicate it. But please feel free to reply with more details and I can always reopen to continue the investigation.

from sdv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.