Comments (9)
Thanks for filing this issue @Ng-ms. I'm a bit confused at the scenario you are describing.
The InvalidDataError
indicates that there is a mismatch between the data and metadata. If you convert the data column to numerical, you would also need to update the metadata for that column to be numerical. The InvalidDataError
means that your synthesizer has crashed so you are be unable to fit PARSynthesizer and sample from it.
when converting the data back to DateTime in the synthetic data is gives a range of dates like 09-08-1768 and 10-03-1644 ..
I am confused because this sentence implies that you already have synthetic data. How are you able to get synthetic data if the synthesizer crashed (with the InvalidDataError
)? Something doesn't seem to add up.
It would be helpful if you could share the Python code that you are using to load data, modify it, create metadata, create the synthesizer, sample from it, etc. And also if you could indicate where the crash is happening.
from sdv.
Sorry if my earlier messages were a bit unclear. Here's more info to explain better. i have two cases/tries here :
1.Using datetime columns as context without alteration: This leads to InvalidDataError due to a mismatch between data and the defined metadata even though in the metadata these columns are specified as datetime type , preventing the fitting of the PARSynthesizer.e
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.sequence_key = 'ID_P'
metadata.update_column(column_name='ID_P', sdtype='id')
metadata.set_sequence_index(column_name='DATE_P')
#metadata.save_to_json(filepath='my_metadata_v2.json')
from sdv.metadata import SingleTableMetadata
#metadata = SingleTableMetadata.load_from_json(
# filepath='my_metadata_v1.json')
print(metadata)
# Generate synthetic data
print('start')
synthesizer = PARSynthesizer(metadata,epochs=150, context_columns= [' data_DM', 'DATE_DIS','date_HIP','date_DCM','date_DID'], verbose=True, enforce_min_max_values=True, enforce_rounding=True, cuda=True)
2.Converting datetime to numerical for synthesis: This results in synthetic data with unrealistic dates (e.g., 09-08-1768), indicating a problem in handling or converting these numerical values back to datetime.
context_date_columns = ['data_DM', 'DATE_DIS', 'date_HIP', 'date_DCM', 'date_DID']
for col in context_date_columns:
df[col] = pd.to_datetime(df[col], format='%d/%m/%Y').astype(int)
#metadata.save_to_json(filepath='my_metadata_v2.json')
from sdv.metadata import SingleTableMetadata
#metadata = SingleTableMetadata.load_from_json(
# filepath='my_metadata_v1.json')
print(metadata)
# Generate synthetic data
print('start')
synthesizer = PARSynthesizer(metadata,epochs=150, context_columns= [' data_DM', 'DATE_DIS','date_HIP','date_DCM','date_DID'], verbose=True, enforce_min_max_values=True, enforce_rounding=True, cuda=True)
synthesizer.fit(df)
print('end')
synthetic_data = synthesizer.sample(num_sequences=100,sequence_length=None)
for col in context_date_columns:
synthetic_data[col] = pd.to_datetime(synthetic_data[col], unit='ns').dt.date
from sdv.
Hello @npatki, do you have any ideas on how to solve this ?
from sdv.
Hi @Ng-ms,
Thanks for confirming. The errors indicate that there are mismatches between how you are converting the data from datetime to numerical, and how you're converting them back from numerical to datetime. If you are doing any conversions, you also need to update your metadata as the sdtype is no longer datetime but numerical.
Here is a code snippet that may help:
import pandas as pd
# convert datetime columns to numerical
data[COLUMN_NAME] = pd.to_datetime(data[COLUMN_NAME], format='%d/%m/%Y').astype(int)
# update these columns to be sdtype 'numerical' in the metadata, as they are no longer datetime!
metadata.update_column(column_name=COLUMN_NAME, sdtype='numerical')
# save this version of metadata!
metadata.save_to_json(filepath='metadata_converted_context.json')
# now you can fit and sample
synthesizer = PARSynthesizer(metadata, epochs=150, context_columns=[COLUMN_NAME])
synthesizer.fit(data)
synthetic_data = synthesizer.sample(num_sequences=100)
# synthetic data will have numerical values. convert them to datetime
synthetic_data[COLUMN_NAME] = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').dt.date
from sdv.
thank you @npatki
i am actully upadating the metadata, the problem i am getting very strange dates
from sdv.
Hi @Ng-ms, that is unfortunate to hear. As I mentioned in my previous message, you may want to double check how you are doing your conversion from datetime --> numerical, and back from numerical --> datetime. Good practice will be to inspect your data every step of the way. What does the input data look like? What are the min/max values in the input data for fit
? Etc.
Unfortunately, there is only so much I can do with these screenshots. If you are able to provide access to your real data or metadata, as well as the full and complete code that you have currently in SDV, that will be helpful. If we are not able to replicate your issue, it is unlikely we will be able to provide any kinds of useful information. Please provide any other information you think will be helpful. Thanks.
from sdv.
Hi @Ng-ms are you still encountering this problem?
Since this issue has been inactive for a while, I'm closing it off. But please feel free to reply with any more info. We can always reopen the issue to continue investigation.
from sdv.
hi @npatki yes unfortunately i am still having this problem eventhough my converting for the data is right but I am still getting unlogical (out of the min and max ) dates
from sdv.
@npatki Hello, i am still having the same error , sometimes just one column gives this unrealistic dates and some times (if I train the model longer) I am getting more than one date columns with unrealistic dates like (1700, 1898)
from sdv.
Related Issues (20)
- TVAESynthesizer.__init__() got an unexpected keyword argument 'verbose' HOT 1
- Use of incorrect parameter name in example HOT 2
- If no filepath is provided, do not create a file during `sample`
- HMA Synthesizer's `scale` parameter doesn't work for small values
- Add header to log.csv file
- Enable the ability to run multi table synthesizers on disjointed table schemas
- Order Metadata Columns Alphabetically in Visualization
- Certain attributes are mapped as Unknown SDType and we have to change the dtype using custom script HOT 1
- Add workflow to generate release notes
- Confusing warning when using GANs that suggests that CUDA isn't being used
- Rename `file_path` to `filepath` parameter in ExcelHandler
- Support null foreign keys in get_random_subset
- Remove input error from base synthesizer class once nullable foreign keys are supported
- Support nullable foreign keys in HMA
- HMA sampling crashes when unknown sdtype detected for numerical column
- Rename the `file_name` parameter to `filepath` parameter in ExcelHandler
- Add timeouts to requests in release notes script
- Support PARSynthesizer learning sequential patterns in categorical columns
- Release notes should not include PRs
- Cap numpy to less than 2.0.0 until SDV supports
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sdv.