Comments (5)
The error is caused by malformed basenames of the files within the TARs:
15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b.png
becomes15_Cohen_1-s2
as the grouping__key__
, and0-S0929664620300449-gr3_lrg-b.png
as the additional key to be added to the example- whereas the intended behavior was to use
15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b
as the grouping__key__
, andpng
as the additional key to be added to the example
To get the expected behavior, the basenames of the files within the TARs should be fixed so that they only contain a single dot, the one separating the file extension.
from datasets.
I reopen it because I think we should try to give a clearer error message with a specific error code.
For now, it's hard for the user to understand where the error comes from (not everybody knows the subtleties of the webdataset filename structure).
(we can transfer it to https://github.com/huggingface/dataset-viewer if it fits better there)
from datasets.
same with .jpg -> https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions
Error code: DatasetGenerationError
Exception: DatasetGenerationError
Message: An error occurred while generating the dataset
Traceback: Traceback (most recent call last):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1748, in _prepare_split_single
for key, record in generator:
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 818, in wrapped
for item in generator(*args, **kwargs):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 109, in _generate_examples
example[field_name] = {"path": example["__key__"] + "." + field_name, "bytes": example[field_name]}
KeyError: 'jpg'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1316, in compute_config_parquet_and_info_response
parquet_operations, partial = stream_convert_to_parquet(
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 909, in stream_convert_to_parquet
builder._prepare_split(
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1627, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1784, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
from datasets.
More details in the spec (https://docs.google.com/document/d/18OdLjruFNX74ILmgrdiCI9J1fQZuhzzRBCHV9URWto0/edit#heading=h.hkptaq2kct2s)
The prefix of a file is all directory components of the file plus the file name component up to the first “.” in the file name.
The last extension (i.e., the portion after the last “.”) in a file name determines the file type.
Example:
images17/image194.left.jpg
images17/image194.right.jpg
images17/image194.json
images17/image12.left.jpg
images17/image12.json
images17/image12.right.jpg
images3/image1459.left.jpg
…
When reading this with a WebDataset library, you would get the following two dictionaries back in sequence:
{ “__key__”: “images17/image194”, “left.jpg”: b”...”, “right.jpg”: b”...”, “json”: b”...”}
{ “__key__”: “images17/image12”, “left.jpg”: b”...”, “right.jpg”: b”...”, “json”: b”...”}
from datasets.
OK, the issue is different in the latter case: some files are suffixed as .jpeg
, and others as .jpg
:)
Is it a limitation of the webdataset format, or of the datasets library @lhoestq? And could we be able to give a clearer error?
from datasets.
Related Issues (20)
- Export Parquet Tablet Audio-Set is null bytes in Arrow
- Caching map result of DatasetDict.
- Avoid downloading the whole dataset when only README.me has been touched on hub. HOT 2
- ValueError: Couldn't infer the same data file format for all splits. Got {'train': ('json', {}), 'validation': (None, {})}
- Support for pathlib.Path in datasets 2.19.0
- save_to_disk() freezes when saving on s3 bucket with multiprocessing
- JSON loader implicitly coerces floats to integers
- ExpectedMoreSplits error when using data_dir
- Enable Sharding to Equal Sized Shards
- Supporting FFCV: Fast Forward Computer Vision
- Import sorting is disabled by flake8 noqa directive after switching to ruff linter
- FileNotFoundError:error when loading C4 dataset HOT 5
- to_tf_dataset: Visible devices cannot be modified after being initialized
- load_dataset error HOT 2
- `Dataset.with_format` behaves inconsistently with documentation HOT 1
- load_dataset() should load all subsets, if no specific subset is specified HOT 2
- Remove canonical datasets from docs
- My Private Dataset doesn't exist on the Hub or cannot be accessed HOT 7
- Manual downloads should count as downloads HOT 1
- Method to load Laion400m
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasets.