reported at <a href="https://huggingface.co/datasets/tbone5563/tar_images/discussions/

same with .jpg -> <a href="https://huggingface.co/datasets/ProGamerGov/synthetic-da

More details in the spec (<a href="https://docs.google.com/document/d/18OdLjruFNX74ILm

Webdataset: KeyError: 'png' on some datasets when streaming about datasets HOT 5 OPEN

lhoestq commented on June 16, 2024

Webdataset: KeyError: 'png' on some datasets when streaming

from datasets.

Comments (5)

albertvillanova commented on June 16, 2024

The error is caused by malformed basenames of the files within the TARs:

15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b.png becomes 15_Cohen_1-s2 as the grouping __key__, and 0-S0929664620300449-gr3_lrg-b.png as the additional key to be added to the example
whereas the intended behavior was to use 15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b as the grouping __key__, and png as the additional key to be added to the example

To get the expected behavior, the basenames of the files within the TARs should be fixed so that they only contain a single dot, the one separating the file extension.

from datasets.

severo commented on June 16, 2024

I reopen it because I think we should try to give a clearer error message with a specific error code.

For now, it's hard for the user to understand where the error comes from (not everybody knows the subtleties of the webdataset filename structure).

(we can transfer it to https://github.com/huggingface/dataset-viewer if it fits better there)

from datasets.

severo commented on June 16, 2024

same with .jpg -> https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions

Error code:   DatasetGenerationError
Exception:    DatasetGenerationError
Message:      An error occurred while generating the dataset
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1748, in _prepare_split_single
                  for key, record in generator:
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 818, in wrapped
                  for item in generator(*args, **kwargs):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 109, in _generate_examples
                  example[field_name] = {"path": example["__key__"] + "." + field_name, "bytes": example[field_name]}
              KeyError: 'jpg'
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1316, in compute_config_parquet_and_info_response
                  parquet_operations, partial = stream_convert_to_parquet(
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 909, in stream_convert_to_parquet
                  builder._prepare_split(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1627, in _prepare_split
                  for job_id, done, content in self._prepare_split_single(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1784, in _prepare_split_single
                  raise DatasetGenerationError("An error occurred while generating the dataset") from e
              datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

from datasets.

severo commented on June 16, 2024

More details in the spec (https://docs.google.com/document/d/18OdLjruFNX74ILmgrdiCI9J1fQZuhzzRBCHV9URWto0/edit#heading=h.hkptaq2kct2s)

The prefix of a file is all directory components of the file plus the file name component up to the first “.” in the file name.
The last extension (i.e., the portion after the last “.”) in a file name determines the file type.

Example:
images17/image194.left.jpg
images17/image194.right.jpg
images17/image194.json
images17/image12.left.jpg
images17/image12.json
images17/image12.right.jpg
images3/image1459.left.jpg
…
When reading this with a WebDataset library, you would get the following two dictionaries back in sequence:

    { “__key__”: “images17/image194”, “left.jpg”: b”...”, “right.jpg”: b”...”, “json”: b”...”}
    { “__key__”: “images17/image12”, “left.jpg”: b”...”, “right.jpg”: b”...”, “json”: b”...”}

from datasets.

severo commented on June 16, 2024

OK, the issue is different in the latter case: some files are suffixed as .jpeg, and others as .jpg :)

Is it a limitation of the webdataset format, or of the datasets library @lhoestq? And could we be able to give a clearer error?

from datasets.

Webdataset: KeyError: 'png' on some datasets when streaming about datasets HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent