Coder Social home page Coder Social logo

Comments (5)

albertvillanova avatar albertvillanova commented on June 16, 2024

The error is caused by malformed basenames of the files within the TARs:

  • 15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b.png becomes 15_Cohen_1-s2 as the grouping __key__, and 0-S0929664620300449-gr3_lrg-b.png as the additional key to be added to the example
  • whereas the intended behavior was to use 15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b as the grouping __key__, and png as the additional key to be added to the example

To get the expected behavior, the basenames of the files within the TARs should be fixed so that they only contain a single dot, the one separating the file extension.

from datasets.

severo avatar severo commented on June 16, 2024

I reopen it because I think we should try to give a clearer error message with a specific error code.

For now, it's hard for the user to understand where the error comes from (not everybody knows the subtleties of the webdataset filename structure).

(we can transfer it to https://github.com/huggingface/dataset-viewer if it fits better there)

from datasets.

severo avatar severo commented on June 16, 2024

same with .jpg -> https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions

Error code:   DatasetGenerationError
Exception:    DatasetGenerationError
Message:      An error occurred while generating the dataset
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1748, in _prepare_split_single
                  for key, record in generator:
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 818, in wrapped
                  for item in generator(*args, **kwargs):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 109, in _generate_examples
                  example[field_name] = {"path": example["__key__"] + "." + field_name, "bytes": example[field_name]}
              KeyError: 'jpg'
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1316, in compute_config_parquet_and_info_response
                  parquet_operations, partial = stream_convert_to_parquet(
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 909, in stream_convert_to_parquet
                  builder._prepare_split(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1627, in _prepare_split
                  for job_id, done, content in self._prepare_split_single(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1784, in _prepare_split_single
                  raise DatasetGenerationError("An error occurred while generating the dataset") from e
              datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

from datasets.

severo avatar severo commented on June 16, 2024

More details in the spec (https://docs.google.com/document/d/18OdLjruFNX74ILmgrdiCI9J1fQZuhzzRBCHV9URWto0/edit#heading=h.hkptaq2kct2s)

The prefix of a file is all directory components of the file plus the file name component up to the first “.” in the file name.
The last extension (i.e., the portion after the last “.”) in a file name determines the file type.

Example:
images17/image194.left.jpg
images17/image194.right.jpg
images17/image194.json
images17/image12.left.jpg
images17/image12.json
images17/image12.right.jpg
images3/image1459.left.jpg

When reading this with a WebDataset library, you would get the following two dictionaries back in sequence:

    { “__key__”: “images17/image194”, “left.jpg”: b”...”, “right.jpg”: b”...”, “json”: b”...”}
    { “__key__”: “images17/image12”, “left.jpg”: b”...”, “right.jpg”: b”...”, “json”: b”...”}

from datasets.

severo avatar severo commented on June 16, 2024

OK, the issue is different in the latter case: some files are suffixed as .jpeg, and others as .jpg :)

Is it a limitation of the webdataset format, or of the datasets library @lhoestq? And could we be able to give a clearer error?

from datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.