Coder Social home page Coder Social logo

Comments (12)

Dref360 avatar Dref360 commented on June 23, 2024

Hello,

Have you tried loading the dataset in streaming mode? Documentation

This way you wouldn't have to load it all. Also, let's be nice to Parquet, it's a really nice technology and we don't need to be mean :)

from datasets.

lucasjinreal avatar lucasjinreal commented on June 23, 2024

I have downloaded part of it, just want to know how to load part of it, stream mode is not work for me since my network (in china) not stable, I don't want do it all again and again.

Just curious, doesn't there a way to load part of it?

from datasets.

Dref360 avatar Dref360 commented on June 23, 2024

Could you convert the IterableDataset to a Dataset after taking the first 100 rows with .take? This way, you would have a local copy of the first 100 rows on your system and thus won't need to download. Would that work?

Here is a SO question detailing how to do the conversion.

from datasets.

lucasjinreal avatar lucasjinreal commented on June 23, 2024

I mean, the parquet is like:

00000-0143554
00001-0143554
00002-0143554
...
00100-0143554
...
09100-0143554

I just downloaded the first 9900 part of it.

I can not load with load_dataset, it throw an error says my file is not same as parquet all amount.

How could I load the only I have?

( I really don't want downlaod them all, cause, I don't need all, and pulus, its huge.... )

As I said, I have donwloaded about 9999... It's not about stream... I just wnat to konw how to load offline... part....

from datasets.

albertvillanova avatar albertvillanova commented on June 23, 2024

Hi, @lucasjinreal.

I am not sure of understanding your issue. What is the error message and stack trace you get? What version of datasets are you using? Could you provide a reproducible example?

Without knowing all those details, I would naively say that you can load whatever number of Parquet files by using the "parquet" loader: https://huggingface.co/docs/datasets/loading#parquet

ds = load_dataset("parquet", data_files="data/train-001*-of-00314.parquet", split="train")

from datasets.

lucasjinreal avatar lucasjinreal commented on June 23, 2024

@albertvillanova Not sure you have tested with this or not, but I have tried,

the only error I got is it still laodding all parquet with a progress bar maxium to the whole number 014354, and it loads my 0 - 000999 part, then throws an error.

Says Numinfo is not same.

I am so confused,

from datasets.

albertvillanova avatar albertvillanova commented on June 23, 2024

Yes, my code snippet works.

Could you copy-paste your code and the output? Otherwise we are not able to know what the issue is.

from datasets.

lucasjinreal avatar lucasjinreal commented on June 23, 2024

@albertvillanova Hi, thanks for the tracing of the issue.

This is the output:

ython get_llava_recap_cc3m.py
Generating train split:   3%|███▋                                                                                                                | 101910/3199866 [00:16<08:30, 6065.67 examples/s]
Traceback (most recent call last):
  File "get_llava_recap_cc3m.py", line 31, in <module>
    dataset = load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet")
  File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 2582, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1118, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/info_utils.py", line 101, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=156885281898.75, num_examples=3199866, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=4994080770, num_examples=101910, shard_lengths=[10191, 10291, 10291, 10291, 10291, 10191, 10191, 10291, 10291, 9591], dataset_name='llava-recap-cc3m')}]

this is my code:

dataset = load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet")

My situation and requirements:

00314 is all, but I downlaode about 150, half of it, as you can see, i used 0000*-of-00314. which should be at most 99 file being loaded.

But it just fail.

Can u understand my issue now?

If so, then do not suggest me with stream, Just want to know, is there a way to load part if it...... and please don't say you can not replicate my issue when you have downloaded them all, my english is not good, but I think all situations and all prerequists I have addressed already.

from datasets.

albertvillanova avatar albertvillanova commented on June 23, 2024

I see you did not use the "parquet" loader as I suggested in my code snippet above: #6979 (comment)
Please try passing "parquet" instead of "llava-recap-cc3m/" to load_dataset, and the complete path to data files in data_files:

load_dataset("parquet", data_files="llava-recap-cc3m/data/train-001*-of-00314.parquet")

from datasets.

albertvillanova avatar albertvillanova commented on June 23, 2024

Let me explain that you get the error because of this content within the dataset_info YAML tag in the llava-recap-cc3m/README.md:

  - name: train
    num_bytes: 156885281898.75
    num_examples: 3199866

By default, if there is that content in the README file, load_dataset performs a basic check to verify it the generated number of examples matches the expected one and raises a NonMatchingSplitsSizesError if that is not the case.

You can avoid this basic check by passing verification_mode="no_checks":

load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet", verification_mode="no_checks")

from datasets.

albertvillanova avatar albertvillanova commented on June 23, 2024

And please, next time you have an issue, please fill the Bug template issue with all the necessary information: https://github.com/huggingface/datasets/issues/new?assignees=&labels=&projects=&template=bug-report.yml

Otherwise it is very difficult for us to understand the underlying problem and to propose a pertinent solution.

from datasets.

lucasjinreal avatar lucasjinreal commented on June 23, 2024

thank u albert!

It solved my issue!

from datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.