I have a HUGE dataset about 14TB, I unable to download all parquet all. I just take ab

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How can I load partial parquet files only? about datasets HOT 12 CLOSED

lucasjinreal commented on June 23, 2024

How can I load partial parquet files only?

from datasets.

Comments (12)

Dref360 commented on June 23, 2024

Hello,

Have you tried loading the dataset in streaming mode? Documentation

This way you wouldn't have to load it all. Also, let's be nice to Parquet, it's a really nice technology and we don't need to be mean :)

from datasets.

lucasjinreal commented on June 23, 2024

I have downloaded part of it, just want to know how to load part of it, stream mode is not work for me since my network (in china) not stable, I don't want do it all again and again.

Just curious, doesn't there a way to load part of it?

from datasets.

Dref360 commented on June 23, 2024

Could you convert the IterableDataset to a Dataset after taking the first 100 rows with .take? This way, you would have a local copy of the first 100 rows on your system and thus won't need to download. Would that work?

Here is a SO question detailing how to do the conversion.

from datasets.

lucasjinreal commented on June 23, 2024

I mean, the parquet is like:

00000-0143554
00001-0143554
00002-0143554
...
00100-0143554
...
09100-0143554

I just downloaded the first 9900 part of it.

I can not load with load_dataset, it throw an error says my file is not same as parquet all amount.

How could I load the only I have?

( I really don't want downlaod them all, cause, I don't need all, and pulus, its huge.... )

As I said, I have donwloaded about 9999... It's not about stream... I just wnat to konw how to load offline... part....

from datasets.

albertvillanova commented on June 23, 2024

Hi, @lucasjinreal.

I am not sure of understanding your issue. What is the error message and stack trace you get? What version of datasets are you using? Could you provide a reproducible example?

Without knowing all those details, I would naively say that you can load whatever number of Parquet files by using the "parquet" loader: https://huggingface.co/docs/datasets/loading#parquet

ds = load_dataset("parquet", data_files="data/train-001*-of-00314.parquet", split="train")

from datasets.

lucasjinreal commented on June 23, 2024

@albertvillanova Not sure you have tested with this or not, but I have tried,

the only error I got is it still laodding all parquet with a progress bar maxium to the whole number 014354, and it loads my 0 - 000999 part, then throws an error.

Says Numinfo is not same.

I am so confused,

from datasets.

albertvillanova commented on June 23, 2024

Yes, my code snippet works.

Could you copy-paste your code and the output? Otherwise we are not able to know what the issue is.

from datasets.

lucasjinreal commented on June 23, 2024

@albertvillanova Hi, thanks for the tracing of the issue.

This is the output:

ython get_llava_recap_cc3m.py
Generating train split:   3%|███▋                                                                                                                | 101910/3199866 [00:16<08:30, 6065.67 examples/s]
Traceback (most recent call last):
  File "get_llava_recap_cc3m.py", line 31, in <module>
    dataset = load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet")
  File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 2582, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1118, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/info_utils.py", line 101, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=156885281898.75, num_examples=3199866, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=4994080770, num_examples=101910, shard_lengths=[10191, 10291, 10291, 10291, 10291, 10191, 10191, 10291, 10291, 9591], dataset_name='llava-recap-cc3m')}]

this is my code:

dataset = load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet")

My situation and requirements:

00314 is all, but I downlaode about 150, half of it, as you can see, i used 0000*-of-00314. which should be at most 99 file being loaded.

But it just fail.

Can u understand my issue now?

If so, then do not suggest me with stream, Just want to know, is there a way to load part if it...... and please don't say you can not replicate my issue when you have downloaded them all, my english is not good, but I think all situations and all prerequists I have addressed already.

from datasets.

albertvillanova commented on June 23, 2024

I see you did not use the "parquet" loader as I suggested in my code snippet above: #6979 (comment)
Please try passing "parquet" instead of "llava-recap-cc3m/" to load_dataset, and the complete path to data files in data_files:

load_dataset("parquet", data_files="llava-recap-cc3m/data/train-001*-of-00314.parquet")

from datasets.

albertvillanova commented on June 23, 2024

Let me explain that you get the error because of this content within the dataset_info YAML tag in the llava-recap-cc3m/README.md:

  - name: train
    num_bytes: 156885281898.75
    num_examples: 3199866

By default, if there is that content in the README file, load_dataset performs a basic check to verify it the generated number of examples matches the expected one and raises a NonMatchingSplitsSizesError if that is not the case.

You can avoid this basic check by passing verification_mode="no_checks":

load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet", verification_mode="no_checks")

from datasets.

albertvillanova commented on June 23, 2024

And please, next time you have an issue, please fill the Bug template issue with all the necessary information: https://github.com/huggingface/datasets/issues/new?assignees=&labels=&projects=&template=bug-report.yml

Otherwise it is very difficult for us to understand the underlying problem and to propose a pertinent solution.

from datasets.

lucasjinreal commented on June 23, 2024

thank u albert!

It solved my issue!

from datasets.

How can I load partial parquet files only? about datasets HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent