Comments (12)
Hello,
Have you tried loading the dataset in streaming mode? Documentation
This way you wouldn't have to load it all. Also, let's be nice to Parquet, it's a really nice technology and we don't need to be mean :)
from datasets.
I have downloaded part of it, just want to know how to load part of it, stream mode is not work for me since my network (in china) not stable, I don't want do it all again and again.
Just curious, doesn't there a way to load part of it?
from datasets.
Could you convert the IterableDataset to a Dataset after taking the first 100 rows with .take
? This way, you would have a local copy of the first 100 rows on your system and thus won't need to download. Would that work?
Here is a SO question detailing how to do the conversion.
from datasets.
I mean, the parquet is like:
00000-0143554
00001-0143554
00002-0143554
...
00100-0143554
...
09100-0143554
I just downloaded the first 9900 part of it.
I can not load with load_dataset, it throw an error says my file is not same as parquet all amount.
How could I load the only I have?
( I really don't want downlaod them all, cause, I don't need all, and pulus, its huge.... )
As I said, I have donwloaded about 9999... It's not about stream... I just wnat to konw how to load offline... part....
from datasets.
Hi, @lucasjinreal.
I am not sure of understanding your issue. What is the error message and stack trace you get? What version of datasets
are you using? Could you provide a reproducible example?
Without knowing all those details, I would naively say that you can load whatever number of Parquet files by using the "parquet" loader: https://huggingface.co/docs/datasets/loading#parquet
ds = load_dataset("parquet", data_files="data/train-001*-of-00314.parquet", split="train")
from datasets.
@albertvillanova Not sure you have tested with this or not, but I have tried,
the only error I got is it still laodding all parquet with a progress bar maxium to the whole number 014354, and it loads my 0 - 000999 part, then throws an error.
Says Numinfo is not same.
I am so confused,
from datasets.
Yes, my code snippet works.
Could you copy-paste your code and the output? Otherwise we are not able to know what the issue is.
from datasets.
@albertvillanova Hi, thanks for the tracing of the issue.
This is the output:
ython get_llava_recap_cc3m.py
Generating train split: 3%|███▋ | 101910/3199866 [00:16<08:30, 6065.67 examples/s]
Traceback (most recent call last):
File "get_llava_recap_cc3m.py", line 31, in <module>
dataset = load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet")
File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 2582, in load_dataset
builder_instance.download_and_prepare(
File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1118, in _download_and_prepare
verify_splits(self.info.splits, split_dict)
File "/usr/local/lib/python3.8/dist-packages/datasets/utils/info_utils.py", line 101, in verify_splits
raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=156885281898.75, num_examples=3199866, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=4994080770, num_examples=101910, shard_lengths=[10191, 10291, 10291, 10291, 10291, 10191, 10191, 10291, 10291, 9591], dataset_name='llava-recap-cc3m')}]
this is my code:
dataset = load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet")
My situation and requirements:
00314 is all, but I downlaode about 150, half of it, as you can see, i used 0000*-of-00314.
which should be at most 99 file being loaded.
But it just fail.
Can u understand my issue now?
If so, then do not suggest me with stream, Just want to know, is there a way to load part if it...... and please don't say you can not replicate my issue when you have downloaded them all, my english is not good, but I think all situations and all prerequists I have addressed already.
from datasets.
I see you did not use the "parquet" loader as I suggested in my code snippet above: #6979 (comment)
Please try passing "parquet" instead of "llava-recap-cc3m/" to load_dataset
, and the complete path to data files in data_files
:
load_dataset("parquet", data_files="llava-recap-cc3m/data/train-001*-of-00314.parquet")
from datasets.
Let me explain that you get the error because of this content within the dataset_info
YAML tag in the llava-recap-cc3m/README.md
:
- name: train
num_bytes: 156885281898.75
num_examples: 3199866
By default, if there is that content in the README file, load_dataset
performs a basic check to verify it the generated number of examples matches the expected one and raises a NonMatchingSplitsSizesError
if that is not the case.
You can avoid this basic check by passing verification_mode="no_checks"
:
load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet", verification_mode="no_checks")
from datasets.
And please, next time you have an issue, please fill the Bug template issue with all the necessary information: https://github.com/huggingface/datasets/issues/new?assignees=&labels=&projects=&template=bug-report.yml
Otherwise it is very difficult for us to understand the underlying problem and to propose a pertinent solution.
from datasets.
thank u albert!
It solved my issue!
from datasets.
Related Issues (20)
- Import sorting is disabled by flake8 noqa directive after switching to ruff linter
- FileNotFoundError:error when loading C4 dataset HOT 12
- to_tf_dataset: Visible devices cannot be modified after being initialized
- load_dataset error HOT 2
- `Dataset.with_format` behaves inconsistently with documentation HOT 1
- load_dataset() should load all subsets, if no specific subset is specified HOT 2
- Remove canonical datasets from docs
- My Private Dataset doesn't exist on the Hub or cannot be accessed HOT 7
- Manual downloads should count as downloads HOT 1
- Method to load Laion400m
- IndexError during training with Squad dataset and T5-small model HOT 1
- load json file error with v2.20.0 HOT 2
- Support NumPy 2.0
- cannot split dataset when using load_dataset HOT 1
- Convert polars DataFrame back to datasets
- AttributeError: module 'pyarrow.lib' has no attribute 'ListViewType' HOT 1
- cache in nfs error
- Problematic rank after calling `split_dataset_by_node` twice
- Dataset with streaming doesn't work with proxy
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasets.