Comments (6)
Can you retry using datasets
2.19 ? We improved a lot the speed of downloading datasets with tons of small files.
pip install -U datasets
Now this takes 17sec on my side instead of the 17min minutes @loicmagne mentioned :)
>>> %time ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files="data/*.jsonl")
Downloading readme: 100%|█████████████████████████████████| 13.7k/13.7k [00:00<00:00, 5.47MB/s]
Resolving data files: 100%|█████████████████████████████████| 250/250 [00:00<00:00, 612.51it/s]
Downloading data: 100%|██████████████████████████████████| 250/250 [00:12<00:00, 19.68files/s]
Generating train split: 247809 examples [00:00, 1057071.08 examples/s]
CPU times: user 4.95 s, sys: 3.1 s, total: 8.05 s
Wall time: 17.4 s
from datasets.
It's the fastest way I think :)
Alternatively you can download the dataset repository locally using huggingface_hub (either via CLI or in python) and load the subsets one by one locally using a for loop as you were doing before (just pass the directory path to load_dataset instead of the dataset_id).
from datasets.
Hi !
It's possible to multiple files at once:
data_files = "data/*.jsonl"
# Or pass a list of files
langs = ['ka-ml', 'br-sr', 'ka-pt', 'id-ko', ..., 'fi-ze_zh', 'he-kk', 'ka-tr']
data_files = [f"data/{lang}.jsonl" for lang in langs]
ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files=data_files, split="train")
Also maybe you can add a subset called "all" for people that want to load all the data without having to list all the languages ?
- config_name: all
data_files: data/*.jsonl
from datasets.
Thanks for your reply, it is indeed much faster, however the result is a dataset where all the subsets are "merged" together, the language pair is lost:
DatasetDict({
train: Dataset({
features: ['sentence1', 'sentence2'],
num_rows: 247809
})
})
I guess I could add a 'lang' feature for each row in the dataset, is there a better way to do it ?
from datasets.
Hi @lhoestq over at embeddings-benchmark/mteb#530 we have started examining these issues and would love to make a PR for datasets if we believe there is a way to improve the speed. As I assume you have a better overview than me @lhoestq, would you be interested in a PR, and might you have an idea about where we would start working on it?
We see a speed comparison of
- 15 minutes (for ~20% of the languages) when loaded using a for loop
- 17 minutes using the your suggestion
- ~30 seconds when using @loicmagne "merged" method.
Worth mentioning is that solution 2 looses the language information.
from datasets.
Can you retry using
datasets
2.19 ? We improved a lot the speed of downloading datasets with tons of small files.pip install -U datasets
Now this takes 17sec on my side instead of the 17min minutes @loicmagne mentioned :)
>>> %time ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files="data/*.jsonl") Downloading readme: 100%|█████████████████████████████████| 13.7k/13.7k [00:00<00:00, 5.47MB/s] Resolving data files: 100%|█████████████████████████████████| 250/250 [00:00<00:00, 612.51it/s] Downloading data: 100%|██████████████████████████████████| 250/250 [00:12<00:00, 19.68files/s] Generating train split: 247809 examples [00:00, 1057071.08 examples/s] CPU times: user 4.95 s, sys: 3.1 s, total: 8.05 s Wall time: 17.4 s
I was actually just noticing that, I bumped from 2.18 to 2.19 and got a massive speedup, amazing!
About the fact that subset names are lost when loading all files at once, currently my solution is to add a 'lang' feature to each rows, convert to polars and use:
ds_split = ds.to_polars().group_by('lang')
It's fast so I think it's an acceptable solution, but is there a better way to do it ?
from datasets.
Related Issues (20)
- Make convert_to_parquet CLI command create script branch
- Allow deleting a subset/config from a no-script dataset HOT 2
- `map` with `num_proc` > 1 leads to OOM HOT 1
- Give more details in `DataFilesNotFoundError` when getting the config names
- Loading problems of Datasets with a single shard
- Winogrande does not seem to be compatible with datasets version of 1.18.0 HOT 2
- Loading a remote dataset fails in the last release (v2.19.0)
- Load and save from/to disk no longer accept pathlib.Path
- Add a doc page for the convert_to_parquet CLI
- Super slow iteration with trivial custom transform HOT 1
- largelisttype not supported (.from_polars())
- ExpectedMoreSplits error on load_dataset when upgrading to 2.19.0 HOT 3
- Cannot use cached dataset without Internet connection (or when servers are down) HOT 3
- Remove token arg from CLI examples
- Delete uploaded files from the UI
- Unable to load wiki_auto_asset_turk from GEM HOT 6
- Datasets with files with colon : in filenames cannot be used on Windows
- IterableDataset raises exception instead of retrying HOT 5
- load_dataset doesn't support list column HOT 1
- Unimaginable super slow iteration HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasets.