Describe the bug I have a multilingual dataset that contains a lot

Hi ! It's possible to multiple files at once: <div class="highli

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

High overhead when loading lots of subsets from the same dataset about datasets HOT 6 OPEN

loicmagne commented on May 25, 2024 1

High overhead when loading lots of subsets from the same dataset

from datasets.

Comments (6)

lhoestq commented on May 25, 2024 3

Can you retry using datasets 2.19 ? We improved a lot the speed of downloading datasets with tons of small files.

pip install -U datasets

Now this takes 17sec on my side instead of the 17min minutes @loicmagne mentioned :)

>>> %time ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files="data/*.jsonl")
Downloading readme: 100%|█████████████████████████████████| 13.7k/13.7k [00:00<00:00, 5.47MB/s]
Resolving data files: 100%|█████████████████████████████████| 250/250 [00:00<00:00, 612.51it/s]
Downloading data: 100%|██████████████████████████████████| 250/250 [00:12<00:00, 19.68files/s]
Generating train split: 247809 examples [00:00, 1057071.08 examples/s]
CPU times: user 4.95 s, sys: 3.1 s, total: 8.05 s
Wall time: 17.4 s

from datasets.

lhoestq commented on May 25, 2024 2

It's the fastest way I think :)

Alternatively you can download the dataset repository locally using huggingface_hub (either via CLI or in python) and load the subsets one by one locally using a for loop as you were doing before (just pass the directory path to load_dataset instead of the dataset_id).

from datasets.

lhoestq commented on May 25, 2024

Hi !

It's possible to multiple files at once:

data_files = "data/*.jsonl"
# Or pass a list of files
langs = ['ka-ml', 'br-sr', 'ka-pt', 'id-ko', ..., 'fi-ze_zh', 'he-kk', 'ka-tr']
data_files = [f"data/{lang}.jsonl" for lang in langs]
ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files=data_files, split="train")

Also maybe you can add a subset called "all" for people that want to load all the data without having to list all the languages ?

  - config_name: all
    data_files: data/*.jsonl

from datasets.

loicmagne commented on May 25, 2024

Thanks for your reply, it is indeed much faster, however the result is a dataset where all the subsets are "merged" together, the language pair is lost:

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2'],
        num_rows: 247809
    })
})

I guess I could add a 'lang' feature for each row in the dataset, is there a better way to do it ?

from datasets.

KennethEnevoldsen commented on May 25, 2024

Hi @lhoestq over at embeddings-benchmark/mteb#530 we have started examining these issues and would love to make a PR for datasets if we believe there is a way to improve the speed. As I assume you have a better overview than me @lhoestq, would you be interested in a PR, and might you have an idea about where we would start working on it?

We see a speed comparison of

15 minutes (for ~20% of the languages) when loaded using a for loop
17 minutes using the your suggestion
~30 seconds when using @loicmagne "merged" method.

Worth mentioning is that solution 2 looses the language information.

from datasets.

loicmagne commented on May 25, 2024

Can you retry using datasets 2.19 ? We improved a lot the speed of downloading datasets with tons of small files.

pip install -U datasets

Now this takes 17sec on my side instead of the 17min minutes @loicmagne mentioned :)

>>> %time ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files="data/*.jsonl")
Downloading readme: 100%|█████████████████████████████████| 13.7k/13.7k [00:00<00:00, 5.47MB/s]
Resolving data files: 100%|█████████████████████████████████| 250/250 [00:00<00:00, 612.51it/s]
Downloading data: 100%|██████████████████████████████████| 250/250 [00:12<00:00, 19.68files/s]
Generating train split: 247809 examples [00:00, 1057071.08 examples/s]
CPU times: user 4.95 s, sys: 3.1 s, total: 8.05 s
Wall time: 17.4 s

I was actually just noticing that, I bumped from 2.18 to 2.19 and got a massive speedup, amazing!

About the fact that subset names are lost when loading all files at once, currently my solution is to add a 'lang' feature to each rows, convert to polars and use:

ds_split = ds.to_polars().group_by('lang')

It's fast so I think it's an acceptable solution, but is there a better way to do it ?

from datasets.

High overhead when loading lots of subsets from the same dataset about datasets HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent