Coder Social home page Coder Social logo

Comments (6)

lhoestq avatar lhoestq commented on May 25, 2024 3

Can you retry using datasets 2.19 ? We improved a lot the speed of downloading datasets with tons of small files.

pip install -U datasets

Now this takes 17sec on my side instead of the 17min minutes @loicmagne mentioned :)

>>> %time ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files="data/*.jsonl")
Downloading readme: 100%|█████████████████████████████████| 13.7k/13.7k [00:00<00:00, 5.47MB/s]
Resolving data files: 100%|█████████████████████████████████| 250/250 [00:00<00:00, 612.51it/s]
Downloading data: 100%|██████████████████████████████████| 250/250 [00:12<00:00, 19.68files/s]
Generating train split: 247809 examples [00:00, 1057071.08 examples/s]
CPU times: user 4.95 s, sys: 3.1 s, total: 8.05 s
Wall time: 17.4 s

from datasets.

lhoestq avatar lhoestq commented on May 25, 2024 2

It's the fastest way I think :)

Alternatively you can download the dataset repository locally using huggingface_hub (either via CLI or in python) and load the subsets one by one locally using a for loop as you were doing before (just pass the directory path to load_dataset instead of the dataset_id).

from datasets.

lhoestq avatar lhoestq commented on May 25, 2024

Hi !

It's possible to multiple files at once:

data_files = "data/*.jsonl"
# Or pass a list of files
langs = ['ka-ml', 'br-sr', 'ka-pt', 'id-ko', ..., 'fi-ze_zh', 'he-kk', 'ka-tr']
data_files = [f"data/{lang}.jsonl" for lang in langs]
ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files=data_files, split="train")

Also maybe you can add a subset called "all" for people that want to load all the data without having to list all the languages ?

  - config_name: all
    data_files: data/*.jsonl

from datasets.

loicmagne avatar loicmagne commented on May 25, 2024

Thanks for your reply, it is indeed much faster, however the result is a dataset where all the subsets are "merged" together, the language pair is lost:

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2'],
        num_rows: 247809
    })
})

I guess I could add a 'lang' feature for each row in the dataset, is there a better way to do it ?

from datasets.

KennethEnevoldsen avatar KennethEnevoldsen commented on May 25, 2024

Hi @lhoestq over at embeddings-benchmark/mteb#530 we have started examining these issues and would love to make a PR for datasets if we believe there is a way to improve the speed. As I assume you have a better overview than me @lhoestq, would you be interested in a PR, and might you have an idea about where we would start working on it?

We see a speed comparison of

  1. 15 minutes (for ~20% of the languages) when loaded using a for loop
  2. 17 minutes using the your suggestion
  3. ~30 seconds when using @loicmagne "merged" method.

Worth mentioning is that solution 2 looses the language information.

from datasets.

loicmagne avatar loicmagne commented on May 25, 2024

Can you retry using datasets 2.19 ? We improved a lot the speed of downloading datasets with tons of small files.

pip install -U datasets

Now this takes 17sec on my side instead of the 17min minutes @loicmagne mentioned :)

>>> %time ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files="data/*.jsonl")
Downloading readme: 100%|█████████████████████████████████| 13.7k/13.7k [00:00<00:00, 5.47MB/s]
Resolving data files: 100%|█████████████████████████████████| 250/250 [00:00<00:00, 612.51it/s]
Downloading data: 100%|██████████████████████████████████| 250/250 [00:12<00:00, 19.68files/s]
Generating train split: 247809 examples [00:00, 1057071.08 examples/s]
CPU times: user 4.95 s, sys: 3.1 s, total: 8.05 s
Wall time: 17.4 s

I was actually just noticing that, I bumped from 2.18 to 2.19 and got a massive speedup, amazing!

About the fact that subset names are lost when loading all files at once, currently my solution is to add a 'lang' feature to each rows, convert to polars and use:

ds_split = ds.to_polars().group_by('lang')

It's fast so I think it's an acceptable solution, but is there a better way to do it ?

from datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.