Describe the bug Dataset is 10X slower when applying trivial trans

Similar issue in text process <div class="highlight highlight-source-python notran

Super slow iteration with trivial custom transform about datasets HOT 2 OPEN

xslittlegrass commented on June 17, 2024 1

Super slow iteration with trivial custom transform

from datasets.

Comments (2)

rangehow commented on June 17, 2024

Similar issue in text process

tokenizer=AutoTokenizer.from_pretrained(model_dir[args.model])
train_dataset=datasets.load_from_disk(dataset_dir[args.dataset],keep_in_memory=True)['train']
train_dataset=train_dataset.map(partial(dname2func[args.dataset],tokenizer=tokenizer),batched=True,num_proc =50,remove_columns=train_dataset.features.keys(),desc='tokenize',keep_in_memory=True)

After this train_dataset will be like

Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 51760
})

In which input_ids and labels are both List[int]
However, per iter on dataset cost 7.412479639053345s ……？

for j in tqdm(range(len(train_dataset)),desc='first stage'):
    input_id,label=train_dataset['input_ids'][j],train_dataset['labels'][j]

from datasets.

lhoestq commented on June 17, 2024

The transform currently replaces the numpy formatting.

So you're back to copying data to long python lists which is super slow.

It would be cool for the transform to not remove the formatting in this case, but this requires a few changes in the lib

from datasets.

Super slow iteration with trivial custom transform about datasets HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent