Comments (2)
Similar issue in text process
tokenizer=AutoTokenizer.from_pretrained(model_dir[args.model])
train_dataset=datasets.load_from_disk(dataset_dir[args.dataset],keep_in_memory=True)['train']
train_dataset=train_dataset.map(partial(dname2func[args.dataset],tokenizer=tokenizer),batched=True,num_proc =50,remove_columns=train_dataset.features.keys(),desc='tokenize',keep_in_memory=True)
After this train_dataset will be like
Dataset({
features: ['input_ids', 'labels'],
num_rows: 51760
})
In which input_ids and labels are both List[int]
However, per iter on dataset cost 7.412479639053345s ……?
for j in tqdm(range(len(train_dataset)),desc='first stage'):
input_id,label=train_dataset['input_ids'][j],train_dataset['labels'][j]
from datasets.
The transform currently replaces the numpy formatting.
So you're back to copying data to long python lists which is super slow.
It would be cool for the transform to not remove the formatting in this case, but this requires a few changes in the lib
from datasets.
Related Issues (20)
- Fail to load "stas/c4-en-10k" dataset since 2.16 version HOT 2
- Add MedImg for streaming HOT 3
- Column order is nondeterministic when loading from JSON
- ```push_to_hub()``` - Prevent Automatic Generation of Splits
- WinError 32 The process cannot access the file during load_dataset
- NonMatchingSplitsSizesError when using data_dir HOT 2
- Invalid YAML in README.md: unknown tag !<tag:yaml.org,2002:python/tuple>
- Export Parquet Tablet Audio-Set is null bytes in Arrow
- Caching map result of DatasetDict.
- Avoid downloading the whole dataset when only README.me has been touched on hub. HOT 2
- ValueError: Couldn't infer the same data file format for all splits. Got {'train': ('json', {}), 'validation': (None, {})}
- Support for pathlib.Path in datasets 2.19.0
- save_to_disk() freezes when saving on s3 bucket with multiprocessing
- JSON loader implicitly coerces floats to integers
- ExpectedMoreSplits error when using data_dir
- Enable Sharding to Equal Sized Shards
- Supporting FFCV: Fast Forward Computer Vision
- Import sorting is disabled by flake8 noqa directive after switching to ruff linter
- FileNotFoundError:error when loading C4 dataset HOT 5
- to_tf_dataset: Visible devices cannot be modified after being initialized
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasets.