I planned to use datatrove to apply my tokenizer so that data is ready to use with nan

Hi! Two things: you can always convert the text data to any fo

Awesome, thanks a lot <a class="user-mention notranslate" data-hovercard-type="user" d

How to load a dataset with the output a tokenizer? about datatrove HOT 5 CLOSED

Jeronymous commented on September 25, 2024

How to load a dataset with the output a tokenizer?

from datatrove.

Comments (5)

guipenedo commented on September 25, 2024 1

Hi, I pinged the nanotron team internally and they are working on moving support of datatrove's .ds files to the public repo :)

from datatrove.

guipenedo commented on September 25, 2024 1

Nanotron also has support for a custom dataloader now, where you can plug in datatrove's Dataset directly: huggingface/nanotron#162
Closing the issue

from datatrove.

manuelbrack commented on September 25, 2024

Following up on this:

Are there some helpers to convert ds files into parquet files (or something loadable with datasets) for a given context size

I was wondering the same thing. I really like datatrove, but as of now I cannot use the tokenization pipeline, since there is no way to load the data :D

from datatrove.

guipenedo commented on September 25, 2024

Hi! Two things:

you can always convert the text data to any format by using a writer. For instance, you can convert it to parquet using a ParquetWriter. This will convert the data with the text and not with the actual tokens, however
I have just added a torch Dataset that can load tokens from a dataset tokenized by datatrove into .ds files here. Let me know if you would require some changes to this class for it to work with your setup

from datatrove.

Jeronymous commented on September 25, 2024

Awesome, thanks a lot @guipenedo

from datatrove.

Recommend Projects