I am running the following commend with this <a href="https://huggingface.co/datasets/

Yes, I am on main branch. I have reruned combinaing checkpoint

Thanks, this helped me debug, fixed it now in commit <a class="commit-link" data-hover

Now it works. <a href="https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads"

Error when trying to run create_thread_prompts.py about llama2lang HOT 12 CLOSED

understandlingbv commented on June 12, 2024

Error when trying to run create_thread_prompts.py

from llama2lang.

Comments (12)

ErikTromp commented on June 12, 2024

Are you on the latest main branch? If so, can you either:

Upload your combined checkpoints to HF (rerun the script but now with target folder bein your HF dataset name
or
Add a screenshot/paste the folder structure of your translate and your combine outputs

from llama2lang.

ChrystianSchutz commented on June 12, 2024

Yes, I am on main branch.

I have reruned combinaing checkpoints and I still have the same issue. https://huggingface.co/datasets/chrystians/oasst1_pl_2_2
I am not quite sure what do you mean by screenshot:

root@c7a6c8a800b6:/madladLLaMa2lang/LLaMa2lang#  python3 combine_checkpoints.py checkpointMadlad oasst1_pl_2_2
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████| 82/82 [00:00<00:00, 228.16ba/s]
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.86s/it]
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 270.26ba/s]
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.28it/s]
root@c7a6c8a800b6:/madladLLaMa2lang/LLaMa2lang#  python3 create_thread_prompts.py chrystians/oasst1_pl_2_2 "Jestes polskim chatbotem ktory odpowiada tylko po polsku" oasst1_pl_2_2_threads
Downloading readme: 100%|███████████████████████████████████████████████████████████████████████| 1.52k/1.52k [00:00<00:00, 11.5MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████| 19.7M/19.7M [00:02<00:00, 6.87MB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████| 723k/723k [00:00<00:00, 2.07MB/s]
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.61s/it]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2376.38it/s]
Generating train split: 100%|███████████████████████████████████████████████████████| 81037/81037 [00:00<00:00, 394031.81 examples/s]
Generating validation split: 100%|████████████████████████████████████████████████████| 3001/3001 [00:00<00:00, 302632.87 examples/s]
  6%|█████                                                                                       | 348/6264 [00:00<00:09, 600.71it/s]
Traceback (most recent call last):
  File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 100, in <module>
    main()
  File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 90, in main
    dataset[fold] = dataset[fold].rename_column('0', 'text')
AttributeError: 'list' object has no attribute 'rename_column'
root@c7a6c8a800b6:/madladLLaMa2lang/LLaMa2lang#

from llama2lang.

ErikTromp commented on June 12, 2024

Thanks, this helped me debug, fixed it now in commit 3fe474f

Give it another go

from llama2lang.

ChrystianSchutz commented on June 12, 2024

Now it works. https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads

from llama2lang.

ErikTromp commented on June 12, 2024

Great we finally got there, feel free to make a PR with this one if the quality is better or do you plan on training a model too?

from llama2lang.

ChrystianSchutz commented on June 12, 2024

Yes, I have to justify expensive gpu. But now I have checked the threads dataset, I think that something is broken after that fix.
Practically all of the new one thread chat prompts are empty only with initial prompt (instruction_prompt):
"Jestes polskim chatbotem ktory odpowiada tylko po polsku"
https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads

Compared to the previous one:
https://huggingface.co/datasets/chrystians/Jestes

from llama2lang.

ErikTromp commented on June 12, 2024

Ok let me know if you want me to do it instead if you want to avoid expenses (the Training).

I will try to recreate your Polish dataset this week to see what I wrecked. The past 2 weeks I have been coding mostly from a phone and in Colab because I was on vacation so I hope to resolve it soon.

from llama2lang.

ChrystianSchutz commented on June 12, 2024

Relax. I can gladly do the training it was a joke.
What I wanted to say is only is that in my opinion something is broken with translations/dataset . Because there is a lot of empty ones only including the prompt. I don't quite know why and if there is an issue with dataset or translations.

https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads/viewer/default/train?p=77
Like from page 77 to 94 there is only practically prompts.
Later there are some conversations at page 11 so yeah the translation works:
https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads/viewer/default/train?p=11

Just an idea there are also other datasets like: ShareGPT
or that one: https://huggingface.co/datasets/lmsys/lmsys-chat-1m
My idea is different one a little bit is to maybe write something to extract the conversation from that large dataset (lmsys-chat-1m) in a particular language and then based on that train the model.

from llama2lang.

ErikTromp commented on June 12, 2024

So as it turns out, the default for madlad is to generate only 20 tokens for every translation, resulting in too short translations (not empty though, wasn't able to find out why those occur). I changed this now to a max of 2k tokens but that significantly slows it down (obviously) - an order of magnitude slower than Helsinki NLP now. To remedy, I added an option to load in 4 bits even instead of 8 bits so you can increase batch size but I am afraid it'll still be a lot slower.

As for swapping out datasets - we plan to fully support that but the creation of thread prompts is a bit involved in that case so working on that still (translation already supports swapping in different datasets).

Your translated dataset at https://huggingface.co/datasets/chrystians/oasst1_pl_2_2 already contains a lot of empty texts but not sure why that happened, I now added a check in the script itself to verify if translations are empty after which it throws and exception.

EDIT: I hit the exception - seems that madlad fails on specific characters. It died on some JSON/code inside a prompt.

from llama2lang.

ErikTromp commented on June 12, 2024

Madlad works now, it was quite broken so far. Beware that it is a lot slower though.

Let me know if it works for you too, then we can close this issue.

from llama2lang.

ChrystianSchutz commented on June 12, 2024

Madlad works now, it was quite broken so far. Beware that it is a lot slower though.

I have run it, but I don't know what you mean by now. I have run with this one commit:

Date: Sun Jan 7 16:23:06 2024 +0100
0326e3f

Here are datasets:
https://huggingface.co/datasets/chrystians/oasst1_pl_3
https://huggingface.co/datasets/chrystians/oasst1_pl_3_threads

To be honest I don't see much of the difference between the translations or the quality.
They also include empty threads, maybe this is issue that is also in dataset, not the program per se. And about quality of translation I am too unfamiliar to help with that.

Should I run it again?

from llama2lang.

ErikTromp commented on June 12, 2024

Yes madlad crashes silently if there are newlines in the input text. Fixed that yesterday (in 25d75f2) by replacing newlines with space. If you want to have madlad-based translations, you have to rerun the translate_oasst.py script entirely but I ran it briefly on Colab for PL and ZH and the estimated total time is now 150-200 hours (vs 10-15 for opus)...

Might be worth it but we will also be adding quite a few other translation models so perhaps better to wait for those.

from llama2lang.

Error when trying to run create_thread_prompts.py about llama2lang HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent