Coder Social home page Coder Social logo

Comments (12)

ErikTromp avatar ErikTromp commented on June 12, 2024

Are you on the latest main branch? If so, can you either:

  • Upload your combined checkpoints to HF (rerun the script but now with target folder bein your HF dataset name
    or
  • Add a screenshot/paste the folder structure of your translate and your combine outputs

from llama2lang.

ChrystianSchutz avatar ChrystianSchutz commented on June 12, 2024

Yes, I am on main branch.

  1. I have reruned combinaing checkpoints and I still have the same issue. https://huggingface.co/datasets/chrystians/oasst1_pl_2_2
  2. I am not quite sure what do you mean by screenshot:
    obraz
    obraz
root@c7a6c8a800b6:/madladLLaMa2lang/LLaMa2lang#  python3 combine_checkpoints.py checkpointMadlad oasst1_pl_2_2
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████| 82/82 [00:00<00:00, 228.16ba/s]
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.86s/it]
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 270.26ba/s]
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.28it/s]
root@c7a6c8a800b6:/madladLLaMa2lang/LLaMa2lang#  python3 create_thread_prompts.py chrystians/oasst1_pl_2_2 "Jestes polskim chatbotem ktory odpowiada tylko po polsku" oasst1_pl_2_2_threads
Downloading readme: 100%|███████████████████████████████████████████████████████████████████████| 1.52k/1.52k [00:00<00:00, 11.5MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████| 19.7M/19.7M [00:02<00:00, 6.87MB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████| 723k/723k [00:00<00:00, 2.07MB/s]
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.61s/it]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2376.38it/s]
Generating train split: 100%|███████████████████████████████████████████████████████| 81037/81037 [00:00<00:00, 394031.81 examples/s]
Generating validation split: 100%|████████████████████████████████████████████████████| 3001/3001 [00:00<00:00, 302632.87 examples/s]
  6%|█████                                                                                       | 348/6264 [00:00<00:09, 600.71it/s]
Traceback (most recent call last):
  File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 100, in <module>
    main()
  File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 90, in main
    dataset[fold] = dataset[fold].rename_column('0', 'text')
AttributeError: 'list' object has no attribute 'rename_column'
root@c7a6c8a800b6:/madladLLaMa2lang/LLaMa2lang#

from llama2lang.

ErikTromp avatar ErikTromp commented on June 12, 2024

Thanks, this helped me debug, fixed it now in commit 3fe474f

Give it another go

from llama2lang.

ChrystianSchutz avatar ChrystianSchutz commented on June 12, 2024

Now it works. https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads

from llama2lang.

ErikTromp avatar ErikTromp commented on June 12, 2024

Great we finally got there, feel free to make a PR with this one if the quality is better or do you plan on training a model too?

from llama2lang.

ChrystianSchutz avatar ChrystianSchutz commented on June 12, 2024

Yes, I have to justify expensive gpu. But now I have checked the threads dataset, I think that something is broken after that fix.
Practically all of the new one thread chat prompts are empty only with initial prompt (instruction_prompt):
"Jestes polskim chatbotem ktory odpowiada tylko po polsku"
https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads

Compared to the previous one:
https://huggingface.co/datasets/chrystians/Jestes

from llama2lang.

ErikTromp avatar ErikTromp commented on June 12, 2024

Ok let me know if you want me to do it instead if you want to avoid expenses (the Training).

I will try to recreate your Polish dataset this week to see what I wrecked. The past 2 weeks I have been coding mostly from a phone and in Colab because I was on vacation so I hope to resolve it soon.

from llama2lang.

ChrystianSchutz avatar ChrystianSchutz commented on June 12, 2024

Relax. I can gladly do the training it was a joke.
What I wanted to say is only is that in my opinion something is broken with translations/dataset . Because there is a lot of empty ones only including the prompt. I don't quite know why and if there is an issue with dataset or translations.

https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads/viewer/default/train?p=77
Like from page 77 to 94 there is only practically prompts.
Later there are some conversations at page 11 so yeah the translation works:
https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads/viewer/default/train?p=11

Just an idea there are also other datasets like: ShareGPT
or that one: https://huggingface.co/datasets/lmsys/lmsys-chat-1m
My idea is different one a little bit is to maybe write something to extract the conversation from that large dataset (lmsys-chat-1m) in a particular language and then based on that train the model.

from llama2lang.

ErikTromp avatar ErikTromp commented on June 12, 2024

So as it turns out, the default for madlad is to generate only 20 tokens for every translation, resulting in too short translations (not empty though, wasn't able to find out why those occur). I changed this now to a max of 2k tokens but that significantly slows it down (obviously) - an order of magnitude slower than Helsinki NLP now. To remedy, I added an option to load in 4 bits even instead of 8 bits so you can increase batch size but I am afraid it'll still be a lot slower.

As for swapping out datasets - we plan to fully support that but the creation of thread prompts is a bit involved in that case so working on that still (translation already supports swapping in different datasets).

Your translated dataset at https://huggingface.co/datasets/chrystians/oasst1_pl_2_2 already contains a lot of empty texts but not sure why that happened, I now added a check in the script itself to verify if translations are empty after which it throws and exception.

EDIT: I hit the exception - seems that madlad fails on specific characters. It died on some JSON/code inside a prompt.

from llama2lang.

ErikTromp avatar ErikTromp commented on June 12, 2024

Madlad works now, it was quite broken so far. Beware that it is a lot slower though.

Let me know if it works for you too, then we can close this issue.

from llama2lang.

ChrystianSchutz avatar ChrystianSchutz commented on June 12, 2024

Madlad works now, it was quite broken so far. Beware that it is a lot slower though.

I have run it, but I don't know what you mean by now. I have run with this one commit:

Date: Sun Jan 7 16:23:06 2024 +0100
0326e3f

Here are datasets:
https://huggingface.co/datasets/chrystians/oasst1_pl_3
https://huggingface.co/datasets/chrystians/oasst1_pl_3_threads

To be honest I don't see much of the difference between the translations or the quality.
They also include empty threads, maybe this is issue that is also in dataset, not the program per se. And about quality of translation I am too unfamiliar to help with that.

Should I run it again?

from llama2lang.

ErikTromp avatar ErikTromp commented on June 12, 2024

Yes madlad crashes silently if there are newlines in the input text. Fixed that yesterday (in 25d75f2) by replacing newlines with space. If you want to have madlad-based translations, you have to rerun the translate_oasst.py script entirely but I ran it briefly on Colab for PL and ZH and the estimated total time is now 150-200 hours (vs 10-15 for opus)...

Might be worth it but we will also be adding quite a few other translation models so perhaps better to wait for those.

from llama2lang.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.