Comments (12)
Are you on the latest main branch? If so, can you either:
- Upload your combined checkpoints to HF (rerun the script but now with target folder bein your HF dataset name
or - Add a screenshot/paste the folder structure of your translate and your combine outputs
from llama2lang.
Yes, I am on main branch.
- I have reruned combinaing checkpoints and I still have the same issue. https://huggingface.co/datasets/chrystians/oasst1_pl_2_2
- I am not quite sure what do you mean by screenshot:
root@c7a6c8a800b6:/madladLLaMa2lang/LLaMa2lang# python3 combine_checkpoints.py checkpointMadlad oasst1_pl_2_2
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████| 82/82 [00:00<00:00, 228.16ba/s]
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.86s/it]
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 270.26ba/s]
Uploading the dataset shards: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.28it/s]
root@c7a6c8a800b6:/madladLLaMa2lang/LLaMa2lang# python3 create_thread_prompts.py chrystians/oasst1_pl_2_2 "Jestes polskim chatbotem ktory odpowiada tylko po polsku" oasst1_pl_2_2_threads
Downloading readme: 100%|███████████████████████████████████████████████████████████████████████| 1.52k/1.52k [00:00<00:00, 11.5MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████| 19.7M/19.7M [00:02<00:00, 6.87MB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████| 723k/723k [00:00<00:00, 2.07MB/s]
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.61s/it]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2376.38it/s]
Generating train split: 100%|███████████████████████████████████████████████████████| 81037/81037 [00:00<00:00, 394031.81 examples/s]
Generating validation split: 100%|████████████████████████████████████████████████████| 3001/3001 [00:00<00:00, 302632.87 examples/s]
6%|█████ | 348/6264 [00:00<00:09, 600.71it/s]
Traceback (most recent call last):
File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 100, in <module>
main()
File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 90, in main
dataset[fold] = dataset[fold].rename_column('0', 'text')
AttributeError: 'list' object has no attribute 'rename_column'
root@c7a6c8a800b6:/madladLLaMa2lang/LLaMa2lang#
from llama2lang.
Thanks, this helped me debug, fixed it now in commit 3fe474f
Give it another go
from llama2lang.
Now it works. https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads
from llama2lang.
Great we finally got there, feel free to make a PR with this one if the quality is better or do you plan on training a model too?
from llama2lang.
Yes, I have to justify expensive gpu. But now I have checked the threads dataset, I think that something is broken after that fix.
Practically all of the new one thread chat prompts are empty only with initial prompt (instruction_prompt):
"Jestes polskim chatbotem ktory odpowiada tylko po polsku"
https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads
Compared to the previous one:
https://huggingface.co/datasets/chrystians/Jestes
from llama2lang.
Ok let me know if you want me to do it instead if you want to avoid expenses (the Training).
I will try to recreate your Polish dataset this week to see what I wrecked. The past 2 weeks I have been coding mostly from a phone and in Colab because I was on vacation so I hope to resolve it soon.
from llama2lang.
Relax. I can gladly do the training it was a joke.
What I wanted to say is only is that in my opinion something is broken with translations/dataset . Because there is a lot of empty ones only including the prompt. I don't quite know why and if there is an issue with dataset or translations.
https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads/viewer/default/train?p=77
Like from page 77 to 94 there is only practically prompts.
Later there are some conversations at page 11 so yeah the translation works:
https://huggingface.co/datasets/chrystians/oasst1_pl_2_threads/viewer/default/train?p=11
Just an idea there are also other datasets like: ShareGPT
or that one: https://huggingface.co/datasets/lmsys/lmsys-chat-1m
My idea is different one a little bit is to maybe write something to extract the conversation from that large dataset (lmsys-chat-1m) in a particular language and then based on that train the model.
from llama2lang.
So as it turns out, the default for madlad is to generate only 20 tokens for every translation, resulting in too short translations (not empty though, wasn't able to find out why those occur). I changed this now to a max of 2k tokens but that significantly slows it down (obviously) - an order of magnitude slower than Helsinki NLP now. To remedy, I added an option to load in 4 bits even instead of 8 bits so you can increase batch size but I am afraid it'll still be a lot slower.
As for swapping out datasets - we plan to fully support that but the creation of thread prompts is a bit involved in that case so working on that still (translation already supports swapping in different datasets).
Your translated dataset at https://huggingface.co/datasets/chrystians/oasst1_pl_2_2 already contains a lot of empty texts but not sure why that happened, I now added a check in the script itself to verify if translations are empty after which it throws and exception.
EDIT: I hit the exception - seems that madlad fails on specific characters. It died on some JSON/code inside a prompt.
from llama2lang.
Madlad works now, it was quite broken so far. Beware that it is a lot slower though.
Let me know if it works for you too, then we can close this issue.
from llama2lang.
Madlad works now, it was quite broken so far. Beware that it is a lot slower though.
I have run it, but I don't know what you mean by now. I have run with this one commit:
Date: Sun Jan 7 16:23:06 2024 +0100
0326e3f
Here are datasets:
https://huggingface.co/datasets/chrystians/oasst1_pl_3
https://huggingface.co/datasets/chrystians/oasst1_pl_3_threads
To be honest I don't see much of the difference between the translations or the quality.
They also include empty threads, maybe this is issue that is also in dataset, not the program per se. And about quality of translation I am too unfamiliar to help with that.
Should I run it again?
from llama2lang.
Yes madlad crashes silently if there are newlines in the input text. Fixed that yesterday (in 25d75f2) by replacing newlines with space. If you want to have madlad-based translations, you have to rerun the translate_oasst.py script entirely but I ran it briefly on Colab for PL and ZH and the estimated total time is now 150-200 hours (vs 10-15 for opus)...
Might be worth it but we will also be adding quite a few other translation models so perhaps better to wait for those.
from llama2lang.
Related Issues (20)
- Error when executing create_thread_prompts.py HOT 4
- AttributeError: 'Dataset' object has no attribute 'keys' HOT 3
- Running combine_checkpoints for all languages HOT 1
- English folder is empty when translated the dataset to english HOT 1
- Question or bug HOT 5
- Feature request: ChatML support
- Madlad: unrecognized arguments: --model_size 7b HOT 2
- Got this error during finetuning HOT 3
- Support finetuning from local disk too
- Issue with THREAD_TEMPLATE HOT 3
- Sample Example for finetuning
- Feedback on the hindi fientuned model HOT 7
- SeamlessM4T-v2 default (medium) model removed from huggingface HOT 1
- nllb.py and madlad.py points to the incorrect HF repositories HOT 9
- error running benchmark.py with seamless HOT 2
- Dataset chat format independent HOT 2
- problem with run_inference.py HOT 4
- Best translation model for turkish HOT 2
- Translating takes too long (How to finetuning with QLoRA?) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama2lang.