Coder Social home page Coder Social logo

understandlingbv / llama2lang Goto Github PK

View Code? Open in Web Editor NEW
194.0 10.0 25.0 243 KB

Convenience scripts to finetune (chat-)LLaMa3 and other models for any language

License: Apache License 2.0

Python 100.00%
ai genai huggingface llama2 llama3 llm mistral

llama2lang's Introduction

πŸš€ Now with LLaMa3 support πŸš€

LLaMa2lang v0.6

This repository contains convenience scripts to finetune LLaMa3-8B (or any other foundation model) for chat towards any language (that isn't English). The rationale behind this is that LLaMa3 is trained on primarily English data and while it works to some extent for other languages, its performance is poor compared to English.

TL;DR

pip install -r requirements.txt

# Translate OASST1 to target language
python translate.py m2m target_lang checkpoint_location

# Combine the checkpoint files into a dataset
python combine_checkpoints.py input_folder output_location

# Finetune
python finetune.py tuned_model dataset_name instruction_prompt

# Optionally finetune with DPO (RLHF)
python finetune_dpo.py tuned_model dataset_name instruction_prompt

# Run inference
python run_inference.py model_name instruction_prompt input

What it does

The process we follow to tune a foundation model such as LLaMa3 for a specific language is as follows:

  1. Load a dataset that contains Q&A/instruction pairs.
  2. Translate the entire dataset to a given target language.
  3. Load the translated dataset and extract threads by recursively selecting prompts with their respective answers with the highest rank only, through to subsequent prompts, etc.
  4. Turn the threads into prompts following a given template (customizable).
  5. Use QLoRA and PEFT to finetune a base foundation model's instruct finetune on this dataset.
    • Use QLoRA and PEFT to finetune with DPO to extend the model's capacities even further and teach it preferred answers over rejected ones. Note that your base dataset must have this information.
    • Alternatively to DPO, you can achieve the same with ORPO
  6. Run inference using the newly trained model.

Supported paradigms

Translation

  • OPUS
  • M2M
  • MADLAD
  • mBART
  • NLLB
  • Seamless (Large only)
  • Tower Instruct (Can correct spelling mistakes)

Base datasets

The following have been tested but potentially more will work

  • OASST1
  • OASST2

Supported foundation models

  • LLaMa3
  • LLaMa2
  • Mistral
  • (Unofficial) Mixtral 8x7B

Roadmap

  • [L2L-6] Investigate interoperability with other libraries (Axolotl, llamacpp, unsloth)
  • [L2L-7] Allow for different quantizations next to QLoRA (GGUF, GPTQ, AWQ)
  • [L2L-10] Support extending the tokenizer and vocabulary

Cost and runtime

The above process can be fully run on a free Google Colab T4 GPU. The last step however, can only be successfully run with short enough context windows and a batch of at most 2. In addition, the translation in step 2 takes about 36 hours in total for any given language so should be run in multiple steps if you want to stick with a free Google Colab GPU.

Our fine-tuned models for step 5 were performed using an A40 on vast.ai and cost us less than a dollar for each model, completing in about 1.5 hours.

Usage

  1. Make sure pytorch is installed and working for your environment (use of CUDA preferable): https://pytorch.org/get-started/locally/

  2. Clone the repo and install the requirements.

pip install -r requirements.txt

  1. Translate your base dataset to your designated target language.
usage: translate.py [-h] [--quant8] [--quant4] [--base_dataset BASE_DATASET] [--base_dataset_text_field BASE_DATASET_TEXT_FIELD] [--base_dataset_lang_field BASE_DATASET_LANG_FIELD]
                    [--checkpoint_n CHECKPOINT_N] [--batch_size BATCH_SIZE] [--max_length MAX_LENGTH] [--cpu] [--source_lang SOURCE_LANG]
                    {opus,mbart,madlad,m2m,nllb,seamless_m4t_v2,towerinstruct} ... target_lang checkpoint_location

Translate an instruct/RLHF dataset to a given target language using a variety of translation models

positional arguments:
  {opus,mbart,madlad,m2m,nllb,seamless_m4t_v2,towerinstruct}
                        The model/architecture used for translation.
    opus                Translate the dataset using HelsinkiNLP OPUS models.
    mbart               Translate the dataset using mBART.
    madlad              Translate the dataset using Google's MADLAD models.
    m2m                 Translate the dataset using Facebook's M2M models.
    nllb                Translate the dataset using Facebook's NLLB models.
    seamless_m4t_v2     Translate the dataset using Facebook's SeamlessM4T-v2 multimodal models.
    towerinstruct       Translate the dataset using Unbabel's Tower Instruct. Make sure your target language is in the 10 languages supported by the model.
  target_lang           The target language. Make sure you use language codes defined by the translation model you are using.
  checkpoint_location   The folder the script will write (JSONized) checkpoint files to. Folder will be created if it doesn't exist.

options:
  -h, --help            show this help message and exit
  --quant8              Optional flag to load the translation model in 8 bits. Decreases memory usage, increases running time
  --quant4              Optional flag to load the translation model in 4 bits. Decreases memory usage, increases running time
  --base_dataset BASE_DATASET
                        The base dataset to translate, defaults to OpenAssistant/oasst1
  --base_dataset_text_field BASE_DATASET_TEXT_FIELD
                        The base dataset's column name containing the actual text to translate. Defaults to text
  --base_dataset_lang_field BASE_DATASET_LANG_FIELD
                        The base dataset's column name containing the language the source text was written in. Defaults to lang
  --checkpoint_n CHECKPOINT_N
                        An integer representing how often a checkpoint file will be written out. To start off, 400 is a reasonable number.
  --batch_size BATCH_SIZE
                        The batch size for a single translation model. Adjust based on your GPU capacity. Default is 10.
  --max_length MAX_LENGTH
                        How much tokens to generate at most. More tokens might be more accurate for lengthy input but creates a risk of running out of memory. Default is unlimited.
  --cpu                 Forces usage of CPU. By default GPU is taken if available.
  --source_lang SOURCE_LANG
                        Source language to select from OASST based on lang property of dataset

If you want more parameters for the different translation models, run:

python translate.py [MODEL] -h

Be sure to specify model-specific parameters first before you specify common parameters from the list above. Example calls:

# Using M2M with 4bit quantization and differen batch sizes to translate Dutch
python translate.py m2m nl ./output_nl --quant4 --batch_size 20

# Using madlad 7B with 8bit quantization for German with different max_length
python translate.py madlad --model_size 7b de ./output_de --quant8 --batch_size 5 --max_length 512

# Be sure to use target language codes that the model you use understands
python translate.py mbart xh_ZA ./output_xhosa
python translate.py nllb nld_Latn ./output_nl
  1. Combine the JSON arrays from the checkpoints' files into a Huggingface Dataset and then either write it to disk or publish it to Huggingface. The script will try to write to disk by default and fall back to publishing to Huggingface if the folder doesn't exist on disk. For publishing to Huggingface, make sure you have your HF_TOKEN environment variable set up as per the documentation.
usage: combine_checkpoints.py [-h] input_folder output_location

Combine checkpoint files from translation.

positional arguments:
  input_folder     The checkpoint folder used in translation, with the target language appended.
                   Example: "./output_nl".
  output_location  Where to write the Huggingface Dataset. Can be a disk location or a Huggingface
                   Dataset repository.

options:
  -h, --help       show this help message and exit
  1. Turn the translated messages into chat/instruct/prompt threads and finetune a foundate model's instruct using LoRA and PEFT.
usage: finetune.py [-h] [--base_model BASE_MODEL] [--base_dataset_text_field BASE_DATASET_TEXT_FIELD] [--base_dataset_rank_field BASE_DATASET_RANK_FIELD] [--base_dataset_id_field BASE_DATASET_ID_FIELD] [--base_dataset_parent_field BASE_DATASET_PARENT_FIELD]
                   [--base_dataset_role_field BASE_DATASET_ROLE_FIELD] [--quant8] [--noquant] [--max_seq_length MAX_SEQ_LENGTH] [--num_train_epochs NUM_TRAIN_EPOCHS] [--batch_size BATCH_SIZE] [--threads_output_name THREADS_OUTPUT_NAME] [--thread_template THREAD_TEMPLATE]
                   [--padding PADDING]
                   tuned_model dataset_name instruction_prompt

Finetune a base instruct/chat model using (Q)LoRA and PEFT

positional arguments:
  tuned_model           The name of the resulting tuned model.
  dataset_name          The name of the dataset to use for fine-tuning. This should be the output of the combine_checkpoints script.
  instruction_prompt    An instruction message added to every prompt given to the chatbot to force it to answer in the target language. Example: "You are a generic chatbot that always answers in English."

options:
  -h, --help            show this help message and exit
  --base_model BASE_MODEL
                        The base foundation model. Default is "NousResearch/Meta-Llama-3-8B-Instruct".
  --base_dataset_text_field BASE_DATASET_TEXT_FIELD
                        The dataset's column name containing the actual text to translate. Defaults to text
  --base_dataset_rank_field BASE_DATASET_RANK_FIELD
                        The dataset's column name containing the rank of an answer given to a prompt. Defaults to rank
  --base_dataset_id_field BASE_DATASET_ID_FIELD
                        The dataset's column name containing the id of a text. Defaults to message_id
  --base_dataset_parent_field BASE_DATASET_PARENT_FIELD
                        The dataset's column name containing the parent id of a text. Defaults to parent_id
  --base_dataset_role_field BASE_DATASET_ROLE_FIELD
                        The dataset's column name containing the role of the author of the text (eg. prompter, assistant). Defaults to role
  --quant8              Finetunes the model in 8 bits. Requires more memory than the default 4 bit.
  --noquant             Do not quantize the finetuning. Requires more memory than the default 4 bit and optional 8 bit.
  --max_seq_length MAX_SEQ_LENGTH
                        The maximum sequence length to use in finetuning. Should most likely line up with your base model's default max_seq_length. Default is 512.
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Number of epochs to use. 2 is default and has been shown to work well.
  --batch_size BATCH_SIZE
                        The batch size to use in finetuning. Adjust to fit in your GPU vRAM. Default is 4
  --threads_output_name THREADS_OUTPUT_NAME
                        If specified, the threads created in this script for finetuning will also be saved to disk or HuggingFace Hub.
  --thread_template THREAD_TEMPLATE
                        A file containing the thread template to use. Default is threads/template_fefault.txt
  --padding PADDING     What padding to use, can be either left or right.

6.1 [OPTIONAL] Finetune using DPO (similar to RLHF)

usage: finetune_dpo.py [-h] [--base_model BASE_MODEL] [--base_dataset_text_field BASE_DATASET_TEXT_FIELD] [--base_dataset_rank_field BASE_DATASET_RANK_FIELD] [--base_dataset_id_field BASE_DATASET_ID_FIELD] [--base_dataset_parent_field BASE_DATASET_PARENT_FIELD] [--quant8]
                       [--noquant] [--max_seq_length MAX_SEQ_LENGTH] [--max_prompt_length MAX_PROMPT_LENGTH] [--num_train_epochs NUM_TRAIN_EPOCHS] [--batch_size BATCH_SIZE] [--threads_output_name THREADS_OUTPUT_NAME] [--thread_template THREAD_TEMPLATE] [--max_steps MAX_STEPS]
                       [--padding PADDING]
                       tuned_model dataset_name instruction_prompt

Finetune a base instruct/chat model using (Q)LoRA and PEFT using DPO (RLHF)

positional arguments:
  tuned_model           The name of the resulting tuned model.
  dataset_name          The name of the dataset to use for fine-tuning. This should be the output of the combine_checkpoints script.
  instruction_prompt    An instruction message added to every prompt given to the chatbot to force it to answer in the target language. Example: "You are a generic chatbot that always answers in English."

options:
  -h, --help            show this help message and exit
  --base_model BASE_MODEL
                        The base foundation model. Default is "NousResearch/Meta-Llama-3-8B-Instruct".
  --base_dataset_text_field BASE_DATASET_TEXT_FIELD
                        The dataset's column name containing the actual text to translate. Defaults to text
  --base_dataset_rank_field BASE_DATASET_RANK_FIELD
                        The dataset's column name containing the rank of an answer given to a prompt. Defaults to rank
  --base_dataset_id_field BASE_DATASET_ID_FIELD
                        The dataset's column name containing the id of a text. Defaults to message_id
  --base_dataset_parent_field BASE_DATASET_PARENT_FIELD
                        The dataset's column name containing the parent id of a text. Defaults to parent_id
  --quant8              Finetunes the model in 8 bits. Requires more memory than the default 4 bit.
  --noquant             Do not quantize the finetuning. Requires more memory than the default 4 bit and optional 8 bit.
  --max_seq_length MAX_SEQ_LENGTH
                        The maximum sequence length to use in finetuning. Should most likely line up with your base model's default max_seq_length. Default is 512.
  --max_prompt_length MAX_PROMPT_LENGTH
                        The maximum length of the prompts to use. Default is 512.
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Number of epochs to use. 2 is default and has been shown to work well.
  --batch_size BATCH_SIZE
                        The batch size to use in finetuning. Adjust to fit in your GPU vRAM. Default is 4
  --threads_output_name THREADS_OUTPUT_NAME
                        If specified, the threads created in this script for finetuning will also be saved to disk or HuggingFace Hub.
  --thread_template THREAD_TEMPLATE
                        A file containing the thread template to use. Default is threads/template_fefault.txt
  --max_steps MAX_STEPS
                        The maximum number of steps to run DPO for. Default is -1 which will run the data through fully for the number of epochs but this will be very time-consuming.
  --padding PADDING     What padding to use, can be either left or right.

6.1 [OPTIONAL] Finetune using ORPO (similar to RLHF)

usage: finetune_orpo.py [-h] [--base_model BASE_MODEL] [--base_dataset_text_field BASE_DATASET_TEXT_FIELD] [--base_dataset_rank_field BASE_DATASET_RANK_FIELD] [--base_dataset_id_field BASE_DATASET_ID_FIELD] [--base_dataset_parent_field BASE_DATASET_PARENT_FIELD] [--quant8]
                        [--noquant] [--max_seq_length MAX_SEQ_LENGTH] [--max_prompt_length MAX_PROMPT_LENGTH] [--num_train_epochs NUM_TRAIN_EPOCHS] [--batch_size BATCH_SIZE] [--threads_output_name THREADS_OUTPUT_NAME] [--thread_template THREAD_TEMPLATE] [--max_steps MAX_STEPS]
                        [--padding PADDING]
                        tuned_model dataset_name instruction_prompt

Finetune a base instruct/chat model using (Q)LoRA and PEFT using ORPO (RLHF)

positional arguments:
  tuned_model           The name of the resulting tuned model.
  dataset_name          The name of the dataset to use for fine-tuning. This should be the output of the combine_checkpoints script.
  instruction_prompt    An instruction message added to every prompt given to the chatbot to force it to answer in the target language. Example: "You are a generic chatbot that always answers in English."

options:
  -h, --help            show this help message and exit
  --base_model BASE_MODEL
                        The base foundation model. Default is "NousResearch/Meta-Llama-3-8B-Instruct".
  --base_dataset_text_field BASE_DATASET_TEXT_FIELD
                        The dataset's column name containing the actual text to translate. Defaults to text
  --base_dataset_rank_field BASE_DATASET_RANK_FIELD
                        The dataset's column name containing the rank of an answer given to a prompt. Defaults to rank
  --base_dataset_id_field BASE_DATASET_ID_FIELD
                        The dataset's column name containing the id of a text. Defaults to message_id
  --base_dataset_parent_field BASE_DATASET_PARENT_FIELD
                        The dataset's column name containing the parent id of a text. Defaults to parent_id
  --quant8              Finetunes the model in 8 bits. Requires more memory than the default 4 bit.
  --noquant             Do not quantize the finetuning. Requires more memory than the default 4 bit and optional 8 bit.
  --max_seq_length MAX_SEQ_LENGTH
                        The maximum sequence length to use in finetuning. Should most likely line up with your base model's default max_seq_length. Default is 512.
  --max_prompt_length MAX_PROMPT_LENGTH
                        The maximum length of the prompts to use. Default is 512.
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Number of epochs to use. 2 is default and has been shown to work well.
  --batch_size BATCH_SIZE
                        The batch size to use in finetuning. Adjust to fit in your GPU vRAM. Default is 4
  --threads_output_name THREADS_OUTPUT_NAME
                        If specified, the threads created in this script for finetuning will also be saved to disk or HuggingFace Hub.
  --thread_template THREAD_TEMPLATE
                        A file containing the thread template to use. Default is threads/template_fefault.txt
  --max_steps MAX_STEPS
                        The maximum number of steps to run ORPO for. Default is -1 which will run the data through fully for the number of epochs but this will be very time-consuming.
  --padding PADDING     What padding to use, can be either left or right.
  1. Run inference using the newly created QLoRA model.
usage: run_inference.py [-h] model_name instruction_prompt input

Script to run inference on a tuned model.

positional arguments:
  model_name          The name of the tuned model that you pushed to Huggingface in the previous
                      step.
  instruction_prompt  An instruction message added to every prompt given to the chatbot to force
                      it to answer in the target language.
  input               The actual chat input prompt. The script is only meant for testing purposes
                      and exits after answering.

options:
  -h, --help          show this help message and exit

Choosing the right translation model

How do I know which translation model to choose for my target language?

We got you covered with out benchmark.py script that helps make somewhat of a good guess (the dataset we use is the same as the OPUS models are trained on so the outcomes are always favorable towards OPUS). For usage, see the help of this script below. Models are loaded in 4-bit quantization and run on a small sample of the OPUS books subset.

Be sure to use the most commonly occurring languages in your base dataset as source_language and your target translation language as target_language. For OASST1 for example, be sure to at least run en and es as source languages.

usage: benchmark.py [-h] [--cpu] [--start START] [--n N] [--max_length MAX_LENGTH] source_language target_language included_models

Benchmark all the different translation models for a specific source and target language to find out which performs best. This uses 4bit quantization to limit GPU usage. Note:
the outcomes are indicative - you cannot assume corretness of the BLEU and CHRF scores but you can compare models against each other relatively.

positional arguments:
  source_language       The source language you want to test for. Check your dataset to see which occur most prevalent or use English as a good start.
  target_language       The source language you want to test for. This should be the language you want to apply the translate script on. Note: in benchmark, we use 2-character
                        language codes, in constrast to translate.py where you need to specify whatever your model expects.
  included_models       Comma-separated list of models to include. Allowed values are: opus, m2m_418m, m2m_1.2b, madlad_3b, madlad_7b, madlad_10b, madlad_7bbt, mbart,
                        nllb_distilled600m, nllb_1.3b, nllb_distilled1.3b, nllb_3.3b, seamless

options:
  -h, --help            show this help message and exit
  --cpu                 Forces usage of CPU. By default GPU is taken if available.
  --start START         The starting offset to include sentences from the OPUS books dataset from. Defaults to 0.
  --n N                 The number of sentences to benchmark on. Defaults to 100.
  --max_length MAX_LENGTH
                        How much tokens to generate at most. More tokens might be more accurate for lengthy input but creates a risk of running out of memory. Default is 512.

Datasets and models

We have created and will continue to create numerous datasets and models already. Want to help democratize LLMs? Clone the repo and create datasets and models for other languages, then create a PR.

Translated oasst1 datasets

Dutch UnderstandLing/oasst1_nl Spanish UnderstandLing/oasst1_es French UnderstandLing/oasst1_fr German UnderstandLing/oasst1_de
Catalan xaviviro/oasst1_ca Portuguese UnderstandLing/oasst1_pt Arabic HeshamHaroon/oasst-arabic Italian UnderstandLing/oasst1_it
Russian UnderstandLing/oasst1_ru Hindi UnderstandLing/oasst1_hi Chinese UnderstandLing/oasst1_zh Polish chrystians/oasst1_pl
Japanese UnderstandLing/oasst1_jap Basque xezpeleta/oasst1_eu Bengali UnderstandLing/oasst1_bn Turkish UnderstandLing/oasst1_tr

Language-specific ❗LLaMa3-8B❗ chat model adapters

Make sure you have access to Meta's LLaMa3-8B model and set your HF_TOKEN before using these models.

UnderstandLing/Llama-3-8B-Instruct-nl Dutch UnderstandLing/Llama-3-8B-Instruct-es Spanish UnderstandLing/Llama-3-8B-Instruct-fr French UnderstandLing/Llama-3-8B-Instruct-de German
UnderstandLing/Llama-3-8B-Instruct-pt Portuguese UnderstandLing/Llama-3-8B-Instruct-it Italian UnderstandLing/Llama-3-8B-Instruct-hi Hindi UnderstandLing/Llama-3-8B-Instruct-ru Russian

Translated LLaMa2 thread chat prompt datasets

Dutch UnderstandLing/oasst1_nl_threads Spanish UnderstandLing/oasst1_es_threads French UnderstandLing/oasst1_fr_threads German UnderstandLing/oasst1_de_threads
Catalan xaviviro/oasst1_ca_threads Portuguese UnderstandLing/oasst1_pt_threads Arabic HeshamHaroon/oasst-arabic_threads Italian UnderstandLing/oasst1_it_threads
Russian UnderstandLing/oasst1_ru_threads Hindi UnderstandLing/oasst1_hi_threads Chinese UnderstandLing/oasst1_zh_threads Polish chrystians/oasst1_pl_threads
Japanese UnderstandLing/oasst1_jap_threads Basque xezpeleta/oasst1_eu_threads Bengali UnderstandLing/oasst1_bn_threads Turkish UnderstandLing/oasst1_tr_threads

Language-specific LLaMa2-7B chat model adapters

UnderstandLing/llama-2-7b-chat-nl Dutch UnderstandLing/llama-2-7b-chat-es Spanish UnderstandLing/llama-2-7b-chat-fr French UnderstandLing/llama-2-7b-chat-de German
xaviviro/llama-2-7b-chat-ca Catalan UnderstandLing/llama-2-7b-chat-pt Portuguese HeshamHaroon/llama-2-7b-chat-ar Arabic UnderstandLing/llama-2-7b-chat-it Italian
UnderstandLing/llama-2-7b-chat-ru Russian UnderstandLing/llama-2-7b-chat-hi Hindi UnderstandLing/llama-2-7b-chat-zh Chinese chrystians/llama-2-7b-chat-pl-polish-polski Polish
xezpeleta/llama-2-7b-chat-eu Basque UnderstandLing/llama-2-7b-chat-bn Bengali UnderstandLing/llama-2-7b-chat-tr Turkish

Language-specific Mistral chat model adapters

UnderstandLing/Mistral-7B-Instruct-v0.2-nl Dutch UnderstandLing/Mistral-7B-Instruct-v0.2-es Spanish UnderstandLing/Mistral-7B-Instruct-v0.2-de German

Language-specific LLaMa2-13B chat model adapters

UnderstandLing/llama-2-13b-chat-nl Dutch UnderstandLing/llama-2-13b-chat-es Spanish UnderstandLing/llama-2-13b-chat-fr French

Language-specific Mixtral-8x7B chat model adapters

UnderstandLing/Mixtral-8x7B-Instruct-nl Dutch

Empirical performance

Dutch

<s>[INST] <<SYS>> Je bent een generieke chatbot die altijd in het Nederlands antwoord geeft. <</SYS>> Wat is de hoofdstad van Nederland? [/INST] Amsterdam</s>

<s>[INST] <<SYS>> Je bent een generieke chatbot die altijd in het Nederlands antwoord geeft. <</SYS>> Wat is de hoofdstad van Nederland? [/INST] Amsterdam</s><s>[INST] Hoeveel inwoners heeft die stad? [/INST] 850 duizend inwoners (2023)</s>

<s>[INST] <<SYS>> Je bent een generieke chatbot die altijd in het Nederlands antwoord geeft. <</SYS>> Wat is de hoofdstad van Nederland? [/INST] Amsterdam</s><s>[INST] Hoeveel inwoners heeft die stad? [/INST] 850 duizend inwoners (2023)</s><s>[INST] In welke provincie ligt die stad? [/INST] In de provincie Noord-Holland</s>

<s>[INST] <<SYS>> Je bent een generieke chatbot die altijd in het Nederlands antwoord geeft. <</SYS>> Wie is de minister-president van Nederland? [/INST] Mark Rutte is sinds 2010 minister-president van Nederland. Hij is meerdere keren herkozen.</s>

FAQ

  • Q: Why do you translate the full OASST1/2 dataset first? Wouldn't it be faster to only translate highest ranked threads?

  • A: While you can gain quite a lot in terms of throughput time by first creating the threads and then translating them, we provide full OASST1/2 translations to the community as we believe they can be useful on their own.

  • Q: How well do the fine-tunes perform compared to vanilla LLaMa3?

  • A: While we do not have formal benchmarks, getting LLaMa3 to consistently speak another language than English to begin with is challenging if not impossible. The non-English language it does produce is often grammatically broken. Our fine-tunes do not show this behavior.

  • Q: Can I use other frameworks for fine-tuning?

  • A: Yes you can, we use Axolotl for training on multi-GPU setups.

  • Q: Can I mix different translation models?

  • A: Absolutely, we think it might even increase performance to have translation done by multiple models. You can achieve this by early-stopping a translation and continuing from the checkpoints by reruning the translate script with a different translation model.

Funding

We are based in the Netherland and actively looking for funding to democratize AI and advance its applications. Contact us at [email protected] if you want to invest.

llama2lang's People

Contributors

chrystianschutz avatar eriktromp avatar h9-tect avatar holycowmp3 avatar mirekvink avatar mwzkhalil avatar xezpeleta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llama2lang's Issues

Dataset chat format independent

Hey guys,
So, have any of you thought about creating a dataset for fine tuning that's chat template independent? Like, you know, one that works across the models?
Let me give you an example: I have used UnderstandLing/oasst1_pt_threads to fine-tune Llama, and it was awesome. But I cant do the same thing with phi-2.
Every model has its own way of handling chat format templates. It would be really cool if we could have a dataset translated that I could convert to chat template after.

Entire dataset in English

Is the entire dataset available in english so that the translation is easier. Doing it for a rare language (south asian) is difficult from different languages as translation is available only in english

Madlad: unrecognized arguments: --model_size 7b

Branch
Main

Environment
Colab

RAM/vRAM
16

Script with parameters
Using the translate.py

Data layout or HF dataset
Default dataset

Problem description/Question

It looks like the Madlad returns an error. If you use the given instruction as for the readme:

# Using madlad 7B with 8bit quantization for German with different max_length
python translate.py madlad de ./output_de --quant8 --batch_size 5 --max_length 512 --model_size 7b

You get :

output
usage: translate.py [-h] [--quant8] [--quant4] [--base_dataset BASE_DATASET]
                    [--base_dataset_text_field BASE_DATASET_TEXT_FIELD]
                    [--base_dataset_lang_field BASE_DATASET_LANG_FIELD]
                    [--checkpoint_n CHECKPOINT_N] [--batch_size BATCH_SIZE]
                    [--max_length MAX_LENGTH] [--cpu]
                    {opus,mbart,madlad,m2m} ... target_lang checkpoint_location
translate.py: error: unrecognized arguments: --model_size 7b

I tried

!python /content/LLaMa2lang/translate.py madlad -h

and noticed that the parameter looks supported

usage: translate.py madlad [-h] [--model_size {3b,7b,7b-bt}]

options:
  -h, --help            show this help message and exit
  --model_size {3b,7b,7b-bt}
                        The size of the MADLAD model to use. 7b-bt is the backtrained version
                        (best to avoid unless you know what you are doing).

Issue with THREAD_TEMPLATE

Branch
Main

Environment
RAM/vRAM
Colab

Script with parameters
It's the step 5
finetune_llama.py [--base_model BASE_MODEL] tuned_model dataset_name

Problem description/Question
Seems like at step 5 there is an issue with the default TEMPLATE.
An error is thrown :

Screenshot 2024-01-22 alle 11 22 43

Even if the file exists

Screenshot 2024-01-22 alle 11 28 36

setting it manually as an option then works.

nllb.py and madlad.py points to the incorrect HF repositories

Branch Main

Environment Google Colab
RAM/vRAM 16gb vram

Script with parameters nllb.py; madlad.py

Data layout or HF dataset facebook/nllb-200-distilled-1.3B; google/madlad400-7b-mt-bt; nllb-200-distilled-600M

Problem description/Question
Hi, I was testing the benchmark script and I found that itΒ΄s pointing wrong to some Hugginface repositories. I guess that the actual problem is on the files nllb.py and madlad.py

When executing them, itΒ΄s triying to get the config.json from a wrong url, that points to the HugginFace repositories. Above is the url that the software tries to use, and below the correct ones that I checked that they work:

https://huggingface.co/facebook/nllb-200-DISTILLED600M/resolve/main/config.json
https://huggingface.co/facebook/nllb-200-distilled-600M/resolve/main/config.json

https://huggingface.co/google/madlad400-7b-bt-mt-bt/resolve/main/config.json
https://huggingface.co/google/madlad400-7b-mt-bt/resolve/main/config.json

https://huggingface.co/facebook/nllb-200-DISTILLED1.3B/resolve/main/config.json
https://huggingface.co/facebook/nllb-200-distilled-1.3B/resolve/main/config.json

This is an example of the error that I am getting:

Input: !python benchmark.py en eu nllb_distilled1.3b

Output:

/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
2024-02-12 07:37:14.626071: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-12 07:37:14.626122: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-12 07:37:14.628076: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-12 07:37:16.078312: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[---- LLaMa2Lang ----] Starting benchmarking from en to eu for models ['nllb_distilled1.3b'] on 100 records on device cuda:0
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/facebook/nllb-200-DISTILLED1.3B/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1374, in hf_hub_download
    raise head_call_error
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1247, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1624, in get_hf_file_metadata
    r = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 402, in _request_wrapper
    response = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 426, in _request_wrapper
    hf_raise_for_status(response)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 320, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65c9cab0-385f3047149c5c1844aa14bd;e916b38a-b643-499b-a742-f5c9d4a5454d)

Repository Not Found for url: https://huggingface.co/facebook/nllb-200-DISTILLED1.3B/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/content/LLaMa2lang/benchmark.py", line 109, in <module>
    main()
  File "/content/LLaMa2lang/benchmark.py", line 86, in main
    translator = NLLBTranslator(device, True, quant4_config, False, max_length, model_size)
  File "/content/LLaMa2lang/translators/nllb.py", line 40, in __init__
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map=device, quantization_config=self.quant4_config, load_in_4bit=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 410, in cached_file
    raise EnvironmentError(
OSError: facebook/nllb-200-DISTILLED1.3B is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

translate oasst error japanese

Hey there, not sure if its a configuration issue on my end but trying to create a japanese dataset and comes upto the end of the run and starts loading up all my vram, goes till it can fit anymore then dumps and starts again, not sure if its normal behavior? should i just leave it

running command python translate_oasst.py ja ja 500 20

screenshot of attached behavior

Screenshot 2024-01-03 104221

combined dataset too small?

I run the translate_oasst.py pl script with batch_size =40 and it takes around 1.5h on RTX3090. Completes without errors but after running combine_checkpoints.py script I only get 27k records in my dataset:

https://huggingface.co/datasets/mpazdzioch/oasst1_pl2

I guess something is not right because all other language datasets linked from readme have 88k rows.
Any ideas how to debug this?
I included the output from translate_oasst.py combine_checkpoints.py and create_thread_prompts.py in attachment.
output.txt

Feedback on the hindi fientuned model

I just tried out the hindi model, the outputs were very inconsistent and illogical. Do you think pretraining it with the new syllables from a custom tokenizer should make it better? are you planning to add that to the pipeline?

English folder is empty when translated the dataset to english

Pretty much the title. I am planning to use a different in-house translate model for the oas dataset. After translating oas to English i just found its empty. What should I do to get them back ? Shall I strip it from the original dataset itself ?

[IDEA] Include a better way to translate dataset?

I have used the default translation from the step 2, but sadly a lot of those translations at least from English to Polish are gibberish and absolutely terrible. https://huggingface.co/datasets/chrystians/Jestes?row=3

I want to create a thread to start a discussion about possible alternatives, obvious one would be something like AWS translate or DeepL. And to do that we would need to write a script for API integration, I also don't know how costly is it or if there are any better opensource alternatives.

There are currently around 1(0M) 9949085 characters in the oasst1 dataset

translate_oasst.py: IndexError: list index out of range

Hi,

I am executing it on Google Colab, with an V100. The non batch version didnΒ΄t have that error.

I saw that you published the batch update, so I tried, but I am getting this error:

This is the input:

# Translate the OASST1 dataset into your target language
!python translate_oasst.py en "{base_dir}/test02" 1000

This is the output:

2023-12-30 22:11:02.243645: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-30 22:11:02.243731: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-30 22:11:02.245221: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-30 22:11:03.327637: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
  warnings.warn(
Traceback (most recent call last):
  File "/content/llama2lang/translate_oasst.py", line 25, in <module>
    batch_size = int(sys.argv[4])
IndexError: list index out of range

Error when executing create_thread_prompts.py

Hi,

First of all, thanks for your work. :)

In the step of create_thread_prompts.py I am getting this error using Google Colab.

Please, explain what I am doing wrong, and sorry if itΒ΄s evident but I am no really familiar with programming.

This is the input, after I have used my HugginFace token for writing in my repository :
!python create_thread_prompts.py "{base_dir}/Eus02" "eu: Chatbot generikoa zara, beti euskaraz erantzuten duena." "elBlacksmith/Eus02"

This is the output:

Downloading data files: 100% 1/1 [00:00<00:00, 13443.28it/s]
Extracting data files: 100% 1/1 [00:00<00:00, 23431.87it/s]
Generating train split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1941, in _prepare_split_single
    num_examples, num_bytes = writer.finalize()
  File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py", line 599, in finalize
    raise SchemaInferenceError("Please pass `features` or at least one example when writing data")
datasets.arrow_writer.SchemaInferenceError: Please pass `features` or at least one example when writing data

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/content/llama2lang/create_thread_prompts.py", line 12, in <module>
    dataset = load_dataset('arrow', data_files=os.path.join(dataset_name, '*.arrow'))
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2152, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 948, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1043, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1805, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1950, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

BTW, I am training the basque/Euskera (eu) language and I am not sure if the translate_oasst.py is executing correctly, as itΒ΄s creating several folders inside the "train" and "validation" folder, each one of different language (See attached screenshot). Maybe itΒ΄s fine, but I wanted to point it because at https://huggingface.co/Helsinki-NLP/opus-mt-eu-en there is a model of en to eu, so it doesnΒ΄t look logical to use other languages. But once, again, I have little knowledge of what I am actually doing and perhaps is supposed to work like that.

Capture

Error with create_thread_prompts.py

Hello, I've managed to translate and combine the dataset (OASST1 or OASST2, doesn't make a difference) in finnish, but my progress stops here.

Error messages:

Downloading data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 7025.63it/s]
Extracting data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 3221.43it/s]
Generating train split: 1 examples [00:00, 262.09 examples/s]
Generating validation split: 1 examples [00:00, 470.27 examples/s]
Traceback (most recent call last):
File "/home/mkayhko/LLaMa2lang/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'rank'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/mkayhko/LLaMa2lang/create_thread_prompts.py", line 40, in
min_rank = df['rank'].min()
File "/home/mkayhko/LLaMa2lang/lib/python3.10/site-packages/pandas/core/frame.py", line 3807, in getitem
indexer = self.columns.get_loc(key)
File "/home/mkayhko/LLaMa2lang/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
raise KeyError(key) from err
KeyError: 'rank'

Error when trying to run create_thread_prompts.py

I am running the following commend with this dataset:
python3 create_thread_prompts.py chrystians/oasst1_pl_2 "Jestes polskim chatbotem ktory odpowiada tylko po polsku" oasst1_pl_2threads

  6%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                                                       | 348/6264 [00:00<00:09, 602.31it/s]
Traceback (most recent call last):
  File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 100, in <module>
    main()
  File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 90, in main
    dataset[fold] = dataset[fold].rename_column('0', 'text')
AttributeError: 'list' object has no attribute 'rename_column'

I also tried to run it with previously successful dataset that worked with this commend and the script was also failing:

 python3 create_thread_prompts.py chrystians/oasst1_pl_2 "Jestes polskim chatbotem ktory odpowiada^Cylko po polsku" oasst1_pl_2threads
root@c7a6c8a800b6:/madladLLaMa2lang/LLaMa2lang#  python3 create_thread_prompts.py chrystians/oasst1_pl "test" testUsnac
  6%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                                                       | 201/3618 [00:00<00:04, 800.76it/s]
Traceback (most recent call last):
  File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 100, in <module>
    main()
  File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 90, in main
    dataset[fold] = dataset[fold].rename_column('0', 'text')
AttributeError: 'list' object has no attribute 'rename_column'

problem with run_inference.py

Branch main

Environment Google Colab Pro; GPU T4
RAM/vRAM 16 GB VRAM

Script with parameters !python run_inference.py UnderstandLing/llama-2-7b-chat-es "Hazme una lista de ciudades"

Data layout or HF dataset

Problem description/Question
Hi, I am trying to run run_inference.py to test my results but I am getting problems. I tried with a finetune of yours, and the problem is the same. When running the above script I get this output after downloading all the files:

Loading checkpoint shards: 100% 2/2 [01:03<00:00, 31.82s/it]
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:389: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:394: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
Enter your input, use ':n' for a new thread or ':q' to quit:

Thanks!

Sample Example for finetuning

Can you also provide a sample jupyter notebook implementation for the finetuning part? Im not able to figure out the structure of the dataset to be provided for the finetuning step.

Thank

Question: translating a monolingual HF Dataset

Hey guys.
Was jus wondering if the translation steps fits also datasets with one language only. I am asking as I saw that there is a default parameter which specifies the column which includes the language.
If possible, how to overcome that scenario (having no columns with indication of the language as the dataset is - let's say - all English .. ?
big thanks.

Question : training result model won't stop generating ?

Hi, I tried to train and the result seems cant stop (using llamacpp) what do you use for stop token ? is it or [/INST] ? seems
[/INST] works better
but when I look at the training data, it seems stopped by ?

And at what error level do you stop training ? I cannot get under 1.3 now

Thanks

AttributeError: 'Dataset' object has no attribute 'keys'

I get that error when run the create_thread_prompts part
Traceback (most recent call last):
File "/Users/admin/Downloads/LLaMa2lang-main/create_thread_prompts.py", line 100, in
main()
File "/Users/admin/Downloads/LLaMa2lang-main/create_thread_prompts.py", line 59, in main
folds = dataset.keys()
^^^^^^^^^^^^
AttributeError: 'Dataset' object has no attribute 'keys'

Bad request: Only regular characters and '-', '_', '.' are accepted. '--' and '..' are forbidden. '-' and '.' cannot start or end the name. The name cannot end with ".git". Max length is 96.

python3 create_thread_prompts.py HeshamHaroon/oasst-arabic Ψ£Ω†Ψͺ روبوΨͺ Ω…Ψ­Ψ§Ψ―Ψ«Ψ© ΨΉΨ§Ω… يجيب Ψ―Ψ§Ψ¦Ω…Ω‹Ψ§ Ψ¨Ψ§Ω„Ω„ΨΊΨ© Ψ§Ω„ΨΉΨ±Ψ¨ΩŠHeshamHaroon/oasst1-ar-threads
  6%|β–ˆβ–ˆ                                   | 9845/177210 [01:45<29:58, 93.07it/s]
  6%|β–ˆβ–ˆβ–                                    | 517/9306 [00:00<00:13, 638.50it/s]
Traceback (most recent call last):
  File "/home/hesham/.local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
    response.raise_for_status()
  File "/home/hesham/.local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/repos/create

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/hesham/LLaMa2lang/create_thread_prompts.py", line 72, in <module>
    dataset.push_to_hub(output_location)
  File "/home/hesham/.local/lib/python3.10/site-packages/datasets/dataset_dict.py", line 1662, in push_to_hub
    repo_url = api.create_repo(
  File "/home/hesham/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/hesham/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2816, in create_repo
    hf_raise_for_status(r)
  File "/home/hesham/.local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 326, in hf_raise_for_status
    raise BadRequestError(message, response=response) from e
huggingface_hub.utils._errors.BadRequestError:  (Request ID: Root=1-658d3f14-4f2ba1054e506a0e297e5b8a;88f4178a-190f-4f22-a9e5-0982c0339c9b)

Bad request:
Only regular characters and '-', '_', '.' are accepted. '--' and '..' are forbidden. '-' and '.' cannot start or end the name. The name cannot end with ".git". Max length is 96.

Merging and quantization

Suggest adding a script for merging of base model and qlora, and for quantizing to GGUF or GPTQ.
Also, OASST2 was just released, maybe it's better?

Question or bug

Branch
Main branch

Environment
I am using Colab

RAM/vRAM
16Gb ram and V100

Script with parameters
Using the file translate_oasst.py with two arguments (in addiction to target_lang and checkpoint_location) :
--use_madlad --madlad_quant
in order to test the new madlad. I made no changes to the file translate_oasst.py.

Data layout or HF dataset
Dataset is OpenAssistant/oasst1

Problem description/Question
I am trying to create the translation by using the new Madlad.
After I start the script I get the following error message and it stops.

0% 0/88838 [00:00<?, ?it/s]Got 39283 records for source language en, skipping 0 0% 0/88838 [00:24<?, ?it/s]2024-01-08 11:57:10.576859: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-01-08 11:57:10.576969: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-01-08 11:57:10.708224: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-01-08 11:57:12.917922: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 0% 20/88838 [02:32<158:08:22, 6.41s/it]["Can you clarify the analogy? I'm not following the notation or vocabulary used in the example.", 'Can you write a formal letter to introduce Jeff Bezos to a customer?', 'I asked about contrastive learning in machine learning which has nothing to do with Jeff Bezos. Contrastive learning is used to increase the performance of vision-based tasks using contrast. I want you to explain this technique to me in a way that anyone without machine learning knowledge would understand.', 'Can you explain why it is important to manage stakeholders and engagement actively for any projects or initiatives that you are involved in your workplace?', 'In simple terms, contrastive learning focuses on teaching an AI the points of similarity between different images (or other media) to indirectly aid it in spotting points of divergence when present. To anthropomorphize the process, a human engaged in contrastive learning and eating hundreds of apples in a week would be better equipped to recognize an orange when presented with one.', 'I want to start doing astrophotography as a hobby, any suggestions what could i do?', "Getting started in astrophotography can seem daunting, but with some patience and practice, you can become a master of the craft. To begin, you'll need a good camera and lens, a tripod, and a dark sky location free of light pollution. You will also need to learn about the basics of astrophotography, such as what camera settings to use, how to capture star trails, and the best techniques for tracking celestial objects. You can also purchase or rent different types of telescopes, depending on what types of objects you want to capture. Additionally, it's important to keep up with the latest astrophotography news and trends. Once you have the necessary equipment and knowledge, you can start shooting and experimenting with different techniques to get the best results.", 'Can you tell me more? What would you recommend as a basic set of equipment to get started with? How much will it cost?', "Astrophotography can be a fun and rewarding hobby, and here are some more in depth suggestions for you to get started:\n\n Equipment: As a beginner, you will need a camera that is capable of taking long exposures and manual control over the settings. A good starting point would be a digital SLR (DSLR) camera or a mirrorless camera. You will also need a sturdy tripod, a fast wide-angle lens (f/2.8 or faster), and a remote shutter release cable to minimize camera shake during long exposures.\n\n Location: Look for a location with minimal light pollution and a clear view of the night sky. Check online maps to find the best spots near you.\n\n Settings: Start with a wide-angle lens and set your camera to manual mode. Set your aperture to f/2.8 or wider, ISO to 800 or 1600, and experiment with different exposure times (30 seconds to 2 minutes).\n\n Focus: Focus the lens on the stars using live view and manual focus. You may also need to use a Bahtinov mask to achieve accurate focus.\n\n Editing: After you've taken your photos, use editing software such as Adobe Lightroom or Photoshop to adjust the exposure, contrast, and color balance to get the best results.\n\nThe cost of equipment will depend on the type of camera and lens you choose, but a basic setup can cost anywhere from $500 to $3000. However, you don't need to start with the most expensive equipment to get good results. As you improve, you can upgrade your equipment and add additional accessories such as filters and tracking mounts.\n\nRemember, astrophotography is a hobby that requires patience and practice, so don't be discouraged if your first few attempts don't turn out as well as you had hoped. Keep experimenting and learning, and you'll soon be able to capture stunning images of the night sky!", 'What is a good course of instruction to avoid years of wasted effort?', 'Here is a basic deep-sky astrophotography equipment setup typically includes an equatorial mount, telescope, camera, autoguiding system, and a slew of additional accessories required to run the camera throughout the night. \n\nMount: Sky-Watcher EQ6-R Pro\nTelescope: Sky-Watcher Esprit 100\nCamera: ZWO ASI2600MM Pro\nColor Camera Filters: Optolong L-eXtreme, Radian Triad Ultra\nGuide Scope: William Optics 50mm Guide Scope\nGuide Camera: ZWO ASI290mm Mini\n\nCosts unfortunately may vary drastically, what location on planet earth do you occupy?\nI can help you find average prices and retailers in your area.\n\nWould you like new or used Equipment?\nThis will hap narrow down your selection.', 'You want a camera, tripod, and tracking mount. Those run about 700, 150, and 300 to 2k, respectively. You may also add items like an intervalometer or light pollution filter.\n\n[1] [https://en.wikipedia.org/wiki/Astrophotography\nWhat](https://en.wikipedia.org/wiki/Astrophotography/nWhat) else do you need?', 'buy a telescope\ngo to the planetarium\nbuy a specificaly designed star catching camera', 'Do you have any suggestions for how I might improve my skill in capturing good photos?', 'β€’ Purchase the most ridiculously large telescope you can afford and install it in your backyard. This lets your neighbors know you have sky-high aspirations and a lens to match.\n\nβ€’ Get a camera that lets you crank up the ISO to unconscionably high levels. This\'ll produce lots of noise in your images but makes for a handy excuse for any less-than-stellar results: "It\'s not sensor noise, it\'s star noise!"\n\nβ€’ Spend long hours in the dark gazing upward and complaining about light pollution. Even if you live out in the boondocks, there\'s always some distant illuminated civilization marring your views and precision-focused complaints.\n\nβ€’ Develop an obsession with tracking down elusive deep-sky objects. "M104, M109, show yourselves! I know you\'re out there, hiding among the stars!" The hunt can take all night but that\'s ideal for this hobby.\n\nβ€’ Join online astrophotography forums and brag about your imaging rig and targets acquired. The only thing as vast as the cosmos is the astronomical egos of fellow \'astro-photogs.\' Outdo them all!\n\nβ€’ Most importantly, remember that astrophotography is less about taking pictures and more about embracing the night. So have fun and may the stars shine brightly on your new obsession!', 'I think that your reply is humorous/sarcastic, while I wanted a serious one. Please answer once again in a more serious fashion, with actual advice', "I apologize for my previous response. Here are some tips on how to get good pictures, starting out: \n- Get away from high population areas. Light pollution is a major factor in astrophotography. \n- Use what you can afford. You might want to get an expensive telescope and camera, but with good conditions, some phones can even create decent pictures of the night sky (For example, Google Pixel phones have an astrophotography mode) \n- Keep your camera stable. Astrophotography needs long exposures to gather enough light, so any movement will show up as a blur. \n- For very long exposures, consider a tripod of setup to track the object of interest through the sky\n- Keep an eye on the weather. Even if there is no precipitation or clouds, the atmosphere is turbulent, which will affect your pictures. \n- Start simple and keep reasonable expectations. Your pictures starting out will look amazing to you, but probably won't look as good as what you see online. Remember that other people may have better equipment, locations, and experience. You should seek advice from them, and improve yourself.", "Astrophotography is an interesting hobby and with increasingly better cameras and optics, entry to it becomes easier. If you don't have a telescope, you could see if you can find an observatory you can visit, they often have open doors for interested people. You could also try to find a community to help you get started. \nIt is very favorable to make photos of the stars in areas with low light emissions, so it is best to be in areas without city lights to take good pictures. Modern smartphones are very competent in enhancing low-light photography, and you can find cheap starter kits or even DIY projects for your first telescope. \nAs long as you are curious, you will find ways to look for the stars", 'Can you give me an example of a python script that opens an api point and serves a string?', 'Sure! Here\'s an example Python script that uses the Flask web framework to create a simple API endpoint that serves a string:\n\n``` \nfrom flask import Flask\n\napp = Flask(__name__)\n\[email protected](\'/\')\ndef hello_world():\n return \'Hello, world!\'\n\nif __name__ == \'__main__\':\n app.run()\n\n``` \n\nIn this script, we first import the Flask class from the flask module. Then we create a new instance of the Flask class, using the __name__ variable to specify the name of the application.\n\\\nNext, we define a new route using the @app.route() decorator. This decorator tells Flask to map requests to the root URL ("/") to the hello_world() function.\n\\\nFinally, we use the if __name__ == \'__main__\': block to start the Flask application when the script is executed. By default, the application will run on port 5000.\n\\\nYou can run this script and test the API by opening a web browser and navigating to http://localhost:5000/. You should see a simple "Hello, world!" message displayed in your browser.'] 0% 20/88838 [04:33<337:34:50, 13.68s/it] Traceback (most recent call last): File "/content/drive/MyDrive/LLM_NewLanguage/translate_oasst.py", line 232, in <module> main() File "/content/drive/MyDrive/LLM_NewLanguage/translate_oasst.py", line 203, in main translated_batch = batch_translate_madlad(texts_to_translate, target_lang) File "/content/drive/MyDrive/LLM_NewLanguage/translate_oasst.py", line 101, in batch_translate_madlad raise Exception("Failed to translate properly") Exception: Failed to translate properly

combine_checkpoints.py translated dataset path

It looks like readme file should be changed regarding the combine_checkpoints.py path for the translated oasst dataset.
The instructions for combine_checkpoints.py script describe the path to translated dataset like this:

Screenshot_20240103_092049

but my checkpoints folder doesn't have the language part mentioned in readme. It looks like this:

Screenshot_20240103_092726

Now when I run python3 combine_checkpoints.py /checkpoints/ it works fine but when I add the language part like nl from the docs, it fails.

Turkish Translation Error

Hello
First of all thank you for this amazing tutorial.
I am unable to run the code for Turkish ? the results shows "null" for every "text" field.

Any idea why this is happening?

error running benchmark.py with seamless

Branch main

Environment Google Colab
RAM/vRAM 16 gb vram

Script with parameters !python benchmark.py en eu seamless

Data layout or HF dataset

Problem description/Question

Hi, I tried to run the script !python benchmark.py en eu seamless
and I am getting this error:

/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
2024-02-12 07:43:34.426861: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-12 07:43:34.426909: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-12 07:43:34.432369: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-12 07:43:36.424543: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[---- LLaMa2Lang ----] Starting benchmarking from en to eu for models ['seamless'] on 100 records on device cuda:0
Traceback (most recent call last):
  File "/content/LLaMa2lang/benchmark.py", line 109, in <module>
    main()
  File "/content/LLaMa2lang/benchmark.py", line 91, in main
    translator = Seamless_M4T_V2(device, True, quant4_config, False, max_length, model_size)
TypeError: Seamless_M4T_V2.__init__() takes 6 positional arguments but 7 were given

Unfortunately I donΒ΄t understand this, so I think that I cannot give you more information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.