Coder Social home page Coder Social logo

understandlingbv / llama2lang Goto Github PK

View Code? Open in Web Editor NEW
215.0 11.0 29.0 296 KB

Convenience scripts to finetune (chat-)LLaMa3 and other models for any language

License: Apache License 2.0

Python 99.69% Shell 0.31%
ai genai huggingface llama2 llama3 llm mistral

llama2lang's Issues

Question: translating a monolingual HF Dataset

Hey guys.
Was jus wondering if the translation steps fits also datasets with one language only. I am asking as I saw that there is a default parameter which specifies the column which includes the language.
If possible, how to overcome that scenario (having no columns with indication of the language as the dataset is - let's say - all English .. ?
big thanks.

Merging and quantization

Suggest adding a script for merging of base model and qlora, and for quantizing to GGUF or GPTQ.
Also, OASST2 was just released, maybe it's better?

nllb.py and madlad.py points to the incorrect HF repositories

Branch Main

Environment Google Colab
RAM/vRAM 16gb vram

Script with parameters nllb.py; madlad.py

Data layout or HF dataset facebook/nllb-200-distilled-1.3B; google/madlad400-7b-mt-bt; nllb-200-distilled-600M

Problem description/Question
Hi, I was testing the benchmark script and I found that it´s pointing wrong to some Hugginface repositories. I guess that the actual problem is on the files nllb.py and madlad.py

When executing them, it´s triying to get the config.json from a wrong url, that points to the HugginFace repositories. Above is the url that the software tries to use, and below the correct ones that I checked that they work:

https://huggingface.co/facebook/nllb-200-DISTILLED600M/resolve/main/config.json
https://huggingface.co/facebook/nllb-200-distilled-600M/resolve/main/config.json

https://huggingface.co/google/madlad400-7b-bt-mt-bt/resolve/main/config.json
https://huggingface.co/google/madlad400-7b-mt-bt/resolve/main/config.json

https://huggingface.co/facebook/nllb-200-DISTILLED1.3B/resolve/main/config.json
https://huggingface.co/facebook/nllb-200-distilled-1.3B/resolve/main/config.json

This is an example of the error that I am getting:

Input: !python benchmark.py en eu nllb_distilled1.3b

Output:

/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
2024-02-12 07:37:14.626071: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-12 07:37:14.626122: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-12 07:37:14.628076: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-12 07:37:16.078312: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[---- LLaMa2Lang ----] Starting benchmarking from en to eu for models ['nllb_distilled1.3b'] on 100 records on device cuda:0
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/facebook/nllb-200-DISTILLED1.3B/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 389, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1374, in hf_hub_download
    raise head_call_error
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1247, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1624, in get_hf_file_metadata
    r = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 402, in _request_wrapper
    response = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 426, in _request_wrapper
    hf_raise_for_status(response)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 320, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65c9cab0-385f3047149c5c1844aa14bd;e916b38a-b643-499b-a742-f5c9d4a5454d)

Repository Not Found for url: https://huggingface.co/facebook/nllb-200-DISTILLED1.3B/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/content/LLaMa2lang/benchmark.py", line 109, in <module>
    main()
  File "/content/LLaMa2lang/benchmark.py", line 86, in main
    translator = NLLBTranslator(device, True, quant4_config, False, max_length, model_size)
  File "/content/LLaMa2lang/translators/nllb.py", line 40, in __init__
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map=device, quantization_config=self.quant4_config, load_in_4bit=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 410, in cached_file
    raise EnvironmentError(
OSError: facebook/nllb-200-DISTILLED1.3B is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

English folder is empty when translated the dataset to english

Pretty much the title. I am planning to use a different in-house translate model for the oas dataset. After translating oas to English i just found its empty. What should I do to get them back ? Shall I strip it from the original dataset itself ?

AttributeError: 'Dataset' object has no attribute 'keys'

I get that error when run the create_thread_prompts part
Traceback (most recent call last):
File "/Users/admin/Downloads/LLaMa2lang-main/create_thread_prompts.py", line 100, in
main()
File "/Users/admin/Downloads/LLaMa2lang-main/create_thread_prompts.py", line 59, in main
folds = dataset.keys()
^^^^^^^^^^^^
AttributeError: 'Dataset' object has no attribute 'keys'

Turkish Translation Error

Hello
First of all thank you for this amazing tutorial.
I am unable to run the code for Turkish ? the results shows "null" for every "text" field.

Any idea why this is happening?

[Bug] Error with benchmarking: 'NoneType' object is not iterable

Branch main

Environment
RAM/vRAM

Script with parameters

python benchmark.py en sl "opus, m2m_418m, m2m_1.2b, madlad_3b, madlad_7b, madlad_10b, madlad_7bbt, mbart, nllb_distilled600m, nllb_1.3b, nllb_distilled1.3b, nllb_3.3b, seamless"  # Try to benchmark

Data layout or HF dataset opsu-100

Problem description/Question
i'm getting an error when trying to benchmark the translator models ... I ran the above command and get the following output:

[---- LLaMa2Lang ----] Starting benchmarking from en to sl for models ['opus'] on 100 records on device cuda:0
[---- LLaMa2Lang ----] No translation possible from en to sl
Traceback (most recent call last):
  File "benchmark.py", line 109, in <module>
    main()
  File "benchmark.py", line 98, in main
    translated += translator.translate([s[source_language]], source_language, model_target_language)
TypeError: 'NoneType' object is not iterable

I'm not sure why there No translation possible from en to sl, as it clearly exists on dataset: https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-sl

Madlad: unrecognized arguments: --model_size 7b

Branch
Main

Environment
Colab

RAM/vRAM
16

Script with parameters
Using the translate.py

Data layout or HF dataset
Default dataset

Problem description/Question

It looks like the Madlad returns an error. If you use the given instruction as for the readme:

# Using madlad 7B with 8bit quantization for German with different max_length
python translate.py madlad de ./output_de --quant8 --batch_size 5 --max_length 512 --model_size 7b

You get :

output
usage: translate.py [-h] [--quant8] [--quant4] [--base_dataset BASE_DATASET]
                    [--base_dataset_text_field BASE_DATASET_TEXT_FIELD]
                    [--base_dataset_lang_field BASE_DATASET_LANG_FIELD]
                    [--checkpoint_n CHECKPOINT_N] [--batch_size BATCH_SIZE]
                    [--max_length MAX_LENGTH] [--cpu]
                    {opus,mbart,madlad,m2m} ... target_lang checkpoint_location
translate.py: error: unrecognized arguments: --model_size 7b

I tried

!python /content/LLaMa2lang/translate.py madlad -h

and noticed that the parameter looks supported

usage: translate.py madlad [-h] [--model_size {3b,7b,7b-bt}]

options:
  -h, --help            show this help message and exit
  --model_size {3b,7b,7b-bt}
                        The size of the MADLAD model to use. 7b-bt is the backtrained version
                        (best to avoid unless you know what you are doing).

Sample Example for finetuning

Can you also provide a sample jupyter notebook implementation for the finetuning part? Im not able to figure out the structure of the dataset to be provided for the finetuning step.

Thank

Entire dataset in English

Is the entire dataset available in english so that the translation is easier. Doing it for a rare language (south asian) is difficult from different languages as translation is available only in english

Question : training result model won't stop generating ?

Hi, I tried to train and the result seems cant stop (using llamacpp) what do you use for stop token ? is it or [/INST] ? seems
[/INST] works better
but when I look at the training data, it seems stopped by ?

And at what error level do you stop training ? I cannot get under 1.3 now

Thanks

Dataset chat format independent

Hey guys,
So, have any of you thought about creating a dataset for fine tuning that's chat template independent? Like, you know, one that works across the models?
Let me give you an example: I have used UnderstandLing/oasst1_pt_threads to fine-tune Llama, and it was awesome. But I cant do the same thing with phi-2.
Every model has its own way of handling chat format templates. It would be really cool if we could have a dataset translated that I could convert to chat template after.

Error with create_thread_prompts.py

Hello, I've managed to translate and combine the dataset (OASST1 or OASST2, doesn't make a difference) in finnish, but my progress stops here.

Error messages:

Downloading data files: 100%|███████████████████████████████████████████| 2/2 [00:00<00:00, 7025.63it/s]
Extracting data files: 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 3221.43it/s]
Generating train split: 1 examples [00:00, 262.09 examples/s]
Generating validation split: 1 examples [00:00, 470.27 examples/s]
Traceback (most recent call last):
File "/home/mkayhko/LLaMa2lang/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'rank'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/mkayhko/LLaMa2lang/create_thread_prompts.py", line 40, in
min_rank = df['rank'].min()
File "/home/mkayhko/LLaMa2lang/lib/python3.10/site-packages/pandas/core/frame.py", line 3807, in getitem
indexer = self.columns.get_loc(key)
File "/home/mkayhko/LLaMa2lang/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
raise KeyError(key) from err
KeyError: 'rank'

Issue with THREAD_TEMPLATE

Branch
Main

Environment
RAM/vRAM
Colab

Script with parameters
It's the step 5
finetune_llama.py [--base_model BASE_MODEL] tuned_model dataset_name

Problem description/Question
Seems like at step 5 there is an issue with the default TEMPLATE.
An error is thrown :

Screenshot 2024-01-22 alle 11 22 43

Even if the file exists

Screenshot 2024-01-22 alle 11 28 36

setting it manually as an option then works.

[IDEA] Include a better way to translate dataset?

I have used the default translation from the step 2, but sadly a lot of those translations at least from English to Polish are gibberish and absolutely terrible. https://huggingface.co/datasets/chrystians/Jestes?row=3

I want to create a thread to start a discussion about possible alternatives, obvious one would be something like AWS translate or DeepL. And to do that we would need to write a script for API integration, I also don't know how costly is it or if there are any better opensource alternatives.

There are currently around 1(0M) 9949085 characters in the oasst1 dataset

Feedback on the hindi fientuned model

I just tried out the hindi model, the outputs were very inconsistent and illogical. Do you think pretraining it with the new syllables from a custom tokenizer should make it better? are you planning to add that to the pipeline?

Question or bug

Branch
Main branch

Environment
I am using Colab

RAM/vRAM
16Gb ram and V100

Script with parameters
Using the file translate_oasst.py with two arguments (in addiction to target_lang and checkpoint_location) :
--use_madlad --madlad_quant
in order to test the new madlad. I made no changes to the file translate_oasst.py.

Data layout or HF dataset
Dataset is OpenAssistant/oasst1

Problem description/Question
I am trying to create the translation by using the new Madlad.
After I start the script I get the following error message and it stops.

0% 0/88838 [00:00<?, ?it/s]Got 39283 records for source language en, skipping 0 0% 0/88838 [00:24<?, ?it/s]2024-01-08 11:57:10.576859: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-01-08 11:57:10.576969: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-01-08 11:57:10.708224: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-01-08 11:57:12.917922: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 0% 20/88838 [02:32<158:08:22, 6.41s/it]["Can you clarify the analogy? I'm not following the notation or vocabulary used in the example.", 'Can you write a formal letter to introduce Jeff Bezos to a customer?', 'I asked about contrastive learning in machine learning which has nothing to do with Jeff Bezos. Contrastive learning is used to increase the performance of vision-based tasks using contrast. I want you to explain this technique to me in a way that anyone without machine learning knowledge would understand.', 'Can you explain why it is important to manage stakeholders and engagement actively for any projects or initiatives that you are involved in your workplace?', 'In simple terms, contrastive learning focuses on teaching an AI the points of similarity between different images (or other media) to indirectly aid it in spotting points of divergence when present. To anthropomorphize the process, a human engaged in contrastive learning and eating hundreds of apples in a week would be better equipped to recognize an orange when presented with one.', 'I want to start doing astrophotography as a hobby, any suggestions what could i do?', "Getting started in astrophotography can seem daunting, but with some patience and practice, you can become a master of the craft. To begin, you'll need a good camera and lens, a tripod, and a dark sky location free of light pollution. You will also need to learn about the basics of astrophotography, such as what camera settings to use, how to capture star trails, and the best techniques for tracking celestial objects. You can also purchase or rent different types of telescopes, depending on what types of objects you want to capture. Additionally, it's important to keep up with the latest astrophotography news and trends. Once you have the necessary equipment and knowledge, you can start shooting and experimenting with different techniques to get the best results.", 'Can you tell me more? What would you recommend as a basic set of equipment to get started with? How much will it cost?', "Astrophotography can be a fun and rewarding hobby, and here are some more in depth suggestions for you to get started:\n\n Equipment: As a beginner, you will need a camera that is capable of taking long exposures and manual control over the settings. A good starting point would be a digital SLR (DSLR) camera or a mirrorless camera. You will also need a sturdy tripod, a fast wide-angle lens (f/2.8 or faster), and a remote shutter release cable to minimize camera shake during long exposures.\n\n Location: Look for a location with minimal light pollution and a clear view of the night sky. Check online maps to find the best spots near you.\n\n Settings: Start with a wide-angle lens and set your camera to manual mode. Set your aperture to f/2.8 or wider, ISO to 800 or 1600, and experiment with different exposure times (30 seconds to 2 minutes).\n\n Focus: Focus the lens on the stars using live view and manual focus. You may also need to use a Bahtinov mask to achieve accurate focus.\n\n Editing: After you've taken your photos, use editing software such as Adobe Lightroom or Photoshop to adjust the exposure, contrast, and color balance to get the best results.\n\nThe cost of equipment will depend on the type of camera and lens you choose, but a basic setup can cost anywhere from $500 to $3000. However, you don't need to start with the most expensive equipment to get good results. As you improve, you can upgrade your equipment and add additional accessories such as filters and tracking mounts.\n\nRemember, astrophotography is a hobby that requires patience and practice, so don't be discouraged if your first few attempts don't turn out as well as you had hoped. Keep experimenting and learning, and you'll soon be able to capture stunning images of the night sky!", 'What is a good course of instruction to avoid years of wasted effort?', 'Here is a basic deep-sky astrophotography equipment setup typically includes an equatorial mount, telescope, camera, autoguiding system, and a slew of additional accessories required to run the camera throughout the night. \n\nMount: Sky-Watcher EQ6-R Pro\nTelescope: Sky-Watcher Esprit 100\nCamera: ZWO ASI2600MM Pro\nColor Camera Filters: Optolong L-eXtreme, Radian Triad Ultra\nGuide Scope: William Optics 50mm Guide Scope\nGuide Camera: ZWO ASI290mm Mini\n\nCosts unfortunately may vary drastically, what location on planet earth do you occupy?\nI can help you find average prices and retailers in your area.\n\nWould you like new or used Equipment?\nThis will hap narrow down your selection.', 'You want a camera, tripod, and tracking mount. Those run about 700, 150, and 300 to 2k, respectively. You may also add items like an intervalometer or light pollution filter.\n\n[1] [https://en.wikipedia.org/wiki/Astrophotography\nWhat](https://en.wikipedia.org/wiki/Astrophotography/nWhat) else do you need?', 'buy a telescope\ngo to the planetarium\nbuy a specificaly designed star catching camera', 'Do you have any suggestions for how I might improve my skill in capturing good photos?', '• Purchase the most ridiculously large telescope you can afford and install it in your backyard. This lets your neighbors know you have sky-high aspirations and a lens to match.\n\n• Get a camera that lets you crank up the ISO to unconscionably high levels. This\'ll produce lots of noise in your images but makes for a handy excuse for any less-than-stellar results: "It\'s not sensor noise, it\'s star noise!"\n\n• Spend long hours in the dark gazing upward and complaining about light pollution. Even if you live out in the boondocks, there\'s always some distant illuminated civilization marring your views and precision-focused complaints.\n\n• Develop an obsession with tracking down elusive deep-sky objects. "M104, M109, show yourselves! I know you\'re out there, hiding among the stars!" The hunt can take all night but that\'s ideal for this hobby.\n\n• Join online astrophotography forums and brag about your imaging rig and targets acquired. The only thing as vast as the cosmos is the astronomical egos of fellow \'astro-photogs.\' Outdo them all!\n\n• Most importantly, remember that astrophotography is less about taking pictures and more about embracing the night. So have fun and may the stars shine brightly on your new obsession!', 'I think that your reply is humorous/sarcastic, while I wanted a serious one. Please answer once again in a more serious fashion, with actual advice', "I apologize for my previous response. Here are some tips on how to get good pictures, starting out: \n- Get away from high population areas. Light pollution is a major factor in astrophotography. \n- Use what you can afford. You might want to get an expensive telescope and camera, but with good conditions, some phones can even create decent pictures of the night sky (For example, Google Pixel phones have an astrophotography mode) \n- Keep your camera stable. Astrophotography needs long exposures to gather enough light, so any movement will show up as a blur. \n- For very long exposures, consider a tripod of setup to track the object of interest through the sky\n- Keep an eye on the weather. Even if there is no precipitation or clouds, the atmosphere is turbulent, which will affect your pictures. \n- Start simple and keep reasonable expectations. Your pictures starting out will look amazing to you, but probably won't look as good as what you see online. Remember that other people may have better equipment, locations, and experience. You should seek advice from them, and improve yourself.", "Astrophotography is an interesting hobby and with increasingly better cameras and optics, entry to it becomes easier. If you don't have a telescope, you could see if you can find an observatory you can visit, they often have open doors for interested people. You could also try to find a community to help you get started. \nIt is very favorable to make photos of the stars in areas with low light emissions, so it is best to be in areas without city lights to take good pictures. Modern smartphones are very competent in enhancing low-light photography, and you can find cheap starter kits or even DIY projects for your first telescope. \nAs long as you are curious, you will find ways to look for the stars", 'Can you give me an example of a python script that opens an api point and serves a string?', 'Sure! Here\'s an example Python script that uses the Flask web framework to create a simple API endpoint that serves a string:\n\n``` \nfrom flask import Flask\n\napp = Flask(__name__)\n\[email protected](\'/\')\ndef hello_world():\n return \'Hello, world!\'\n\nif __name__ == \'__main__\':\n app.run()\n\n``` \n\nIn this script, we first import the Flask class from the flask module. Then we create a new instance of the Flask class, using the __name__ variable to specify the name of the application.\n\\\nNext, we define a new route using the @app.route() decorator. This decorator tells Flask to map requests to the root URL ("/") to the hello_world() function.\n\\\nFinally, we use the if __name__ == \'__main__\': block to start the Flask application when the script is executed. By default, the application will run on port 5000.\n\\\nYou can run this script and test the API by opening a web browser and navigating to http://localhost:5000/. You should see a simple "Hello, world!" message displayed in your browser.'] 0% 20/88838 [04:33<337:34:50, 13.68s/it] Traceback (most recent call last): File "/content/drive/MyDrive/LLM_NewLanguage/translate_oasst.py", line 232, in <module> main() File "/content/drive/MyDrive/LLM_NewLanguage/translate_oasst.py", line 203, in main translated_batch = batch_translate_madlad(texts_to_translate, target_lang) File "/content/drive/MyDrive/LLM_NewLanguage/translate_oasst.py", line 101, in batch_translate_madlad raise Exception("Failed to translate properly") Exception: Failed to translate properly

Error when executing create_thread_prompts.py

Hi,

First of all, thanks for your work. :)

In the step of create_thread_prompts.py I am getting this error using Google Colab.

Please, explain what I am doing wrong, and sorry if it´s evident but I am no really familiar with programming.

This is the input, after I have used my HugginFace token for writing in my repository :
!python create_thread_prompts.py "{base_dir}/Eus02" "eu: Chatbot generikoa zara, beti euskaraz erantzuten duena." "elBlacksmith/Eus02"

This is the output:

Downloading data files: 100% 1/1 [00:00<00:00, 13443.28it/s]
Extracting data files: 100% 1/1 [00:00<00:00, 23431.87it/s]
Generating train split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1941, in _prepare_split_single
    num_examples, num_bytes = writer.finalize()
  File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py", line 599, in finalize
    raise SchemaInferenceError("Please pass `features` or at least one example when writing data")
datasets.arrow_writer.SchemaInferenceError: Please pass `features` or at least one example when writing data

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/content/llama2lang/create_thread_prompts.py", line 12, in <module>
    dataset = load_dataset('arrow', data_files=os.path.join(dataset_name, '*.arrow'))
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2152, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 948, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1043, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1805, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1950, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

BTW, I am training the basque/Euskera (eu) language and I am not sure if the translate_oasst.py is executing correctly, as it´s creating several folders inside the "train" and "validation" folder, each one of different language (See attached screenshot). Maybe it´s fine, but I wanted to point it because at https://huggingface.co/Helsinki-NLP/opus-mt-eu-en there is a model of en to eu, so it doesn´t look logical to use other languages. But once, again, I have little knowledge of what I am actually doing and perhaps is supposed to work like that.

Capture

Bad request: Only regular characters and '-', '_', '.' are accepted. '--' and '..' are forbidden. '-' and '.' cannot start or end the name. The name cannot end with ".git". Max length is 96.

python3 create_thread_prompts.py HeshamHaroon/oasst-arabic أنت روبوت محادثة عام يجيب دائمًا باللغة العربيHeshamHaroon/oasst1-ar-threads
  6%|██                                   | 9845/177210 [01:45<29:58, 93.07it/s]
  6%|██▏                                    | 517/9306 [00:00<00:13, 638.50it/s]
Traceback (most recent call last):
  File "/home/hesham/.local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
    response.raise_for_status()
  File "/home/hesham/.local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/repos/create

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/hesham/LLaMa2lang/create_thread_prompts.py", line 72, in <module>
    dataset.push_to_hub(output_location)
  File "/home/hesham/.local/lib/python3.10/site-packages/datasets/dataset_dict.py", line 1662, in push_to_hub
    repo_url = api.create_repo(
  File "/home/hesham/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/hesham/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2816, in create_repo
    hf_raise_for_status(r)
  File "/home/hesham/.local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 326, in hf_raise_for_status
    raise BadRequestError(message, response=response) from e
huggingface_hub.utils._errors.BadRequestError:  (Request ID: Root=1-658d3f14-4f2ba1054e506a0e297e5b8a;88f4178a-190f-4f22-a9e5-0982c0339c9b)

Bad request:
Only regular characters and '-', '_', '.' are accepted. '--' and '..' are forbidden. '-' and '.' cannot start or end the name. The name cannot end with ".git". Max length is 96.

translate oasst error japanese

Hey there, not sure if its a configuration issue on my end but trying to create a japanese dataset and comes upto the end of the run and starts loading up all my vram, goes till it can fit anymore then dumps and starts again, not sure if its normal behavior? should i just leave it

running command python translate_oasst.py ja ja 500 20

screenshot of attached behavior

Screenshot 2024-01-03 104221

combined dataset too small?

I run the translate_oasst.py pl script with batch_size =40 and it takes around 1.5h on RTX3090. Completes without errors but after running combine_checkpoints.py script I only get 27k records in my dataset:

https://huggingface.co/datasets/mpazdzioch/oasst1_pl2

I guess something is not right because all other language datasets linked from readme have 88k rows.
Any ideas how to debug this?
I included the output from translate_oasst.py combine_checkpoints.py and create_thread_prompts.py in attachment.
output.txt

translate_oasst.py: IndexError: list index out of range

Hi,

I am executing it on Google Colab, with an V100. The non batch version didn´t have that error.

I saw that you published the batch update, so I tried, but I am getting this error:

This is the input:

# Translate the OASST1 dataset into your target language
!python translate_oasst.py en "{base_dir}/test02" 1000

This is the output:

2023-12-30 22:11:02.243645: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-30 22:11:02.243731: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-30 22:11:02.245221: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-30 22:11:03.327637: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
  warnings.warn(
Traceback (most recent call last):
  File "/content/llama2lang/translate_oasst.py", line 25, in <module>
    batch_size = int(sys.argv[4])
IndexError: list index out of range

problem with run_inference.py

Branch main

Environment Google Colab Pro; GPU T4
RAM/vRAM 16 GB VRAM

Script with parameters !python run_inference.py UnderstandLing/llama-2-7b-chat-es "Hazme una lista de ciudades"

Data layout or HF dataset

Problem description/Question
Hi, I am trying to run run_inference.py to test my results but I am getting problems. I tried with a finetune of yours, and the problem is the same. When running the above script I get this output after downloading all the files:

Loading checkpoint shards: 100% 2/2 [01:03<00:00, 31.82s/it]
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:389: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:394: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
Enter your input, use ':n' for a new thread or ':q' to quit:

Thanks!

error running benchmark.py with seamless

Branch main

Environment Google Colab
RAM/vRAM 16 gb vram

Script with parameters !python benchmark.py en eu seamless

Data layout or HF dataset

Problem description/Question

Hi, I tried to run the script !python benchmark.py en eu seamless
and I am getting this error:

/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
2024-02-12 07:43:34.426861: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-12 07:43:34.426909: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-12 07:43:34.432369: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-12 07:43:36.424543: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[---- LLaMa2Lang ----] Starting benchmarking from en to eu for models ['seamless'] on 100 records on device cuda:0
Traceback (most recent call last):
  File "/content/LLaMa2lang/benchmark.py", line 109, in <module>
    main()
  File "/content/LLaMa2lang/benchmark.py", line 91, in main
    translator = Seamless_M4T_V2(device, True, quant4_config, False, max_length, model_size)
TypeError: Seamless_M4T_V2.__init__() takes 6 positional arguments but 7 were given

Unfortunately I don´t understand this, so I think that I cannot give you more information.

Error when trying to run create_thread_prompts.py

I am running the following commend with this dataset:
python3 create_thread_prompts.py chrystians/oasst1_pl_2 "Jestes polskim chatbotem ktory odpowiada tylko po polsku" oasst1_pl_2threads

  6%|█████                                                                                       | 348/6264 [00:00<00:09, 602.31it/s]
Traceback (most recent call last):
  File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 100, in <module>
    main()
  File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 90, in main
    dataset[fold] = dataset[fold].rename_column('0', 'text')
AttributeError: 'list' object has no attribute 'rename_column'

I also tried to run it with previously successful dataset that worked with this commend and the script was also failing:

 python3 create_thread_prompts.py chrystians/oasst1_pl_2 "Jestes polskim chatbotem ktory odpowiada^Cylko po polsku" oasst1_pl_2threads
root@c7a6c8a800b6:/madladLLaMa2lang/LLaMa2lang#  python3 create_thread_prompts.py chrystians/oasst1_pl "test" testUsnac
  6%|█████                                                                                       | 201/3618 [00:00<00:04, 800.76it/s]
Traceback (most recent call last):
  File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 100, in <module>
    main()
  File "/madladLLaMa2lang/LLaMa2lang/create_thread_prompts.py", line 90, in main
    dataset[fold] = dataset[fold].rename_column('0', 'text')
AttributeError: 'list' object has no attribute 'rename_column'

combine_checkpoints.py translated dataset path

It looks like readme file should be changed regarding the combine_checkpoints.py path for the translated oasst dataset.
The instructions for combine_checkpoints.py script describe the path to translated dataset like this:

Screenshot_20240103_092049

but my checkpoints folder doesn't have the language part mentioned in readme. It looks like this:

Screenshot_20240103_092726

Now when I run python3 combine_checkpoints.py /checkpoints/ it works fine but when I add the language part like nl from the docs, it fails.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.