phoebussi / alpaca-cot Goto Github PK

We unified the interfaces of instruction-tuning data (e.g., CoT data), multiple LLMs and parameter-efficient methods (e.g., lora, p-tuning) together for easy use. We welcome open-source enthusiasts to initiate any meaningful PR on this repo and integrate as many LLM related technologies as possible. 我们打造了方便研究人员上手和使用大模型等微调平台，我们欢迎开源爱好者发起任何有意义的pr！

License: Apache License 2.0

Python 8.09% Jupyter Notebook 89.71% Makefile 0.02% Dockerfile 0.06% Shell 0.03% MDX 2.10%

chatglm llama llm lora chatgpt cot instruction-tuning alpaca moss p-tuning

alpaca-cot's Introduction

中文 | English

Alpaca-CoT: An Instruction-Tuning Platform with Unified Interface for Instruction Collection, Parameter-efficient Methods, and Large Language Models

This is the repository for the Alpaca-CoT project, which aims to build an instruction finetuning (IFT) platform with extensive instruction collection (especially the CoT datasets) and a unified interface for various large language models and parameter-efficient methods. We are constantly expanding our instruction-tuning data collection, and integrating more LLMs and more parameter-efficient methods. In addition, we created a new branch tabular_llm to build a Tabular LLM for solving Table Intelligence Tasks.

You are warmly welcome to provide us with any non-collected instruction-tuning datasets (or their sources). We will uniformly format them, train the Alpaca model (and other LLMs in the early future) with these datasets, open source the model checkpoints, and conduct extensive empirical studies. We hope that our project can make a modest contribution to the open-source process of large language models, and reduce its threshold for NLP researchers to get started.

You can also choose to join our group chat (WeChat) and communicate with more people with the same interests. At present, the number of group members is too large to join the group directly through the group QR code. You need to connect with me first to get into the group.

News

⚠ If you want to use other methods besides LORA, please install the edited version in our project pip install -e ./peft.
🚀12.8: LLM InternLM was merged.
🚀8.16: 4bit quantization is available for lora, qlora and adalora.
🚀8.16: Parameter-efficient methods Qlora, Sequential adapter and Parallel adapter was merged.
🚀7.24: LLM ChatGLM v2 was merged.
🚀7.20: LLM Baichuan was merged.
6.25: Add model evaluation code, including belle and MMCU.

- more

5.20: fixes bugs in model saving and add wandb support.
5.15: more datasets like GPT4Tools, Auto CoT, pCLUE are add.
🚀5.5: A new branch tabular_llm is created to build a Tabular LLM. We collect instruction fine-tuning data for table-related tasks like table question answering and use them to fine-tune LLMs in this repo.
🚀5.4: All parameter-efficient methods in PEFT (e.g., p-tuning) were merged, which can be set by hyper-parameter directly.
🚀5.4: LLM MOSS was merged.
4.21: Datasets GAOKAO, camel, FLAN-Muffin, COIG are collected and formatted.
4.15: Datasets webGPT, dolly, baize, hh-rlhf, OIG(part) are collected and formatted.
4.12: Now you can try Alpaca-CoT on Google Colab.
4.11: Added function multi-turn conversation by @paulcx.
4.9: Datasets firefly, instruct, Code Alpaca are collected and formatted, which can be found here.
4.7: Added functions Parameter merging, Local chatting, Batch predicting and Web service building by @weberr.
4.4: Datasets GPTeacher,Guanaco,HC3,prosocial-dialog, belle-chat&belle-math, xP3 and natural-instructions are collected and formatted.
4.3: The Chinese CoT dataset CoT_CN_data.json can be found here.

Overview

LLaMA [1] is a great work that demonstrates the amazing zero-shot and few-shot ability. It significantly reduces the cost of training, finetuning, and using competitive large language models, i.e., LLaMA-13B outperforms GPT-3(175B) and LLaMA-65B is competitive with PaLM-540B. Recently, to boost the instruction-following ability of LLaMA, Stanford Alpaca [2] finetuned LLaMA-7B on 52K instruction-following data generated by the Self-Instruct [3] techniques. However, at present, the LLM research community still faces three challenges: 1. Even LLaMA-7b still has high requirements for computing resources; 2. There are few open source datasets for instruction finetuning; and 3. There is a lack of empirical study on the impact of various types of instruction on model abilities, such as the ability to respond to Chinese instruction and the CoT reasoning.

To this end, we propose this project, which leverages various improvements that were subsequently proposed, with the following advantages:

1. This repo contains code, modified from here and here, that can finetune LLaMA cheaply and efficiently (without performance degradation compared to Stanford Alpaca) by using low-rank adaptation (LoRA) [4], PEFT and bitsandbytes. The 7b, 13b and 30b versions of LLaMA models can be easily trained on a single 80G A100.
1. The models published in this repo significantly improve the CoT (reasoning) capability.
1. The models published in this repo significantly improve the ability to follow Chinese instructions.
1. This repo contains a collection of instruction-finetuning datasets that are continuously collected, which so far include English, Chinese and CoT instructions. In addition, a collection of checkpoints trained with various instruction datasets is also provided.
1. This repo integrates multiple LLMs and unifies their interfaces, It can be easily switched through hyperparameters. Currently, it includes LLaMA, ChatGLM[5], Bloom[6] and MOSS, and more will continue to be added in the future for researchers to easily invoke and compare different LLMs.
1. This repo integrates multiple parameter-efficient methods and unifies their interfaces, It can be easily switched through hyperparameters. Currently, it includes LoRA, P-tuning[5], adalora and prefix tuning, and more will continue to be added in the future for researchers to easily invoke and compare different parameter-efficient methods.
1. This repo contains extensive empirical studies and qualitative analysis, which may provide valuable findings and promote the exploration of LLM in the future.

To the best of our knowledge, this work is the first to study CoT reasoning based on LLaMA and Alpaca. Therefore, we abbreviate our work to Alpaca-CoT.

Data Collection

The relative size of collected datasets can be shown by this graph:

Referring to this (@yaodongC), we labeled each collected dataset according to the following rules:

(Lang)Lingual-Tags:

EN: Instruction datasets in English
CN: Instruction datasets in Chinese
ML: [Multi-lingual] Instruction datasets in multiple languages

(Task)Task-Tags:

MT: [Multi-task] Datasets containing multiple tasks
TS: [Task-specific] Datasets tailored for specific tasks

(Gen)Generation-method:

HG: [Human Generated Dataset] Datasets created by humans
SI: [Self-Instruct] Datasets generated using self-instruct methods
MIX: [Mixed Dataset] Dataset contains both human and machine generated data
COL: [Collection of Dataset] Dataset made from a collection of other datasets

Statistics

Dataset	Nums	Lang	Task	Gen	Type	Src	Url
Chain of Thought	74771	EN/CN	MT	HG	instruct with cot reasoning	annotating CoT on existing data	download
GPT4all	806199	EN	MT	COL	code, stories and dialogs	distillation from GPT-3.5-turbo	download
GPTeacher	29013	EN	MT	SI	general, roleplay, toolformer	GPT-4 & toolformer	download
Guanaco	534610	ML	MT	SI	various linguistic tasks	text-davinci-003	download
HC3	37175	EN/CN	TS	MIX	dialogue evaluation	human or ChatGPT	download
alpaca	52002	EN	MT	SI	general instruct	text-davinci-003	download
Natural Instructions	5040134	ML	MT	COL	diverse nlp tasks	human annotated datasets collection	download
belle_cn	1079517	CN	TS/MT	SI	general, mathematical reasoning, dialogue	text-davinci-003	download
instinwild	52191	EN/CN	MT	SI	generation, open-qa, mind-storm	text-davinci-003	download
prosocial dialog	165681	EN	TS	MIX	dialogue	GPT-3 rewrites questions + humans feedback manually	download
finance_en	68912	EN	TS	COL	financial related qa	GPT3.5	download
xP3	78883588	ML	MT	COL	a collection of prompts & datasets across 46 of languages & 16 NLP tasks	human annotated datasets collection	download
firefly	1649398	CN	MT	COL	23 nlp tasks	human annotated datasets collection	download
instruct	888969	EN	MT	COL	augmented of GPT4All, Alpaca, open-source Meta datasets	augmentation performed using the advanced NLP tools provided by AllenAI	download
Code Alpaca	20022	EN	TS	SI	code generation, editing, optimization	text-davinci-003	download
Alpaca_GPT4	52002	EN/CN	MT	SI	general instruct	generated by GPT-4 using Alpaca	download
webGPT	18994	EN	TS	MIX	information retrieval (IR) QA	fine-tuned GPT-3, each instruction has two outputs, select better one	download
dolly 2.0	15015	EN	TS	HG	closed QA , summarization and etc, Wikipedia as references	human annotated	download
baize	653699	EN	MT	COL	a collection from Alpaca, Quora, StackOverFlow and MedQuAD questions	human annotated datasets collection	download
hh-rlhf	284517	EN	TS	MIX	dialogue	dialog between human and RLHF models	download
OIG(part)	49237	EN	MT	COL	created from various tasks, such as question and answering	using data augmentation, human annotated datasets collection	download
GAOKAO	2785	CN	MT	COL	Multiple-choice, Fill-in-the-blank and Open-ended questions from examination	human annotated	download
camel	760620	EN	MT	SI	Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog	gpt-3.5-turbo	download
FLAN-Muffin	1764800	EN	MT	COL	60 nlp tasks	human annotated datasets collection	download
COIG(FlagInstruct)	298428	CN	MT	COL	collect fron Exam, Translated, Human Value Alignment Instructions and Counterfactural Correction Multi-round Chat	using automatic tool and manual verification	download
GPT4Tools	71446	EN	MT	SI	a collection of tool-related instructions	gpt-3.5-turbo	download
ShareChat	1663241	EN	MT	MIX	general instruct	crowdsourcing to collect conversations between people and ChatGPT (ShareGPT)	download
Auto CoT	5816	EN	MT	COL	arithmetic, commonsense, symbolic, and other logical reasoning tasks	human annotated datasets collection	download
MOSS	1583595	EN/CN	TS	SI	general instruct	text-davinci-003	download
ultrachat	28247446	EN			Questions about the World, Writing and Creation, Assistance on Existent Materials	two separate gpt-3.5-turbo	download
Chinese-medical	792099	CN	TS	COL	Questions about medical advice	crawl	download
CSL	396206	CN	MT	COL	paper text generation, keyword extraction, text summarization and text classification	crawl	download
pCLUE	1200705	CN	MT	COL	general instruct		download
news_commentary	252776	CN	TS	COL	translate		download
StackLLaMA	todo	EN

Download

You can download all the formatted data here. Then you should put them in the data folder.

You can download all checkpoints trained on various types of instruction data from here. Then, after setting LoRA_WEIGHTS (in generate.py) to the local path, you can directly execute the model inference.

Data Formatting

All data in our collection is formatted into the same templates, where each sample is as follows:

[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]

Note that, for CoT datasets, we first use the template provided by FLAN to change the original dataset into various Chain-of-Thoughts forms, and then convert it to the above format. The formatting script can be found here.

Multi-interface Unified Platform

Setup

pip install -r requirements.txt

Note that, make sure python>=3.9 when finetuning ChatGLM.

PEFT

if you want to use other methods besides LORA, please install the edited version in our project

pip install -e ./peft

Instruction Finetuning

In order for researchers to conduct systematic IFT research on LLMs, we have collected different types of instruction data, integrated multiple LLMs, and unified interfaces, making it easy to customize the desired collocation:

--model_type : Set the LLM you want to use. Currently, [llama, chatglm, bloom, moss] are supported. The latter two have strong Chinese capabilities, and more LLMs will be integrated in the future.
--peft_type: Set the PEFT you want to use. Currently, [lora, adalora, prefix tuning, p tuning, prompt] are supported.
--data: Set the data type used for IFT to flexibly tailor the desired command compliance ability. For example, for strong reasoning ability, set "alpaca-cot", for strong Chinese ability, set "belle1.5m", for coding and story generation ability, set "gpt4all", and for financial related response ability, set "finance".
--model_name_or_path: This is set to load different versions of the model weights for the target LLM --model_type. For example, to load the llama's 13b version of weights, you can set decapoda-research/llama-13b-hf.

Single GPU

for LLaMA

python3 uniform_finetune.py --model_type llama --model_name_or_path decapoda-research/llama-7b-hf \
    --data alpaca-belle-cot --lora_target_modules q_proj v_proj \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1

Note: for multiple datasets, you can use --data like --data ./data/alpaca.json ./data/finance.json <path2yourdata_1>

for ChatGLM

python3 uniform_finetune.py   --model_type chatglm --model_name_or_path THUDM/chatglm-6b \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \
    --learning_rate 2e-5 --epochs 1

Note that load_in_8bit is not yet suitable for ChatGLM, so batch_size must be smaller than others.

for BLOOM

python3 uniform_finetune.py   --model_type bloom --model_name_or_path bigscience/bloomz-7b1-mt \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1

for MOSS

python3 uniform_finetune.py   ---model_type moss --model_name_or_path fnlp/moss-moon-003-sft  \
    --data alpaca --lora_target_modules q_proj v_proj --per_gpu_train_batch_size 1 \
    --learning_rate 3e-4 --epochs 3

for InternLM

python3 uniform_finetune.py   --model_type internlm --model_name_or_path internlm/internlm-7b \
    --data alpaca --lora_target_modules q_proj v_proj --lora_r 32 --lora_alpha 32 \
    --lora_dropout 0.1 --per_gpu_train_batch_size 1 --learning_rate 2e-5 --epochs 1 \
    --compute_dtype="fp32"

Note that you can also pass the local path (where LLM weights saved) to --model_name_or_path. And the data type --data can be freely set according to your interests.

Multiple GPUs

torchrun --nnodes 1 --nproc_per_node $ngpu uniform_finetune.py $args --data $data

for LLaMA

python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy uniform_finetune.py \
    --model_type llama --model_name_or_path decapoda-research/llama-7b-hf \
    --data alpaca-belle-cot --lora_target_modules q_proj v_proj \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1

for ChatGLM

python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
    uniform_finetune.py   --model_type chatglm --model_name_or_path THUDM/chatglm-6b \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \
    --learning_rate 2e-5 --epochs 1

Note that load_in_8bit is not yet suitable for ChatGLM, so batch_size must be smaller than others.

for BLOOM

python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
    uniform_finetune.py   --model_type bloom --model_name_or_path bigscience/bloomz-7b1-mt \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1

for InternLM

python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
    uniform_finetune.py   --model_type internlm --model_name_or_path internlm/internlm-7b \
    --data alpaca --lora_target_modules q_proj v_proj --lora_r 32 --lora_alpha 32 \
    --lora_dropout 0.1 --per_gpu_train_batch_size 1 --learning_rate 2e-5 --epochs 1 \
    --compute_dtype="fp32"

Inference

python3 generate.py  --data alpaca-belle-cot --model_type llama

python3 generate.py  --data alpaca-belle-cot --model_type chatglm

python3 generate.py  --data alpaca-belle-cot --model_type bloom

More details of instruction finetuing and inference can be found here where we modified from. Note that the folders saved-xxx7b are the save path for LoRA weights, and LLaMA weights are automatically downloaded from Hugging Face.

Inference Hyper-parameter Explanation

top_p=0.9,
        #Moderately increase the probability threshold of nucleus sampling to increase the quantity of candidate tokens and increase generation diversity.

temperature=1.0,
        #The previous low temperature parameter could lead to a severe polarization in the probability distribution of generated words, which degenerates the generation strategy into greedy decoding.

do_sample=True,
        #do_sample parameter is set to False by default. After setting to True, the generation methods turn into beam-search multinomial sampling decoding strategy.

no_repeat_ngram_size=6,
        #Configure the probability of the next repeating n-gram to 0, to ensure that there are no n-grams appearing twice. This setting is an empirical preliminary exploration.

repetition_penalty=1.8,
        #For words that have appeared before, in the subsequent prediction process, we reduce the probability of their reoccurrence by introducing the repetition_penalty parameter. This setting is an empirical preliminary exploration.

Parameter merging

python3 merge.py --model_type llama --size 7b --lora_dir xxx --merged_dir yyy

Local chatting

python3 server.py --model_type chatglm --size 6b --lora_dir xxx

Batch predicting

python3 predict.py --model_type chatglm --size 6b --data for_dict_data --lora_dir xxx --result_dir yyy

Web service building

python3 web.py --model_type chatglm --size 6b --lora_dir xxx

Empirical Study of Instruction-tuning Open LLMs in Chinese (As of June 25th)

Note: The following experimental results are all obtained from ___An Empirical Study of Instruction-tuning Large Language Models in Chinese___.

1. Benchmarks

This paper selects two evaluation benchmarks, Belle-eval and MMCU, to comprehensively evaluate LLM competencies in Chinese.

Belle-eval is constructed by self-instruct with ChatGPT, which has 1,000 diverse instructions that involve 10 categories covering common NLP tasks (e.g., QA) and challenging tasks (e.g., code and math). We use ChatGPT to rate the model responses based on the golden answers. This benchmark is considered to be as the assessment of AGI (instruction-following) capability.

MMCU is a collection of Chinese multiple choice questions in four professional disciplines of medicine, law, psychology and education (e.g., Gaokao examination). It allows LLMs to take exams in human society in a multiple-choice test manner, making it suitable for evaluating the breadth and depth of knowledge of LLMs across multiple disciplines.

Data statistics of Belle-eval and MMCU are shown in the table above.

2. Main Factors

We conduct experiments to study the three main factors in instruction-tuning LLMs: LLM bases, Parameter-efficient Methods, Chinese Instruction Datasets.

2.1 LLM Bases

For open LLMs, we test existing LLMs and LLMs fine-tuned with LoRA on Alpaca-GPT4 on Belle-eval and MMCU, respectively.

Table 2 shows the scores of open LLMs on Belle-eval. Table 3 shows the accuracy of LLMs on MMCU. They fine-tune all the open LLMs with the same parameter-efficient method LoRA and the same instruction dataset Alpaca-GPT4.

Experimental Results:

Evaluation of Existing LLMs

Performance on Belle-eval

(1) For base LLMs, Bloom performs the best.

(2) For sft LLMs, ChatGLM outperforms others by large margins, thanks to the fact that it is trained with the most Chinese tokens and HFRL.

(3) The Open QA, Math, CloseQA and Extract categories are still very challenging for existing open LLMs.

(4) Vicuna and moss-sft have clear improvements compared to their bases, LLaMA and moss-base, respectively.

(5) In contrast, the performance of sft models, Bloomz and Bloomz-mt, is reduced compared to the base model Bloom, because they tend to generate a shorter response.

Performance on MMCU

(1) All base LLMs perform poorly because it is almost difficult to generate content in the specified format before fine-tuning, e.g., outputting option numbers.

(2) All sft LLMs outperform their corresponding base LLMs, respectively. In particular, Bloomz performs the best (even beats ChatGLM) because it can generate option number directly as required without generating other irrelevant content, which is also due to the data characteristics of its supervised fine-tuning dataset xP3.

(3) Among the four disciplines, law is the most challenging for LLMs.

The performance results of LLMs after instruction-tuning on Alpaca-GPT4-zh are shown in Figure 1.

Instruction-tuning Different LLMs

(1) On Belle-eval, the performance improvement of sft LLMs brought by instruction-tuning is not as significant as that of base LLMs, except for sft Bloomz and Bloomz-mt.

(2) Vicuna and ChatGLM encounter performance drops after instruction-tuning, because Vicuna is trained from real human-ChatGPT conversations, with better quality than Alpaca-GPT4. ChatGLM adopts HFRL, which may be no longer suitable for further instruction-tuning.

(3) On MMCU, most LLMs achieve performance boosts after instruction-tuning, with the exception of Bloomz and Bloomz-mt, which have unexpectedly significantly decreased performance.

(4) After instruction-tuning, Bloom has significant improvements and performs well on both benchmarks. Although ChatGLM beats Bloom consistently, it suffers performance drop during instruction-tuning. Therefore, among all open LLMs, Bloom is most suitable as a foundation model in the subsequent experiments for Chinese instruction-tuning exploration.

2.2 Parameter-efficient Methods

For parameter-efficient methods other than LoRA, the paper collects a range of parameter-efficient methods to instruction-tune Bloom on the Alpaca-GPT4 dataset.

Experimental Results:

Comparison of Parameter-efficient Methods

(1) SadapterH performs the best among all parameter-efficient methods, which can be used as an alternative to LoRA.

(2) P-tuning and prompt-tuning underperform others by large margins, indicating that only adding trainable layers in the embedding layer are not enough to support LLMs for generation tasks.

(3) Although AdaLoRA is an improvement of LoRA, its performance has a clear drop, possibly because the LoRA's trainable parameters for LLMs are not suitable for further reduction.

(4) Comparing the upper and lower parts, it can be seen that increasing the number of trainable parameters for sequential adapters (i.e., SadapterP and SadapterH) does not bring gain, while the opposite phenomenon is observed for parallel adapters(i.e., P-adapter)
Training Loss

(1) Prompt-tuning and P-tuning converge the slowest and has the highest losses after convergence. This shows that embedding-only adapters are not suitable for instruction-tuning LLMs.

(2) The initial loss of AdaLoRA is very high because it requires simultaneous learning of parameter budget allocation, which makes the model unable to fit the training data well.

(3) The other methods can quickly converge on training data and fit it well.

2.3 Chinese instruction Datasets

For the impact of various types of Chinese instruction datasets, authors gather popular open Chinese instructions (as shown in Table 5) to fine-tune Bloom with LoRA.

Table 6 and Table 7 show Bloom's fine-tuning on different instruction datasets.

Experimental Results:

Performance on Belle-eval

(1) the instruction data constructed by ChatGPT (e.g., using self-instruction methods or collecting real human-ChatGPT conversations) consistently enhances the instruction-following ability with 3.1 ∼ 11-point score increases.

(2) Among these datasets, Belle has the best performance due to the largest amount of instruction data. However, the performance of models trained on moss-sft-data, containing more data built in a similar way, is unsatisfactory.

(3) The performance brought by the Alpaca-GPT4 instructions is the second best, with only 49K being comparable to the 1.54M Belle.

(4) Instinwild brings the least performance gains among them because the seed instructions it crawls from Tweet ("in wild") are not as comprehensive as those (like Alpaca) carefully designed by humans.

(5) These ChatGPT-based data mainly have a significant improvement effect on open generation tasks such as Brain Storm and Generation, while there is a significant decrease in tasks that require high reading comprehension skills, such as Close QA and Extract.

(6) These instruction datasets cause damage to the model's instruction-following ability, because the form and intent of each NLP or examination dataset are unitary, which can easily be overfitted.

(7) Among them, COIG-trans performs the best because it involves over 2000 different tasks with a wide variety of task instructions. In contrast, xP3 and COIG-ccmc have the worst negative impact on model performance. Both of them only cover a few types of tasks (translation and QA for the former, counterfactual correction conversations for the latter), which hardly cover the popular instructions and tasks for humans.
Performance on MMCU

(1) Instruction-tuning on each dataset can always result in performance improvement.

(2) Among the ChatGPT-based data shown in the upper part, ShareGPT-zh underperforms others by large margins. This may be due to the fact that real users rarely ask multiple choice questions about academic topics.

(3) Among the dataset-collection data shown in the lower part, HC3 and COIG-ccmc results in the lowest accuracy because the unique questions of HC3 are only 13K, and the task format of COIG-ccmc is significantly different from MMCU.

(4) COIG-exam brings the greatest accuracy improvement, benefiting from the similar task format as MMCU.

3. Other Factors

Four Other Factors: CoT, Expansion of Chinese Vocabulary, Language of Prompts and Human-value Alignment

3.1 CoT

For CoT, authors compare the performance before and after adding CoT data during instruction-tuning.

Experiment Settings:

We collect 9 CoT datasets and their prompts from FLAN, and then translate them into Chinese using Google Translate. They compare the performance before and after adding CoT data during instruction-tuning.

First note the way to add CoT data as "Alpaca-GPT4+CoT". In addition, add a sentence "先思考，再决定" ("think step by step" in Chinese) at the end of each instruction, to induce the model to respond to instructions based on the CoT, and label this way as "Alpaca-GPT4+CoT*".

Experimental Results:

"Alpaca-GPT4+CoT" outperforms "Alpaca-GPT4" in Code and Math tasks that require strong reasoning ability. Besides, there is also a significant improvement in the MMCU Education task.
As shown in the line of "Alpaca-GPT4+CoT*", the simple sentence can further improve the performance of reasoning tasks Code and Education, while the Math performance is slightly inferior to "Alpaca-GPT4+CoT". This may require further exploring of more robust prompts.

3.2 Expansion of Chinese Vocabulary

For expansion of Chinese vocabulary, authors test the influence of the number of Chinese tokens in the tokenizer’s vocabulary on LLMs’ ability to express Chinese. For example, if a Chinese character is in the vocabulary, it can be represented by a single token, otherwise it may require multiple tokens to represent it.

Experiment Settings: Authors mainly conduct experiments on LLaMA, which uses SentencePiece(32K vocabulary size of Chinese characters) covering fewer Chinese characters than Bloom(250K).

Experimental Results:

Pre-training on more Chinese corpus with expansion of Chinese vocabulary is consistently helpful for instruction-following ability.
And counterintuitively, "llama-voc-pre-l" (100B) is inferior to "llama-voc-pre" (20B) on MMCU, which shows that pre-training on more data may not necessarily lead to higher performance for academic exams.

3.3 Language of Prompts

For the language of prompts, authors test the suitability of instruction fine-tuning for using Chinese prompts.

Figure 4 shows the results of using Chinese and English prompts based on LLaMA and Bloom. When instruction-tuning LLaMA, using Chinese prompts can improve the performance on both benchmarks compared to English prompts, while the opposite phenomenon can be observed on Bloom.

Experimental Results:

For models with weaker Chinese abilities(e.g., LLaMA), using Chinese prompts can effectively help respond in Chinese.
For models with good Chinese abilities (e.g., Bloom), using prompts in English (the language they are better at) can better guide the model to understand the process of fine-tuning with instructions.

3.4 Human-value Alignment

To avoid LLMs generating toxic content, aligning them with human values is a crucial issue. We add human-value alignment data built by COIG into instruction-tuning to explore its impact.

Figure 5 compares the results of instruction-tuning with and without human-value alignment.

Experimental Results: The human-value alignment results in a slight performance drop. How to balance the harmlessness and performance of LLMs is a research direction worth exploring in the future.

Quantitative Analysis

Note: The following figure shows the statistics of the dataset collected as of March 26, which is only displayed as a motivation of data collection. More datasets have been collected, such as financial related instruction datasets.

The current collection of instruction-finetuning datasets consists mainly of three parts:

alpaca_data_cleaned.json: about 52K English instruction-following training samples.
CoT_data.json: 9 CoT datasets involving about 75k samples. (published by FLAN[7])
belle_data_cn.json: about 0.5M Chinese |instruction-following training samples. (published by BELLE [8])

Ablation of CoT and Chinese Instructions

"w/o CoT" and "w/o CN" denote models that exclude CoT data and Chinese instructions from their instruction finetuning data, respectively.

The above table shows two examples (involving with numerical calculations) that require a certain amount of reasoning ability to respond correctly. As shown in the middle column, Ours w/o CoT fails to generate the correct response, which shows that once the finetuning data does not contain CoT data, the model's reasoning ability significantly decreases. This further demonstrates that CoT data is essential for LLM models.

The above table shows two examples that require the ability to respond to Chinese instructions. As shown in the right column, either the generated content of Ours w/o CN is unreasonable, or the Chinese instructions are answered in English by Ours w/o CN. This shows that removing Chinese data during finetuning will cause the model to be unable to handle Chinese instructions, and further demonstrates the need to collect Chinese instruction finetuning data.

The above table shows a relatively difficult example, which requires both a certain accumulation of knowledge of Chinese history and a logical and complete ability to state historical events. As shown in this table, Ours w/o CN can only generate a short and erroneous response, because due to the lack of Chinese finetuning data, the corresponding knowledge of Chinese history is naturally lacking. Although Ours w/o CoT lists some relevant Chinese historical events, its logic of expression is self-contradictory, which is caused by the lack of CoT data. `

In summary, the models finetuned from our complete dataset (English, Chinese, and CoT instruction data) can significantly improve model reasoning and Chinese instruction following abilities.

The Effect of CoT Data

Samples of each odd number of rows do not apply the CoT prompt, such as "step-by-step reasoning." Both Ours(w/CoT) and Alpaca are based on LLaMA-7B, and the only difference between them two is that the instruction-finetuning data of Ours(w/CoT) has a extra CoT data than that of Alpaca.

From the above table, we find that:

Ours(w/CoT) always generates the correct rationale before the answer, while Alpaca fails to generate any reasonable rationale, as shown in the first 4 examples (commonsense questions). This shows that using CoT data for finetuning can significantly improve reasoning ability.
For Ours(w/CoT), the CoT prompt (e.g., concatenate 'step-by-step' with the input question) has little effect on easy examples (e.g., commonsense questions) and has an important effect on challenging questions (e.g., questions requiring reasoning, like the last four examples).
For Alpaca, CoT prompt always has little effect or even negative impact. For the last two examples, after adding CoT prompt, Aplpaca changes the correct generated answer to the wrong one. This may be due to the inconsistency between the input forms of finetuning and inference.

The Effect of Chinese Instruction Data

Quantitative comparison of responses to Chinese instructions.

Our model is finetuned from a 7B LLaMA on 52K English instructions and 0.5M Chinese instructions. Stanford Alpaca (our reimplementation) is finetuned from a 7B LLaMA on 52K English instructions. BELLE is finetuned from a 7B BLOOM on 2B Chinese instructions.

From the above table, several observations can be found:

Compared to Alpaca, ours (w/ CN) has a stronger ability to understand Chinese instructions. For the first example, Alpaca fails to distinguish between the instruction part and input part, while we do.
Chinese instruction finetuning data can significant enhance the ability to interact in Chinese. For the second example, ours (w/ CN) not only provides the correct code, but also provides the corresponding Chinese annotation, while Alpaca does not. In addition, as shown in the 3-5 examples, Alpaca can only respond to Chinese instruction with an English response.
Compared to BELLE, ours (w/ CN)'s performance on instructions requiring an open response (as shown in last two examples) still needs to be improved. BELLE's outstanding performance against such instructions is due to: 1. Its BLOOM backbone model encounters much more multilingual data during pre-training; 2. Its Chinese instruction finetuning data is more than ours, that is, 2M vs 0.5M.

Quantitative comparison of responses to English instructions. The purpose of this subsection is to explore whether finetuning on Chinese instructions has a negative impact on Alpaca.

From the above table, we find that:

Finetuning with Chinese instruction data does not weaken the original English instruction–following ability, on the contrary, there is also a certain enhancement in generating a better response to English instructions. The response of ours (w/ CN) shows more detail than that of Alpaca, e.g. for the third example, ours (w/ CN) list three more provinces than Alpaca.

Citation

Please cite the repo if you use the data collection, code, and experimental findings in this repo.

@misc{si2023empirical,
      title={An Empirical Study of Instruction-tuning Large Language Models in Chinese}, 
      author={Qingyi Si and Tong Wang and Zheng Lin and Xu Zhang and Yanan Cao and Weiping Wang},
      year={2023},
      eprint={2310.07328},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

For data and models, please cite the original data, parameter-efficient methods and LLMs source as well.

We would like to express our special gratitude to APUS AilMe Lab for sponsoring the 8 A100 GPUs for the experiments.

(back to top)

All Thanks To Our Contributors

alpaca-cot's People

Contributors

Stargazers

Watchers

Forkers

littlesister1024 forex24 zsc zurichrain ybqu raburabu91 nasakim madehong zhaixingang wakeya xieren58 fangyuchuan synehe mlshenkai anatanick gptalgopro yangbo kemolo mon0l1th boomer001 yikongge lainey12 xuexidi ylinlinz huitingliu chengtaoyuan fanqinghui hphp basewave01 bengg1981 smallaitt zhangsanfeng86 gitliubo zkcpku overbestfitting vinhjaxt techthiyanes rchanggogogo threadshare techventurebuilder aijianiula0601 xiaoshengjun julyhcw c00renut hhy5277 moyuwin jaredshuai tonypius fatlism benliao jchzhao qwcbw iamblue jinghuayao zhihao-chen mhshih luzhongqiu learning-group1 dumpmemory assassindesign mazeyang lianzhaoy dkqkxx dingyh0626 huyanluanyu1949 shunsunsun zhengyaoyaoyao zhongpei openandrus dgo2dance weberrr carlziess jangocheng xsun15 jianks1 anastazya weizj2000 muou55555 hihaluemen paulcx houchenxd minghsuanwu chenxichen95 lzxlin louieyliu zhangjianbing0815 qxcool yibit qqr1 x22x22 hasai666 zwq2018 xusenlinzy utensil grasshourse xie-minghui r00mz gaoxiaojun myxiaoyu adambear

alpaca-cot's Issues

训练belle0.5M时，中途断掉了，有办法从checkpoint接着往下训练吗？

训练belle0.5M时，中途断掉了，有办法从checkpoint接着往下训练吗？具体如何操作？

增加 BELLE 1M CN data

BELLE已开源其新的1M中文数据且更优质，与之前的0.5M不重复，可以合并进去
https://huggingface.co/datasets/BelleGroup/generated_train_1M_CN

GPT-4 Instruction dataset

Take a look:

https://github.com/teknium1/GPTeacher

Issues regarding FastChat dataset

The repo marks FastChat as subsets from ShareGPT (Show in the table). However, I checked the repo of FastChat and find that their released dataset are processed Alpaca data to demonstrate how to train Vicuna, rather than ShareGPT data they used. You may want to fix this by adding a notice to the description of FastChat.

Recommend the dataset

The dataset used in SFT by ColossalAI: https://github.com/XueFuzhao/InstructionWild
A summary of available datasets: https://zhuanlan.zhihu.com/p/615277009

反馈

大佬，gpt4那个数据集，instruction和output全部一样，是不是有什么问题？

请问可以开源更大尺寸的模型吗？比如30b，65b

有效果对比和分析吗

cost time

i just want to ask how much time will be cost if someone finetune 1 epoch with 1M instruction data on one GPU A100. I'm doing this, and it seems need 120 hours，so long!

作inference的时候，下载失败

llama微调后的保存模型不一致

我这边使用llama-7b微调后的模型保存如下：

但看你这边放出的模型是下面的：

这中间是有什么特殊操作吗

中文词表

请问中文词表有做单独处理吗。原来的llama可以编码解码中文，但是绝大多数是以字节的形式编码的~

Data from ShareGPT

Will crawling the data from ShareGPT (as done by Google Bard and Vicuna) be possible? The conversations shared by real-person users of ChatGPT are of very high quality.

The repo of ShareGPT: https://github.com/domeccleston/sharegpt.

fine-tuning完了以后，使用时AutoTokenizer这步报错

fine-tuning这部已顺利跑成功了，belle1.5m 的指令集已经替换成我自己的指令集了。
python3 uniform_finetune.py --model_type llama --model_name_or_path decapoda-research/llama-7b-hf
--data belle1.5m --lora_target_modules q_proj v_proj
--per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
saved_models下面也生成了llama-7b-hf_belle1.5m文件夹，里面有两个文件config.json和pytorch_model.bin
然后加载模型：
from transformers import AutoTokenizer, AutoModelForCausalLM
import sys
model_path = "./saved_models/llama-7b-hf_belle1.5m"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
运行到tockenizer时，抛异常：

OSError Traceback (most recent call last)
Cell In[5], line 1
----> 1 tokenizer = AutoTokenizer.from_pretrained(model_path)

File ~/.conda/envs/python39/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:715, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
713 else:
714 if tokenizer_class_py is not None:
--> 715 return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
716 else:
717 raise ValueError(
718 "This tokenizer cannot be instantiated. Please make sure you have sentencepiece installed "
719 "in order to use this tokenizer."
720 )

File ~/.conda/envs/python39/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:1795, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1789 logger.info(
1790 f"Can't load following files from cache: {unresolved_files} and cannot check if these "
1791 "files are necessary for the tokenizer to operate."
1792 )
1794 if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):
-> 1795 raise EnvironmentError(
1796 f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
1797 "'https://huggingface.co/models', make sure you don't have a local directory with the same name. "
1798 f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory "
1799 f"containing all relevant files for a {cls.name} tokenizer."
1800 )
1802 for file_id, file_path in vocab_files.items():
1803 if file_id not in resolved_vocab_files:

OSError: Can't load tokenizer for '/root/llm/Alpaca-CoT-main/saved_models/llama-7b-hf_belle1.5m'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/root/llm/Alpaca-CoT-main/saved_models/llama-7b-hf_belle1.5m' is the correct path to a directory containing all relevant files for a LlamaTokenizer tokenizer.

后面该怎么加载啊？还是模型生成时就少点什么？

How to finetune without lora?

Thanks for your work.
I wanna know how to diable lora config which means train all parameter.
I have viewed the code, whether I can train all parameter by commented the below code out in uniform_finetune.py?

    # config = LoraConfig(
    #     r=args.lora_r,
    #     lora_alpha=args.lora_alpha,
    #     target_modules=args.lora_target_modules,
    #     lora_dropout=args.lora_dropout,
    #     bias="none",
    #     task_type="CAUSAL_LM",
    # )
    # model = get_peft_model(model, config)

    # # the size of trainable parameters for lora modules
    # model.print_trainable_parameters()

llama模型训练中途失败

error on Multiple GPUs:

I can run on a single gpu, but multi-gpu will report the following error, has anyone encountered it
"
uniform_finetune.py: error: unrecognized arguments: --local-rank=3
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 141382) of binary: /usr/local/conda/bin/python3
"

搞个中文版的说明啊

反正也没啥老外来看，不如直接弄个中文版的，看得顺畅

需要安装什么python版本？

expected scalar type Half but found Float

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /export/home/gth/alpaca_lora/uniform_finetune.py:294 in │
│ │
│ 291 │ args = parser.parse_args() │
│ 292 │ print(args) │
│ 293 │ │
│ ❱ 294 │ train(args) │
│ 295 │
│ │
│ /export/home/gth/alpaca_lora/uniform_finetune.py:263 in train │
│ │
│ 260 │ if torch.version >= "2" and sys.platform != "win32": │
│ 261 │ │ model = torch.compile(model) │
│ 262 │ │
│ ❱ 263 │ trainer.train() │
│ 264 │ │
│ 265 │ model.save_pretrained(output_dir) │
│ 266 │
│ │
│ /home/admin/anaconda3/lib/python3.9/site-packages/transformers/trainer.py:1644 in train │
│ │
│ 1641 │ │ inner_training_loop = find_executable_batch_size( │
│ 1642 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1643 │ │ ) │
│ ❱ 1644 │ │ return inner_training_loop( │
│ 1645 │ │ │ args=args, │
│ 1646 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1647 │ │ │ trial=trial, │
│ │
│ /home/admin/anaconda3/lib/python3.9/site-packages/transformers/trainer.py:1909 in │
│ _inner_training_loop │
│ │
│ 1906 │ │ │ │ ): │
│ 1907 │ │ │ │ │ # Avoid unnecessary DDP synchronization since there will be no backw │
│ 1908 │ │ │ │ │ with model.no_sync(): │
│ ❱ 1909 │ │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1910 │ │ │ │ else: │
│ 1911 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1912 │
│ │
│ /home/admin/anaconda3/lib/python3.9/site-packages/transformers/trainer.py:2667 in training_step │
│ │
│ 2664 │ │ │ loss = loss / self.args.gradient_accumulation_steps │
│ 2665 │ │ │
│ 2666 │ │ if self.do_grad_scaling: │
│ ❱ 2667 │ │ │ self.scaler.scale(loss).backward() │
│ 2668 │ │ elif self.use_apex: │
│ 2669 │ │ │ with amp.scale_loss(loss, self.optimizer) as scaled_loss: │
│ 2670 │ │ │ │ scaled_loss.backward() │
│ │
│ /home/admin/anaconda3/lib/python3.9/site-packages/torch/_tensor.py:488 in backward │
│ │
│ 485 │ │ │ │ create_graph=create_graph, │
│ 486 │ │ │ │ inputs=inputs, │
│ 487 │ │ │ ) │
│ ❱ 488 │ │ torch.autograd.backward( │
│ 489 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │
│ 490 │ │ ) │
│ 491 │
│ │
│ /home/admin/anaconda3/lib/python3.9/site-packages/torch/autograd/init.py:197 in backward │
│ │
│ 194 │ # The reason we repeat same the comment below is that │
│ 195 │ # some Python versions print out the first line of a multi-line function │
│ 196 │ # calls in the traceback and some print out the last line │
│ ❱ 197 │ Variable.execution_engine.run_backward( # Calls into the C++ engine to run the bac │
│ 198 │ │ tensors, grad_tensors, retain_graph, create_graph, inputs, │
│ 199 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │
│ 200 │
│ │
│ /home/admin/anaconda3/lib/python3.9/site-packages/torch/autograd/function.py:267 in apply │
│ │
│ 264 │ │ │ │ │ │ │ "Function is not allowed. You should only implement one " │
│ 265 │ │ │ │ │ │ │ "of them.") │
│ 266 │ │ user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn │
│ ❱ 267 │ │ return user_fn(self, *args) │
│ 268 │ │
│ 269 │ def apply_jvp(self, *args): │
│ 270 │ │ # _forward_cls is defined by derived class │
│ │
│ /home/admin/anaconda3/lib/python3.9/site-packages/torch/utils/checkpoint.py:157 in backward │
│ │
│ 154 │ │ │ raise RuntimeError( │
│ 155 │ │ │ │ "none of output has requires_grad=True," │
│ 156 │ │ │ │ " this checkpoint() is not necessary") │
│ ❱ 157 │ │ torch.autograd.backward(outputs_with_grad, args_with_grad) │
│ 158 │ │ grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None │
│ 159 │ │ │ │ │ for inp in detached_inputs) │
│ 160 │
│ │
│ /home/admin/anaconda3/lib/python3.9/site-packages/torch/autograd/init.py:197 in backward │
│ │
│ 194 │ # The reason we repeat same the comment below is that │
│ 195 │ # some Python versions print out the first line of a multi-line function │
│ 196 │ # calls in the traceback and some print out the last line │
│ ❱ 197 │ Variable.execution_engine.run_backward( # Calls into the C++ engine to run the bac │
│ 198 │ │ tensors, grad_tensors, retain_graph, create_graph, inputs, │
│ 199 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │
│ 200 │
│ │
│ /home/admin/anaconda3/lib/python3.9/site-packages/torch/autograd/function.py:267 in apply │
│ │
│ 264 │ │ │ │ │ │ │ "Function is not allowed. You should only implement one " │
│ 265 │ │ │ │ │ │ │ "of them.") │
│ 266 │ │ user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn │
│ ❱ 267 │ │ return user_fn(self, *args) │
│ 268 │ │
│ 269 │ def apply_jvp(self, *args): │
│ 270 │ │ # _forward_cls is defined by derived class │
│ │
│ /home/admin/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/functions.py:456 in │
│ backward │
│ │
│ 453 │ │ │ │
│ 454 │ │ │ elif state.CB is not None: │
│ 455 │ │ │ │ CB = state.CB.to(ctx.dtype_A, copy=True).mul(state.SCB.unsqueeze(1).mul │
│ ❱ 456 │ │ │ │ grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype │
│ 457 │ │ │ elif state.CxB is not None: │
│ 458 │ │ │ │ │
│ 459 │ │ │ │ if state.tile_indices is None: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: expected scalar type Half but found Float

Missing alpaca-belle-cot-7b?

Incredible work!! Thank you! But it seems that alpaca-belle-cot-7b is missing for now?

单卡的A100 80GB显卡微调llama-7b-hf，belle0.5M的数据，per_gpu_train_batch_size可以设置更高一些吗？

per_gpu_train_batch_size默认是4，可以设置到8吗？

How to add the show of loss in training?

Thanks for your work! It really benefit me.
Cause I am new to pytorch, I wander how to add the code of showing loss change during training?
Would you mind give me some case?
Looking forward to your reply.
Thanks.

Datasets Resources: A list of instruction datasets

Hi, I am collecting a list of open-source datasets. You can find more availabe datasets here: https://github.com/yaodongC/awesome-instruction-dataset

这个工程是我目前看到最全的，我们要不合作一下！

您需要什么数据？我们有一个5人标注小队外配一个爬虫工程师，一起构建高质量数据集如何？
详情可加微信：15821444815

--master_addr=xxx --master_port=yyy 这个参数是什么？怎么设置呢？

用多GPU训练会出现问题？

Traceback (most recent call last):
  File "train.py", line 206, in <module>
    trainer.train()
  File "/home/anaconda3/envs/alpaca/lib/python3.8/site-packages/transformers/trainer.py", line 1644, in train
    return inner_training_loop(
  File "/home/anaconda3/envs/alpaca/lib/python3.8/site-packages/transformers/trainer.py", line 1911, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/anaconda3/envs/alpaca/lib/python3.8/site-packages/transformers/trainer.py", line 2657, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/anaconda3/envs/alpaca/lib/python3.8/site-packages/transformers/trainer.py", line 2689, in compute_loss
    outputs = model(**inputs)
  File "/home/anaconda3/envs/alpaca/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/anaconda3/envs/alpaca/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 157, in forward
    raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

用4个v100训练7b模型，显存占用很大

你好，我用4个v100训练7b模型，batch size用默认的4，加载完模型后，每个显卡的显存占用13gb，这个正常吗？为什么单卡24gb就可以呢？有什么方法进一步降低多卡训练的显存占用吗？

谁试过pytorch2.0 和pytorch1.X的训练速度的区别

我用bloom模型跑的多卡的finetune，但是根据我观察，采用pytorch1.13.1的训练速度和pytorch2.0.0的训练速度，几乎是一样的。想问一下各位大佬，通过lora方法训练模型，pytorch2.0是否有提速？谢谢了。

这个是instruct tunning还是prompt tuning呢？有改变所有模型权重层吗？

fine-tuned结束后，运行chatglm报错

Traceback (most recent call last):
File "/root/llm/Alpaca-CoT-main/app.py", line 15, in
from model_chatglm import ChatGLMForConditionalGeneration, ChatGLMTokenizer
ModuleNotFoundError: No module named 'model_chatglm'
这个需要pip什么安装包吗？

fine-tuning后，调用app.py传参问题

get_model_class(args.model_type, args.model_name_or_path, args.lora_name_or_path)里面需要三个参数，
第三个参数应该传什么？假如 --model_type llama --model_name_or_path ./saved_models/llama-7b-hf_belle1.5m
lora_name_or_path 呢？

训练效果差，每次回复后面总是要跟一堆重复东西

你好，感谢你的开源工作，不过我训练得到的模型效果总是不行，每次输入后面总是跟了一堆重复的东西,,例子如下。但是用你的一系列开源权重模型是没问题的。所以不知道能否给我提供一些训练上的建议，
`
input: hello

response: Hello! 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋 👋

input:你吃饭了吗
response: Yes, I have eaten. 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊
`

一些具体的训练情况如下，选择了llama7b的模型，然后主要是修改了train_batch_size 改成了32(因为觉得训练有点慢就把它加大了)，把梯度累计改成了4，其他没选择的都是代码里的默认参数

python3 uniform_finetune.py --model_type llama --model_name_or_path weight/llama-7b-hf \ --data alpaca --lora_target_modules q_proj v_proj \ --per_gpu_train_batch_size 32 --learning_rate 3e-4 --epochs 3 --output_dir test_output \ --gradient_accumulation_steps 4 \

测试代码用的app.py，也没有修改里面的内容。

There was an error when I ran finetune.py.

traceback (most recent call last):
File "finetune.py", line 231, in
trainer.train()
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/trainer.py", line 1648, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/trainer.py", line 1911, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/trainer.py", line 2657, in training_step
loss = self.compute_loss(model, inputs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/trainer.py", line 2689, in compute_loss
outputs = model(**inputs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/_utils.py", line 543, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 3 on device 3.
Original Traceback (most recent call last):
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/peft/peft_model.py", line 538, in forward
**kwargs,
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/models/llama/modeling_llama.py", line 714, in forward
return_dict=return_dict,
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/models/llama/modeling_llama.py", line 590, in forward
None,
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/models/llama/modeling_llama.py", line 581, in custom_forward
return module(*inputs, output_attentions, None)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/models/llama/modeling_llama.py", line 324, in forward
hidden_states = self.mlp(hidden_states)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/models/llama/modeling_llama.py", line 155, in forward
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (2048x3 and 4x4096)

可以出一个colab能跑的版本嘛，感觉玩的人会多些

一直在关注你们的工作，做的很全很完善。但是建议有时间可以出一个colab版本，从装包开始都写在notebook里面，这样玩起来会容易一点，方便大家测试

Auto CoT by Amazon

https://github.com/amazon-science/auto-cot

Seems interesting, also dataset is available, too

How to further fine-tune on alpaca

The code only provides the code for fine-tune on llama and other models. If you want to further fine-tune on alpaca with good fine-tune, how to achieve it

Pls add our work into your citation list.

Cite the original LLaMA, Stanford Alpaca, Self-Instruct and LoRA papers as well, please.

Pls add BELLE in your citation list :)
https://github.com/LianjiaTech/BELLE

More data

Swype will properly release their instruction fine tuning dataset real soon:
https://huggingface.co/datasets/swype/instruct-102.4k

And there is also:
https://github.com/nomic-ai/gpt4all

llama-7b的rt时间多久呢？

Have you ever tried fine-tuning ChatGLM?

It's an open-source project from Tsinghua University, offering similar performance to Alpaca from Stanford U, and also emphasizing better ability in Chinese dialogs.

Is it possible to run on 24gb vram?

I have a 4090 and decided to add some more dataset to train it.

run generate.py, download Lora model config failed

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 259, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/saved-alpaca-belle-cot7b/resolve/main/adapter_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/peft/utils/config.py", line 99, in from_pretrained
    config_file = hf_hub_download(pretrained_model_name_or_path, CONFIG_NAME)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1134, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1475, in get_hf_file_metadata
    hf_raise_for_status(r)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 291, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-6423b5fb-3dc6880b29a2556c43cb8c3d)

Repository Not Found for url: https://huggingface.co/saved-alpaca-belle-cot7b/resolve/main/adapter_config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/Alpaca-CoT/generate.py", line 47, in <module>
    model = PeftModel.from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/peft/peft_model.py", line 135, in from_pretrained
    config = PEFT_TYPE_TO_CONFIG_MAPPING[PeftConfig.from_pretrained(model_id).peft_type].from_pretrained(model_id)
  File "/usr/local/lib/python3.10/site-packages/peft/utils/config.py", line 101, in from_pretrained
    raise ValueError(f"Can't find config.json at '{pretrained_model_name_or_path}'")
ValueError: Can't find config.json at 'saved-alpaca-belle-cot7b'

I think the LORA_WEIGHTS missing repo info

Would you like regularly release newly trained model checkpoint?

Hi,

I have just checkout out the HF repo for this project. I seems most models are not updated for a week.

I know FT would cost time. I wonder if you can release a schedule for regularly updating fined tuned models based on latest dataset included in this project? Or does it mean releaseing your own fine tuned ckpt is no longer within the scope of this project?

Thanks!

多GPU训练完成，单GPU推理失败

Python 3.9.12, torch 2.0.0, peft 0.3.0.dev0, transformers 4.28.0.dev0
训练采用多GPU：

torchrun --nproc_per_node 8 uniform_finetune.py --model_type llama --model_name_or_path ../llama_weights_converted/7B/ --data alpaca-gpt4-cot --lora_target_modules q_proj v_proj --per_gpu_train_batch_size 32 --gradient_accumulation_steps 2 --learning_rate 3e-4 --epochs 1

训练成功，
测试采用单卡，采用测试权重，lora权重加载报错：
LORA_WEIGHTS = "./saved_models/llama_alpaca-gpt4-cot"

CUDA_VISIBLE_DEVICES=0 python generate.py --size 7 --model llama
...
Loading checkpoint shards: 100%|█████████████████████████████████████████████| 3/3 [00:14<00:00,  4.84s/it]
Traceback (most recent call last):
  File "/generate.py", line 86, in <module>
    model = PeftModel.from_pretrained(
  File "/home/conda/llama/lib/python3.9/site-packages/peft/peft_model.py", line 164, in from_pretrained
    model = set_peft_model_state_dict(model, adapters_weights)
  File "/home/conda/llama/lib/python3.9/site-packages/peft/utils/save_and_load.py", line 74, in set_peft_model_state_dict
    model.load_state_dict(peft_model_state_dict, strict=False)
  File "/home/conda/llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
        size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([8, 4096]) from checkpoint, the shape in current model is torch.Size([1, 4096]).
        size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([4096, 8]) from checkpoint, the shape in current model is torch.Size([4096, 1]).
        size mismatch for base_model.model.model.layers.0.self_attn.v_proj.lora_A.weight: copying a param with shape torch.Size([8, 4096]) from checkpoint, the shape in current model is torch.Size([1, 4096]).

看了一下是peft的加载模型出现问题
搜了一下说在torch.load中设置map_device=“cuda:0"就可以，但是报同样的错误
请问有遇到过这种问题么，如何解决

load_in_8bit可以初始为False

修改load_in_8bit参数，可以不用安装最新的bitsandbytes和peft（尤其是要处理对于cuda等环境依赖的时候，成本会很高）
我推荐 load_in_8bit=False，不影响模型的训练和加载，可以提升大家接入使用的速度

使用v100-32g的显卡，推理rt需要好几分钟

运行generate.py时报错

运行python generate.py --model_type chatglm --size 7 时可以正常起来，里面对应的chatglm的lora_weights已经写死。
后面输入instruction以后，运行报错：
Traceback (most recent call last):
File "/root/llm/Alpaca-CoT-main/generate.py", line 258, in
response = evaluate(instruction)
File "/root/llm/Alpaca-CoT-main/generate.py", line 212, in evaluate
output = tokenizer.decode(s)
File "/root/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/fdb7a601d8f8279806124542e11549bdd76f62f6/tokenization_chatglm.py", line 276, in decode
if self.pad_token_id in token_ids: # remove pad
RuntimeError: Boolean value of Tensor with more than one value is ambiguous

output的前一步那个s，打印出来是有值的：

Response:

The dtype of attention mask (torch.int64) is not bool
tensor([ 32313, 20107, 20125, 26054, 20109, 23384, 20104, 21833, 20007,
31121, 20104, 20532, 20109, 32475, 49321, 20100, 21029, 20007,
20004, 145875, 57010, 20012, 20004, 20150, 88230, 29668, 90663,
83831, 85119, 99903, 20004, 145875, 31034, 20012, 150001, 150004,
20483, 22739, 20142, 20372, 88230, 29668, 90663, 20103, 20142,
21224, 20006, 20120, 20134, 20236, 20103, 21008, 20208, 22095,
20012, 20004, 20004, 20009, 20007, 150009, 22999, 20142, 20372,
88230, 29668, 20102, 90085, 84121, 90663, 83823, 20004, 20010,
20007, 150009, 86246, 20058, 85119, 84052, 20062, 90959, 84140,
20006, 83984, 20058, 99903, 85119, 145907, 20004, 20013, 20007,
150009, 86977, 84121, 85119, 84086, 20006, 84111, 85964, 83824,
83995, 84015, 83824, 86299, 84015, 83835, 83823, 20004, 20016,
20007, 150009, 86246, 20058, 99903, 20062, 90997, 20006, 85749,
137200, 119854, 83966, 88230, 83823, 20004, 20004, 24400, 20120,
20127, 99903, 84192, 20006, 20142, 20372, 88230, 29668, 90663,
20134, 20113, 21554, 20103, 20142, 21224, 20102, 20120, 20134,
20113, 20477, 20103, 21506, 20142, 21224, 20207, 20142, 20372,
88230, 29668, 20007, 150005], device='cuda:0')

如何预训练模型和增加词汇表？

下载下来7B的模型之后，测试了几个中文问题，发现回答有很多无法识别的字符，是不是模型中中文的词汇表特别小？请问如何扩充中文词汇，并且在此基础上增加中文预训练语料来预训练？

ChatGLM微调报错

运行uniform_finetune.py

`
╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
│  /home/inspur/.cache/huggingface/modules/transformers_modules/chatglm-6b/tokenization_chatglm.py │
│ :1                                                                                               │
│ <!DOCTYPE html>                                                                                  │
│ ▲                                                                                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
SyntaxError: invalid syntax
`

phoebussi / alpaca-cot Goto Github PK

alpaca-cot's Introduction

Alpaca-CoT: An Instruction-Tuning Platform with Unified Interface for Instruction Collection, Parameter-efficient Methods, and Large Language Models

News

Overview

Data Collection

Statistics

Download

Data Formatting

Multi-interface Unified Platform

Setup

Instruction Finetuning

Inference

Inference Hyper-parameter Explanation

Parameter merging

Local chatting

Batch predicting

Web service building

Empirical Study of Instruction-tuning Open LLMs in Chinese (As of June 25th)

1. Benchmarks

2. Main Factors

2.1 LLM Bases

2.2 Parameter-efficient Methods

2.3 Chinese instruction Datasets

3. Other Factors

3.1 CoT

3.2 Expansion of Chinese Vocabulary

3.3 Language of Prompts

3.4 Human-value Alignment

Quantitative Analysis

Ablation of CoT and Chinese Instructions

The Effect of CoT Data

The Effect of Chinese Instruction Data

Citation

All Thanks To Our Contributors

alpaca-cot's People

Contributors

Stargazers

Watchers

Forkers

alpaca-cot's Issues

Response:

Recommend Projects

Recommend Topics

Recommend Org