Coder Social home page Coder Social logo

hiyouga / chatglm-efficient-tuning Goto Github PK

View Code? Open in Web Editor NEW
3.6K 32.0 458.0 198.18 MB

Fine-tuning ChatGLM-6B with PEFT | 基于 PEFT 的高效 ChatGLM 微调

License: Apache License 2.0

Python 100.00%
chatglm chatgpt fine-tuning lora alpaca peft huggingface language-model transformers pytorch

chatglm-efficient-tuning's Introduction

https://github-readme-stats.vercel.app/api/top-langs/?username=hiyouga&hide=HTML

hiyouga

Yaowei Zheng

Ph.D. Student

Beihang University

37 Xueyuan Rd., Haidian Dist.

Beijing, China, 100191

Education

  • 2022.09-Present School of Computer Science and Engineering, Beihang University Ph.D.
  • 2017.09-2021.06 Shen Yuan Honors College, Beihang University B.Eng.

Research Interests

  • Natural Language Processing
  • Large Language Models

Skills

  • Natural Language: Chinese (Native); English (CET-6); Japanese (JLPT-N2)
  • Programming Language: Python; C++; Java; JavaScript; PHP; Go; Verilog HDL; MATLAB
  • Typesetting Language: LaTeX; Markdown
  • Programming Framework: PyTorch; TensorFlow
  1. Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo and Yongqiang Ma: LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. Preprint. [arXiv]
  2. Junfan Chen, Richong Zhang, Yaowei Zheng, Qianben Chen, Chunming Hu and Yongyi Mao: DualCL: Principled Supervised Contrastive Learning as Mutual Information Maximization for Text Classification. WWW2024. [arXiv][Code]
  3. Richong Zhang, Qianben Chen, Yaowei Zheng, Samuel Mensah and Yongyi Mao: Aspect-level Sentiment Analysis via a Syntax-based Neural Network. IEEE/ACM Transactions on Audio, Speech, and Language Processing. [DOI]
  4. Xiaohui Guo, Richong Zhang, Yaowei Zheng and Yongyi Mao: Robust Regularization with Adversarial Labelling of Perturbed Samples. IJCAI2021. [DOI][arXiv]
  5. Yaowei Zheng, Richong Zhang and Yongyi Mao: Regularizing Neural Networks via Adversarial Model Perturbation. CVPR2021. [DOI][arXiv][Code][Poster][Video]
  6. Yaowei Zheng, Richong Zhang, Suyuchen Wang, Samuel Mensah and Yongyi Mao: Anchored Model Transfer and Soft Instance Transfer for Cross-Task Cross-Domain Learning: A Study Through Aspect-Level Sentiment Classification. WWW2020. [DOI]
  7. Yaowei Zheng, Richong Zhang, Samuel Mensah and Yongyi Mao: Replicate, Walk, and Stop on Syntax: an Effective Neural Network Model for Aspect-Level Sentiment Classification. AAAI2020. [DOI][Code]

Academic Service

  • Conference Reviewer: AAAI, EMNLP, NAACL, COLING
  • Journal Reviewer: Neural Computation

chatglm-efficient-tuning's People

Contributors

akamya997 avatar codemayq avatar hiyouga avatar janglichao avatar jiahuanluo avatar jinsongpan avatar michaeloo0 avatar mmrbun avatar netease-yanxuan avatar niu-dali avatar suprityoung avatar yang-hangwa avatar zhongpei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chatglm-efficient-tuning's Issues

为什么我在不联网的情况下不能lora呢?

config文件里的chatglm-6b模型的路径我没改,在联网的机器可以正常运行,可切换到离线状态就不能跑了,为什么?我需要如何修改代码才可以在离线状态下lora训练chatglm模型呢?请大神指点!!!

RLHF 训练时报错

File "D:\Software\anaconda3\envs\chatglm\lib\site-packages\transformers\generation\utils.py", line 924, in _merge_criteria_processor_list
if len(custom_list) == 0:
TypeError: object of type 'InvalidScoreLogitsProcessor' has no len()
是transformers版本问题吗?试了几个版本都不行

AssertionError: No inf checks were recorded for this optimizer.

当 运行

    $ CUDA_VISIBLE_DEVICES=0 python src/finetune.py  --do_train  --dataset alpaca_gpt4_zh  --finetuning_type freeze  --output_dir path_to_checkpoint  --per_device_train_batch_size 2  --gradient_accumulation_steps 2  --lr_scheduler_type cosine  --logging_steps 10  --save_steps 1000  --learning_rate 5e-5  --num_train_epochs 1.0   --quantization_bit=8 --fp16

出现

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /ChatGLM-Efficient-Tuning/src/finetune.py:88 in <module>               │
│                                                                                                  │
│   85                                                                                             │
│   86                                                                                             │
│   87 if __name__ == "__main__":                                                                  │
│ ❱ 88 │   main()                                                                                  │
│   89                                                                                             │
│                                                                                                  │
│ /ChatGLM-Efficient-Tuning/src/finetune.py:60 in main                   │
│                                                                                                  │
│   57 │                                                                                           │
│   58 │   # Training                                                                              │
│   59 │   if training_args.do_train:                                                              │
│ ❱ 60 │   │   train_result = trainer.train()                                                      │
│   61 │   │   trainer.log_metrics("train", train_result.metrics)                                  │
│   62 │   │   trainer.save_metrics("train", train_result.metrics)                                 │
│   63 │   │   trainer.save_state() # along with the loss values                                   │
│                                                                                                  │
│ /anaconda3/envs/py38_chat_peft/lib/python3.8/site-packages/transformers/trainer.py:16 │
│ 62 in train                                                                                      │
│                                                                                                  │
│   1659 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1660 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1661 │   │   )                                                                                 │
│ ❱ 1662 │   │   return inner_training_loop(                                                       │
│   1663 │   │   │   args=args,                                                                    │
│   1664 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1665 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /anaconda3/envs/py38_chat_peft/lib/python3.8/site-packages/transformers/trainer.py:19 │
│ 91 in _inner_training_loop                                                                       │
│                                                                                                  │
│   1988 │   │   │   │   │   │   │   xm.optimizer_step(self.optimizer)                             │
│   1989 │   │   │   │   │   elif self.do_grad_scaling:                                            │
│   1990 │   │   │   │   │   │   scale_before = self.scaler.get_scale()                            │
│ ❱ 1991 │   │   │   │   │   │   self.scaler.step(self.optimizer)                                  │
│   1992 │   │   │   │   │   │   self.scaler.update()                                              │
│   1993 │   │   │   │   │   │   scale_after = self.scaler.get_scale()                             │
│   1994 │   │   │   │   │   │   optimizer_was_run = scale_before <= scale_after                   │
│                                                                                                  │
│ /anaconda3/envs/py38_chat_peft/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler │
│ .py:368 in step                                                                                  │
│                                                                                                  │
│   365 │   │   if optimizer_state["stage"] is OptState.READY:                                     │
│   366 │   │   │   self.unscale_(optimizer)                                                       │
│   367 │   │                                                                                      │
│ ❱ 368 │   │   assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were rec   │
│   369 │   │                                                                                      │
│   370 │   │   retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)         │
│   371                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: No inf checks were recorded for this optimizer.

请问是什么问题呢?

The latest code has some problems while training

Traceback (most recent call last):
File "/mnt/task_runtime/ChatGLM-Efficient-Tuning/src/train_ppo.py", line 125, in
main()
File "/mnt/task_runtime/ChatGLM-Efficient-Tuning/src/train_ppo.py", line 93, in main
responses_with_queries = ppo_trainer.generate(queries, length_sampler=output_length_sampler, **gen_kwargs)
File "/mnt/task_runtime/ChatGLM-Efficient-Tuning/src/utils/ppo.py", line 162, in generate
response = self.accelerator.unwrap_model(self.model).generate(
File "/mnt/miniconda/envs/py310/lib/python3.10/site-packages/trl/models/modeling_value_head.py", line 195, in generate
return self.pretrained_model.generate(*args, **kwargs)
File "/mnt/miniconda/envs/py310/lib/python3.10/site-packages/peft/peft_model.py", line 731, in generate
outputs = self.base_model.generate(**kwargs)
File "/mnt/miniconda/envs/py310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/miniconda/envs/py310/lib/python3.10/site-packages/transformers/generation/utils.py", line 1454, in generate
logits_processor = self._get_logits_processor(
File "/mnt/miniconda/envs/py310/lib/python3.10/site-packages/transformers/generation/utils.py", line 935, in _get_logits_processor
processors = self._merge_criteria_processor_list(processors, logits_processor)
File "/mnt/miniconda/envs/py310/lib/python3.10/site-packages/transformers/generation/utils.py", line 957, in _merge_criteria_processor_list
if len(custom_list) == 0:
TypeError: object of type 'InvalidScoreLogitsProcessor' has no len()

Using Accelerate train

Thank you for open this repo.
In example for DDP train:

accelerate config # configure the environment
accelerate launch src/finetune.py # arguments (same as above)

However. No accelerate library found in your repo.

+ from accelerate import Accelerator
+ accelerator = Accelerator()

+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+     model, optimizer, training_dataloader, scheduler
+ )
...

第三步出现错误:RuntimeError: probability tensor contains either inf, nan or element < 0

File "/output/ChatGLM-Efficient-Tuning/src/train_ppo.py", line 114, in
main()
File "/output/ChatGLM-Efficient-Tuning/src/train_ppo.py", line 82, in main
responses_with_queries = ppo_trainer.generate(queries, length_sampler=output_length_sampler, **gen_kwargs)
File "/output/ChatGLM-Efficient-Tuning/src/utils/ppo.py", line 162, in generate
response = self.accelerator.unwrap_model(self.model).generate(
File "/usr/local/envs/chatglm_etuning/lib/python3.10/site-packages/trl/models/modeling_value_head.py", line 195, in generate
return self.pretrained_model.generate(*args, **kwargs)
File "/usr/local/envs/chatglm_etuning/lib/python3.10/site-packages/peft/peft_model.py", line 731, in generate
outputs = self.base_model.generate(**kwargs)
File "/usr/local/envs/chatglm_etuning/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
return self.sample(
File "/usr/local/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/generation/utils.py", line 2560, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

4.14最新代码,尝试跑demo报数据集sum校验异常

04/15/2023 22:48:30 - INFO - utils - Loading dataset DatasetInfo(load_from='file', dataset_name=None, file_name='alpaca_gpt4_data_zh.json', file_sha1='736d3a9d0fcbb252d1e8f902920961ecfd310e41')...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Users\tians\PycharmProjects\ChatGLM-Efficient-Tuning\finetune_chatglm.py:67 in │
│ │
│ 64 │
│ 65 │
│ 66 if name == "main": │
│ ❱ 67 │ main() │
│ 68 │
│ │
│ C:\Users\tians\PycharmProjects\ChatGLM-Efficient-Tuning\finetune_chatglm.py:22 in main │
│ │
│ 19 │ │
│ 20 │ # Prepare pretrained model and dataset │
│ 21 │ model_args, data_args, training_args, finetuning_args = prepare_args() │
│ ❱ 22 │ dataset = prepare_data(model_args, data_args, training_args) │
│ 23 │ model, tokenizer = load_pretrained(model_args, finetuning_args, is_trainable=trainin │
│ 24 │ dataset = preprocess_data(dataset, tokenizer, data_args, training_args) │
│ 25 │ data_collator = DataCollatorForChatGLM(tokenizer, model, data_args.ignore_pad_token_ │
│ │
│ C:\Users\tians\PycharmProjects\ChatGLM-Efficient-Tuning\utils.py:246 in prepare_data │
│ │
│ 243 │ │ │ data_file = os.path.join(data_args.dataset_dir, dataset_info.file_name) │
│ 244 │ │ │ extension = dataset_info.file_name.split(".")[-1] │
│ 245 │ │ │ if dataset_info.file_sha1 is not None: │
│ ❱ 246 │ │ │ │ checksum(data_file, dataset_info.file_sha1) │
│ 247 │ │ │ else: │
│ 248 │ │ │ │ logger.warning("Checksum failed: missing SHA-1 hash value in dataset_inf │
│ 249 │ │ │ raw_datasets = load_dataset( │
│ │
│ C:\Users\tians\PycharmProjects\ChatGLM-Efficient-Tuning\utils.py:228 in checksum │
│ │
│ 225 │ │ │ binary_data = datafile.read() │
│ 226 │ │ sha1 = hashlib.sha1(binary_data).hexdigest() │
│ 227 │ │ if sha1 != hash: │
│ ❱ 228 │ │ │ raise ValueError("Checksum failed for {}.".format(file_path)) │
│ 229 │ │
│ 230 │ max_samples = data_args.max_train_samples if training_args.do_train else data_args.m │
│ 231 │ all_datasets = [] # support multiple datasets │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Checksum failed for data\alpaca_gpt4_data_zh.json.


日志如上,请问要如何解决这个问题呢?

Dataset doesn't exist.

FileNotFoundError: Couldn't find a dataset script at 
/content/JosephusCheung/GuanacoDataset/GuanacoDataset.py or any data file in the
same directory. Couldn't find 'JosephusCheung/GuanacoDataset' on the Hugging 
Face Hub either: FileNotFoundError: Dataset 'JosephusCheung/GuanacoDataset' 
doesn't exist on the Hub. If the repo is private or gated, make sure to log in 
with `huggingface-cli login`.

找不到arguments模块

CUDA_VISIBLE_DEVICES=0 python3 infer.py --checkpoint_dir ../output/checkpoint-2000
执行后,提示如下错误:
ModuleNotFoundError: No module named 'arguments'
请问一下,这个arguments是PIP3库里的arguments吗?还是本地的

微调时报错了,能帮我看下吗

ImportError: cannot import name 'WEIGHTS_NAME' from 'peft.utils.other' (/root/miniconda3/lib/python3.10/site-packages/peft/utils/other.py)
./finetune.sh: line 7: --dataset: command not found

您好,在单卡训练时报错RuntimeError: mixed dtype (CPU): expect input to have scalar type of BFloat16

File "/root/.cache/huggingface/modules/transformers_modules/models/modeling_chatglm.py", line 624, in forward
attention_input = self.input_layernorm(hidden_states)
File "/opt/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/anaconda3/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 190, in forward
return F.layer_norm(
File "/opt/anaconda3/lib/python3.9/site-packages/torch/nn/functional.py", line 2515, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: mixed dtype (CPU): expect input to have scalar type of BFloat16

训练语料长度超出2048token

训练语料长度超出时候,只有warnning,但是训练的时候会报错,我要改成超长语料剔除的话,在哪里修改代码?

用lora微调,为什么没有readme的效果

评估结果是:

***** eval metrics *****
eval_bleu-4 = 15.0234
eval_rouge-1 = 35.197
eval_rouge-2 = 15.4659
eval_rouge-l = 26.9888
eval_runtime = 0:02:57.77
eval_samples_per_second = 0.563
eval_steps_per_second = 0.073

跟原版模型几乎一样BLEU还下降了?

调了个寂寞,倒是没什么灾难性遗忘

請問要怎麼加入--lora_rank 8這個參數

請問要怎麼加入--lora_rank 8這個參數
以下為例子

CUDA_VISIBLE_DEVICES=0 python src/finetune.py \
    --do_train \
    --dataset alpaca_gpt4_zh \
    --finetuning_type lora \
    --output_dir path_to_checkpoint \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 1.0 \
    --lora_rank 8 \
    --fp16

直接加在倒數第二行可以嗎?
我是個程式新手,請多包涵。謝謝。

about train speed.

Set same parmeters with different dataset.

  1. small dataset get 1.5s/iter
  2. large dataset get 5.6s/iter
    Why get this appearence,

Occurs some loss = nan steps, is it rational?

I am using your example to train ppo, while the logging show some loss = nan, I am curious about that if it is rational?

some plots:

{'loss': 0.3290, 'reward': -2.0304, 'learning_rate': 5e-05}
0%| | 1/13000 [00:10<36:34:36, 10.13s/it]{'loss': nan, 'reward': 5.7646, 'learning_rate': 5e-05}
0%| | 2/13000 [00:28<54:33:11, 15.11s/it{'loss': 0.2527, 'reward': 1.2237, 'learning_rate': 5e-05}
0%| | 3/13000 [00:39<47:01:51, 13.03s/it]{'loss': 0.1512, 'reward': 10.7681, 'learning_rate': 5e-05}
0%| | 4/13000 [01:01<60:36:15, 16.79s/it]{'loss': 0.0769, 'reward': 7.0280, 'learning_rate': 5e-05}
0%| | 5/13000 [01:20<63:06:18, 17.48s/it]{'loss': 0.1685, 'reward': 12.2049, 'learning_rate': 5e-05}
0%| | 6/13000 [01:28<51:51:58, 14.37s/it]

OOM Error when saving models trained in INT8 mode

A bug report from the WeChat group:

04/20/2023 11:04:39 - INFO - utils .common - Saving model checkpoint to path_to_checkpoint
Traceback (most recent call last):
  File "/src/finetune.py", line 73, in <module>
    main()
  File "/src/finetune.py", line 55, in main
    trainer.save_model()
  File "/site-packages/transformers/trainer.py", line 2830, in save_model
    self._save(output_dir)
  File "src/utils/common.py", line 462, in _save
    self.model.save_pretrained(output_dir) # only save peft weights with the built-in method
  File "/peft/src/peft/peft_model.py", line 116, in save_pretrained
    output_state_dict = get_peft_model_state_dict(
  File "/peft/src/peft/utils/save_and_load.py", line 32, in get_peft_model_state_dict
    state_dict = model.state_dict()
  File "/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/torch/nn/modules/module.py", line 1818, in state_dict
      [Previous line repeated 4 more times]
  File "/torch/nn/modules/module.py", line 1815, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "/bitsandbytes/nn/modules.py", line 268, in _save to_state_dict
    self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
  File "/bitsandbytes/autograd/_functions.py", line 96, in undo_layout
    outputs = torch.empty_like(tensor) # note: not using .index_copy because it was slower on cuda
torch.cuda.OutofMemoryError: CUDA out of memory, Tried to allocate 64.00 MiB (GPU 0: 14,76 GiB total capacity; 13.82 GiB already allocated; 47.75 MiB free; 14.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

A similar report can be found at: huggingface/peft#335

I suppose the failure is caused by state_dict = model.state_dict().

我在尝试lora时报错

我的服务器有4个32G的GPU显卡,lora时报错如下:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper__index_select)
我需要改哪里的代码才能指定一个卡呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.