BrokenPipeError: [Errno 32] Broken pipe

Thanks you ,My english is senior,excuse me.
when i run the CUDA_VISIBLE_DEVICES=1 python3 train_qlora.py --train_args_json chatGLM_6B_QLoRA.json --model_name_or_path /T106/chatGLM-6B-QLoRA-main/chatGLM-6B-QLoRA-main/remote_scripts/chatglm-6b/ --train_data_path data/train.jsonl --eval_data_path data/dev.jsonl --lora_rank 4 --lora_dropout 0.05 --compute_dtype fp32.
I get the error :
(base) root@461jc47ml0du4-0:/T106/chatGLM-6B-QLoRA-main/chatGLM-6B-QLoRA-main# CUDA_VISIBLE_DEVICES=1 python3 train_qlora.py --train_args_json chatGLM_6B_QLoRA.json --model_name_or_path /T106/chatGLM-6B-QLoRA-main/chatGLM-6B-QLoRA-main/remote_scripts/chatglm-6b/ --train_data_path data/train.jsonl --eval_data_path data/dev.jsonl --lora_rank 4 --lora_dropout 0.05 --compute_dtype fp32
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:11<00:00, 1.38s/it]
trainable params: 1,835,008 || all params: 6,175,121,408 || trainable%: 0.029716144489446126
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-7be044e55537389d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 425.26it/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-7be044e55537389d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-b1ad1cf49d010a09.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/json/default-7be044e55537389d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-7f22050519838b48.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-7be044e55537389d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-82c52662d9060a3a.arrow
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-15aa3e3de12fc81f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1132.07it/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-15aa3e3de12fc81f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-d76e708a4953afce.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/json/default-15aa3e3de12fc81f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-466cf4ff38bae650.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-15aa3e3de12fc81f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-74e21cf8bae50992.arrow
wandb: Currently logged in as: 2315553823 (fky_hbj). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.15.10
wandb: Run data is saved locally in /T106/chatGLM-6B-QLoRA-main/chatGLM-6B-QLoRA-main/wandb/run-20230920_070236-anybjcsm
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run kind-sunset-17
wandb: ⭐️ View project at https://wandb.ai/fky_hbj/huggingface
wandb: 🚀 View run at https://wandb.ai/fky_hbj/huggingface/runs/anybjcsm
0%| | 0/3581 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
0%| | 2/3581 [00:25<12:28:26, 12.55s/it]Exception in thread NetStatThr:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 267, in check_network_status
self._loop_check_status(
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 223, in _loop_check_status
Exception in thread IntMsgThr:
local_handle = request()
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 735, in deliver_network_status
File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
return self._deliver_network_status(status)
self.run()
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 466, in _deliver_network_status
File "/opt/conda/lib/python3.8/threading.py", line 870, in run
return self._deliver_record(record)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 425, in _deliver_record
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 299, in check_internal_messages
handle = mailbox._deliver_record(record, interface=self)
self._loop_check_status(
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 223, in _loop_check_status
local_handle = request()
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 743, in deliver_internal_messages
return self._deliver_internal_messages(internal_message)
interface._publish(record)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 472, in _deliver_internal_messages
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
return self._deliver_record(record)
self.send_server_request(server_req)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 425, in _deliver_record
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
handle = mailbox._deliver_record(record, interface=self)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
self._send_message(msg)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
interface._publish(record)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
self._sock_client.send_record_publish(record)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
3%|███▍ | 100/3581 [18:37<10:43:05, 11.08s/it]Traceback (most recent call last):
File "train_qlora.py", line 206, in
train(args)
File "train_qlora.py", line 200, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1927, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2240, in _maybe_log_save_evaluate
self.log(logs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2595, in log
self.control = self.callback_handler.on_log(self.args, self.state, self.control, logs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_callback.py", line 399, in on_log
return self.call_event("on_log", args, state, control, logs=logs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_callback.py", line 406, in call_event
result = getattr(callback, event)(
File "/opt/conda/lib/python3.8/site-packages/transformers/integrations/integration_utils.py", line 803, in on_log
self._wandb.log({**logs, "train/global_step": state.global_step})
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 419, in wrapper
return func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 370, in wrapper_fn
return func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 360, in wrapper
return func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1792, in log
self._log(data=data, step=step, commit=commit)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1567, in _log
self._partial_history_callback(data, step, commit)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1439, in _partial_history_callback
self._backend.interface.publish_partial_history(
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 546, in publish_partial_history
self._publish_partial_history(partial_history)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history
self._publish(rec)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe

I wonder if that's the problem of the wandb

两张卡训练，16*2，OOM

模型加载时内存使用大概12G，代码运行到model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)这行突然变成24G左右，然后就报错了。还没有开始训练。fp16，batch_size设成1。

qlora 微调效果

hi，你好请问有没有adgen上的Qlora在验证集上的效果比如rouge-l 、blue ，效果会比p-tuning好么？

模型修改问题

我想利用这个框架训练其他的模型比如vicuna13b 这类的我该怎么修改文件？

没有docker环境用git bash运行报错

ChatGLM3-6B 微调报错 RuntimeError: CUDA error: invalid argument

我是在huggingface/transformers-pytorch-gpu:4.29.1镜像中操作的，例子的chatglm-6b可以正常微调，推理。
但是，ChatGLM3-6B 微调报错。我是从modelscope上下载的模型 https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary
我的训练配置json：

{
    "output_dir": "saved_files/chatglm3_6b_qlora_t32",
    "per_device_train_batch_size": 4,
    "gradient_accumulation_steps": 8,
    "per_device_eval_batch_size": 4,
    "learning_rate": 1e-3,
    "num_train_epochs": 10,
    "lr_scheduler_type": "linear",
    "warmup_ratio": 0.1,
    "logging_steps": 1,
    "save_strategy": "steps",
    "save_steps": 500,
    "evaluation_strategy": "steps",
    "eval_steps": 500,
    "optim": "adamw_torch",
    "fp16": false,
    "remove_unused_columns": false,
    "ddp_find_unused_parameters": false,
    "seed": 42
}

我的训练命令：

python3 train_qlora.py \
--train_args_json chatglm3_6b_qlora.json \
--model_name_or_path /share/public/huggingface_cache/ZhipuAI/chatglm3-6b \
--train_data_path data/futures_train.jsonl \
--eval_data_path data/futures_dev.jsonl \
--lora_rank 4 \
--lora_dropout 0.05 \
--compute_dtype fp32

报错日志：

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
Traceback (most recent call last):
  File "train_qlora.py", line 209, in <module>
    train(args)
  File "train_qlora.py", line 203, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1938, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2770, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1821, in backward
    loss.backward(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

是不是还不支持chatGLM2-6B?

你好,是不是还不支持chatGLM2-6B,使用发现好多错误.有没有使用过多卡跑成功的大佬,要怎么修改

数据集格式instruction,input,output时tokenized出错

我的数据格式{ "instruction": "相关信息", "input": "相关提问", "output": "相关回答" }
数据集格式不能这样嘛，我看那个train_qlora.py里可以处理这种格式呀
下面是报错信息

用V100显卡微调chatglm2-6b，但是loss一直为0，eval_loss=nan

刚开始以为是V100不支持int8量化的原因，但是查看了train_qlora.py的代码，代码中写的是int4，所以排除了这个原因，请问这个问题该如何解决，谢谢
.

用qlora做二次预训练merge后推理极慢

我用qlora的方式先做了一次sft
没有merge 推理速度还行。
又用qlora做了二次pretrain 看影响的层除了qkv之外dense层也影响了结果把二次预训练的glm2 6b基座模型和qlora adapter做完merge之后，推理速度极慢，一个问题五分钟还没出答案，不知道大佬有没有遇到过这个情况

模型训练执行问题

这是什么呀，怎么选择呢？

不能训练被加载的8bit模型

微调chatGLM2与chatGLM1时，加载出现同样的错误

修改模型后出错，

我修改模型为chatGLM2-6B后出现这样的错误是什么原因？

python3 train_qlora.py --train_args_json chatGLM_6B_QLoRA.json --model_name_or_path THUDM/chatglm2-6b --train_data_path data/train.jsonl --eval_data_path data/dev.jsonl --lora_rank 4 --lora_dropout 0.05 --compute_dtype fp32

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

issues

bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 7/7 [00:07<00:00, 1.05s/it]
trainable params: 974,848 || all params: 6,244,558,848 || trainable%: 0.01561115883009451
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-d5629da83678d2e9/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16.56it/s]
Traceback (most recent call last):
File "train_qlora.py", line 203, in
train(args)
File "train_qlora.py", line 181, in train
train_dataset = get_datset(global_args.train_data_path, tokenizer, global_args)
File "train_qlora.py", line 82, in get_datset
dataset = data['train'].map(lambda example: tokenize_func(example, tokenizer, global_args),
File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 578, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 543, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3073, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3427, in _map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3330, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "train_qlora.py", line 82, in
dataset = data['train'].map(lambda example: tokenize_func(example, tokenizer, global_args),
File "train_qlora.py", line 73, in tokenize_func
question_length = input_ids.index(tokenizer.bos_token_id)
ValueError: None is not in list

base_model merge with lora fault

AttributeError: 'PeftModelForCausalLM' object has no attribute 'merge_and_upload'
During handling of the above exception, another exception occurred:
AttributeError: 'ChatGLMForConditionalGeneration' object has no attribute 'merge_and_upload'

包环境已经更新成最新的，调了一天这个问题，来求助！

ChatGLM2-6b模型合并时报错ValueError: We need an `offload_dir` to dispatch this model according to this `device_map`, the following submodules need to be offloaded

(venv) PS C:\MyFiles\AI\model\chatglm2> python merge_lora_and_quantize.py --lora_path saved_files/chatGLM_6B_QLoRA_t32 --output_path /tmp/merged_qlora_model_4bit --remote_scripts_dir remote_scripts/chatglm2-6b --qbits 4
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:07<00:00,  1.10s/it]
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Traceback (most recent call last):
  File "C:\MyFiles\AI\model\chatglm2\merge_lora_and_quantize.py", line 80, in <module>
    main(lora_path=args.lora_path,
  File "C:\MyFiles\AI\model\chatglm2\merge_lora_and_quantize.py", line 54, in main
    merged_model, lora_config = merge_lora(lora_path, device_map)
  File "C:\MyFiles\AI\model\chatglm2\merge_lora_and_quantize.py", line 28, in merge_lora
    model = PeftModel.from_pretrained(base_model, lora_path, device_map=device_map)
  File "C:\MyFiles\AI\model\chatglm2\venv\lib\site-packages\peft\peft_model.py", line 181, in from_pretrained
    model.load_adapter(model_id, adapter_name, **kwargs)
  File "C:\MyFiles\AI\model\chatglm2\venv\lib\site-packages\peft\peft_model.py", line 406, in load_adapter
    dispatch_model(
  File "C:\MyFiles\AI\model\chatglm2\venv\lib\site-packages\accelerate\big_modeling.py", line 374, in dispatch_model
    raise ValueError(
ValueError: We need an `offload_dir` to dispatch this model according to this `device_map`, the following submodules need to be offloaded: base_model.model.transformer.encoder.layers.12, base_model.model.transformer.encoder.layers.13, base_model.model.transformer.encoder.layers.14, base_model.model.transformer.encoder.layers.15, base_model.model.transformer.encoder.layers.16, base_model.model.transformer.encoder.layers.17, base_model.model.transformer.encoder.layers.18, base_model.model.transformer.encoder.layers.19, base_model.model.transformer.encoder.layers.20, base_model.model.transformer.encoder.layers.21, base_model.model.transformer.encoder.layers.22, base_model.model.transformer.encoder.layers.23, base_model.model.transformer.encoder.layers.24, base_model.model.transformer.encoder.layers.25, base_model.model.transformer.encoder.layers.26, base_model.model.transformer.encoder.layers.27, base_model.model.transformer.encoder.final_layernorm, base_model.model.transformer.output_layer.
(venv) PS C:\MyFiles\AI\model\chatglm2>

初始模型

chatglm2-6b

GPU内存

16G

部分依赖项版本：

accelerate==0.28.0
transformers==4.38.2
peft==0.3.0

我尝试了

更换依赖项版本
结果没有变化
在merge_lora_and_quantize.py文件中加入offload_dir参数

结果产生了如下报错

(venv) PS C:\MyFiles\AI\model\chatglm2> python merge_lora_and_quantize.py --lora_path saved_files/chatGLM_6B_QLoRA_t32 --output_path /tmp/merged_qlora_model_4bit --remote_scripts_dir remote_scripts/chatglm2-6b --qbits 4
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:07<00:00,  1.07s/it]
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
WARNING:root:Some parameters are on the meta device device because they were offloaded to the disk and cpu.
Traceback (most recent call last):
  File "C:\MyFiles\AI\model\chatglm2\merge_lora_and_quantize.py", line 80, in <module>
    main(lora_path=args.lora_path,
  File "C:\MyFiles\AI\model\chatglm2\merge_lora_and_quantize.py", line 56, in main
    quantized_model = quantize(merged_model, qbits)
  File "C:\MyFiles\AI\model\chatglm2\merge_lora_and_quantize.py", line 35, in quantize
    qmodel = model.quantize(qbits).half().cuda()
  File "C:\Users\71977\.cache\huggingface\modules\transformers_modules\chatglm2-6b\modeling_chatglm.py", line 1197, in quantize
    self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device,
  File "C:\Users\71977\.cache\huggingface\modules\transformers_modules\chatglm2-6b\quantization.py", line 157, in quantize
    weight=layer.self_attention.query_key_value.weight.to(torch.cuda.current_device()),
NotImplementedError: Cannot copy out of meta tensor; no data!

请教单卡RTX3060训练示例所需时长

您好，请问在配置正常的情况下，跑一遍readme.md里的训练示例大概用多长时间？
希望有一个参照确认是否配置正确了

单卡 RTX3060 cuda 11.7

示例：
python3 train_qlora.py
--train_args_json chatGLM_6B_QLoRA.json
--model_name_or_path THUDM/chatglm-6b
--train_data_path data/train.jsonl
--eval_data_path data/dev.jsonl
--lora_rank 4
--lora_dropout 0.05
--compute_dtype fp32

请问模型怎么才能通过deepspeed进行多卡训练

如题

怎么并行训练？

我试图用
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2
这样并行训练，会直接报错

ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices in any distributed mode. In order to use 8-bit
models that have been loaded across multiple GPUs the solution is to use Naive Pipeline Parallelism. Therefore you should not specify that you
are under any distributed regime in your accelerate config.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 594501) of binary: /DaTa/mambaforge/bin/python3.10

完整log如下：

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /DaTa/dl/MedicalGPT_lora_train/chatGLM-6B-QLoRA/train_qlora.py:206 in         │
│ <module>                                                                                         │
│                                                                                                  │
│   203                                                                                            │
│   204 if __name__ == "__main__":                                                                 │
│   205 │   args = parse_args()                                                                    │
│ ❱ 206 │   train(args)                                                                            │
│   207                                                                                            │
│   208                                                                                            │
│                                                                                                  │
│ /DaTa/dl/MedicalGPT_lora_train/chatGLM-6B-QLoRA/train_qlora.py:200 in train   │
│                                                                                                  │
│   197 │   │   data_collator=data_collator                                                        │
│   198 │   )                                                                                      │
│   199 │                                                                                          │
│ ❱ 200 │   trainer.train(resume_from_checkpoint=resume_from_checkpoint)                           │
│   201 │   trainer.model.save_pretrained(hf_train_args.output_dir)                                │
│   202                                                                                            │
│   203                                                                                            │
│                                                                                                  │
│ /DaTa/mambaforge/lib/python3.10/site-packages/transformers/trainer.py:1645 in │
│ train                                                                                            │
│                                                                                                  │
│   1642 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1643 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1644 │   │   )                                                                                 │
│ ❱ 1645 │   │   return inner_training_loop(                                                       │
│   1646 │   │   │   args=args,                                                                    │
│   1647 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1648 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /DaTa/mambaforge/lib/python3.10/site-packages/transformers/trainer.py:1756 in │
│ _inner_training_loop                                                                             │
│                                                                                                  │
│   1753 │   │   │   │   if self.use_apex:                                                         │
│   1754 │   │   │   │   │   model = self.accelerator.prepare(self.model)                          │
│   1755 │   │   │   │   else:                                                                     │
│ ❱ 1756 │   │   │   │   │   model, self.optimizer = self.accelerator.prepare(self.model, self.op  │
│   1757 │   │   │   else:                                                                         │
│   1758 │   │   │   │   # to handle cases wherein we pass "DummyScheduler" such as when it is sp  │
│   1759 │   │   │   │   model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(      │
│                                                                                                  │
│ /DaTa/mambaforge/lib/python3.10/site-packages/accelerate/accelerator.py:1182  │
│ in prepare                                                                                       │
│                                                                                                  │
│   1179 │   │   elif self.distributed_type == DistributedType.MEGATRON_LM:                        │
│   1180 │   │   │   result = self._prepare_megatron_lm(*args)                                     │
│   1181 │   │   else:                                                                             │
│ ❱ 1182 │   │   │   result = tuple(                                                               │
│   1183 │   │   │   │   self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d i  │
│   1184 │   │   │   )                                                                             │
│   1185 │   │   │   result = tuple(self._prepare_one(obj, device_placement=d) for obj, d in zip(  │
│                                                                                                  │
│ /DaTa/mambaforge/lib/python3.10/site-packages/accelerate/accelerator.py:1183  │
│ in <genexpr>                                                                                     │
│                                                                                                  │
│   1180 │   │   │   result = self._prepare_megatron_lm(*args)                                     │
│   1181 │   │   else:                                                                             │
│   1182 │   │   │   result = tuple(                                                               │
│ ❱ 1183 │   │   │   │   self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d i  │
│   1184 │   │   │   )                                                                             │
│   1185 │   │   │   result = tuple(self._prepare_one(obj, device_placement=d) for obj, d in zip(  │
│   1186                                                                                           │
│                                                                                                  │
│ /DaTa/mambaforge/lib/python3.10/site-packages/accelerate/accelerator.py:1022  │
│ in _prepare_one                                                                                  │
│                                                                                                  │
│   1019 │   │   │   if isinstance(obj, torch.utils.data.DataLoader):                              │
│   1020 │   │   │   │   return self.prepare_data_loader(obj, device_placement=device_placement)   │
│   1021 │   │   │   elif isinstance(obj, torch.nn.Module):                                        │
│ ❱ 1022 │   │   │   │   return self.prepare_model(obj, device_placement=device_placement)         │
│   1023 │   │   │   elif isinstance(obj, torch.optim.Optimizer):                                  │
│   1024 │   │   │   │   optimizer = self.prepare_optimizer(obj, device_placement=device_placemen  │
│   1025 │   │   │   │   return optimizer                                                          │
│                                                                                                  │
│ /DaTa/mambaforge/lib/python3.10/site-packages/accelerate/accelerator.py:1247  │
│ in prepare_model                                                                                 │
│                                                                                                  │
│   1244 │   │   ):                                                                                │
│   1245 │   │   │   model_devices = set(model.hf_device_map.values())                             │
│   1246 │   │   │   if len(model_devices) > 1 and self.distributed_type != DistributedType.NO:    │
│ ❱ 1247 │   │   │   │   raise ValueError(                                                         │
│   1248 │   │   │   │   │   "You can't train a model that has been loaded in 8-bit precision on   │
│   1249 │   │   │   │   │   " In order to use 8-bit models that have been loaded across multiple  │
│   1250 │   │   │   │   │   " Therefore you should not specify that you are under any distribute  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices in any distributed mode. In order to use 8-bit
models that have been loaded across multiple GPUs the solution is to use Naive Pipeline Parallelism. Therefore you should not specify that you
are under any distributed regime in your accelerate config.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /DaTa/dl/MedicalGPT_lora_train/chatGLM-6B-QLoRA/train_qlora.py:206 in         │
│ <module>                                                                                         │
│                                                                                                  │
│   203                                                                                            │
│   204 if __name__ == "__main__":                                                                 │
│   205 │   args = parse_args()                                                                    │
│ ❱ 206 │   train(args)                                                                            │
│   207                                                                                            │
│   208                                                                                            │
│                                                                                                  │
│ /DaTa/dl/MedicalGPT_lora_train/chatGLM-6B-QLoRA/train_qlora.py:200 in train   │
│                                                                                                  │
│   197 │   │   data_collator=data_collator                                                        │
│   198 │   )                                                                                      │
│   199 │                                                                                          │
│ ❱ 200 │   trainer.train(resume_from_checkpoint=resume_from_checkpoint)                           │
│   201 │   trainer.model.save_pretrained(hf_train_args.output_dir)                                │
│   202                                                                                            │
│   203                                                                                            │
│                                                                                                  │
│ /DaTa/mambaforge/lib/python3.10/site-packages/transformers/trainer.py:1645 in │
│ train                                                                                            │
│                                                                                                  │
│   1642 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1643 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1644 │   │   )                                                                                 │
│ ❱ 1645 │   │   return inner_training_loop(                                                       │
│   1646 │   │   │   args=args,                                                                    │
│   1647 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1648 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /DaTa/mambaforge/lib/python3.10/site-packages/transformers/trainer.py:1756 in │
│ _inner_training_loop                                                                             │
│                                                                                                  │
│   1753 │   │   │   │   if self.use_apex:                                                         │
│   1754 │   │   │   │   │   model = self.accelerator.prepare(self.model)                          │
│   1755 │   │   │   │   else:                                                                     │
│ ❱ 1756 │   │   │   │   │   model, self.optimizer = self.accelerator.prepare(self.model, self.op  │
│   1757 │   │   │   else:                                                                         │
│   1758 │   │   │   │   # to handle cases wherein we pass "DummyScheduler" such as when it is sp  │
│   1759 │   │   │   │   model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(      │
│                                                                                                  │
│ /DaTa/mambaforge/lib/python3.10/site-packages/accelerate/accelerator.py:1182  │
│ in prepare                                                                                       │
│                                                                                                  │
│   1179 │   │   elif self.distributed_type == DistributedType.MEGATRON_LM:                        │
│   1180 │   │   │   result = self._prepare_megatron_lm(*args)                                     │
│   1181 │   │   else:                                                                             │
│ ❱ 1182 │   │   │   result = tuple(                                                               │
│   1183 │   │   │   │   self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d i  │
│   1184 │   │   │   )                                                                             │
│   1185 │   │   │   result = tuple(self._prepare_one(obj, device_placement=d) for obj, d in zip(  │
│                                                                                                  │
│ /DaTa/mambaforge/lib/python3.10/site-packages/accelerate/accelerator.py:1183  │
│ in <genexpr>                                                                                     │
│                                                                                                  │
│   1180 │   │   │   result = self._prepare_megatron_lm(*args)                                     │
│   1181 │   │   else:                                                                             │
│   1182 │   │   │   result = tuple(                                                               │
│ ❱ 1183 │   │   │   │   self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d i  │
│   1184 │   │   │   )                                                                             │
│   1185 │   │   │   result = tuple(self._prepare_one(obj, device_placement=d) for obj, d in zip(  │
│   1186                                                                                           │
│                                                                                                  │
│ /DaTa/mambaforge/lib/python3.10/site-packages/accelerate/accelerator.py:1022  │
│ in _prepare_one                                                                                  │
│                                                                                                  │
│   1019 │   │   │   if isinstance(obj, torch.utils.data.DataLoader):                              │
│   1020 │   │   │   │   return self.prepare_data_loader(obj, device_placement=device_placement)   │
│   1021 │   │   │   elif isinstance(obj, torch.nn.Module):                                        │
│ ❱ 1022 │   │   │   │   return self.prepare_model(obj, device_placement=device_placement)         │
│   1023 │   │   │   elif isinstance(obj, torch.optim.Optimizer):                                  │
│   1024 │   │   │   │   optimizer = self.prepare_optimizer(obj, device_placement=device_placemen  │
│   1025 │   │   │   │   return optimizer                                                          │
│                                                                                                  │
│ /DaTa/mambaforge/lib/python3.10/site-packages/accelerate/accelerator.py:1247  │
│ in prepare_model                                                                                 │
│                                                                                                  │
│   1244 │   │   ):                                                                                │
│   1245 │   │   │   model_devices = set(model.hf_device_map.values())                             │
│   1246 │   │   │   if len(model_devices) > 1 and self.distributed_type != DistributedType.NO:    │
│ ❱ 1247 │   │   │   │   raise ValueError(                                                         │
│   1248 │   │   │   │   │   "You can't train a model that has been loaded in 8-bit precision on   │
│   1249 │   │   │   │   │   " In order to use 8-bit models that have been loaded across multiple  │
│   1250 │   │   │   │   │   " Therefore you should not specify that you are under any distribute  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices in any distributed mode. In order to use 8-bit
models that have been loaded across multiple GPUs the solution is to use Naive Pipeline Parallelism. Therefore you should not specify that you
are under any distributed regime in your accelerate config.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 594501) of binary: /DaTa/mambaforge/bin/python3.10
Traceback (most recent call last):
  File "/DaTa/mambaforge/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/DaTa/mambaforge/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346
  , in wrapper
    return f(*args, **kwargs)
  File "/DaTa/mambaforge/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/DaTa/mambaforge/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/DaTa/mambaforge/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/DaTa/mambaforge/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_qlora.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-07-06_12:13:14
  host      : zhongshanyanke
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 594502)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-06_12:13:14
  host      : zhongshanyanke
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 594501)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

调优后的模型是否可以在CPU服务上部署？

调优并合并后的量化模型可以在普通的CPU服务器运行吗？

训练结束后没有保存完整的adapter文件

训练结束后saved_files文件夹中只有adapter_config.json,adapter_model.bin,README.md三个文件。

model_name_or_path加载本地模型出错,怎么才能加载本地已经下载好的模型

修改如下parser.add_argument('--model_name_or_path', type=str, default='H:/langchain/ChatGLM-6B-Model', help='模型id或local path')，运行出错

模型微调出现模型部分参数在cpu上面

大佬，这是怎么回事，我是直接执行了train_qlora.py文件，然后出现了这个错误
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm2_6b/modeling_chatglm.py", line 588, in forward
hidden_states, kv_cache = layer(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jovyan/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm2_6b/modeling_chatglm.py", line 510, in forward
attention_output, kv_cache = self.self_attention(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jovyan/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/jovyan/.cache/huggingface/modules/transformers_modules/chatglm2_6b/modeling_chatglm.py", line 342, in forward
mixed_x_layer = self.query_key_value(hidden_states)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/peft/tuners/lora.py", line 456, in forward
after_A = self.lora_A(self.lora_dropout(x))
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

你好，微调后，提问一些别的问题会出现<UNK>

protobuf 和 icetk 版本不兼容

Installing collected packages: protobuf, icetk
Attempting uninstall: protobuf
Found existing installation: protobuf 3.20.2
Uninstalling protobuf-3.20.2:
Successfully uninstalled protobuf-3.20.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
onnx 1.13.1 requires protobuf<4,>=3.20.2, but you have protobuf 3.18.3 which is incompatible.
tensorboard 2.12.2 requires protobuf>=3.19.6, but you have protobuf 3.18.3 which is incompatible.
Successfully installed icetk-0.0.7 protobuf-3.18.3

提供的docker环境，tensorboard onnx 要求protobuf>=3.19.6，icetk要求小于3.18，该怎么弄

模型load cuda out of memory

hi, 我使用20g的3090卡但是load 时候报cuda out of memory , 请问这个有方法优化？我记得之前跑lora的时候貌似10多个g就可以跑lora finetuning

chatglm2-6b lora微调是不是有问题

lora微调只有这2个文件，而且模型没有任何微调过的效果

python train_qlora.py --train_args_json chatGLM_6B_QLoRA.json --model_name_or_path /rainbow/zhangjunfeng/bert_models/pytorch/chatglm2-6b --train_data_path /rainbow/zhangjunfeng/ChatGLM-Efficient-Tuning/data/rb.jsonl --lora_rank 4 --lora_dropout 0.05 --compute_dtype fp32

python=3.9
peft==0.4.0
bitsandbytes==0.41.0
其他的按requirements.txt安装

对chatglm2进行lora微调时，提示CUDA error: invalid argument，麻烦大佬看一下

对chatglm2进行lora微调时，提示CUDA error: invalid argument；使用的windows系统，python310，cuda：11.8
PS E:\Chatglm2-Qlora\chatGLM-6B-QLoRA-main> python train_qlora.py --train_args_json chatGLM_6B_QLoRA.json --model_name_or_path THUDM/chatglm2-6b --train_data_path data/train.jsonl --eval_data_path data/dev.jsonl --lora_rank 4 --lora_dropout 0.05 --compute_dtype fp16

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll
CUDA SETUP: CUDA runtime path found: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\cudart64_110.dll
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll...
The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 7/7 [00:06<00:00, 1.11it/s]
trainable params: 974,848 || all params: 3,389,286,400 || trainable%: 0.0287626327477076
Found cached dataset json (C:/Users/Administrator/.cache/huggingface/datasets/json/default-d642ff6439cea90e/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 243.11it/s]
Found cached dataset json (C:/Users/Administrator/.cache/huggingface/datasets/json/default-bf648ec70cbcb4a4/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.15.3
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
0%| | 0/3581 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
Traceback (most recent call last):
File "E:\Chatglm2-Qlora\chatGLM-6B-QLoRA-main\train_qlora.py", line 206, in
train(args)
File "E:\Chatglm2-Qlora\chatGLM-6B-QLoRA-main\train_qlora.py", line 200, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\trainer.py", line 1645, in train
return inner_training_loop(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\trainer.py", line 1938, in inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\trainer.py", line 2770, in training_step
self.accelerator.backward(loss)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\accelerator.py", line 1821, in backward
loss.backward(**kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch_tensor.py", line 487, in backward
torch.autograd.backward(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd_init.py", line 200, in backward
Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\function.py", line 274, in apply
return user_fn(self, *args)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd_init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ E:\Chatglm2-Qlora\chatGLM-6B-QLoRA-main\train_qlora.py:206 in │
│ │
│ 203 │
│ 204 if name == "main": │
│ 205 │ args = parse_args() │
│ ❱ 206 │ train(args) │
│ 207 │
│ 208 │
│ │
│ E:\Chatglm2-Qlora\chatGLM-6B-QLoRA-main\train_qlora.py:200 in train │
│ │
│ 197 │ │ data_collator=data_collator │
│ 198 │ ) │
│ 199 │ │
│ ❱ 200 │ trainer.train(resume_from_checkpoint=resume_from_checkpoint) │
│ 201 │ trainer.model.save_pretrained(hf_train_args.output_dir) │
│ 202 │
│ 203 │
│ │
│ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tr │
│ ainer.py:1645 in train │
│ │
│ 1642 │ │ inner_training_loop = find_executable_batch_size( │
│ 1643 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1644 │ │ ) │
│ ❱ 1645 │ │ return inner_training_loop( │
│ 1646 │ │ │ args=args, │
│ 1647 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1648 │ │ │ trial=trial, │
│ │
│ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tr │
│ ainer.py:1938 in _inner_training_loop │
│ │
│ 1935 │ │ │ │ │ self.control = self.callback_handler.on_step_begin(args, self.state, │
│ 1936 │ │ │ │ │
│ 1937 │ │ │ │ with self.accelerator.accumulate(model): │
│ ❱ 1938 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1939 │ │ │ │ │
│ 1940 │ │ │ │ if ( │
│ 1941 │ │ │ │ │ args.logging_nan_inf_filter │
│ │
│ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tr │
│ ainer.py:2770 in training_step │
│ │
│ 2767 │ │ │ with amp.scale_loss(loss, self.optimizer) as scaled_loss: │
│ 2768 │ │ │ │ scaled_loss.backward() │
│ 2769 │ │ else: │
│ ❱ 2770 │ │ │ self.accelerator.backward(loss) │
│ 2771 │ │ │
│ 2772 │ │ return loss.detach() / self.args.gradient_accumulation_steps │
│ 2773 │
│ │
│ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\acce │
│ lerator.py:1821 in backward │
│ │
│ 1818 │ │ elif self.scaler is not None: │
│ 1819 │ │ │ self.scaler.scale(loss).backward(**kwargs) │
│ 1820 │ │ else: │
│ ❱ 1821 │ │ │ loss.backward(**kwargs) │
│ 1822 │ │
│ 1823 │ def unscale_gradients(self, optimizer=None): │
│ 1824 │ │ """ │
│ │
│ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch_tensor.p │
│ y:487 in backward │
│ │
│ 484 │ │ │ │ create_graph=create_graph, │
│ 485 │ │ │ │ inputs=inputs, │
│ 486 │ │ │ ) │
│ ❱ 487 │ │ torch.autograd.backward( │
│ 488 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │
│ 489 │ │ ) │
│ 490 │
│ │
│ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\ │
│ init.py:200 in backward │
│ │
│ 197 │ # The reason we repeat same the comment below is that │
│ 198 │ # some Python versions print out the first line of a multi-line function │
│ 199 │ # calls in the traceback and some print out the last line │
│ ❱ 200 │ Variable.execution_engine.run_backward( # Calls into the C++ engine to run the bac │
│ 201 │ │ tensors, grad_tensors, retain_graph, create_graph, inputs, │
│ 202 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │
│ 203 │
│ │
│ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\ │
│ function.py:274 in apply │
│ │
│ 271 │ │ │ │ │ │ │ "Function is not allowed. You should only implement one " │
│ 272 │ │ │ │ │ │ │ "of them.") │
│ 273 │ │ user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn │
│ ❱ 274 │ │ return user_fn(self, *args) │
│ 275 │ │
│ 276 │ def apply_jvp(self, *args): │
│ 277 │ │ # _forward_cls is defined by derived class │
│ │
│ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\che │
│ ckpoint.py:157 in backward │
│ │
│ 154 │ │ │ raise RuntimeError( │
│ 155 │ │ │ │ "none of output has requires_grad=True," │
│ 156 │ │ │ │ " this checkpoint() is not necessary") │
│ ❱ 157 │ │ torch.autograd.backward(outputs_with_grad, args_with_grad) │
│ 158 │ │ grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None │
│ 159 │ │ │ │ │ for inp in detached_inputs) │
│ 160 │
│ │
│ C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\ │
│ init.py:200 in backward │
│ │
│ 197 │ # The reason we repeat same the comment below is that │
│ 198 │ # some Python versions print out the first line of a multi-line function │
│ 199 │ # calls in the traceback and some print out the last line │
│ ❱ 200 │ Variable.execution_engine.run_backward( # Calls into the C++ engine to run the bac │
│ 201 │ │ tensors, grad_tensors, retain_graph, create_graph, inputs, │
│ 202 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │
│ 203 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

带instruction和input的数据集怎么微调啊？

微调后推理性能问题

微调后合并并量化int4模型，直接对新模型进行推理，推理速度明显慢于官方int4模型。
但是如果是把微调的pytorch_model.bin替换官方的pytorch_model.bin文件后，再推理，速度就和官方的是差不多的。

这是哪块儿的问题呢？是得需要修再修改新模型的其他文件吗？

merge后模型Loading checkpoint shards Killed

求问ChatGLM2-6B，我用数据集微调后，使用adapter推理成功了，但是merge之后使用官方cli_demo会直接Loading checkpoint shards: 0%| Killed，看了一下fp32合并后的模型有23.2G，换成fp16后的模型为11.6G，但是同样会出现killed问题。
训练配置：

{
    "output_dir": "saved_files/chatGLM_6B_QLoRA_t32",
    "per_device_train_batch_size": 4,
    "gradient_accumulation_steps": 8,
    "per_device_eval_batch_size": 4,
    "learning_rate": 1e-3,
    "num_train_epochs": 1.0,
    "lr_scheduler_type": "linear",
    "warmup_ratio": 0.1,
    "logging_steps": 100,
    "save_strategy": "steps",
    "save_steps": 500,
    "evaluation_strategy": "steps",
    "eval_steps": 500,
    "optim": "adamw_torch",
    "fp16": false,
    "remove_unused_columns": false,
    "ddp_find_unused_parameters": false,
    "seed": 42
}

训练命令：

python3 train_qlora.py --train_args_json chatGLM_6B_QLoRA.json --model_name_or_path chatglm2-6b --train_data_path data/train.jsonl --eval_data_path data/dev.jsonl --lora_rank 4 --lora_dropout 0.05 --compute_dtype fp32

合并命令：

python3 merge_lora_and_quantize.py --lora_path QLoRA_20230811_2500 --output_path output_merged/QLoRA_20230811_2500 --remote_scripts_dir remote_scripts/chatglm2-6b --device auto --qbits 0

微调chatglm-6b占用内存20g以上，什么原因

ValueError:

ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded
the model on the correct device using for example device_map={'':torch.cuda.current_device()}you're training on. Make sure you loaded the model on the correct device using for example device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

请问，我运行qlora的代码过程中出现这个错误，这是因为训练设备的原因吗？怎么修改？求助。

推理性能？

Hi @shuxueslpi ,
感谢分享，想问下基于https://github.com/shuxueslpi/chatGLM-6B-QLoRA/blob/main/train_qlora.py#L138
这样训练出来的int4模型，推理速度会比原始的chatglm6b fp16快吗
因为chatglm自带的quantize方法，似乎只能压缩显存，但是耗时会变久。

训练显存有些问题

1.显存占用，同样的config和参数，我的20G，看author才9G多，batchsize是4吗
2.训练过程中，显存增加的原因是啥
3.内存占用上和其他微调方法没明显优势啊，p-tuning V2

训练完成后遇到的问题

主目录下执行说明文件中给的参考：python3 train_qlora.py --train_args_json chatGLM_6B_QLoRA.json --model_name_or_path THUDM/chatglm-6b --train_data_path data/train.jsonl --eval_data_path data/dev.jsonl --lora_rank 4 --lora_dropout 0.05 --compute_dtype fp32后，训练完成后遇到如下问题，请问如何解决。

100%|██████████████████████████████████████████████████████████| 3581/3581 [11:25:57<00:00, 11.49s/it]
wandb: Waiting for W&B process to finish... (success).
wandb: Network error (TransientError), entering retry loop.

训练结束后，数据展示出现问题

使用chatglm训练后，train loss曲线和readme中差不多。
使用chatglm2训练后,就改了一下Lora_rank 为8，训练结束，train loss就一个点，eval loss正常

log中的train loss也正常，就图表显示只有一个点，可能是什么原因呢？

ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory

合并问题

我看peft仓库里源码，量化加载的模型没法直接merge合并吧，所以要想合并的话必须全精度加载训练完之后合并，然后再用glm官方的量化脚本进行量化的嘛，那我看到你train里面的代码是量化加载的吧，这种情况下想合并应该要对lora layer的weight进行反量化才可以吧

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)，请问如何调整训练代码，适配多卡训练？？

Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-bc6f7aa3b7d4f48f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-8d547816b9814051.arrow
0%| | 0/3581 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/disk_data/LLM/chatGLM-6B-QLoRA-main/train_qlora.py:229 in │
│ │
│ 226 │
│ 227 if name == "main": │
│ 228 │ args = parse_args() │
│ ❱ 229 │ train(args) │
│ 230 │
│ 231 │
│ │
│ /mnt/disk_data/LLM/chatGLM-6B-QLoRA-main/train_qlora.py:223 in train │
│ │
│ 220 │ │ data_collator=data_collator │
│ 221 │ ) │
│ 222 │ │
│ ❱ 223 │ trainer.train(resume_from_checkpoint=resume_from_checkpoint) │
│ 224 │ trainer.model.save_pretrained(hf_train_args.output_dir) │
│ 225 │
│ 226 │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/transform │
│ ers/trainer.py:1645 in train │
│ │
│ 1642 │ │ inner_training_loop = find_executable_batch_size( │
│ 1643 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1644 │ │ ) │
│ ❱ 1645 │ │ return inner_training_loop( │
│ 1646 │ │ │ args=args, │
│ 1647 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1648 │ │ │ trial=trial, │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/transform │
│ ers/trainer.py:1938 in _inner_training_loop │
│ │
│ 1935 │ │ │ │ │ self.control = self.callback_handler.on_step_begin(args, self.state, │
│ 1936 │ │ │ │ │
│ 1937 │ │ │ │ with self.accelerator.accumulate(model): │
│ ❱ 1938 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1939 │ │ │ │ │
│ 1940 │ │ │ │ if ( │
│ 1941 │ │ │ │ │ args.logging_nan_inf_filter │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/transform │
│ ers/trainer.py:2759 in training_step │
│ │
│ 2756 │ │ │ return loss_mb.reduce_mean().detach().to(self.args.device) │
│ 2757 │ │ │
│ 2758 │ │ with self.compute_loss_context_manager(): │
│ ❱ 2759 │ │ │ loss = self.compute_loss(model, inputs) │
│ 2760 │ │ │
│ 2761 │ │ if self.args.n_gpu > 1: │
│ 2762 │ │ │ loss = loss.mean() # mean() to average on multi-gpu parallel training │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/transform │
│ ers/trainer.py:2784 in compute_loss │
│ │
│ 2781 │ │ │ labels = inputs.pop("labels") │
│ 2782 │ │ else: │
│ 2783 │ │ │ labels = None │
│ ❱ 2784 │ │ outputs = model(**inputs) │
│ 2785 │ │ # Save past state if it exists │
│ 2786 │ │ # TODO: this needs to be fixed and made cleaner later. │
│ 2787 │ │ if self.args.past_index >= 0: │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/nn/ │
│ modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/peft/peft │
│ _model.py:922 in forward │
│ │
│ 919 │ │ │ │ │ **kwargs, │
│ 920 │ │ │ │ ) │
│ 921 │ │ │ │
│ ❱ 922 │ │ │ return self.base_model( │
│ 923 │ │ │ │ input_ids=input_ids, │
│ 924 │ │ │ │ attention_mask=attention_mask, │
│ 925 │ │ │ │ inputs_embeds=inputs_embeds, │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/nn/ │
│ modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/accelerat │
│ e/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /root/.cache/huggingface/modules/transformers_modules/ChatGLM2-6B/modeling_chatglm.py:960 in │
│ forward │
│ │
│ 957 │ │ │ shift_labels = labels[..., 1:].contiguous() │
│ 958 │ │ │ # Flatten the tokens │
│ 959 │ │ │ loss_fct = CrossEntropyLoss(ignore_index=-100) │
│ ❱ 960 │ │ │ loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.v │
│ 961 │ │ │ │
│ 962 │ │ │ lm_logits = lm_logits.to(hidden_states.dtype) │
│ 963 │ │ │ loss = loss.to(hidden_states.dtype) │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/nn/ │
│ modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/nn/ │
│ modules/loss.py:1174 in forward │
│ │
│ 1171 │ │ self.label_smoothing = label_smoothing │
│ 1172 │ │
│ 1173 │ def forward(self, input: Tensor, target: Tensor) -> Tensor: │
│ ❱ 1174 │ │ return F.cross_entropy(input, target, weight=self.weight, │
│ 1175 │ │ │ │ │ │ │ ignore_index=self.ignore_index, reduction=self.reduction, │
│ 1176 │ │ │ │ │ │ │ label_smoothing=self.label_smoothing) │
│ 1177 │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/nn/ │
│ functional.py:3029 in cross_entropy │
│ │
│ 3026 │ │ ) │
│ 3027 │ if size_average is not None or reduce is not None: │
│ 3028 │ │ reduction = _Reduction.legacy_get_string(size_average, reduce) │
│ ❱ 3029 │ return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(re │
│ 3030 │
│ 3031 │
│ 3032 def binary_cross_entropy( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking
argument for argument target in method wrapper_CUDA_nll_loss_forward)

peft报错

peft=0.3.0
报错如下：
ImportError: cannot import name 'prepare_model_for_kbit_training' from 'peft' (/home/mlamp/miniconda3/envs/chatglm/lib/python3.9/site-packages/peft/init.py)

chatglm2-6b输入不跟官方的保持统一吗？

我看chatglm2的官方代码里build_inputs的函数如下：

def build_inputs(self, tokenizer, query: str, history: List[Tuple[str, str]] = None):
      prompt = ""
      for i, (old_query, response) in enumerate(history):
          prompt += "[Round {}]\n\n问：{}\n\n答：{}\n\n".format(i + 1, old_query, response)
      prompt += "[Round {}]\n\n问：{}\n\n答：".format(len(history) + 1, query)
      inputs = tokenizer([prompt], return_tensors="pt")
      inputs = inputs.to(self.device)
      return inputs

也就是说无论有没有多轮，

[Round {}]\n\n问：{}\n\n答：

这个格式都是有的。但是我看训练代码里的处理数据方式如下：

def tokenize_func(example, tokenizer, global_args, ignore_label_id=-100):
    """单样本tokenize处理"""
    question = global_args.prompt_text + example['instruction']
    if example.get('input', None):
        if example['input'].strip():
            question += f'''\n{example['input']}'''
    answer = example['output']
    q_ids = tokenizer.encode(text=question, add_special_tokens=False)
    a_ids = tokenizer.encode(text=answer, add_special_tokens=False)
    if len(q_ids) > global_args.max_input_length - 2:  # 2 - gmask, bos
        q_ids = q_ids[: global_args.max_input_length - 2]
    if len(a_ids) > global_args.max_output_length - 1:  # 1 - eos
        a_ids = a_ids[: global_args.max_output_length - 1]
    input_ids = tokenizer.build_inputs_with_special_tokens(q_ids, a_ids)
    # question_length = input_ids.index(tokenizer.bos_token_id)
    question_length = len(q_ids) + 2  # chatglm1 - gmask, bos, chatglm2 - gmask, sop
    labels = [ignore_label_id] * question_length + input_ids[question_length:]
    return {'input_ids': input_ids, 'labels': labels}

这里并没有加入如上的模板。是不是加上跟chatglm2对齐的话，会更合适一些？

lora和qlora对比

你好，
1.有对比过lora和qlora的效果吗？
2.还有个问题，有多个qlora，我是否能加载一个base model（只加载一次），和多个qlora model，根据不同的情况自由组合（不多次的merge）。

合并模型的时候显存不够,用的4090，24G

		  Traceback (most recent call last):

File "/root/autodl-tmp/chatGLM-6B-QLoRA-main/merge_lora_and_quantize.py", line 82, in
main(lora_path=args.lora_path,
File "/root/autodl-tmp/chatGLM-6B-QLoRA-main/merge_lora_and_quantize.py", line 56, in main
merged_model, lora_config = merge_lora(lora_path, device_map)
File "/root/autodl-tmp/chatGLM-6B-QLoRA-main/merge_lora_and_quantize.py", line 30, in merge_lora
model = PeftModel.from_pretrained(base_model, lora_path, device_map=device_map)
File "/root/miniconda3/lib/python3.10/site-packages/peft/peft_model.py", line 181, in from_pretrained
model.load_adapter(model_id, adapter_name, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/peft/peft_model.py", line 372, in load_adapter
adapters_weights = torch.load(
File "/root/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 1014, in load
return _load(opened_zipfile,
File "/root/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 1422, in _load
result = unpickler.load()
File "/root/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 1392, in persistent_load
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/root/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 1366, in load_tensor
wrap_storage=restore_location(storage, location),
File "/root/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 1299, in restore_location
return default_restore_location(storage, str(map_location))
File "/root/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
result = fn(storage, location)
File "/root/miniconda3/lib/python3.10/site-packages/torch/serialization.py", line 279, in _cuda_deserialize
return obj.cuda(device)
File "/root/miniconda3/lib/python3.10/site-packages/torch/_utils.py", line 114, in _cuda
untyped_storage = torch.UntypedStorage(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacty of 23.65 GiB of which 2.06 MiB is free. Process 97728 has 23.64 GiB memory in use. Of the allocated memory 23.26 GiB is allocated by PyTorch, and 39.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Batch Inference支持吗

代码里好像只能单条获得输出，可以batch infer吗或者有相关代码参考嘛

RuntimeError: mat1 and mat2 shapes cannot be multiplied (588x4096 and 1x9437184) ，ChatGLM2-6B做微调，请问如何设置参数？？

sh 参数如下：
python3 train_qlora.py
--train_args_json chatGLM_6B_QLoRA.json
--model_name_or_path /mnt/disk_data/soft/text-generation-webui-main/models/ChatGLM2-6B
--train_data_path data/train.jsonl
--eval_data_path data/dev.jsonl
--lora_rank 4
--lora_dropout 0.05
--compute_dtype fp16
--max_input_length 64
--max_output_length 128 \

bin /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████| 7/7 [00:08<00:00, 1.24s/it]
trainable params: 974,848 || all params: 3,287,312,384 || trainable%: 0.029654863491062736
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-e0d9f754b035be5e/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|██████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 321.06it/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-e0d9f754b035be5e/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-55c45c43ff9cb6e7.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/json/default-e0d9f754b035be5e/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-ac4d0aad0908daff.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-e0d9f754b035be5e/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-be803c5d4b114f15.arrow
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-bc6f7aa3b7d4f48f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|██████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 938.95it/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-bc6f7aa3b7d4f48f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-d09ce473a7ac4a01.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/json/default-bc6f7aa3b7d4f48f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-ec63b687e0fed7f9.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-bc6f7aa3b7d4f48f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-8d547816b9814051.arrow
0%| | 0/3581 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/disk_data/LLM/chatGLM-6B-QLoRA-main/train_qlora.py:229 in │
│ │
│ 226 │
│ 227 if name == "main": │
│ 228 │ args = parse_args() │
│ ❱ 229 │ train(args) │
│ 230 │
│ 231 │
│ │
│ /mnt/disk_data/LLM/chatGLM-6B-QLoRA-main/train_qlora.py:223 in train │
│ │
│ 220 │ │ data_collator=data_collator │
│ 221 │ ) │
│ 222 │ │
│ ❱ 223 │ trainer.train(resume_from_checkpoint=resume_from_checkpoint) │
│ 224 │ trainer.model.save_pretrained(hf_train_args.output_dir) │
│ 225 │
│ 226 │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/transform │
│ ers/trainer.py:1645 in train │
│ │
│ 1642 │ │ inner_training_loop = find_executable_batch_size( │
│ 1643 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1644 │ │ ) │
│ ❱ 1645 │ │ return inner_training_loop( │
│ 1646 │ │ │ args=args, │
│ 1647 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1648 │ │ │ trial=trial, │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/transform │
│ ers/trainer.py:1938 in _inner_training_loop │
│ │
│ 1935 │ │ │ │ │ self.control = self.callback_handler.on_step_begin(args, self.state, │
│ 1936 │ │ │ │ │
│ 1937 │ │ │ │ with self.accelerator.accumulate(model): │
│ ❱ 1938 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1939 │ │ │ │ │
│ 1940 │ │ │ │ if ( │
│ 1941 │ │ │ │ │ args.logging_nan_inf_filter │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/transform │
│ ers/trainer.py:2759 in training_step │
│ │
│ 2756 │ │ │ return loss_mb.reduce_mean().detach().to(self.args.device) │
│ 2757 │ │ │
│ 2758 │ │ with self.compute_loss_context_manager(): │
│ ❱ 2759 │ │ │ loss = self.compute_loss(model, inputs) │
│ 2760 │ │ │
│ 2761 │ │ if self.args.n_gpu > 1: │
│ 2762 │ │ │ loss = loss.mean() # mean() to average on multi-gpu parallel training │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/transform │
│ ers/trainer.py:2784 in compute_loss │
│ │
│ 2781 │ │ │ labels = inputs.pop("labels") │
│ 2782 │ │ else: │
│ 2783 │ │ │ labels = None │
│ ❱ 2784 │ │ outputs = model(**inputs) │
│ 2785 │ │ # Save past state if it exists │
│ 2786 │ │ # TODO: this needs to be fixed and made cleaner later. │
│ 2787 │ │ if self.args.past_index >= 0: │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/nn/ │
│ modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/accelerat │
│ e/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /root/.cache/huggingface/modules/transformers_modules/ChatGLM2-6B/modeling_chatglm.py:934 in │
│ forward │
│ │
│ 931 │ │ use_cache = use_cache if use_cache is not None else self.config.use_cache │
│ 932 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │
│ 933 │ │ │
│ ❱ 934 │ │ transformer_outputs = self.transformer( │
│ 935 │ │ │ input_ids=input_ids, │
│ 936 │ │ │ position_ids=position_ids, │
│ 937 │ │ │ attention_mask=attention_mask, │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/nn/ │
│ modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /root/.cache/huggingface/modules/transformers_modules/ChatGLM2-6B/modeling_chatglm.py:830 in │
│ forward │
│ │
│ 827 │ │ rotary_pos_emb = rotary_pos_emb.transpose(0, 1).contiguous() │
│ 828 │ │ │
│ 829 │ │ # Run encoder. │
│ ❱ 830 │ │ hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder( │
│ 831 │ │ │ inputs_embeds, full_attention_mask, rotary_pos_emb=rotary_pos_emb, │
│ 832 │ │ │ kv_caches=past_key_values, use_cache=use_cache, output_hidden_states=output │
│ 833 │ │ ) │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/nn/ │
│ modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /root/.cache/huggingface/modules/transformers_modules/ChatGLM2-6B/modeling_chatglm.py:631 in │
│ forward │
│ │
│ 628 │ │ │ │
│ 629 │ │ │ layer = self._get_layer(index) │
│ 630 │ │ │ if self.gradient_checkpointing and self.training: │
│ ❱ 631 │ │ │ │ layer_ret = torch.utils.checkpoint.checkpoint( │
│ 632 │ │ │ │ │ layer, │
│ 633 │ │ │ │ │ hidden_states, │
│ 634 │ │ │ │ │ attention_mask, │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/uti │
│ ls/checkpoint.py:249 in checkpoint │
│ │
│ 246 │ │ raise ValueError("Unexpected keyword arguments: " + ",".join(arg for arg in kwar │
│ 247 │ │
│ 248 │ if use_reentrant: │
│ ❱ 249 │ │ return CheckpointFunction.apply(function, preserve, *args) │
│ 250 │ else: │
│ 251 │ │ return _checkpoint_without_reentrant( │
│ 252 │ │ │ function, │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/aut │
│ ograd/function.py:506 in apply │
│ │
│ 503 │ │ if not torch._C._are_functorch_transforms_active(): │
│ 504 │ │ │ # See NOTE: [functorch vjp and autograd interaction] │
│ 505 │ │ │ args = _functorch.utils.unwrap_dead_wrappers(args) │
│ ❱ 506 │ │ │ return super().apply(*args, **kwargs) # type: ignore[misc] │
│ 507 │ │ │
│ 508 │ │ if cls.setup_context == _SingleLevelFunction.setup_context: │
│ 509 │ │ │ raise RuntimeError( │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/uti │
│ ls/checkpoint.py:107 in forward │
│ │
│ 104 │ │ ctx.save_for_backward(*tensor_inputs) │
│ 105 │ │ │
│ 106 │ │ with torch.no_grad(): │
│ ❱ 107 │ │ │ outputs = run_function(*args) │
│ 108 │ │ return outputs │
│ 109 │ │
│ 110 │ @staticmethod │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/nn/ │
│ modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/accelerat │
│ e/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /root/.cache/huggingface/modules/transformers_modules/ChatGLM2-6B/modeling_chatglm.py:544 in │
│ forward │
│ │
│ 541 │ │ # Layer norm at the beginning of the transformer layer. │
│ 542 │ │ layernorm_output = self.input_layernorm(hidden_states) │
│ 543 │ │ # Self attention. │
│ ❱ 544 │ │ attention_output, kv_cache = self.self_attention( │
│ 545 │ │ │ layernorm_output, │
│ 546 │ │ │ attention_mask, │
│ 547 │ │ │ rotary_pos_emb, │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/nn/ │
│ modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/accelerat │
│ e/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /root/.cache/huggingface/modules/transformers_modules/ChatGLM2-6B/modeling_chatglm.py:376 in │
│ forward │
│ │
│ 373 │ │ # ===================== │
│ 374 │ │ │
│ 375 │ │ # Attention heads [sq, b, h] --> [sq, b, (np * 3 * hn)] │
│ ❱ 376 │ │ mixed_x_layer = self.query_key_value(hidden_states) │
│ 377 │ │ │
│ 378 │ │ if self.multi_query_attention: │
│ 379 │ │ │ (query_layer, key_layer, value_layer) = mixed_x_layer.split( │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/nn/ │
│ modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/accelerat │
│ e/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/peft/tune │
│ rs/lora.py:1123 in forward │
│ │
│ 1120 │ │ │ │ self.active_adapter = adapter_name │
│ 1121 │ │ │ │
│ 1122 │ │ │ def forward(self, x: torch.Tensor): │
│ ❱ 1123 │ │ │ │ result = super().forward(x) │
│ 1124 │ │ │ │ │
│ 1125 │ │ │ │ if self.disable_adapters or self.active_adapter not in self.lora_A.keys( │
│ 1126 │ │ │ │ │ return result │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/bitsandby │
│ tes/nn/modules.py:219 in forward │
│ │
│ 216 │ │ │ x = x.to(self.compute_dtype) │
│ 217 │ │ │
│ 218 │ │ bias = None if self.bias is None else self.bias.to(self.compute_dtype) │
│ ❱ 219 │ │ out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.qua │
│ 220 │ │ │
│ 221 │ │ out = out.to(inp_dtype) │
│ 222 │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/bitsandby │
│ tes/autograd/_functions.py:564 in matmul_4bit │
│ │
│ 561 │
│ 562 def matmul_4bit(A: tensor, B: tensor, quant_state: List, out: tensor = None, bias=None): │
│ 563 │ assert quant_state is not None │
│ ❱ 564 │ return MatMul4Bit.apply(A, B, out, bias, quant_state) │
│ 565 │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/torch/aut │
│ ograd/function.py:506 in apply │
│ │
│ 503 │ │ if not torch._C._are_functorch_transforms_active(): │
│ 504 │ │ │ # See NOTE: [functorch vjp and autograd interaction] │
│ 505 │ │ │ args = _functorch.utils.unwrap_dead_wrappers(args) │
│ ❱ 506 │ │ │ return super().apply(*args, **kwargs) # type: ignore[misc] │
│ 507 │ │ │
│ 508 │ │ if cls.setup_context == _SingleLevelFunction.setup_context: │
│ 509 │ │ │ raise RuntimeError( │
│ │
│ /mnt/disk_data/soft/oobabooga_linux/installer_files/conda/lib/python3.10/site-packages/bitsandby │
│ tes/autograd/_functions.py:512 in forward │
│ │
│ 509 │ │ │
│ 510 │ │ # 1. Dequantize │
│ 511 │ │ # 2. MatmulnN │
│ ❱ 512 │ │ output = torch.nn.functional.linear(A, F.dequantize_fp4(B, state).to(A.dtype).t( │
│ 513 │ │ │
│ 514 │ │ # 3. Save state │
│ 515 │ │ ctx.state = state │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: mat1 and mat2 shapes cannot be multiplied (588x4096 and 1x9437184)

qlora微调训练随机性

您好，请问你遇到过在环境代码完全一样的情况下，qlora在两台相同显卡的机器上训练结果不一致的问题嘛，ptuning方式训练结果很稳定，qlora每次结果都不一样

ChatGLM2-6B微调后合并报错We need an `offload_dir` to dispatch this model according to this `device_map`, the following submodules need to be offloaded:

peft版本使用0.3.0 和0.4.0都有这个问题

使用的命令是
python3 merge_lora_and_quantize.py
--lora_path saved_files
--output_path merged_qlora_model_4bit
--remote_scripts_dir remote_scripts/chatglm2-6b
--qbits 4

微调时使用的输出文件夹是 "output_dir": "saved_files" lora_path是正确的
文件夹内只有这些文件

transformers>4.30.2时，模型不会进行量化，是因为什么？

ChatGLM可进行QLoRA微调，但ChatGLM2会报显存OOM

GPU硬件 4张2080Ti，单卡显存12G，指定单卡运行。
执行chatglm-6b微调
CUDA_VISIBLE_DEVICES=0 python train_qlora.py \ --train_args_json chatGLM_6B_QLoRA.json \ --model_name_or_path /data/chatglm-6b \ --train_data_path data/train.jsonl \ --eval_data_path data/eval.jsonl \ --lora_rank 4 \ --lora_dropout 0.05 \ --compute_dtype fp32
正常运行，并在./saved_files目录保存结果。
但执行chatglm2-6b微调（确认chatglm2-6b文件是最新版本）
CUDA_VISIBLE_DEVICES=0 python train_qlora.py \ --train_args_json chatGLM_6B_QLoRA.json \ --model_name_or_path /data/chatglm2-6b \ --train_data_path data/train.jsonl \ --eval_data_path data/eval.jsonl \ --lora_rank 4 \ --lora_dropout 0.05 \ --compute_dtype fp32
会报错，
`===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/softwares/anaconda3/envs/langchain/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/softwares/anaconda3/envs/langchain/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/softwares/anaconda3/envs/langchain did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/softwares/anaconda3/envs/langchain/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
You are loading your model in 8bit or 4bit but no linear modules were found in your model. Please double check your model architecture, or submit an issue on github if you think this is a bug.
The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/data/FT_LLM/chatGLM-6B-QLoRA/train_qlora.py:214 in │
│ │
│ 211 │
│ 212 if name == "main": │
│ 213 │ args = parse_args() │
│ ❱ 214 │ train(args) │
│ 215 │
│ 216 │
│ │
│ /data/data/FT_LLM/chatGLM-6B-QLoRA/train_qlora.py:153 in train │
│ │
│ 150 │ # "output_layer": "cpu", │
│ 151 │ # } │
│ 152 │ │
│ ❱ 153 │ model = AutoModel.from_pretrained(global_args.model_name_or_path, │
│ 154 │ │ │ │ │ │ │ │ │ quantization_config=q_config, │
│ 155 │ │ │ │ │ │ │ │ │ device_map='auto', │
│ 156 │ │ │ │ │ │ │ │ │ trust_remote_code=True) │
│ │
│ /home/softwares/anaconda3/envs/langchain/lib/python3.10/site-packages/transformers/models/ │
│ auto/auto_factory.py:488 in from_pretrained │
│ │
│ 485 │ │ │ │ model_class.register_for_auto_class(cls.name) │
│ 486 │ │ │ else: │
│ 487 │ │ │ │ cls.register(config.class, model_class, exist_ok=True) │
│ ❱ 488 │ │ │ return model_class.from_pretrained( │
│ 489 │ │ │ │ pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, │
│ 490 │ │ │ ) │
│ 491 │ │ elif type(config) in cls.model_mapping.keys(): │
│ │
│ /home/softwares/anaconda3/envs/langchain/lib/python3.10/site-packages/transformers/modelin │
│ g_utils.py:2842 in from_pretrained │
│ │
│ 2839 │ │ │ │ │ key: device_map[key] for key in device_map.keys() if key not in modu │
│ 2840 │ │ │ │ } │
│ 2841 │ │ │ │ if "cpu" in device_map_without_lm_head.values() or "disk" in device_map │
│ ❱ 2842 │ │ │ │ │ raise ValueError( │
│ 2843 │ │ │ │ │ │ """ │
│ 2844 │ │ │ │ │ │ Some modules are dispatched on the CPU or the disk. Make sure yo │
│ 2845 │ │ │ │ │ │ the quantized model. If you want to dispatch the model on the CP │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.
注释掉train_qlora.py中的model = AutoModel.from_pretrained(global_args.model_name_or_path,
quantization_config=q_config,
device_map='auto',
trust_remote_code=True) device_map='auto'会报OOM错误===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/softwares/anaconda3/envs/langchain/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/softwares/anaconda3/envs/langchain/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/softwares/anaconda3/envs/langchain did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/softwares/anaconda3/envs/langchain/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
You are loading your model in 8bit or 4bit but no linear modules were found in your model. Please double check your model architecture, or submit an issue on github if you think this is a bug.
Loading checkpoint shards: 71%|██████████████████████████████████████████████████████████████▊ | 5/7 [00:14<00:05, 2.91s/it]
OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB (GPU 0; 10.75 GiB total capacity; 10.08 GiB already allocated; 142.50 MiB free;
10.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See
documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`
请问chatglm2有对应的修改方案吗？

大佬Qlora是怎么调的？

You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.

然后执行：prepare_model_for_kbit_training就OOM，看log信息，好像Chatglm2-6b的权重并没有被量化。
到prepare_model_for_kbit_training，把未量化权重全部转位float32，然后就OOM了。

使用的模型权重以及代码是今天最新的

shuxueslpi / chatglm-6b-qlora Goto Github PK

chatglm-6b-qlora's People

Contributors

Stargazers

Watchers

Forkers

chatglm-6b-qlora's Issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Recommend Projects

Recommend Topics

Recommend Org