大概5小时可以训练完，但是loss一直是0，是正常的吗 {'loss': 0.0, 'learning_rate': 1.9230769

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

是bitsandbytes这个库影响的。(<a class="issue-link js-issue-link" data-error-text="Failed to lo

大概5小时可以训练完，但是loss一直是0，是正常的吗 about chatglm-tuning HOT 15 CLOSED

mymusise commented on July 17, 2024

大概5小时可以训练完，但是loss一直是0，是正常的吗

from chatglm-tuning.

Comments (15)

archwolf118 commented on July 17, 2024 6

@TccccD 好像就是算力问题，v100不支持int8的方式load大模型训练，需要把finetune.py的60行的load_in_8bit改成False

from chatglm-tuning.

xyzanonymous666 commented on July 17, 2024 1

单卡 V100 32GB: 开启 fp16, 模型导入时 load_in_8bit=False, batch_size < 3 可以运行
{'loss': 2.6075, 'learning_rate': 9.78e-05, 'epoch': 0.0}
{'loss': 1.9953, 'learning_rate': 9.53e-05, 'epoch': 0.0}
{'loss': 1.9127, 'learning_rate': 9.28e-05, 'epoch': 0.01}
{'loss': 1.8311, 'learning_rate': 9.03e-05, 'epoch': 0.01}
{'loss': 1.7649, 'learning_rate': 8.78e-05, 'epoch': 0.01}

from chatglm-tuning.

zhoujz10 commented on July 17, 2024 1

单卡 V100 32GB: 开启 fp16, 模型导入时 load_in_8bit=False, batch_size < 3 可以运行 {'loss': 2.6075, 'learning_rate': 9.78e-05, 'epoch': 0.0} {'loss': 1.9953, 'learning_rate': 9.53e-05, 'epoch': 0.0} {'loss': 1.9127, 'learning_rate': 9.28e-05, 'epoch': 0.01} {'loss': 1.8311, 'learning_rate': 9.03e-05, 'epoch': 0.01} {'loss': 1.7649, 'learning_rate': 8.78e-05, 'epoch': 0.01}

Hi @xyzanonymous666 我的配置与你相同，也是V100 32G 开启fp16 load_in_8bit=False，但我参考作者的这篇文章，单卡设置batch_size=32，似乎也可以跑？

作者的command如下：

python finetune.py \
  --dataset_path /data/nfs/guodong.li/data/alpaca_tokenize \
  --lora_rank 8 \
  --per_device_train_batch_size 32 \
  --gradient_accumulation_steps 4 \
  --num_train_epochs 3 \
  --save_steps 1000 \
  --save_total_limit 2 \
  --learning_rate 1e-4 \
  --fp16 \
  --remove_unused_columns false \
  --logging_steps 50 \
  --output_dir /home/guodong.li/data/chatglm-6b-lora

from chatglm-tuning.

mymusise commented on July 17, 2024

我这边训完loss大概在1左右，我的bs是4

from chatglm-tuning.

TccccD commented on July 17, 2024

我这边训完loss大概在1左右，我的bs是4

我用的是bs只能等于1的版本，bs可以大于1的版本还改了什么吗

from chatglm-tuning.

TccccD commented on July 17, 2024

以及--fp16 这个参数，如果加上的话，会报一个半精度的错，去掉的话就能够成功训练，loss=0会不会是这个原因？

Traceback (most recent call last):
File "finetune.py", line 93, in
main()
File "finetune.py", line 85, in main
trainer.train()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2655, in training_step
self.scaler.scale(loss).backward()
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 456, in backward
grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A)
RuntimeError: expected scalar type Half but found Float

from chatglm-tuning.

archwolf118 commented on July 17, 2024

是不是显卡的问题？我用V100就报半精度的错，用3090就没事，真是奇怪，怀疑是算力的问题。你的显卡是什么型号的？

from chatglm-tuning.

TccccD commented on July 17, 2024

是不是显卡的问题？我用V100就报半精度的错，用3090就没事，真是奇怪，怀疑是算力的问题。你的显卡是什么型号的？

我的也是V100

from chatglm-tuning.

TccccD commented on July 17, 2024

@TccccD 好像就是算力问题，v100不支持int8的方式load大模型训练，需要把finetune.py的60行的load_in_8bit改成False

是看INT8 Tensor Cores 这个字段吗

from chatglm-tuning.

archwolf118 commented on July 17, 2024

是bitsandbytes这个库影响的。(TimDettmers/bitsandbytes#100)

from chatglm-tuning.

TccccD commented on July 17, 2024

是bitsandbytes这个库影响的。(TimDettmers/bitsandbytes#100)

我看评论说所有GPU都支持？但是bitsandbytes最新版不就是0.37.1吧。。。我就是这个版本的

from chatglm-tuning.

Adherer commented on July 17, 2024

我用P40训练，batch_size等于1时，loss也是0，请问您解决了吗？
{"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50},
{"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100},
{"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150}

更新：batch_size等于2时，step=50时，loss不为0，后续都是0，感觉像是个bug
{"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50},
{"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100},
{"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150},
{"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}

from chatglm-tuning.

Adherer commented on July 17, 2024

我用P40训练，batch_size等于1时，loss也是0，请问您解决了吗？ {"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50}, {"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100}, {"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150}

更新：batch_size等于2时，step=50时，loss不为0，后续都是0，感觉像是个bug {"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50}, {"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100}, {"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150}, {"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}

已经解决，只要开启fp16，loss就正常了，fp16为False，loss则一直为0

from chatglm-tuning.

chuckhope commented on July 17, 2024

我这边也是v100 16gb的 fp16训练不动，开了int8，显存是下来了，但是loss就是0，bitsandbytes 0.37.1，看对应的issue确实说都支持

from chatglm-tuning.

yuhp-zts commented on July 17, 2024

我用P40训练，batch_size等于1时，loss也是0，请问您解决了吗？ {"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50}, {"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100}, {"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150}
更新：batch_size等于2时，step=50时，loss不为0，后续都是0，感觉像是个bug {"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50}, {"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100}, {"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150}, {"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}

已经解决，只要开启fp16，loss就正常了，fp16为False，loss则一直为0

您好，p40不是不支持fp16吗

from chatglm-tuning.

大概5小时可以训练完，但是loss一直是0，是正常的吗 about chatglm-tuning HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent