Coder Social home page Coder Social logo

Comments (15)

archwolf118 avatar archwolf118 commented on July 17, 2024 6

@TccccD 好像就是算力问题,v100不支持int8的方式load大模型训练,需要把finetune.py的60行的load_in_8bit改成False

from chatglm-tuning.

xyzanonymous666 avatar xyzanonymous666 commented on July 17, 2024 1

单卡 V100 32GB: 开启 fp16, 模型导入时 load_in_8bit=False, batch_size < 3 可以运行
{'loss': 2.6075, 'learning_rate': 9.78e-05, 'epoch': 0.0}
{'loss': 1.9953, 'learning_rate': 9.53e-05, 'epoch': 0.0}
{'loss': 1.9127, 'learning_rate': 9.28e-05, 'epoch': 0.01}
{'loss': 1.8311, 'learning_rate': 9.03e-05, 'epoch': 0.01}
{'loss': 1.7649, 'learning_rate': 8.78e-05, 'epoch': 0.01}

from chatglm-tuning.

zhoujz10 avatar zhoujz10 commented on July 17, 2024 1

单卡 V100 32GB: 开启 fp16, 模型导入时 load_in_8bit=False, batch_size < 3 可以运行 {'loss': 2.6075, 'learning_rate': 9.78e-05, 'epoch': 0.0} {'loss': 1.9953, 'learning_rate': 9.53e-05, 'epoch': 0.0} {'loss': 1.9127, 'learning_rate': 9.28e-05, 'epoch': 0.01} {'loss': 1.8311, 'learning_rate': 9.03e-05, 'epoch': 0.01} {'loss': 1.7649, 'learning_rate': 8.78e-05, 'epoch': 0.01}

Hi @xyzanonymous666 我的配置与你相同,也是V100 32G 开启fp16 load_in_8bit=False,但我参考作者的这篇文章,单卡设置batch_size=32,似乎也可以跑?

作者的command如下:

python finetune.py \
  --dataset_path /data/nfs/guodong.li/data/alpaca_tokenize \
  --lora_rank 8 \
  --per_device_train_batch_size 32 \
  --gradient_accumulation_steps 4 \
  --num_train_epochs 3 \
  --save_steps 1000 \
  --save_total_limit 2 \
  --learning_rate 1e-4 \
  --fp16 \
  --remove_unused_columns false \
  --logging_steps 50 \
  --output_dir /home/guodong.li/data/chatglm-6b-lora

from chatglm-tuning.

mymusise avatar mymusise commented on July 17, 2024

我这边训完loss大概在1左右,我的bs是4

from chatglm-tuning.

TccccD avatar TccccD commented on July 17, 2024

我这边训完loss大概在1左右,我的bs是4

我用的是bs只能等于1的版本,bs可以大于1的版本还改了什么吗

from chatglm-tuning.

TccccD avatar TccccD commented on July 17, 2024

以及--fp16 这个参数,如果加上的话,会报一个半精度的错,去掉的话就能够成功训练,loss=0会不会是这个原因?

Traceback (most recent call last):
File "finetune.py", line 93, in
main()
File "finetune.py", line 85, in main
trainer.train()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2655, in training_step
self.scaler.scale(loss).backward()
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 456, in backward
grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A)
RuntimeError: expected scalar type Half but found Float

from chatglm-tuning.

archwolf118 avatar archwolf118 commented on July 17, 2024

是不是显卡的问题?我用V100就报半精度的错,用3090就没事,真是奇怪,怀疑是算力的问题。你的显卡是什么型号的?

from chatglm-tuning.

TccccD avatar TccccD commented on July 17, 2024

是不是显卡的问题?我用V100就报半精度的错,用3090就没事,真是奇怪,怀疑是算力的问题。你的显卡是什么型号的?

我的也是V100

from chatglm-tuning.

TccccD avatar TccccD commented on July 17, 2024

@TccccD 好像就是算力问题,v100不支持int8的方式load大模型训练,需要把finetune.py的60行的load_in_8bit改成False

image

是看INT8 Tensor Cores 这个字段吗

from chatglm-tuning.

archwolf118 avatar archwolf118 commented on July 17, 2024

是bitsandbytes这个库影响的。(TimDettmers/bitsandbytes#100)

from chatglm-tuning.

TccccD avatar TccccD commented on July 17, 2024

是bitsandbytes这个库影响的。(TimDettmers/bitsandbytes#100)

我看评论说所有GPU都支持?但是bitsandbytes最新版不就是0.37.1吧。。。我就是这个版本的

from chatglm-tuning.

Adherer avatar Adherer commented on July 17, 2024

我用P40训练,batch_size等于1时,loss也是0,请问您解决了吗?
{"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50},
{"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100},
{"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150}

更新:batch_size等于2时,step=50时,loss不为0,后续都是0,感觉像是个bug
{"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50},
{"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100},
{"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150},
{"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}

from chatglm-tuning.

Adherer avatar Adherer commented on July 17, 2024

我用P40训练,batch_size等于1时,loss也是0,请问您解决了吗? {"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50}, {"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100}, {"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150}

更新:batch_size等于2时,step=50时,loss不为0,后续都是0,感觉像是个bug {"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50}, {"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100}, {"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150}, {"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}

已经解决,只要开启fp16,loss就正常了,fp16为False,loss则一直为0

from chatglm-tuning.

chuckhope avatar chuckhope commented on July 17, 2024

我这边也是v100 16gb的 fp16训练不动,开了int8,显存是下来了,但是loss就是0,bitsandbytes 0.37.1,看对应的issue确实说都支持

from chatglm-tuning.

yuhp-zts avatar yuhp-zts commented on July 17, 2024

我用P40训练,batch_size等于1时,loss也是0,请问您解决了吗? {"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50}, {"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100}, {"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150}
更新:batch_size等于2时,step=50时,loss不为0,后续都是0,感觉像是个bug {"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50}, {"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100}, {"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150}, {"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}

已经解决,只要开启fp16,loss就正常了,fp16为False,loss则一直为0

您好,p40不是不支持fp16吗

from chatglm-tuning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.