Comments (15)
@TccccD 好像就是算力问题,v100不支持int8的方式load大模型训练,需要把finetune.py的60行的load_in_8bit改成False
from chatglm-tuning.
单卡 V100 32GB: 开启 fp16, 模型导入时 load_in_8bit=False, batch_size < 3 可以运行
{'loss': 2.6075, 'learning_rate': 9.78e-05, 'epoch': 0.0}
{'loss': 1.9953, 'learning_rate': 9.53e-05, 'epoch': 0.0}
{'loss': 1.9127, 'learning_rate': 9.28e-05, 'epoch': 0.01}
{'loss': 1.8311, 'learning_rate': 9.03e-05, 'epoch': 0.01}
{'loss': 1.7649, 'learning_rate': 8.78e-05, 'epoch': 0.01}
from chatglm-tuning.
单卡 V100 32GB: 开启 fp16, 模型导入时 load_in_8bit=False, batch_size < 3 可以运行 {'loss': 2.6075, 'learning_rate': 9.78e-05, 'epoch': 0.0} {'loss': 1.9953, 'learning_rate': 9.53e-05, 'epoch': 0.0} {'loss': 1.9127, 'learning_rate': 9.28e-05, 'epoch': 0.01} {'loss': 1.8311, 'learning_rate': 9.03e-05, 'epoch': 0.01} {'loss': 1.7649, 'learning_rate': 8.78e-05, 'epoch': 0.01}
Hi @xyzanonymous666 我的配置与你相同,也是V100 32G 开启fp16 load_in_8bit=False,但我参考作者的这篇文章,单卡设置batch_size=32,似乎也可以跑?
作者的command如下:
python finetune.py \
--dataset_path /data/nfs/guodong.li/data/alpaca_tokenize \
--lora_rank 8 \
--per_device_train_batch_size 32 \
--gradient_accumulation_steps 4 \
--num_train_epochs 3 \
--save_steps 1000 \
--save_total_limit 2 \
--learning_rate 1e-4 \
--fp16 \
--remove_unused_columns false \
--logging_steps 50 \
--output_dir /home/guodong.li/data/chatglm-6b-lora
from chatglm-tuning.
我这边训完loss大概在1左右,我的bs是4
from chatglm-tuning.
我这边训完loss大概在1左右,我的bs是4
我用的是bs只能等于1的版本,bs可以大于1的版本还改了什么吗
from chatglm-tuning.
以及--fp16 这个参数,如果加上的话,会报一个半精度的错,去掉的话就能够成功训练,loss=0会不会是这个原因?
Traceback (most recent call last):
File "finetune.py", line 93, in
main()
File "finetune.py", line 85, in main
trainer.train()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2655, in training_step
self.scaler.scale(loss).backward()
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 456, in backward
grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A)
RuntimeError: expected scalar type Half but found Float
from chatglm-tuning.
是不是显卡的问题?我用V100就报半精度的错,用3090就没事,真是奇怪,怀疑是算力的问题。你的显卡是什么型号的?
from chatglm-tuning.
是不是显卡的问题?我用V100就报半精度的错,用3090就没事,真是奇怪,怀疑是算力的问题。你的显卡是什么型号的?
我的也是V100
from chatglm-tuning.
@TccccD 好像就是算力问题,v100不支持int8的方式load大模型训练,需要把finetune.py的60行的load_in_8bit改成False
是看INT8 Tensor Cores 这个字段吗
from chatglm-tuning.
是bitsandbytes这个库影响的。(TimDettmers/bitsandbytes#100)
from chatglm-tuning.
是bitsandbytes这个库影响的。(TimDettmers/bitsandbytes#100)
我看评论说所有GPU都支持?但是bitsandbytes最新版不就是0.37.1吧。。。我就是这个版本的
from chatglm-tuning.
我用P40训练,batch_size等于1时,loss也是0,请问您解决了吗?
{"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50},
{"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100},
{"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150}
更新:batch_size等于2时,step=50时,loss不为0,后续都是0,感觉像是个bug
{"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50},
{"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100},
{"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150},
{"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}
from chatglm-tuning.
我用P40训练,batch_size等于1时,loss也是0,请问您解决了吗? {"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50}, {"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100}, {"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150}
更新:batch_size等于2时,step=50时,loss不为0,后续都是0,感觉像是个bug {"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50}, {"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100}, {"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150}, {"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}
已经解决,只要开启fp16,loss就正常了,fp16为False,loss则一直为0
from chatglm-tuning.
我这边也是v100 16gb的 fp16训练不动,开了int8,显存是下来了,但是loss就是0,bitsandbytes 0.37.1,看对应的issue确实说都支持
from chatglm-tuning.
我用P40训练,batch_size等于1时,loss也是0,请问您解决了吗? {"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50}, {"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100}, {"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150}
更新:batch_size等于2时,step=50时,loss不为0,后续都是0,感觉像是个bug {"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50}, {"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100}, {"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150}, {"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}已经解决,只要开启fp16,loss就正常了,fp16为False,loss则一直为0
您好,p40不是不支持fp16吗
from chatglm-tuning.
Related Issues (20)
- 请问大佬是否有计划可以支持下qlora? HOT 1
- 修改max_seq_length好像并没有生效? HOT 1
- 如何支持多卡跑
- 请教一个问题,data_collator中不需要实现attention mask么? HOT 2
- ChatGLM LoRA微调之后,量化quantize=8显存、推理耗时都反向增加 HOT 1
- finetune数据使用data_collator时报错 KeyError:seq_len HOT 2
- 微调语料格式转换出现乱码 HOT 1
- 请问如何读取checkpoint继续训练? HOT 1
- AttributeError: 'ChatGLMModel' object has no attribute 'lm_head' HOT 3
- 请问下如果想让模型学到某个领域的数据集,大概需要多大的数据量呢?
- 这个项目停更了吗
- 问题请教
- 问题请教:将prompt token设置为-100即可不计算loss
- [数据预处理-tokenization时报错] datasets.builder.DatasetGenerationError
- 请问这个项目支持chatglm3吗
- 请问在训练过程中输出的日志中loss、learning_rate和epoch分别代表什么含义
- 在colab上运行finetune.ipynb的时候会报一个huggingface登录的错误,有人遇到同样的错误吗? HOT 1
- 关于保存的adapter_model.bin无实际推理效果的问题 HOT 2
- 基于3af1bfd提交在3090上跑起来的requirements.txt
- 小白,求大神解答,ImportError: cannot import name 'soft_unicode' from 'markupsafe HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chatglm-tuning.