{'loss': 0.0, 'learning_rate': 1.9e-05, 'epoch': 0.15} {'loss': 0.0, 'learning_rat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

可能和硬件有关： <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

bs设置为1和3的时候，loss都是0，是哪里有问题吗？ about chatglm-tuning HOT 13 OPEN

mymusise commented on July 17, 2024

bs设置为1和3的时候，loss都是0，是哪里有问题吗？

from chatglm-tuning.

Comments (13)

Adherer commented on July 17, 2024 5

我也遇到这个问题了，显卡是v100，目前查到是在 modeling_chatGLM::SelfAttention::forward(): output = self.dense(context_layer) # 这一行

output的结果中有inf和-inf， self.dense 的类型是<class 'bitsandbytes.nn.modules.Linear8bitLt'>，初步看上去是 in8量化的问题，Linear8bitLt 内的实现太复杂了，还没看明白。真正原因还没还知道，困扰了2天了。。。。

这个问题我解决了，解决办法是脚本中启用fp16，加载模型时，load_in_8bit设置成False，即可正常训练，具体原因我也没查出来是为啥

from chatglm-tuning.

TccccD commented on July 17, 2024

@phantommlin hello，你的loss为0问题解决了吗

from chatglm-tuning.

phantommlin commented on July 17, 2024

没有哦

from chatglm-tuning.

mymusise commented on July 17, 2024

可能和硬件有关： #19

from chatglm-tuning.

Adherer commented on July 17, 2024

我用P40训练，batch_size等于1时，loss也是0，请问您解决了吗？
{"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50},
{"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100},
{"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150}

更新：batch_size等于2时，step=50时，loss不为0，后续都是0，感觉像是个bug
{"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50},
{"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100},
{"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150},
{"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}

from chatglm-tuning.

zhangzhenhu commented on July 17, 2024

我也遇到这个问题了，显卡是v100，目前查到是在
modeling_chatGLM::SelfAttention::forward():
output = self.dense(context_layer) # 这一行

output的结果中有inf和-inf， self.dense 的类型是<class 'bitsandbytes.nn.modules.Linear8bitLt'>，初步看上去是 in8量化的问题，Linear8bitLt 内的实现太复杂了，还没看明白。真正原因还没还知道，困扰了2天了。。。。

from chatglm-tuning.

SizhaoXu commented on July 17, 2024

启用fp16, load_in_8bit设置为False, 会出现以下报错：
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

不启用fp16, load_in_8bit设为True时，正常运行，loss一直为0.

from chatglm-tuning.

SizhaoXu commented on July 17, 2024

启用fp16, load_in_8bit设置为False, 会出现以下报错： RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

不启用fp16, load_in_8bit设为True时，正常运行，loss一直为0.

问题已解决，更新了peft就可以work了

from chatglm-tuning.

chuckhope commented on July 17, 2024

@SizhaoXu bro，“不启用fp16, load_in_8bit设为True时，正常运行，loss一直为0.”这个可以吗

from chatglm-tuning.

dominicqi commented on July 17, 2024

启用fp16, load_in_8bit设置为False, 会出现以下报错： RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
不启用fp16, load_in_8bit设为True时，正常运行，loss一直为0.

问题已解决，更新了peft就可以work了

你好，更新peft到什么版本呢,我已经是v0.2.0了

from chatglm-tuning.

iMountTai commented on July 17, 2024

启用fp16, load_in_8bit设置为False, 会出现以下报错： RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
不启用fp16, load_in_8bit设为True时，正常运行，loss一直为0.

问题已解决，更新了peft就可以work了

请问您更新的peft是什么版本？

from chatglm-tuning.

moseshu commented on July 17, 2024

更新peft到最新版本么0.3.0.dev0么？

from chatglm-tuning.

guotong1988 commented on July 17, 2024

启用fp16, load_in_8bit设置为False, 会出现以下报错：
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

这个错怎么解决

from chatglm-tuning.

bs设置为1和3的时候，loss都是0，是哪里有问题吗？ about chatglm-tuning HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent