Coder Social home page Coder Social logo

Comments (7)

shuxueslpi avatar shuxueslpi commented on September 11, 2024

单卡应该没有问题。
多卡我手边还没有机器搞……

from chatglm-6b-qlora.

shenmadouyaowen avatar shenmadouyaowen commented on September 11, 2024

单卡应该没有问题。 多卡我手边还没有机器搞……

好的,大佬看看以下报错要怎么修改

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
Traceback (most recent call last):
  File "/media/ubuntu/chat/chatGLM-6B-QLoRA-main/train_qlora.py", line 206, in <module>
    train(args)
  File "/media/ubuntu/chat/chatGLM-6B-QLoRA-main/train_qlora.py", line 200, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/anaconda3/envs/ql/lib/python3.9/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/root/anaconda3/envs/ql/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/root/anaconda3/envs/ql/lib/python3.9/site-packages/transformers/trainer.py", line 2759, in training_step
    loss = self.compute_loss(model, inputs)
  File "/root/anaconda3/envs/ql/lib/python3.9/site-packages/transformers/trainer.py", line 2784, in compute_loss
    outputs = model(**inputs)
  File "/root/anaconda3/envs/ql/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/ql/lib/python3.9/site-packages/peft/peft_model.py", line 857, in forward
    return self.base_model(
  File "/root/anaconda3/envs/ql/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/ql/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 954, in forward
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
  File "/root/anaconda3/envs/ql/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/ql/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/root/anaconda3/envs/ql/lib/python3.9/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)

from chatglm-6b-qlora.

shuxueslpi avatar shuxueslpi commented on September 11, 2024

看了下别人的代码,https://github.com/beyondguo/LLM-Tuning/blob/master/chatglm2_lora_tuning.py
102-106行的注释,似乎是和你一样的问题,看看能不能解决,但我觉得后面这块官方代码应该会更新解决的吧。
等我搞到机器我再试试多卡的……

from chatglm-6b-qlora.

shenmadouyaowen avatar shenmadouyaowen commented on September 11, 2024

看了下别人的代码,https://github.com/beyondguo/LLM-Tuning/blob/master/chatglm2_lora_tuning.py 102-106行的注释,似乎是和你一样的问题,看看能不能解决,但我觉得后面这块官方代码应该会更新解决的吧。 等我搞到机器我再试试多卡的……

辛苦拉,他这个我跑通了....
大佬加油

from chatglm-6b-qlora.

shenmadouyaowen avatar shenmadouyaowen commented on September 11, 2024

看了下别人的代码,https://github.com/beyondguo/LLM-Tuning/blob/master/chatglm2_lora_tuning.py 102-106行的注释,似乎是和你一样的问题,看看能不能解决,但我觉得后面这块官方代码应该会更新解决的吧。 等我搞到机器我再试试多卡的……

per_device_train_batch_size 填2都能爆内存,大佬有什么建议优化一下么

from chatglm-6b-qlora.

shuxueslpi avatar shuxueslpi commented on September 11, 2024

@shenmadouyaowen 他这个代码好像是lora的,8bit训练,我这个是qlora,4bit的,理论上我这个代码应该显存占用比他小,两个建议:
1、chatglm2-6b的官方模型代码重新拉一下最新的,因为之前的代码没有实现activation checkpointing,最新的代码实现了
2、如果1完成了,仍然爆显存,把那几段注释的里的代码加到我这个代码里试试,我明天估计能搞到多卡的机器,到时候我也试试
3、相对的,也可以把我qlora的配置加到他的代码里
其实我觉得就单机的话,24G显存我这个qlora应该可以很大batchsize了吧,我在12G的卡上,跑ADGEN数据集都可以到8的batchsize,都没用满

from chatglm-6b-qlora.

shenmadouyaowen avatar shenmadouyaowen commented on September 11, 2024

是的,我用他代码加入到你的代码了,然后显示爆内存,修改per_device_train_batch_size=1就可以跑了,已经拉取了最新文件,坐等大佬优化.

from chatglm-6b-qlora.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.