Coder Social home page Coder Social logo

Comments (15)

Youggls avatar Youggls commented on August 17, 2024 2

个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的,我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了

@Youggls 大佬请详细说说,怎么改的,感谢

@ray075hl
modeling_chatglm.py 文件中,有很多模块使用了 skip_init 初始化,需要将他们全部修改,修改修改初始化方法,如下图:
image
这里的 self.dense_h_to_4h 模块的初始化方法应该修改为:
image

但是这样做可能会带来加载速度的减慢。另外,我目前的机器是 TitanXP x 2,每张卡 12G 显存,开启 zero3 后仍然显存不足。

from chatglm_lora_multi-gpu.

liangwq avatar liangwq commented on August 17, 2024 1

@Youggls 大佬,帮帮我 0.0

我遇到了下面的错误,不知道是哪里出问题了:

 File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Using /home/la/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.000453948974609375 seconds
  0%|                                                                                                                                                                               | 0/10000 [00:00<?, ?it/s]/home2/la/chatgml-tuning/modeling_chatglm.py:266: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.)
  attention_scores.masked_fill_(attention_mask.byte(), -10000.0)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 116311 closing signal SIGTERM
**ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 116312) of binary:** /home/la/anaconda3/envs/chatglm-tuning/bin/python
Traceback (most recent call last):
  File "/home/la/anaconda3/envs/chatglm-tuning/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    **raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:** 

看起来是多显卡问题,
model = ChatGLMForConditionalGeneration.from_pretrained(model_id, cache_dir ='./', trust_remote_code=True,torch_dtype=torch.float16).cuda()
指定cuda试试

from chatglm_lora_multi-gpu.

liangwq avatar liangwq commented on August 17, 2024

如果按现在的代码跑,至少32G,你得吧batchsize改小
小量化训练,估计要改比较久

from chatglm_lora_multi-gpu.

llplay avatar llplay commented on August 17, 2024

deepspeed 改成stage 3, 会报这个错,@liangwq 遇见过吗 好像是skip_init导致的,不知道怎么解决

NotImplementedError: Cannot copy out of meta tensor; no data!
tensor(..., device='meta', size=(308281344,), dtype=torch.float16,
grad_fn=)

│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:355 in │
│ wrapper │
│ │
│ 352 │ │ │ │ │ is_child_module = True │
│ 353 │ │ │ │ │ setattr(module, "_ds_child_entered", True) │
│ 354 │ │ │ │ │
│ ❱ 355 │ │ │ │ f(module, *args, **kwargs) │
│ 356 │ │ │ │ │
│ 357 │ │ │ │ if is_child_module: │
│ 358 │ │ │ │ │ # child's init is done, now we can run a single post_init on the │
│ │
│ /HanLP3/Chatglm_lora_multi-gpu/modeling_chatglm.py:726 in init
│ │
│ 723 │ │ self.position_encoding_2d = config.position_encoding_2d │
│ 724 │ │ self.model_parallel = True │
│ 725 │ │ │
│ ❱ 726 │ │ self.word_embeddings = skip_init( │
│ 727 │ │ │ torch.nn.Embedding, │
│ 728 │ │ │ num_embeddings=self.vocab_size, embedding_dim=self.hidden_size, │
│ 729 │ │ │ dtype=self.params_dtype │
│ │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/utils/init.py:52 in skip_init │
│ │
│ 49 │ │
│ 50 │ final_device = kwargs.pop('device', 'cpu') │
│ 51 │ kwargs['device'] = 'meta' │
│ ❱ 52 │ return module_cls(*args, **kwargs).to_empty(device=final_device) │
│ 53 │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:363 in │
│ wrapper │
│ │
│ 360 │ │ │ │ │ │
│ 361 │ │ │ │ │ print_rank_0(f'Running post_init for {module.class.name}', │
│ 362 │ │ │ │ │ │ │ │ force=False) │
│ ❱ 363 │ │ │ │ │ self._post_init_method(module) │
│ 364 │ │ │ │ │
│ 365 │ │ │ │ print_rank_0( │
│ 366 │ │ │ │ │ f'After initializing followed by post init for {module.class.__n │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:760 in │
│ _post_init_method │
│ │
│ 757 │ │ │ │ │ │ logger.warn(f"param {name} in {module.class.name} " │
│ 758 │ │ │ │ │ │ │ │ │ f"not on GPU so was not broadcasted from rank 0") │
│ 759 │ │ │ │ │
│ ❱ 760 │ │ │ │ param.partition() │
│ 761 │ │ see_memory_usage( │
│ 762 │ │ │ f"Param count {param_count}. After converting and partitioning parmas in {mo │
│ 763 │ │ │ force=False) │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:894 in │
│ partition │
│ │
│ 891 │ │ │ ) │
│ 892 │ │ │ if param_list is None: │
│ 893 │ │ │ │ param_list = [cls] │
│ ❱ 894 │ │ │ self._partition(param_list, has_been_updated=has_been_updated) │
│ 895 │ │ │
│ 896 │ │ def reduce_gradients_at_owner(param_list=None, hierarchy=0): │
│ 897 │ │ │ cls = param │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:1038 in │
│ _partition │
│ │
│ 1035 │ │ for param in param_list: │
│ 1036 │ │ │ #print_rank_0(f"Before Partitioning Param {param.ds_id}") │
│ 1037 │ │ │ # self._param_status(param) │
│ ❱ 1038 │ │ │ self._partition_param(param, has_been_updated=has_been_updated) │
│ 1039 │ │ │ param.ds_status = ZeroParamStatus.NOT_AVAILABLE │
│ 1040 │ │ │ # if param.ds_tensor is not None: │
│ 1041 │ │ │ # assert id(param.data) == id(param.ds_tensor.data), \ │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:1128 in │
partition_param │
│ │
│ 1125 │ │ │ if start < param.ds_numel and end <= param.ds_numel: │
│ 1126 │ │ │ │ src_tensor = one_dim_param.narrow(0, start, partition_size) │
│ 1127 │ │ │ │ print(src_tensor) │
│ ❱ 1128 │ │ │ │ param.ds_tensor.copy
(src_tensor) │
│ 1129 │ │ │ │ #partitioned_tensor = src_tensor.clone().detach().to(self.remote_device) │
│ 1130 │ │ │ │
│ 1131 │ │ │ else: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

from chatglm_lora_multi-gpu.

liangwq avatar liangwq commented on August 17, 2024

deepspeed 改成stage 3, 会报这个错,@liangwq 遇见过吗 好像是skip_init导致的,不知道怎么解决

NotImplementedError: Cannot copy out of meta tensor; no data! tensor(..., device='meta', size=(308281344,), dtype=torch.float16, grad_fn=)

│ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:355 in │ │ wrapper │ │ │ │ 352 │ │ │ │ │ is_child_module = True │ │ 353 │ │ │ │ │ setattr(module, "_ds_child_entered", True) │ │ 354 │ │ │ │ │ │ ❱ 355 │ │ │ │ f(module, *args, **kwargs) │ │ 356 │ │ │ │ │ │ 357 │ │ │ │ if is_child_module: │ │ 358 │ │ │ │ │ # child's init is done, now we can run a single post_init on the │ │ │ │ /HanLP3/Chatglm_lora_multi-gpu/modeling_chatglm.py:726 in init │ │ │ │ 723 │ │ self.position_encoding_2d = config.position_encoding_2d │ │ 724 │ │ self.model_parallel = True │ │ 725 │ │ │ │ ❱ 726 │ │ self.word_embeddings = skip_init( │ │ 727 │ │ │ torch.nn.Embedding, │ │ 728 │ │ │ num_embeddings=self.vocab_size, embedding_dim=self.hidden_size, │ │ 729 │ │ │ dtype=self.params_dtype │ │ │ │ /usr/local/lib/python3.8/dist-packages/torch/nn/utils/init.py:52 in skip_init │ │ │ │ 49 │ │ │ 50 │ final_device = kwargs.pop('device', 'cpu') │ │ 51 │ kwargs['device'] = 'meta' │ │ ❱ 52 │ return module_cls(*args, **kwargs).to_empty(device=final_device) │ │ 53 │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:363 in │ │ wrapper │ │ │ │ 360 │ │ │ │ │ │ │ 361 │ │ │ │ │ print_rank_0(f'Running post_init for {module.class.name}', │ │ 362 │ │ │ │ │ │ │ │ force=False) │ │ ❱ 363 │ │ │ │ │ self._post_init_method(module) │ │ 364 │ │ │ │ │ │ 365 │ │ │ │ print_rank_0( │ │ 366 │ │ │ │ │ f'After initializing followed by post init for {module.class.__n │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:760 in │ │ _post_init_method │ │ │ │ 757 │ │ │ │ │ │ logger.warn(f"param {name} in {module.class.name} " │ │ 758 │ │ │ │ │ │ │ │ │ f"not on GPU so was not broadcasted from rank 0") │ │ 759 │ │ │ │ │ │ ❱ 760 │ │ │ │ param.partition() │ │ 761 │ │ see_memory_usage( │ │ 762 │ │ │ f"Param count {param_count}. After converting and partitioning parmas in {mo │ │ 763 │ │ │ force=False) │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:894 in │ │ partition │ │ │ │ 891 │ │ │ ) │ │ 892 │ │ │ if param_list is None: │ │ 893 │ │ │ │ param_list = [cls] │ │ ❱ 894 │ │ │ self._partition(param_list, has_been_updated=has_been_updated) │ │ 895 │ │ │ │ 896 │ │ def reduce_gradients_at_owner(param_list=None, hierarchy=0): │ │ 897 │ │ │ cls = param │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:1038 in │ │ _partition │ │ │ │ 1035 │ │ for param in param_list: │ │ 1036 │ │ │ #print_rank_0(f"Before Partitioning Param {param.ds_id}") │ │ 1037 │ │ │ # self._param_status(param) │ │ ❱ 1038 │ │ │ self._partition_param(param, has_been_updated=has_been_updated) │ │ 1039 │ │ │ param.ds_status = ZeroParamStatus.NOT_AVAILABLE │ │ 1040 │ │ │ # if param.ds_tensor is not None: │ │ 1041 │ │ │ # assert id(param.data) == id(param.ds_tensor.data), \ │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │ │ │ │ 8 │ function call.""" │ │ 9 │ def wrapped_fn(*args, **kwargs): │ │ 10 │ │ get_accelerator().range_push(func.qualname) │ │ ❱ 11 │ │ ret_val = func(*args, **kwargs) │ │ 12 │ │ get_accelerator().range_pop() │ │ 13 │ │ return ret_val │ │ 14 │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:1128 in │ │ partition_param │ │ │ │ 1125 │ │ │ if start < param.ds_numel and end <= param.ds_numel: │ │ 1126 │ │ │ │ src_tensor = one_dim_param.narrow(0, start, partition_size) │ │ 1127 │ │ │ │ print(src_tensor) │ │ ❱ 1128 │ │ │ │ param.ds_tensor.copy(src_tensor) │ │ 1129 │ │ │ │ #partitioned_tensor = src_tensor.clone().detach().to(self.remote_device) │ │ 1130 │ │ │ │ │ 1131 │ │ │ else: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

这看起来是deepspeed组网的时候除了问题,你现在是有几块GPU
你把你硬件配置信息也发上来下

from chatglm_lora_multi-gpu.

llplay avatar llplay commented on August 17, 2024

image

from chatglm_lora_multi-gpu.

liangwq avatar liangwq commented on August 17, 2024

image

你有4张卡可以用的,你指定下local_rank,或者看看是不是可以用deive_map,具体怎么弄你谷歌下:多张gpu卡如何制定local_rank环境变量

from chatglm_lora_multi-gpu.

Youggls avatar Youggls commented on August 17, 2024

个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的,我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了

from chatglm_lora_multi-gpu.

ray075hl avatar ray075hl commented on August 17, 2024

个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的,我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了

@Youggls 大佬请详细说说,怎么改的,感谢

from chatglm_lora_multi-gpu.

ray075hl avatar ray075hl commented on August 17, 2024

个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的,我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了

@Youggls 大佬请详细说说,怎么改的,感谢

@ray075hlmodeling_chatglm.py 文件中,有很多模块使用了 skip_init 初始化,需要将他们全部修改,修改修改初始化方法,如下图: image 这里的 self.dense_h_to_4h 模块的初始化方法应该修改为: image

但是这样做可能会带来加载速度的减慢。另外,我目前的机器是 TitanXP x 2,每张卡 12G 显存,开启 zero3 后仍然显存不足。

@Youggls 感谢大佬回复,已调通。 我的情况是 4xp40(22g),开启stage3 是可以在batch_size=1 的情形下微调的, 显存消耗比llama-7b要大一些,是不是词表太大了呢

from chatglm_lora_multi-gpu.

Youggls avatar Youggls commented on August 17, 2024

个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的,我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了

@Youggls 大佬请详细说说,怎么改的,感谢

@ray075hlmodeling_chatglm.py 文件中,有很多模块使用了 skip_init 初始化,需要将他们全部修改,修改修改初始化方法,如下图: image 这里的 self.dense_h_to_4h 模块的初始化方法应该修改为: image
但是这样做可能会带来加载速度的减慢。另外,我目前的机器是 TitanXP x 2,每张卡 12G 显存,开启 zero3 后仍然显存不足。

@Youggls 感谢大佬回复,已调通。 我的情况是 4xp40(22g),开启stage3 是可以在batch_size=1 的情形下微调的, 显存消耗比llama-7b要大一些,是不是词表太大了呢

可能是这个原因,但我没仔细研究过这两个模型结构,embedding matrix确实是模型参数大头。

from chatglm_lora_multi-gpu.

xiaoweiweixiao avatar xiaoweiweixiao commented on August 17, 2024

@Youggls 大佬,帮帮我 0.0

我遇到了下面的错误,不知道是哪里出问题了:

 File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Using /home/la/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.000453948974609375 seconds
  0%|                                                                                                                                                                               | 0/10000 [00:00<?, ?it/s]/home2/la/chatgml-tuning/modeling_chatglm.py:266: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.)
  attention_scores.masked_fill_(attention_mask.byte(), -10000.0)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 116311 closing signal SIGTERM
**ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 116312) of binary:** /home/la/anaconda3/envs/chatglm-tuning/bin/python
Traceback (most recent call last):
  File "/home/la/anaconda3/envs/chatglm-tuning/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    **raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:** 

from chatglm_lora_multi-gpu.

littlerookie avatar littlerookie commented on August 17, 2024

您好,按照上面修改了skip_init报错后,运行报错,提示
image
请问您知道这个要怎么修改吗?

from chatglm_lora_multi-gpu.

littlerookie avatar littlerookie commented on August 17, 2024

个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的,我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了

@Youggls 大佬请详细说说,怎么改的,感谢

@ray075hlmodeling_chatglm.py 文件中,有很多模块使用了 skip_init 初始化,需要将他们全部修改,修改修改初始化方法,如下图: image 这里的 self.dense_h_to_4h 模块的初始化方法应该修改为: image

但是这样做可能会带来加载速度的减慢。另外,我目前的机器是 TitanXP x 2,每张卡 12G 显存,开启 zero3 后仍然显存不足。

请问您12G显存有调通吗,如果调通的话,能不能指点下需要修改哪些内容

from chatglm_lora_multi-gpu.

Youggls avatar Youggls commented on August 17, 2024

您好,按照上面修改了skip_init报错后,运行报错,提示 image 请问您知道这个要怎么修改吗?

看上去是tokenizer报错了,check一下你加载的tokenizer试一下吧。
另外目前我12G显存并没有调通,根据别人的经验,可能需要4*12G显存才能运行。

from chatglm_lora_multi-gpu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.