最少需要多少显存，Tesla K80 11G 跑不起来呀 <a target="_blank" rel="noopener noreferrer" href=""

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

deepspeed 改成stage 3, 会报这个错，<a class="user-mention notranslate" data-hovercard-type="us

deepspeed 改成stage 3, 会报这个错，<a class="user-mention notranslate" data-hover

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-

最少需要多少显存，Tesla K80 11G 跑不起来呀,about liangwq/chatglm_lora_multi-gpu

Youggls commented on August 17, 2024 2

个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的，我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了

@Youggls 大佬请详细说说，怎么改的，感谢

@ray075hl
在 modeling_chatglm.py 文件中，有很多模块使用了 skip_init 初始化，需要将他们全部修改，修改修改初始化方法，如下图：

这里的 self.dense_h_to_4h 模块的初始化方法应该修改为：

但是这样做可能会带来加载速度的减慢。另外，我目前的机器是 TitanXP x 2，每张卡 12G 显存，开启 zero3 后仍然显存不足。

from chatglm_lora_multi-gpu.

liangwq commented on August 17, 2024 1

@Youggls 大佬，帮帮我 0.0

我遇到了下面的错误，不知道是哪里出问题了：

 File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Using /home/la/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.000453948974609375 seconds
  0%|                                                                                                                                                                               | 0/10000 [00:00<?, ?it/s]/home2/la/chatgml-tuning/modeling_chatglm.py:266: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.)
  attention_scores.masked_fill_(attention_mask.byte(), -10000.0)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 116311 closing signal SIGTERM
**ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 116312) of binary:** /home/la/anaconda3/envs/chatglm-tuning/bin/python
Traceback (most recent call last):
  File "/home/la/anaconda3/envs/chatglm-tuning/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    **raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:**

看起来是多显卡问题，
model = ChatGLMForConditionalGeneration.from_pretrained(model_id, cache_dir ='./', trust_remote_code=True,torch_dtype=torch.float16).cuda()
指定cuda试试

from chatglm_lora_multi-gpu.

liangwq commented on August 17, 2024

如果按现在的代码跑，至少32G，你得吧batchsize改小
小量化训练，估计要改比较久

from chatglm_lora_multi-gpu.

llplay commented on August 17, 2024

deepspeed 改成stage 3, 会报这个错，@liangwq 遇见过吗好像是skip_init导致的，不知道怎么解决

NotImplementedError: Cannot copy out of meta tensor; no data!
tensor(..., device='meta', size=(308281344,), dtype=torch.float16,
grad_fn=)

│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:355 in │
│ wrapper │
│ │
│ 352 │ │ │ │ │ is_child_module = True │
│ 353 │ │ │ │ │ setattr(module, "_ds_child_entered", True) │
│ 354 │ │ │ │ │
│ ❱ 355 │ │ │ │ f(module, *args, **kwargs) │
│ 356 │ │ │ │ │
│ 357 │ │ │ │ if is_child_module: │
│ 358 │ │ │ │ │ # child's init is done, now we can run a single post_init on the │
│ │
│ /HanLP3/Chatglm_lora_multi-gpu/modeling_chatglm.py:726 in init │
│ │
│ 723 │ │ self.position_encoding_2d = config.position_encoding_2d │
│ 724 │ │ self.model_parallel = True │
│ 725 │ │ │
│ ❱ 726 │ │ self.word_embeddings = skip_init( │
│ 727 │ │ │ torch.nn.Embedding, │
│ 728 │ │ │ num_embeddings=self.vocab_size, embedding_dim=self.hidden_size, │
│ 729 │ │ │ dtype=self.params_dtype │
│ │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/utils/init.py:52 in skip_init │
│ │
│ 49 │ │
│ 50 │ final_device = kwargs.pop('device', 'cpu') │
│ 51 │ kwargs['device'] = 'meta' │
│ ❱ 52 │ return module_cls(*args, **kwargs).to_empty(device=final_device) │
│ 53 │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:363 in │
│ wrapper │
│ │
│ 360 │ │ │ │ │ │
│ 361 │ │ │ │ │ print_rank_0(f'Running post_init for {module.class.name}', │
│ 362 │ │ │ │ │ │ │ │ force=False) │
│ ❱ 363 │ │ │ │ │ self._post_init_method(module) │
│ 364 │ │ │ │ │
│ 365 │ │ │ │ print_rank_0( │
│ 366 │ │ │ │ │ f'After initializing followed by post init for {module.class.__n │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:760 in │
│ _post_init_method │
│ │
│ 757 │ │ │ │ │ │ logger.warn(f"param {name} in {module.class.name} " │
│ 758 │ │ │ │ │ │ │ │ │ f"not on GPU so was not broadcasted from rank 0") │
│ 759 │ │ │ │ │
│ ❱ 760 │ │ │ │ param.partition() │
│ 761 │ │ see_memory_usage( │
│ 762 │ │ │ f"Param count {param_count}. After converting and partitioning parmas in {mo │
│ 763 │ │ │ force=False) │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:894 in │
│ partition │
│ │
│ 891 │ │ │ ) │
│ 892 │ │ │ if param_list is None: │
│ 893 │ │ │ │ param_list = [cls] │
│ ❱ 894 │ │ │ self._partition(param_list, has_been_updated=has_been_updated) │
│ 895 │ │ │
│ 896 │ │ def reduce_gradients_at_owner(param_list=None, hierarchy=0): │
│ 897 │ │ │ cls = param │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:1038 in │
│ _partition │
│ │
│ 1035 │ │ for param in param_list: │
│ 1036 │ │ │ #print_rank_0(f"Before Partitioning Param {param.ds_id}") │
│ 1037 │ │ │ # self._param_status(param) │
│ ❱ 1038 │ │ │ self._partition_param(param, has_been_updated=has_been_updated) │
│ 1039 │ │ │ param.ds_status = ZeroParamStatus.NOT_AVAILABLE │
│ 1040 │ │ │ # if param.ds_tensor is not None: │
│ 1041 │ │ │ # assert id(param.data) == id(param.ds_tensor.data), \ │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:1128 in │
│ partition_param │
│ │
│ 1125 │ │ │ if start < param.ds_numel and end <= param.ds_numel: │
│ 1126 │ │ │ │ src_tensor = one_dim_param.narrow(0, start, partition_size) │
│ 1127 │ │ │ │ print(src_tensor) │
│ ❱ 1128 │ │ │ │ param.ds_tensor.copy(src_tensor) │
│ 1129 │ │ │ │ #partitioned_tensor = src_tensor.clone().detach().to(self.remote_device) │
│ 1130 │ │ │ │
│ 1131 │ │ │ else: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

from chatglm_lora_multi-gpu.

liangwq commented on August 17, 2024

deepspeed 改成stage 3, 会报这个错，@liangwq 遇见过吗好像是skip_init导致的，不知道怎么解决

NotImplementedError: Cannot copy out of meta tensor; no data! tensor(..., device='meta', size=(308281344,), dtype=torch.float16, grad_fn=)

│ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:355 in │ │ wrapper │ │ │ │ 352 │ │ │ │ │ is_child_module = True │ │ 353 │ │ │ │ │ setattr(module, "_ds_child_entered", True) │ │ 354 │ │ │ │ │ │ ❱ 355 │ │ │ │ f(module, *args, **kwargs) │ │ 356 │ │ │ │ │ │ 357 │ │ │ │ if is_child_module: │ │ 358 │ │ │ │ │ # child's init is done, now we can run a single post_init on the │ │ │ │ /HanLP3/Chatglm_lora_multi-gpu/modeling_chatglm.py:726 in init │ │ │ │ 723 │ │ self.position_encoding_2d = config.position_encoding_2d │ │ 724 │ │ self.model_parallel = True │ │ 725 │ │ │ │ ❱ 726 │ │ self.word_embeddings = skip_init( │ │ 727 │ │ │ torch.nn.Embedding, │ │ 728 │ │ │ num_embeddings=self.vocab_size, embedding_dim=self.hidden_size, │ │ 729 │ │ │ dtype=self.params_dtype │ │ │ │ /usr/local/lib/python3.8/dist-packages/torch/nn/utils/init.py:52 in skip_init │ │ │ │ 49 │ │ │ 50 │ final_device = kwargs.pop('device', 'cpu') │ │ 51 │ kwargs['device'] = 'meta' │ │ ❱ 52 │ return module_cls(*args, **kwargs).to_empty(device=final_device) │ │ 53 │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:363 in │ │ wrapper │ │ │ │ 360 │ │ │ │ │ │ │ 361 │ │ │ │ │ print_rank_0(f'Running post_init for {module.class.name}', │ │ 362 │ │ │ │ │ │ │ │ force=False) │ │ ❱ 363 │ │ │ │ │ self._post_init_method(module) │ │ 364 │ │ │ │ │ │ 365 │ │ │ │ print_rank_0( │ │ 366 │ │ │ │ │ f'After initializing followed by post init for {module.class.__n │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:760 in │ │ _post_init_method │ │ │ │ 757 │ │ │ │ │ │ logger.warn(f"param {name} in {module.class.name} " │ │ 758 │ │ │ │ │ │ │ │ │ f"not on GPU so was not broadcasted from rank 0") │ │ 759 │ │ │ │ │ │ ❱ 760 │ │ │ │ param.partition() │ │ 761 │ │ see_memory_usage( │ │ 762 │ │ │ f"Param count {param_count}. After converting and partitioning parmas in {mo │ │ 763 │ │ │ force=False) │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:894 in │ │ partition │ │ │ │ 891 │ │ │ ) │ │ 892 │ │ │ if param_list is None: │ │ 893 │ │ │ │ param_list = [cls] │ │ ❱ 894 │ │ │ self._partition(param_list, has_been_updated=has_been_updated) │ │ 895 │ │ │ │ 896 │ │ def reduce_gradients_at_owner(param_list=None, hierarchy=0): │ │ 897 │ │ │ cls = param │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:1038 in │ │ _partition │ │ │ │ 1035 │ │ for param in param_list: │ │ 1036 │ │ │ #print_rank_0(f"Before Partitioning Param {param.ds_id}") │ │ 1037 │ │ │ # self._param_status(param) │ │ ❱ 1038 │ │ │ self._partition_param(param, has_been_updated=has_been_updated) │ │ 1039 │ │ │ param.ds_status = ZeroParamStatus.NOT_AVAILABLE │ │ 1040 │ │ │ # if param.ds_tensor is not None: │ │ 1041 │ │ │ # assert id(param.data) == id(param.ds_tensor.data), \ │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │ │ │ │ 8 │ function call.""" │ │ 9 │ def wrapped_fn(*args, **kwargs): │ │ 10 │ │ get_accelerator().range_push(func.qualname) │ │ ❱ 11 │ │ ret_val = func(*args, **kwargs) │ │ 12 │ │ get_accelerator().range_pop() │ │ 13 │ │ return ret_val │ │ 14 │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:1128 in │ │ partition_param │ │ │ │ 1125 │ │ │ if start < param.ds_numel and end <= param.ds_numel: │ │ 1126 │ │ │ │ src_tensor = one_dim_param.narrow(0, start, partition_size) │ │ 1127 │ │ │ │ print(src_tensor) │ │ ❱ 1128 │ │ │ │ param.ds_tensor.copy(src_tensor) │ │ 1129 │ │ │ │ #partitioned_tensor = src_tensor.clone().detach().to(self.remote_device) │ │ 1130 │ │ │ │ │ 1131 │ │ │ else: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

这看起来是deepspeed组网的时候除了问题，你现在是有几块GPU
你把你硬件配置信息也发上来下

from chatglm_lora_multi-gpu.

llplay commented on August 17, 2024

from chatglm_lora_multi-gpu.

liangwq commented on August 17, 2024

你有4张卡可以用的，你指定下local_rank,或者看看是不是可以用deive_map，具体怎么弄你谷歌下：多张gpu卡如何制定local_rank环境变量

from chatglm_lora_multi-gpu.

Youggls commented on August 17, 2024

个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的，我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了

from chatglm_lora_multi-gpu.

ray075hl commented on August 17, 2024

个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的，我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了

@Youggls 大佬请详细说说，怎么改的，感谢

from chatglm_lora_multi-gpu.

ray075hl commented on August 17, 2024

个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的，我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了

@Youggls 大佬请详细说说，怎么改的，感谢

@ray075hl 在 modeling_chatglm.py 文件中，有很多模块使用了 skip_init 初始化，需要将他们全部修改，修改修改初始化方法，如下图：这里的 self.dense_h_to_4h 模块的初始化方法应该修改为：

但是这样做可能会带来加载速度的减慢。另外，我目前的机器是 TitanXP x 2，每张卡 12G 显存，开启 zero3 后仍然显存不足。

@Youggls 感谢大佬回复，已调通。我的情况是 4xp40（22g），开启stage3 是可以在batch_size=1 的情形下微调的，显存消耗比llama-7b要大一些，是不是词表太大了呢

from chatglm_lora_multi-gpu.

Youggls commented on August 17, 2024

个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的，我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了

@Youggls 大佬请详细说说，怎么改的，感谢

@ray075hl 在 modeling_chatglm.py 文件中，有很多模块使用了 skip_init 初始化，需要将他们全部修改，修改修改初始化方法，如下图：这里的 self.dense_h_to_4h 模块的初始化方法应该修改为：
但是这样做可能会带来加载速度的减慢。另外，我目前的机器是 TitanXP x 2，每张卡 12G 显存，开启 zero3 后仍然显存不足。

@Youggls 感谢大佬回复，已调通。我的情况是 4xp40（22g），开启stage3 是可以在batch_size=1 的情形下微调的，显存消耗比llama-7b要大一些，是不是词表太大了呢

可能是这个原因，但我没仔细研究过这两个模型结构，embedding matrix确实是模型参数大头。

from chatglm_lora_multi-gpu.

xiaoweiweixiao commented on August 17, 2024

@Youggls 大佬，帮帮我 0.0

我遇到了下面的错误，不知道是哪里出问题了：

 File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Using /home/la/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.000453948974609375 seconds
  0%|                                                                                                                                                                               | 0/10000 [00:00<?, ?it/s]/home2/la/chatgml-tuning/modeling_chatglm.py:266: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.)
  attention_scores.masked_fill_(attention_mask.byte(), -10000.0)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 116311 closing signal SIGTERM
**ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 116312) of binary:** /home/la/anaconda3/envs/chatglm-tuning/bin/python
Traceback (most recent call last):
  File "/home/la/anaconda3/envs/chatglm-tuning/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    **raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:**

from chatglm_lora_multi-gpu.

littlerookie commented on August 17, 2024

您好，按照上面修改了skip_init报错后，运行报错，提示

请问您知道这个要怎么修改吗？

from chatglm_lora_multi-gpu.

littlerookie commented on August 17, 2024

个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的，我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了

@Youggls 大佬请详细说说，怎么改的，感谢

@ray075hl 在 modeling_chatglm.py 文件中，有很多模块使用了 skip_init 初始化，需要将他们全部修改，修改修改初始化方法，如下图：这里的 self.dense_h_to_4h 模块的初始化方法应该修改为：

但是这样做可能会带来加载速度的减慢。另外，我目前的机器是 TitanXP x 2，每张卡 12G 显存，开启 zero3 后仍然显存不足。

请问您12G显存有调通吗，如果调通的话，能不能指点下需要修改哪些内容

from chatglm_lora_multi-gpu.

Youggls commented on August 17, 2024

您好，按照上面修改了skip_init报错后，运行报错，提示请问您知道这个要怎么修改吗？

看上去是tokenizer报错了，check一下你加载的tokenizer试一下吧。
另外目前我12G显存并没有调通，根据别人的经验，可能需要4*12G显存才能运行。

from chatglm_lora_multi-gpu.

最少需要多少显存，Tesla K80 11G 跑不起来呀 about chatglm_lora_multi-gpu HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent