Hi Lisa, I have been successfully run the training schedule of your code, but during t

Train on Multi GPU about diffusion-lm HOT 3 OPEN

xiangli1999 commented on July 24, 2024

Train on Multi GPU

from diffusion-lm.

Comments (3)

henrydylan commented on July 24, 2024

By the way, this is the traceback by directly running the training schedule:

Traceback (most recent call last):
File "scripts/train.py", line 208, in
main()
File "scripts/train.py", line 143, in main
TrainLoop(
File "/mnt/clam/hhchen/Diffusion-LM/improved-diffusion/improved_diffusion/train_util.py", line 100, in init
self.ema_params = [
File "/mnt/clam/hhchen/Diffusion-LM/improved-diffusion/improved_diffusion/train_util.py", line 101, in
copy.deepcopy(self.master_params) for _ in range(len(self.ema_rate))
File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/copy.py", line 205, in _deepcopy_list
append(deepcopy(a, memo))
File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/copy.py", line 153, in deepcopy
y = copier(memo)
File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/site-packages/torch/nn/parameter.py", line 32, in deepcopy
result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 475.48 MiB already allocated; 3.81 MiB free; 522.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

from diffusion-lm.

henrydylan commented on July 24, 2024

Sorry... my bad. It seems some other guy had been taking up all the GPU resources without noticing it. Now I can run the code! But I still want to know the answer of this question anyway...

from diffusion-lm.

Junyi42 commented on July 24, 2024

Hi, I was also trying to train the model on multiple GPUs some weeks ago, and I just followed the settings in repo iDDPM, and you can simply do it by modified the code of scripts/run_train.py

in line 100:
f"python scripts/train.py " \

change it to:
f"mpiexec -n 2 python scripts/train.py" \

and here, 2 is the number of GPUs you want to parallelize, it works fine on my cluster, hope this will help you.

from diffusion-lm.

Recommend Projects

Train on Multi GPU about diffusion-lm HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent