Coder Social home page Coder Social logo

Train on Multi GPU about diffusion-lm HOT 3 OPEN

xiangli1999 avatar xiangli1999 commented on July 24, 2024
Train on Multi GPU

from diffusion-lm.

Comments (3)

henrydylan avatar henrydylan commented on July 24, 2024

By the way, this is the traceback by directly running the training schedule:

Traceback (most recent call last):
File "scripts/train.py", line 208, in
main()
File "scripts/train.py", line 143, in main
TrainLoop(
File "/mnt/clam/hhchen/Diffusion-LM/improved-diffusion/improved_diffusion/train_util.py", line 100, in init
self.ema_params = [
File "/mnt/clam/hhchen/Diffusion-LM/improved-diffusion/improved_diffusion/train_util.py", line 101, in
copy.deepcopy(self.master_params) for _ in range(len(self.ema_rate))
File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/copy.py", line 205, in _deepcopy_list
append(deepcopy(a, memo))
File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/copy.py", line 153, in deepcopy
y = copier(memo)
File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/site-packages/torch/nn/parameter.py", line 32, in deepcopy
result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 475.48 MiB already allocated; 3.81 MiB free; 522.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

from diffusion-lm.

henrydylan avatar henrydylan commented on July 24, 2024

Sorry... my bad. It seems some other guy had been taking up all the GPU resources without noticing it. Now I can run the code! But I still want to know the answer of this question anyway...

from diffusion-lm.

Junyi42 avatar Junyi42 commented on July 24, 2024

Hi, I was also trying to train the model on multiple GPUs some weeks ago, and I just followed the settings in repo iDDPM, and you can simply do it by modified the code of scripts/run_train.py

in line 100:
f"python scripts/train.py " \

change it to:
f"mpiexec -n 2 python scripts/train.py" \

and here, 2 is the number of GPUs you want to parallelize, it works fine on my cluster, hope this will help you.

from diffusion-lm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.