Comments (3)
By the way, this is the traceback by directly running the training schedule:
Traceback (most recent call last):
File "scripts/train.py", line 208, in
main()
File "scripts/train.py", line 143, in main
TrainLoop(
File "/mnt/clam/hhchen/Diffusion-LM/improved-diffusion/improved_diffusion/train_util.py", line 100, in init
self.ema_params = [
File "/mnt/clam/hhchen/Diffusion-LM/improved-diffusion/improved_diffusion/train_util.py", line 101, in
copy.deepcopy(self.master_params) for _ in range(len(self.ema_rate))
File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/copy.py", line 205, in _deepcopy_list
append(deepcopy(a, memo))
File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/copy.py", line 153, in deepcopy
y = copier(memo)
File "/home/hhchen/miniconda3/envs/Diffusion/lib/python3.8/site-packages/torch/nn/parameter.py", line 32, in deepcopy
result = type(self)(self.data.clone(memory_format=torch.preserve_format), self.requires_grad)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 475.48 MiB already allocated; 3.81 MiB free; 522.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
from diffusion-lm.
Sorry... my bad. It seems some other guy had been taking up all the GPU resources without noticing it. Now I can run the code! But I still want to know the answer of this question anyway...
from diffusion-lm.
Hi, I was also trying to train the model on multiple GPUs some weeks ago, and I just followed the settings in repo iDDPM, and you can simply do it by modified the code of scripts/run_train.py
in line 100:
f"python scripts/train.py " \
change it to:
f"mpiexec -n 2 python scripts/train.py" \
and here, 2 is the number of GPUs you want to parallelize, it works fine on my cluster, hope this will help you.
from diffusion-lm.
Related Issues (20)
- I wander where to find the model in the predictability HOT 1
- Training on A100
- Separate weights for word embedding and lm-head?
- Questions about the result of success rate of PPLM? HOT 2
- Why not directly use Emb(W) as X_0? HOT 2
- Error when running training script on Google Colab HOT 2
- Fail to load GPT2 pretrained model for attribute controled generation
- Reproducing Table 5: Sentence Infilling - CIDEr / BLEU-4 metrics HOT 1
- Baseline reproduction
- error when runing:Exception in thread Thread-4:·······ValueError: signal number 32 out of range
- Which classifier to use in custom_trainer.py for controllable generation?
- About the tT_loss HOT 1
- The difference between this code and the paper "IDDPM" in the run_loop function in train_util.py.
- The relevant code that caused the error is in the Controllable Text Generation section, after the model trained for 6 epochs and started evaluating, it raised a KeyError: 'eval_loss' HOT 2
- Questions about the NLL loss
- E2E training procedure
- Issue while generating controllable text generation
- How to Execute the Semantic Content Subtask with infill.py
- Seq2Seq tasks with Diffusion LM
- Difficulty in running code
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diffusion-lm.