kvablack / ddpo-pytorch Goto Github PK
View Code? Open in Web Editor NEWDDPO for finetuning diffusion models, implemented in PyTorch with LoRA support
License: MIT License
DDPO for finetuning diffusion models, implemented in PyTorch with LoRA support
License: MIT License
For the reproducibility experiments, right now the script has use_lora=True
in the dgx.py
. I just want to double check if that is indeed the case because the README.md seems a bit obscure.
Hello, when I trained an aesthetic model using the default configuration on 8 A800 cards, I found that the training process got stuck after completing one epoch, but it worked fine when using a single A800 card. May I ask what could be the cause of this situation?
This may be a noob question, but can you explain this line where you transfer the model to fp16 or bfp16 only if using lora?
https://github.com/kvablack/ddpo-pytorch/blob/173b2bb6e0e3b2feb7587c98bb54f63b1d3867d5/scripts/train.py#L117C22-L117C22
I am running the prompt alignment experiment with LLaVA-server, although I am using BLIP2 instead of LLaVA.
I wanted to see the VLM's caption of the image along side the prompt, image, and reward, so I added this additional logging to wandb. For passing the caption strings from the server back to the main training process, I converted the fixed-length strings into ascii integers with ord()
, so it can be converted to a torch.tensor
before calling accelerator.gather
at this line, and then back to strings with chr()
. As shown in the image below, the prompts and VLM captions that I received from the server do not match.
Then I used a trick trying to match the input prompt from the client side and the server's response. For each prompt generated with prompt_fn
, I generate a random 5-digit id number. This id is passed to the server, prepended to the VLM's outputs. Then I use the prompts' ids to retrieve the corresponding captions. As shown below, the prompts and the captions now match after using my "id" trick. I also appended the computed rewards to the captions on the server side, before sending the response to client. However, the reward appended at the end of the captions do not match the rewards from the client side (code). It seems that the server's responses don't preserver the order of the queries it receives.
Could you verify if the current code does have this problem where the order of server's responses doesn't match that of the client's queries? I am getting clear training progress, which shouldn't be the case the the rewards' order is messed up.
Hi,
I am experiencing out-of-memory issues with this codebase even at the start of epoch 0.0, despite using an A100 with 80GB VRAM and 128GB RAM :/
I am using the following changes in the config file:
config.sample.batch_size = 1
config.sample.num_batches_per_epoch = 256
config.train.batch_size = 1
config.train.gradient_accumulation_steps = 128
Any chance for Stable Diffusion XL models support in the near future?
Hi,
Thanks for sharing the code!
I am using your code and fine-tuning this model stabilityai/stable-diffusion-2-1
, I choose aesthetic
, I have set Lora=True also. But the training is very memory intensive and in 80GB A100 it cannot even fit batch size of 2 per GPUs. I always have OOM error. Below are my settings:
config = compressibility()
config.project_name = "ddpo-aesthetic"
config.pretrained.model = "stabilityai/stable-diffusion-2-1"
config.num_epochs = 20000
config.reward_fn = "aesthetic_score"
# the DGX machine I used had 8 GPUs, so this corresponds to 8 * 8 * 4 = 256 samples per epoch.
config.sample.batch_size = 2
config.sample.num_batches_per_epoch = 1
# this corresponds to (8 * 4) / (4 * 2) = 4 gradient updates per epoch.
config.train.batch_size = 2
config.train.gradient_accumulation_steps = 1
config.prompt_fn = "simple_animals"
config.per_prompt_stat_tracking = {
"buffer_size": 32,
"min_count": 16,
}
Any suggestions regarding this? I appreciate your help!
Many thanks for conducting this excellent work!
I raised 2 questions while trying to reproduce the experiments.
Very thanks if you can give me some help :-)
I am using the DDPO logic to fine-tuned my own model.
However, I found that the example reward function (LLaVA BERTScore) use a fixed batch size.
After seeing the source code in this repo and the TRL DDPOTrainer class, I found that this batch size may related to sample_batch_size.
I recommend to modify the batch size with the one in the config or leave some comments on it. By doing so, people who wants to design their reward function can have a more sensible guide.
Below is the example reward in this repo I mentioned above.
def llava_bertscore():
"""Submits images to LLaVA and computes a reward by comparing the responses to the prompts using BERTScore. See
https://github.com/kvablack/LLaVA-server for server-side code.
"""
import requests
from requests.adapters import HTTPAdapter, Retry
from io import BytesIO
import pickle
batch_size = 16
url = "http://127.0.0.1:8085"
sess = requests.Session()
retries = Retry(
total=1000, backoff_factor=1, status_forcelist=[500], allowed_methods=False
)
sess.mount("http://", HTTPAdapter(max_retries=retries))
def _fn(images, prompts, metadata):
del metadata
if isinstance(images, torch.Tensor):
images = (images * 255).round().clamp(0, 255).to(torch.uint8).cpu().numpy()
images = images.transpose(0, 2, 3, 1) # NCHW -> NHWC
images_batched = np.array_split(images, np.ceil(len(images) / batch_size))
prompts_batched = np.array_split(prompts, np.ceil(len(prompts) / batch_size))
...
And this is the code which use compute_reward() in the DDPOTraner class in TRL Repo
def step(self, epoch: int, global_step: int):
"""
Perform a single step of training.
Args:
epoch (int): The current epoch.
global_step (int): The current global step.
Side Effects:
- Model weights are updated
- Logs the statistics to the accelerator trackers.
- If `self.image_samples_callback` is not None, it will be called with the prompt_image_pairs, global_step, and the accelerator tracker.
Returns:
global_step (int): The updated global step.
"""
samples, prompt_image_data = self._generate_samples(
iterations=self.config.sample_num_batches_per_epoch,
batch_size=self.config.sample_batch_size,
)
# collate samples into dict where each entry has shape (num_batches_per_epoch * sample.batch_size, ...)
samples = {k: torch.cat([s[k] for s in samples]) for k in samples[0].keys()}
rewards, rewards_metadata = self.compute_rewards(
prompt_image_data, is_async=self.config.async_reward_computation
)
...
I am a bit confused with the logics in the train script.
A "new" unet is defined as pipeline.unet, unet.parameters() is then put in optimizer, and finally loss is computed from unet. Thus, can I understand that this new unet will be updated.
However, we know that pipeline.unet should be updated, and I can observe that the unet in pipeline is indeed updated, not the new unet.
Can anybody tell me why this new unet should be defined? Can we just use something like this:
optimizer(pipeline.unet.parameters(), ...)
noise_pred = pipeline.unet(...)
Thank you very much.
I am trying to reproduce the aesthetic experiment on a single GPU. I made the following changes to the config:
config.sample.batch_size = 1
config.sample.num_batches_per_epoch = 256
config.train.batch_size = 1
config.train.gradient_accumulation_steps = 128
My results are summarized in the following figure:
Few questions I have regarding the results:
ddpo-pytorch
show this behavior.teaser.jpg
that we could compare with for all four experiments?Many thanks for conducting this excellent work. While I read this repo, I raised two questions on the optimized objective.
advantages
corresponds to r(x_0, c)
and ratio
corresponds to the importance sampling term. But where is the gradient term \nabla_\theta \log p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{c}, t, \mathbf{x}_t\right)
? Is it because taking gradient on ratio will implicitly generate the gradient term?# ppo logic
advantages = torch.clamp(
sample["advantages"], -config.train.adv_clip_max, config.train.adv_clip_max
)
ratio = torch.exp(log_prob - sample["log_probs"][:, j])
unclipped_loss = -advantages * ratio
clipped_loss = -advantages * torch.clamp(
ratio, 1.0 - config.train.clip_range, 1.0 + config.train.clip_range
)
loss = torch.mean(torch.maximum(unclipped_loss, clipped_loss))
clip_range
is set to 1e-4 by default, which makes the clipped_loss
close to -advantages
. The following operation torch.maximum(unclipped_loss, clipped_loss)
will bound loss
to be approximately greater than -advantages
. What is the aim of setting such a small value of clip_range
?Hi,
thanks for the reimplementation! Your Gif visualization using iceberg is super nice! Could you maybe also share the code of it?
Thanks a lot!
I tried this code (compressibility finetuning) on colab but I faced GPU memory overflow. I even reduced batch size and sample size but it did not solve the problem. I should mention that I used free gpu which is T4 and has a 15G of GPU RAM (this is claimed that this code could be run on 10G of RAM).
Any suggestions or help?
This code currently only supports DDIM. In the recently released SD-XL, the default scheduler is EulerDiscrete. From the paper and the code, it seems that the prev_sample
is no longer sampled from a Gaussian distribution but a ODE solution instead (correct me if I am wrong here). How to calculate the log_prob of prev_sample
given the noise_pred
in this case?
Hi! I've a couple of questions on the LLaVa alignment:
I saw you mentioned prompt-dependent value function at #7 (comment). By chance, I happen to be using ddpo for related optimizations. Consider the ideal situation, where there is only one prompt and its corresponding reward function. I still found that in the early stages of training, the reward mean is very fluctuate, even if I increase the training batch size or reduce the learning rate, although the overall reward mean is rising in the end. Are there any optimization techniques to make the optimization of a single prompt prompt stable? Any suggestions or insights would be greatly appreciated.
In the case of the same batch size, it is recommended to use a larger number of gradient accumulation steps in a single GPU instead of multi-GPUs considering huggingface/diffusers#4046. It may lead to fluctuations in the reward.
The result of training with llava_bertscore reward is:
But the result of training with aesthetic reward is:
This shows that the reinforcement learning works.
The configuration file, i.e. dgx.py
and base.py
, were not modified.
The hardware used in the training is 8 A800. The llava-server ran in GPU0 and GPU1. The training program ran in all 8 GPUs.
GPU0 and GPU1 was used 59G memory, and others use 30G memory.
The version of llava is liuhaotian/LLaVA-Lightning-MPT-7B-preview.
Thanks for your great work!
I have repeated the lora training,and got a same results in supplied prompt. But I have some issues in large dataset and Unet Trainning.
When I training the LoRA with 400 prompts, I find the reward overfitting to maximum easily in 4K steps. Even I update the prompts to 20K, the same problem will happen.
So, is it work for large dataset ? How to train the ddpo in a large dataset ?
I set "config.use_lora = False" in config/base.py to train Unet, but the reward change to zero within tens steps, and the sd model generate black images.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.