showlab / boxdiff Goto Github PK

View Code? Open in Web Editor NEW

239.0 4.0 14.0 43.34 MB

[ICCV 2023] BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

Python 100.00%

diffusion-models text-to-image-synthesis

boxdiff's People

Contributors

Stargazers

Watchers

Forkers

thanhpham1987 cv-synthesis josegron yangbinb yqgao716 paperwave tkptkptkp helan-mounta1n steven-xiong kamwoh qiukunpeng hickey8 andupotorac htyjers

boxdiff's Issues

Problem with Tkinter library

I'm trying to replicate the setup and the examples that you show on the Readme, and i am currently having problems with the Tkinter library used in utils/drawer. The code crashes when trying to import the Library. I have been able to go round this problem commenting the calls from draw_rectangle and DashedImageDraw and not importing anything from utils.drawer, but i'm just wondering if there is something else to setup to make Tkinter available. Thanks

About the degraded result

Hi, thanks for your interesting work, and I have a try. However, the quality of the generated image is unsatisfactory.

Given the prompt "A dog plays a ball, a cat is sleeping" and layouts of each subject, we cannot obtain results according to the layouts.

Could you share the weights of the YOLOv4 model pretrained with COCO-stuff dataset?

First, thank you for your wonderful research.
It seems that you have conducted your research with the YOLOv4 model trained with COCO-stuff dataset.
I'm aiming to reproduce your paper's results on YOLO mAP score with the COCO-stuff dataset.
I'd like to know where you have downloaded the pretrained weights. Or if you have trained it yourself, could you please provide
the pretrained weights to test your model with? I've been struggling to find YOLO model trained with COCO-stuff dataset. (Not the one with COCO, 81 classes)
Thank you in advance.

RuntimeError: CUDA out of memory (Tesla V100-SXM2 with 16 G memory)

Is there any solution for that?
I changed the regulation from 16 to 8 but for many example i have got the same error.

GLIGEN vs BoxDiff

Hi. In the paper you're mentioning that BoxDiff can work as a plug and play with GLIGEN. But I want to ask if you can provide more details. Don't the two projects do the same thing?

I integrated BoxDiff into diffusers

Feel free to check out https://github.com/huggingface/diffusers/tree/main/examples/community#stable-diffusion-boxdiff

Example use case:

import torch
from PIL import Image, ImageDraw
from copy import deepcopy

from examples.community.pipeline_stable_diffusion_boxdiff import StableDiffusionBoxDiffPipeline

def draw_box_with_text(img, boxes, names):
    colors = ["red", "olive", "blue", "green", "orange", "brown", "cyan", "purple"]
    img_new = deepcopy(img)
    draw = ImageDraw.Draw(img_new)

    W, H = img.size
    for bid, box in enumerate(boxes):
        draw.rectangle([box[0] * W, box[1] * H, box[2] * W, box[3] * H], outline=colors[bid % len(colors)], width=4)
        draw.text((box[0] * W, box[1] * H), names[bid], fill=colors[bid % len(colors)])
    return img_new

pipe = StableDiffusionBoxDiffPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1-base",
    torch_dtype=torch.float16,
)
pipe.to("cuda")

# example 1
prompt = "as the aurora lights up the sky, a herd of reindeer leisurely wanders on the grassy meadow, admiring the breathtaking view, a serene lake quietly reflects the magnificent display, and in the distance, a snow-capped mountain stands majestically, fantasy, 8k, highly detailed"
phrases = [
    "aurora",
    "reindeer",
    "meadow",
    "lake",
    "mountain"
]
boxes = [[1,3,512,202], [75,344,421,495], [1,327,508,507], [2,217,507,341], [1,135,509,242]]

# example 2
# prompt = "A rabbit wearing sunglasses looks very proud"
# phrases = ["rabbit", "sunglasses"]
# boxes = [[67,87,366,512], [66,130,364,262]]

boxes = [[x / 512 for x in box] for box in boxes]

images = pipe(
    prompt,
    boxdiff_phrases=phrases,
    boxdiff_boxes=boxes,
    boxdiff_kwargs={
        "attention_res": 16,
        "normalize_eot": True
    },
    num_inference_steps=50,
    guidance_scale=7.5,
    generator=torch.manual_seed(42),
    safety_checker=None
).images

draw_box_with_text(images[0], boxes, phrases).save("output.png")

cross_attention.py is deprecated; attention_processor.py should be used instead

ptp_utils.py imports CrossAttention from diffusers/models/cross_attention.py:

from diffusers.models.cross_attention import CrossAttention

But as of July 26 2023, cross_attention.py is deprecated; attention_processor.py should be used instead:

huggingface/diffusers#4299

python version

I use python3.8.18. (conda create -n boxdiff python=3.8)
After pip3 install -r requirements.txt and install diffusers using pip3 install -e .,

run CUDA_VISIBLE_DEVICES=0 python3 run_sd_boxdiff.py --prompt "A rabbit wearing sunglasses looks very proud" --P 0.2 --L 1 --seeds [1,2,3,4,5,6,7,8,9] --token_indices [2,4] --bbox [[67,87,366,512],[66,130,364,262]] :

envs/boxdiff/lib/python3.8/site-packages/torch/serialization.py", line 242, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

envs/boxdiff/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

BoxDiff/diffusers/src/diffusers/models/modeling_utils.py", line 119, in load_state_dict
raise OSError(
OSError: Unable to load weights from checkpoint file for '../stable-diffusion-v1-4/unet/diffusion_pytorch_model.bin' at '../stable-diffusion-v1-4/unet/diffusion_pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

how about the performance on sdxl?

will you add it on the sdxl model?

Code License

Dear author, first of all, thank you for your work.
Could you please mention under what license you have released this code?

can you release the newly created evaluation dataset

can you release the newly created evaluation dataset?

thanks a lot!!!

Results looks same as the baseline

Hello, I tried to compare the results by setting the scale factor to 0, but it seems like the results don't vary much from scale_factor=20.
Did I do something wrong?

run_sd_boxdiff.py --prompt "as the aurora lights up the sky, a herd of reindeer leisurely wanders on the grassy meadow, admiring the breathtaking view, a serene lake quietly reflects the magnificent display, and in the distance, a snow-capped mountain stands majestically, fantasy, 8k, highly detailed" --P 0.2 --L 1 --seeds [2] --token_indices [3,12,21,30,46] --bbox [[1,3,512,202],[75,344,421,495],[1,327,508,507],[2,217,507,341],[1,135,509,242]] --refine False

May I ask which GPU the model is inferenced on, thank you

How did you compute the T2I-similarity for the evaluation?

First of all, thank you for your wonderful research.
How is the T2I similarity computed between two embedded vectors? Could I ask if it was computed using the cosine similarity between them?
Thank you.

What does argument "normalize_eot" imply?

Hi,

I'm recently working on adapting BoxDiff into the latest diffusers library, including the integration for both SD and SDXL. I came across this argument normalize_eot here:

BoxDiff/pipeline/sd_pipeline_boxdiff.py

Lines 194 to 198 in 9e90000

    
           if normalize_eot: 
        
               prompt = self.prompt 
        
               if isinstance(self.prompt, list): 
        
                   prompt = self.prompt[0] 
        
               last_idx = len(self.tokenizer(prompt)['input_ids']) - 1

It is set to True for SD2.1 and False for SD1.5. I'm not super familiar with the details of different versions, so would you mind clarifying what is the purpose of this argument? Thank you in advance.

	if normalize_eot:
	prompt = self.prompt
	if isinstance(self.prompt, list):
	prompt = self.prompt[0]
	last_idx = len(self.tokenizer(prompt)['input_ids']) - 1