yuval-alaluf / attend-and-excite Goto Github PK

Official Implementation for "Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models" (SIGGRAPH 2023)

Home Page: https://yuval-alaluf.github.io/Attend-and-Excite/

License: MIT License

Python 1.16% Jupyter Notebook 98.84%

diffusion-models stable-diffusion text-to-image

attend-and-excite's People

Contributors

Stargazers

Watchers

Forkers

barseghyanartur tommasocalo p1atdev ayo-faks codeaudit cian0 arielreplicate eltociear evinpinar pkhoueiry techthiyanes saulocatharino phymhan hangeol birdmanikioishota claforte lanzehua korawat-tanwisuth natsala13 quang-ngh fathyshalaby yam0214 jagilley shreyazomato zyzhang-vicky ppiova jackbonadies martscribe idansc freemindcore stevenshaw1999 peterzs credwood haorand yqgao716 jackzhousz vincentneemie edenzzzz temiooo tunahansalih zhuqianglu a-lakh josephrp wolfnuyts tinaa23 nao-mo aniketgurav xiaoqizhuang paperwave andrehuang erfanili february24-lee

attend-and-excite's Issues

Can this be used with the automatic1111 webui?

perhaps you can be an add-in for sdwebui?

WebUI?

Hey, I'm sure you're aware of Automatic1111's WebUI and was wondering if there is anyway to integrate this in to that?
I'm not sure how you feel about that program, but this looks very useful and I'm sure the community would really love this piece of technology.

Also, a suggestion: if you ever want to implement this in a UI, a more initiative way would be to delimit tokenized words.
Example: an .elephant in a .crown which tokenizes [2,5] by placing a full stop before the word.

Dreambooth and working with new tokens

Hi, @AttendAndExcite I am trying to use your method with finetuned model but it seems to struggle working, results with and without your method are the same. Metrics logs are not changing iteration to iteration.
Model works fine on your examples from the notebook, losses are changing and the results as well.
The only problem I can see is that the tokenizer do not know about the new words for finetuned dreambooth. For example, I am using token word "polevakatyaaa" and it decomposes it into ['pole', 'vak', 'at', 'yaa']. I select all the indices with the word, but it seems to have no clue. What can I do?

NotImplementedError: Module [ModuleList] is missing the required "forward" function

I am facing the following error while trying to run the provided notebooks. What could be the possible reasons for this?

NotImplementedError: Module [ModuleList] is missing the required "forward" function

Detailed traceback:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[7], line 5
      3 prompts = [prompt]
      4 controller = AttentionStore()
----> 5 image = run_and_display(prompts=prompts,
      6                         controller=controller,
      7                         indices_to_alter=token_indices,
      8                         generator=g,
      9                         run_standard_sd=True,
     10                         display_output=True)
     11 vis_utils.show_cross_attention(attention_store=controller,
     12                                prompt=prompt,
     13                                tokenizer=tokenizer,
   (...)
     16                                indices_to_alter=token_indices,
     17                                orig_image=image)

Cell In[4], line 19, in run_and_display(prompts, controller, indices_to_alter, generator, run_standard_sd, scale_factor, thresholds, max_iter_to_alter, display_output)
      5 def run_and_display(prompts: List[str],
      6                     controller: AttentionStore,
      7                     indices_to_alter: List[int],
   (...)
     12                     max_iter_to_alter: int = 25,
     13                     display_output: bool = False):
     14     config = RunConfig(prompt=prompts[0],
     15                        run_standard_sd=run_standard_sd,
     16                        scale_factor=scale_factor,
     17                        thresholds=thresholds,
     18                        max_iter_to_alter=max_iter_to_alter)
---> 19     image = run_on_prompt(model=stable,
     20                           prompt=prompts,
     21                           controller=controller,
     22                           token_indices=indices_to_alter,
     23                           seed=generator,
     24                           config=config)
     25     if display_output:
     26         display(image)

File ~/ptp_sd_exps/Attend-and-Excite/notebooks/../run.py:45, in run_on_prompt(prompt, model, controller, token_indices, seed, config)
     42     ptp_utils.register_attention_control(model, controller)
     44 print("Inside run_on_prompt function")
---> 45 outputs = model(prompt=prompt,
     46                 attention_store=controller,
     47                 indices_to_alter=token_indices,
     48                 attention_res=config.attention_res,
     49                 guidance_scale=config.guidance_scale,
     50                 generator=seed,
     51                 num_inference_steps=config.n_inference_steps,
     52                 max_iter_to_alter=config.max_iter_to_alter,
     53                 run_standard_sd=config.run_standard_sd,
     54                 thresholds=config.thresholds,
     55                 scale_factor=config.scale_factor,
     56                 scale_range=config.scale_range,
     57                 smooth_attentions=config.smooth_attentions,
     58                 sigma=config.sigma,
     59                 kernel_size=config.kernel_size)
     60 image = outputs.images[0]
     61 return image

File /opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/ptp_sd_exps/Attend-and-Excite/notebooks/../pipeline_attend_and_excite.py:207, in AttendAndExcitePipeline.__call__(self, prompt, attention_store, indices_to_alter, attention_res, height, width, num_inference_steps, guidance_scale, eta, generator, latents, output_type, return_dict, max_iter_to_alter, run_standard_sd, thresholds, scale_factor, scale_range, smooth_attentions, sigma, kernel_size, **kwargs)
    205 # Forward pass of denoising with text conditioning
    206 print("calling unet inside __call__")
--> 207 noise_pred_text = self.unet(latents, t, encoder_hidden_states=text_embeddings[1].unsqueeze(0)).sample
    208 print("exiting unet")
    209 self.unet.zero_grad()

File /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.8/site-packages/diffusers/models/unet_2d_condition.py:381, in UNet2DConditionModel.forward(self, sample, timestep, encoder_hidden_states, class_labels, return_dict)
    379 for downsample_block in self.down_blocks:
    380     if hasattr(downsample_block, "has_cross_attention") and downsample_block.has_cross_attention:
--> 381         sample, res_samples = downsample_block(
    382             hidden_states=sample,
    383             temb=emb,
    384             encoder_hidden_states=encoder_hidden_states,
    385         )
    386     else:
    387         sample, res_samples = downsample_block(hidden_states=sample, temb=emb)

File /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.8/site-packages/diffusers/models/unet_2d_blocks.py:612, in CrossAttnDownBlock2D.forward(self, hidden_states, temb, encoder_hidden_states)
    610     else:
    611         hidden_states = resnet(hidden_states, temb)
--> 612         hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states).sample
    614     output_states += (hidden_states,)
    616 if self.downsamplers is not None:

File /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.8/site-packages/diffusers/models/attention.py:217, in Transformer2DModel.forward(self, hidden_states, encoder_hidden_states, timestep, return_dict)
    215 # 2. Blocks
    216 for block in self.transformer_blocks:
--> 217     hidden_states = block(hidden_states, context=encoder_hidden_states, timestep=timestep)
    219 # 3. Output
    220 if self.is_input_continuous:

File /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.8/site-packages/diffusers/models/attention.py:495, in BasicTransformerBlock.forward(self, hidden_states, context, timestep)
    493     hidden_states = self.attn1(norm_hidden_states, context) + hidden_states
    494 else:
--> 495     hidden_states = self.attn1(norm_hidden_states) + hidden_states
    497 if self.attn2 is not None:
    498     # 2. Cross-Attention
    499     norm_hidden_states = (
    500         self.norm2(hidden_states, timestep) if self.use_ada_layer_norm else self.norm2(hidden_states)
    501     )

File /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/ptp_sd_exps/Attend-and-Excite/notebooks/../utils/ptp_utils.py:85, in register_attention_control.<locals>.ca_forward.<locals>.forward(x, context, mask)
     83 out = torch.einsum("b i j, b j d -> b i d", attn, v)
     84 out = self.reshape_batch_dim_to_heads(out)
---> 85 return self.to_out(out)

File /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:246, in _forward_unimplemented(self, *input)
    235 def _forward_unimplemented(self, *input: Any) -> None:
    236     r"""Defines the computation performed at every call.
    237 
    238     Should be overridden by all subclasses.
   (...)
    244         registered hooks while the latter silently ignores them.
    245     """
--> 246     raise NotImplementedError(f"Module [{type(self).__name__}] is missing the required \"forward\" function")

NotImplementedError: Module [ModuleList] is missing the required "forward" function

Compatibility with latest Diffusers versions

Hello, thanks for your great work!

Currently, this repository is built off of a Diffusers version from September 8th, 2022. This breaks compatibility with other implementations when building custom scripts. It would be much appreciated if this repository was upgraded to support the new CrossAttention modules on current versions of Diffusers.

Intuition behind choosing step size

Hi,

First of all, thanks for your excellent research work. This is pretty innovative and motivated me to conduct further experiment on your codebase. I would love to ask about choosing a practical step size when updating latent. How did you come up with your strategy of choosing step size, i.e, linear spacing which decrease from 1 to 0.5? What would be a practical step size do you think if we have an additional loss attached with the existing attend-and-excite loss?

I tried to add an additional loss to mimic the behavior of my expected generation result but it seems like it does not converge very well. It would be great if we could discuss further on this. Many thanks for spending your time and effort.

Best.

KeyError:'up' when generating images

Trying to run the StableDiffusionAttendAndExcitePipeline and keep running into the following error:

Traceback (most recent call last):
  File "/home/ubuntu/imakeimages.py", line 31, in <module>
    result = pipe(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py", line 876, in __call__
    max_attention_per_index = self._aggregate_and_get_max_attention_per_token(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py", line 565, in _aggregate_and_get_max_attention_per_token
    attention_maps = self.attention_store.aggregate_attention(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py", line 99, in aggregate_attention
    for item in attention_maps[location]:
KeyError: 'up'

Any idea what is going wrong? Happy to share more on my setup if needed

Prompts for the quantitative results

Could you please release the list of prompts used for the evaluations in the paper? Thanks.

How about global properties?

Hi, Thank you for your amazing work, and I have a question about your research, not code.

The paper focused on object-centric words such as elephant or crown.
However, words indicating the global property, e.g., "at night", are also frequently omitted in the ouputs of T2I diffusion models. So, I just wonder whether your method also works to the words representing global property.
If not, I also want to know your opinion for that reasons.

Thank you.

About the paper

Sorry to re-open this issue(#9 (comment)). But as in the code, it uses the upper triangle mask. How do you get that mask?

https://github.com/huggingface/transformers/blob/92ce53aab859012f7714dae6d6fce7a7d701e75f/src/transformers/models/clip/modeling_clip.py#L715

How do I give the token indices if I want to emphasize a group of words?

My prompt is "a blue stone and red flowers". If I want to emphasize "red flowers" together what should be the indices, [5, 6] ?

link to arxiv on project page is not working

Your arxiv link on https://attendandexcite.github.io/Attend-and-Excite/ points to the project page, and should probably be switched to point to https://arxiv.org/abs/2301.13826 instead.

While I'm here: great project! It's really neat that just adding a guidance loss to force the latents to attend to all objects works so well. Love it :)

KeyError: 'up_cross'

hello, thanks for your work.

recently, I ran this attend and excite code and This bug occurs, I don't know how to solve this.
"self.attention_store = {}"

I read the issues of prompt-to-prompt, and they say this issue is related to the version of diffusers. need to revise the code of register_attention_control to adapt to the new version of diffusers. but in your code, you have revised the code of register_attention_control and AttendExciteCrossAttnProcessor, so I don't know how to solve this problem

I am looking forward to your answer very much！

CUDA error : no kernel image is available for execution

Hi,

Thanks for this project! Can't wait to try it out. Unfortunately, I have this error : "CUDA error : no kernel image is available for execution" on line 91 of pipeline_stable_diffusion.py.

Any idea why ? :)

How to save the attention maps

Hey, this is an very interesting work!
I tried to save the cross-attention maps by setting save_cross_attention_maps in config.py to True, but sadly found no attention map was saved. Can you show how to save the attention maps?
Thanks!

How to visualize the cross-attention map of each token

Hi, I'm really impressed by your work.

I wonder how to visualize the cross-attention of each token as in Fig.3 of your paper?

Thanks in advance :)

About your paper.

You write in Section 4, "The pre-trained CLIP text encoder prepends a specialized token 〈sot〉 to P indicating the start of the text. During the text encoding process, this token receives global information about the prompt. This leads to 〈sot〉 obtaining a high probability in the token distribution defined in At."

However, due to the casual mask in CLIP training, the 〈sot〉 will not receive any information. Why this token receives global information about the prompt?

About stable-diffusion-2-1

Hi,

I apply your method on stable-diffusion-1.5, it works.
But when I load the pre-trained model stable-diffusion-2-1, I meet this error, it seems like the pipeline is not work.

Seed: 0
0%| | 0/50 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/run.py", line 90, in
main()
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/run.py", line 73, in main
image = run_on_prompt(prompt=config.prompt,
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/run.py", line 44, in run_on_prompt
outputs = model(prompt=prompt,
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/pipeline_attend_and_excite.py", line 506, in call
max_attention_per_index = self._aggregate_and_get_max_attention_per_token(
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/pipeline_attend_and_excite.py", line 224, in _aggregate_and_get_max_attention_per_token
attention_maps = aggregate_attention(
File "/home/ubuntu/12T_1/szs/code/stable_diffusion/stable-diffusion-webui/Attend-and-Excite-diffusers/utils/ptp_utils.py", line 232, in aggregate_attention
out = torch.cat(out, dim=0)
RuntimeError: torch.cat(): expected a non-empty list of Tensors

Latent not updated on first iteration of iterative refinement

Hi,

Very interesting work!
While going through the codebase, I noticed that when performing iterative refinement, the latents are passed twice through the unet without an update in between. Is this the expected behavior?

Also this is nitpicking but I believe in the paper you stated "We set the iterations to 𝑡1 = 0, 𝑡2 = 10, and 𝑡3 = 20", but I believe this should be 𝑡1 = 50, 𝑡2 = 40, and 𝑡3 = 30 if we follow the notations of the rest of the paper?

Best,

"a spiky porcupine", what should I do for procupine, which contains three tokens

As the title

gpu v100 16G RuntimeError: CUDA out of memory.

What is the minimum gpu memory to run?

some issuses about stable-diffusion-2-1-base

for some prompts I test, the method appear bad performance, i don't know why

Fine tuning with new subjects

Hi,

Congratulations on the project, this work is truly exciting!

I have a question about fine tuning: will it be possible to fine tune your model with my own images and prompts? Do you have a training script I can use (similar to what dreambooth does).

Another question is: do you have a variant of this in the image to image space as well (stable diffusion)?

Did you release the evaluation dataset in your paper?

As title. The dataset mentioned in the paper includes animal, object and colors.

Are the datasets(Animal-Animal Animal-Object Object-Object) avaliable ?

I'd like to run the benchmark but it seems that there are no datasets in the repository, or I missed something.

About the Memory usage

Very interesting work. But I noticed that generating 512 resolution images requires about 16GB of memory, and generating 1024 images exceeds 80GB, so that it cannot run on A100. I checked the pipeline code and found that calling unet multiple times in the denoising loop caused a lot of memory consumption. Can this be further optimized?

Run on 8GB GPU?

Hi,

The results in the paper look promising and I'd like to try out your work on my system. However, I can't get the example to run on my 8GB GPU:

CUDA memory: 5630853120
Seed: 42
  0%|                                                                                                                                                                                | 0/51 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run.py", line 90, in <module>
    main()
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "run.py", line 73, in main
    image = run_on_prompt(prompt=config.prompt,
  File "run.py", line 44, in run_on_prompt
    outputs = model(prompt=prompt,
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/cs673/cs673/Attend-and-Excite/pipeline_attend_and_excite.py", line 205, in __call__
    noise_pred_text = self.unet(latents, t, encoder_hidden_states=text_embeddings[1].unsqueeze(0)).sample
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/diffusers/models/unet_2d_condition.py", line 234, in forward
    sample, res_samples = downsample_block(
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/diffusers/models/unet_blocks.py", line 537, in forward
    hidden_states = attn(hidden_states, context=encoder_hidden_states)
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/diffusers/models/attention.py", line 148, in forward
    x = block(x, context=context)
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/diffusers/models/attention.py", line 197, in forward
    x = self.attn1(self.norm1(x)) + x
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/cs673/cs673/Attend-and-Excite/utils/ptp_utils.py", line 71, in forward
    sim = torch.einsum("b i d, b j d -> b i j", q, k) * self.scale
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/torch/functional.py", line 360, in einsum
    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 7.79 GiB total capacity; 5.89 GiB already allocated; 386.25 MiB free; 6.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I've set max_split_size_mb to 128 MB and it seems that setting isn't being respected.

I've also tried setting torch_dtype=torch.float16 when loading the model. This doesn't have the out-of-memory error and instead gives:

CUDA memory: 2824863744
Seed: 42
  0%|                                                                                                                                                                                | 0/51 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run.py", line 90, in <module>
    main()
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "run.py", line 73, in main
    image = run_on_prompt(prompt=config.prompt,
  File "run.py", line 44, in run_on_prompt
    outputs = model(prompt=prompt,
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/cs673/cs673/Attend-and-Excite/pipeline_attend_and_excite.py", line 205, in __call__
    noise_pred_text = self.unet(latents, t, encoder_hidden_states=text_embeddings[1].unsqueeze(0)).sample
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/diffusers/models/unet_2d_condition.py", line 225, in forward
    emb = self.time_embedding(t_emb)
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/diffusers/models/embeddings.py", line 73, in forward
    sample = self.linear_1(sample)
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/cs673/cs673/attend/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Half but found Float

Is it possible to get this running on an 8GB GPU?

Licensing?

Can you tell me what licensing is this repository under?

Memory Requirements

I have been running the notebooks on a 16GB t4. In some cases, it runs well. But in some other cases (the observation is that loss values don't seem to go down much over iterations in many such cases), it throws up CUDA out of memory error in between the iterations itself. Why is it the case that it works well for the same number of iterations for certain prompts but not for others?

For instance, it works fine for a prompt like a cow and dog standing with a cat . But at the same time, for prompts like a boy cutting a birthday cake wearing a hat and an elephant with a crown standing next to a man, it throws memory errors within around 10 iterations. (The italics represent the subjects)

Am I missing something? What can be the probable reasons behind something like this?

Comfyui

Will you plan to support comfyui?

Explainabililty

Can the explain notebook work for real images that weren't created by Attend and Excite or Stable Diffusion?

I tried:
image = Image.open("turtle.jpg")
vis_utils.show_cross_attention(attention_store=controller,
prompt=prompt,
tokenizer=tokenizer,
res=16,
from_where=("up", "down", "mid"),
indices_to_alter=token_indices,
orig_image=image)

And got error:
│ in <cell line: 5>:5 │
│ │
│ /content/Attend-and-Excite/utils/vis_utils.py:22 in show_cross_attention │
│ │
│ 19 │ │ │ │ │ │ orig_image=None): │
│ 20 │ tokens = tokenizer.encode(prompt) │
│ 21 │ decoder = tokenizer.decode │
│ ❱ 22 │ attention_maps = aggregate_attention(attention_store, res, from_where, True, select) │
│ 23 │ images = [] │
│ 24 │ │
│ 25 │ # show spatial attention for indices of tokens to strengthen │
│ │
│ /content/Attend-and-Excite/utils/ptp_utils.py:228 in aggregate_attention │
│ │
│ 225 │ attention_maps = attention_store.get_average_attention() │
│ 226 │ num_pixels = res ** 2 │
│ 227 │ for location in from_where: │
│ ❱ 228 │ │ for item in attention_maps[f"{location}_{'cross' if is_cross else 'self'}"]: │
│ 229 │ │ │ if item.shape[1] == num_pixels: │
│ 230 │ │ │ │ cross_maps = item.reshape(1, -1, res, res, item.shape[-1])[select] │
│ 231 │ │ │ │ out.append(cross_maps) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'up_cross'

Evaluation pipeline

Hi,
thanks for the interesting work. I'm wondering if you could please share the code for your evaluation, e.g. , different similarity evaluation. Thanks a lot!

KeyError: 0

I get this error when inferring.

Iteration 0 | Loss: 0.2740
Iteration 1 | Loss: 0.0576
Iteration 2 | Loss: 0.1228
Iteration 3 | Loss: 0.0293
Iteration 4 | Loss: 0.0415
Iteration 5 | Loss: 0.0324
Iteration 6 | Loss: 0.0325
Iteration 7 | Loss: 0.0340
Iteration 8 | Loss: 0.0391
Iteration 9 | Loss: 0.0454
Iteration 10 | Loss: 0.0406
Iteration 11 | Loss: 0.0393
Iteration 12 | Loss: 0.0386
Iteration 13 | Loss: 0.0374
Iteration 14 | Loss: 0.0359
Iteration 15 | Loss: 0.0335
Iteration 16 | Loss: 0.0307
Iteration 17 | Loss: 0.0289
Iteration 18 | Loss: 0.0283
Iteration 19 | Loss: 0.0278
Iteration 20 | Loss: 0.0281
Iteration 21 | Loss: 0.0292
Iteration 22 | Loss: 0.0311
Iteration 23 | Loss: 0.0328
Iteration 24 | Loss: 0.0339

KeyError Traceback (most recent call last)
in
----> 1 generate_images_for_method(
2 prompt="a cat and a frog",
3 seeds=[6141],
4 indices_to_alter=[2,5],
5 is_attend_and_excite=True

8 frames
in generate_images_for_method(prompt, seeds, indices_to_alter, is_attend_and_excite)
10 controller = AttentionStore()
11 run_standard_sd = False if is_attend_and_excite else True
---> 12 image = run_and_display(prompts=prompts,
13 controller=controller,
14 indices_to_alter=token_indices,

in run_and_display(prompts, controller, indices_to_alter, generator, run_standard_sd, scale_factor, thresholds, max_iter_to_alter, display_output)
17 thresholds=thresholds,
18 max_iter_to_alter=max_iter_to_alter)
---> 19 image = run_on_prompt(model=stable,
20 prompt=prompts,
21 controller=controller,

/content/excite/run.py in run_on_prompt(prompt, model, controller, token_indices, seed, config)
41 if controller is not None:
42 ptp_utils.register_attention_control(model, controller)
---> 43 outputs = model(prompt=prompt,
44 attention_store=controller,
45 indices_to_alter=token_indices,

/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
28 return cast(F, decorate_context)
29

/content/excite/pipeline_attend_and_excite.py in call(self, prompt, attention_store, indices_to_alter, attention_res, height, width, num_inference_steps, guidance_scale, eta, generator, latents, output_type, return_dict, max_iter_to_alter, run_standard_sd, thresholds, scale_factor, scale_range, smooth_attentions, sigma, kernel_size, **kwargs)
259 latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs).prev_sample
260
--> 261 outputs = self._prepare_output(latents, output_type, return_dict)
262 return outputs

/content/excite/pipeline_stable_diffusion.py in _prepare_output(self, latents, output_type, return_dict)
160
161 # run safety checker
--> 162 safety_cheker_input = self.feature_extractor(self.numpy_to_pil(image), return_tensors="pt").to(self.device)
163 image, has_nsfw_concept = self.safety_checker(images=image, clip_input=safety_cheker_input.pixel_values)
164

/usr/local/lib/python3.8/dist-packages/transformers/models/clip/feature_extraction_clip.py in call(self, images, return_tensors, **kwargs)
150 images = [self.convert_rgb(image) for image in images]
151 if self.do_resize and self.size is not None and self.resample is not None:
--> 152 images = [
153 self.resize(image=image, size=self.size, resample=self.resample, default_to_square=False)
154 for image in images

/usr/local/lib/python3.8/dist-packages/transformers/models/clip/feature_extraction_clip.py in (.0)
151 if self.do_resize and self.size is not None and self.resample is not None:
152 images = [
--> 153 self.resize(image=image, size=self.size, resample=self.resample, default_to_square=False)
154 for image in images
155 ]

/usr/local/lib/python3.8/dist-packages/transformers/image_utils.py in resize(self, image, size, resample, default_to_square, max_size)
282 # specified size only for the smallest edge
283 short, long = (width, height) if width <= height else (height, width)
--> 284 requested_new_short = size if isinstance(size, int) else size[0]
285
286 if short == requested_new_short: