Coder Social home page Coder Social logo

gligen's Introduction

GLIGEN: Open-Set Grounded Text-to-Image Generation (CVPR 2023)

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li*, Yong Jae Lee* (*Co-senior authors)

[Project Page] [Paper] [Demo] [YouTube Video] Teaser figure

IMAGE ALT TEXT HERE

  • Go beyond text prompt with GLIGEN: enable new capabilities on frozen text-to-image generation models to ground on various prompts, including box, keypoints and images.
  • GLIGEN’s zero-shot performance on COCO and LVIS outperforms that of existing supervised layout-to-image baselines by a large margin.

🔥 News

  • [2023.11.2] GLIGEN is integreated into LLaVA-Interactive: an all-in-one demo for Image Chat, Segmentation, Generation and Editing. Experience the future of interactive image editing with visual chat. [Project Page] [Demo] [Code] [Paper]

  • [2023.04.18] We have updated our arxiv paper. We explain the difference between GLIGEN and ControlNet here to help researchers to have a better and deeper understanding.

  • [2023.04.08] GLIGEN is combined with Grounding DINO, which free humans from anotating bounding boxes and their concepts. Given a language prompt, Grounding DINO localizes the concepts with boxes: image $\rightarrow$ (box, concept), then GLIGEN inpaint the image: (box, concept) $\rightarrow$ image:

  • [2023.03.22] Our fork on diffusers with support of text-box-conditioned generation and inpainting is released. It is now faster, more flexible, and automatically downloads and loads model from Huggingface Hub! Try it out!
  • [2023.03.20] Stay up-to-date on the line of research on grounded image generation such as GLIGEN, by checking out Computer Vision in the Wild (CVinW) Reading List.
  • [2023.03.19] GLIGEN is covered by great Yannic Kilcher in his latest YouTube video on The biggest week in AI.
  • [2023.03.05] Gradio demo code is released at GLIGEN/demo.
  • [2023.03.03] Code base and checkpoints are released.
  • [2023.02.28] Paper is accepted to CVPR 2023.
  • [2023.01.17] GLIGEN paper and demo is released.

Requirements

We provide dockerfile to setup environment.

Download GLIGEN models

We provide ten checkpoints for different use scenarios. All models here are based on SD-V-1.4.

Mode Modality Download
Generation Box+Text HF Hub
Generation Box+Text+Image HF Hub
Generation Keypoint HF Hub
Inpainting Box+Text HF Hub
Inpainting Box+Text+Image HF Hub
Generation Hed map HF Hub
Generation Canny map HF Hub
Generation Depth map HF Hub
Generation Semantic map HF Hub
Generation Normal map HF Hub

Note that the provided checkpoint for semantic map is only trained on ADE20K dataset; the checkpoint for normal map is only trained on DIODE dataset.

Inference: Generate images with GLIGEN

We provide one script to generate images using provided checkpoints. First download models and put them in gligen_checkpoints. Then run

python gligen_inference.py

Example samples for each checkpoint will be saved in generation_samples. One can check gligen_inference.py for more details about interface.

Training

Grounded generation training

One need to first prepare data for different grounding modality conditions. Refer data for the data we used for different GLIGEN models. Once data is ready, the following command is used to train GLIGEN. (We support multi-GPUs training)

ptyhon main.py --name=your_experiment_name  --yaml_file=path_to_your_yaml_config

The --yaml_file is the most important argument and below we will use one example to explain key components so that one can be familiar with our code and know how to customize training on their own grounding modalities. The other args are self-explanatory by their names. The experiment will be saved in OUTPUT_ROOT/name

One can refer configs/flicker_text.yaml as one example. One can see that there are 5 components defining this yaml: diffusion, model, autoencoder, text_encoder, train_dataset_names and grounding_tokenizer_input. Typecially, diffusion, autoencoder and text_encoder should not be changed as they are defined by Stable Diffusion. One should pay attention to following:

  • Within model we add new argument grounding_tokenizer which defines a network producing grounding tokens. This network will be instantized in the model. One can refer to ldm/modules/diffusionmodules/grounding_net_example.py for more details about defining this network.
  • grounding_tokenizer_input will define a network taking in batch data from dataloader and produce input for the grounding_tokenizer. In other words, it is an intermediante class between dataloader and grounding_tokenizer. One can refer grounding_input/__init__.py for details about defining this class.
  • train_dataset_names should be listing a serial of names of datasets (all datasets will be concatenated internally, thus it is useful to combine datasets for training). Each dataset name should be first registered in dataset/catalog.py. We have listed all dataset we used; if one needs to train GLIGEN on their own modality dataset, please don't forget first list its name there.

Grounded inpainting training

GLIGEN also supports inpainting training. The following command can be used:

ptyhon main.py --name=your_experiment_name  --yaml_file=path_to_your_yaml_config --inpaint_mode=True  --ckpt=path_to_an_adapted_model

Typecially, we first train GLIGEN on generation task (e.g., text grounded generation) and this model has 4 channels for input conv (latent space of Stable Diffusion), then we modify the saved checkpoint to 9 channels with addition 5 channels initilized with 0. This continue training can lead to faster convergence and better results. path_to_an_adapted_model refers to this modified checkpoint, convert_ckpt.py can be used for modifying checkpoint. NOTE: yaml file is the same for generation and inpainting training, one only need to change --inpaint_mode

Citation

@article{li2023gligen,
  title={GLIGEN: Open-Set Grounded Text-to-Image Generation},
  author={Li, Yuheng and Liu, Haotian and Wu, Qingyang and Mu, Fangzhou and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan and Lee, Yong Jae},
  journal={CVPR},
  year={2023}
}

Disclaimer

The original GLIGEN was partly implemented during a part-time internship at Microsoft while the first author was working at The University of Wisconsin-Madison. This repo re-implements GLIGEN in PyTorch with university GPUs. Despite the minor implementation differences, this repo aims to reproduce the results and observations in the paper for research purposes.

Terms and Conditions

We have strict terms and conditions for using the model checkpoints and the demo; it is restricted to uses that follow the license agreement of Latent Diffusion Model and Stable Diffusion.

Broader Impact

It is important to note that our model GLIGEN is designed for open-world grounded text-to-image generation with caption and various condition inputs (e.g. bounding box). However, we also recognize the importance of responsible AI considerations and the need to clearly communicate the capabilities and limitations of our research. While the grounding ability generalizes well to novel spatial configuration and concepts, our model may not perform well in scenarios that are out of scope or beyond the intended use case. We strongly discourage the misuse of our model in scenarios, where our technology could be used to generate misleading or malicious images. We also acknowledge the potential biases that may be present in the data used to train our model, and the need for ongoing evaluation and improvement to address these concerns. To ensure transparency and accountability, we have included a model card that describes the intended use cases, limitations, and potential biases of our model. We encourage users to refer to this model card and exercise caution when applying our technology in new contexts. We hope that our work will inspire further research and discussion on the ethical implications of AI and the importance of transparency and accountability in the development of new technologies.

gligen's People

Contributors

chunyuanli avatar galdude33 avatar haotian-liu avatar stared avatar yuheng-li avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gligen's Issues

GLIGEN with Canny Grounding

Thanks for your work!
I am wondering if there is a way to convert canny grounding model in huggingface to ckpt format.
I was trying to convert it with the code below.

https://github.com/huggingface/diffusers/blob/main/scripts/convert_diffusers_to_original_stable_diffusion.py

But this requires unet in side diffusion_pytorch_model.bin file.
FileNotFoundError: [Errno 2] No such file or directory: './gligen_checkpoints/diffusion_pytorch_model.bin\unet\diffusion_pytorch_model.bin'

More of a thought than an issue

If I'm understanding correctly you're suggesting training a new model with an added layer or conditional NN off the back of a pretrained ancestor model. What I'm wondering is why use the pretrained model, if you're training a model anyway w/ new or the same data why not start fresh, just having a bounding box layer.

Another thing that I was wondering is do you think it may be possible to have a layer in which you give almost explicit general rules for things, such as humans have only five fingers, or similar. I had considered doing this but instead of using bounding boxes, I'd have explicitly stated rules such as pay attention to nouns and adjectives, or follow sentence structure to determine the directive of the prompt. I even wrote up a list of the types of rules that may be applied in such a type of model.

How many bounding boxes does it generally use? For example could you get it to put bounding boxes around pretty much everything down to individual fingers or eyes?

Or is that the reason for the pretrained model, that it has knowledge of what certain things are and it can then label the bounding boxes instead of having to manually do it with millions of images. If that's the case it'd be interesting if it could be applied using a similar technique to create a service that annotates images for any dataset to augment training other people's models.

This is really fascinating work though, I'm excited to see where it can go, and thanks for letting me read about it with the paper and rant a little.

Release of Code

So impressed by your works!
When do you think the codes of this project will be released?
Thank you!

Coco based training weights.

Thanks for sharing your work which is very helpful and interesting. I wanted to ask if you could share the coco trained weights without any of the large scale training using the bigger datasets. I see you discuss the coco trained results in the paper but I could not find them in the Github repository.

Multi GPUs training error

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.

Parameter at index 925 with name output_blocks.11.0.skip_connection.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 34718) of binary: /opt/conda/bin/python

License

Hi, thanks for the great work. Could you please provide the license for this repo. Thanks a lot!

Training Dataset of HF Hub Checkpoint vs. Paper Models

Hello, the paper mentions several models, one trained on COCO, another on LVIS, a third on GoldG, O365, SBU and CC3M. As far as I understand, without retraining the model, you can download one of the ten checkpoints to use with the gligen_inference.py script.

My question is: on which dataset were these checkpoints trained? In particular, for "Box+Text+Image" modality with Generation and Inpainting mode.

Bbox format

Hello, I need to ask in what format is the locations in gligen_inference.py? The images are generating completely out of bbox so I'm wondering what it could be? Here is the code snipset THX in advance :)
`def gen_location():
min_size = 0.075

x1 = round(random.uniform(0, 1 - min_size), 8)
x2 = round(random.uniform(x1 + min_size, 1), 8)

y1 = round(random.uniform(0, 1 - min_size), 8)
y2 = round(random.uniform(y1 + min_size, 1), 8)

return [x1, y1, x2, y2]

if name == "main":

parser = argparse.ArgumentParser()
parser.add_argument("--folder", type=str,  default="generation_samples_2", help="root folder for >


parser.add_argument("--batch_size", type=int, default=1, help="")
parser.add_argument("--no_plms", action='store_true', help="use DDIM instead. WARNING: I did not >    parser.add_argument("--guidance_scale", type=float,  default=7.5, help="")
parser.add_argument("--negative_prompt", type=str,  default='longbody, lowres, bad anatomy, bad h>    #parser.add_argument("--negative_prompt", type=str,  default=None, help="")
args = parser.parse_args()

phrases = load_phrases('DATA/phrases.txt')
dict_list = []

dict_list = []

for phrase in phrases:
    random_locations = gen_location()  # Generate random locations
    x = dict(
        ckpt="/home/paperspace/GLIGEN/gligen_checkpoints/diffusion_pytorch_model.bin",
        prompt=phrase,
        locations= random_locations,
        phrases = ["strawberry"],
        alpha_type=[0.3, 0.0, 0.7],
        save_folder_name="generation_box_text_v3"
    )
    dict_list.append(x)

`

Release of code

Hello! Very impressive paper and work.

We are planning to use this in a master thesis. Are you planning on releasing the code soon?

Kind regards

Support For Diffusers?

Thank you very much for your awesome work! Do you guys have any plans to integrate the gatedSelfAttention & PositionNet into diffusers? I am working on it recently and am willing to help.

Problems on trainer, the unexpected_keys

Since you use the

if level != len(channel_mult) - 1: # will not go to this downsample branch in the last feature

in UNet, when initializing the trainer, the missing Downsample module will be taken as an unexpected key.

missing_keys, unexpected_keys = self.model.load_state_dict( state_dict["model"], strict=False  )
assert unexpected_keys == []

The code will be wrong.
I hope you'll get back to me if I have a wrong understanding.

Error in parsed sketch

I deploy a demo on my own server, but in inpainting task, the parsed sketch always keep loading
截屏2023-03-24 下午2 55 37

Great Work!!! Few queries regarding the dataset preparation

When I am trying to merge the tsv files for Flickr. I am getting the error.

File "/scratch/project_462000189/anwer/mn/GLIGEN/tsv_split_merge.py", line 324, in
merge(args.merge_in_folder, args.merge_out_folder)
File "/scratch/project_462000189/anwer/mn/GLIGEN/tsv_split_merge.py", line 288, in merge
for idx in range(len(reader)):
File "/scratch/project_462000189/anwer/mn/GLIGEN/tsv_split_merge.py", line 137, in len
return self.num_rows()
File "/scratch/project_462000189/anwer/mn/GLIGEN/tsv_split_merge.py", line 102, in num_rows
self._ensure_lineidx_loaded()
File "/scratch/project_462000189/anwer/mn/GLIGEN/tsv_split_merge.py", line 145, in _ensure_lineidx_loaded
self._lineidx = [int(line) for line in lines]
File "/scratch/project_462000189/anwer/mn/GLIGEN/tsv_split_merge.py", line 145, in
self._lineidx = [int(line) for line in lines]
ValueError: invalid literal for int() with base 10: '8760\t{"data_id": 8760, "image": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQgJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyM

Can you please help in this

Data generation at different image sizes

Does GLIGEN supports data generation at different image sizes? like 640640/ 640384?

I tried to generate data with checkpoint_generation_text.pth at 640*384, but the results cannot match.

35

I'm working on this issue and would like to help. :-D

Thanks!

Parsed Sketch Pad field does not parse the Sketch Pad

On this page https://huggingface.co/spaces/gligen/demo
even before I give any Grounding Instructions, the Parsed Sketch Pad already has a bounding box marked (almost the entire area, with some margin, called Obj. 1):
image

And when I try to give some Grounding instructions, it names the frame after the first object.
Drawing things in the Sketch Pad will not change the Parsed Sketch Pad at all, no matter how many lines I draw.

image

Thus I can only generate images with this demo, if I only ask for a single Grounding instruction, and I'm happy with the default bounding box. Otherwise, it will give me an Error, as it doesn't register multiple sketch instructions.

Data for training

Great work!

Can you release the DATA section that describe how the data for training was prepared?

Thanks.

About the AutoEncoder

Dear authors,

Thank you for your excellent work, which inspires me a lot.

I'm wondering why you choose AutoEnocderKL instead of VQ model. Would using VQ-model makes any difference?

Custom dataset

Hello, this work is awesome. Thanks for sharing!
Do you have some instructions for fine-tuning on own dataset?

FileNotFoundError: [Errno 2] No such file or directory: 'gligen_checkpoints/checkpoint_generation_text.pth'

Thanks for your great work.
I downloaded a model diffusion_pytorch_model.bin in HF hub.
I unzipped it and placed at gligen_checkpoints directory, but this error appeared.

Traceback (most recent call last):
  File "gligen_inference.py", line 434, in <module>
    run(meta, args, starting_noise)
  File "/home/lambdasix/anaconda3/envs/GLIGEN/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "gligen_inference.py", line 219, in run
    model, autoencoder, text_encoder, diffusion, config = load_ckpt(meta["ckpt"])
  File "gligen_inference.py", line 70, in load_ckpt
    saved_ckpt = torch.load(ckpt_path)
  File "/home/lambdasix/anaconda3/envs/GLIGEN/lib/python3.8/site-packages/torch/serialization.py", line 771, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/lambdasix/anaconda3/envs/GLIGEN/lib/python3.8/site-packages/torch/serialization.py", line 270, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/lambdasix/anaconda3/envs/GLIGEN/lib/python3.8/site-packages/torch/serialization.py", line 251, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'gligen_checkpoints/checkpoint_generation_text.pth'

Is it possible to use pretrained model now?

How to combine 2d box and canny edge to control the image generation together?

Thank you for the subsequent updates on more controllable methods, including edge, depth, etc. So fast.

But I have a question, when I want to combine 2d box and canny edge to control the image generation together, how to redesign the UNet network structure?

For example, roughly stacking two gated self attention layers, one for fusing 2d box embedding, and the other for fusing edge embedding? Any more experience recommendations?

I would like to get your answer!

Questions about the implementation:

Hi, thanks your good work.
A few small questions about the implementation:

  1. How long did your grounded with text experiment take to converge after training?How do I convert the 100K iterations you mentioned into epoch numbers? I don't know how to evaluate how long it takes to converge on my custom dataset

  2. why did you set max_boxes_per_data=30? Does it not work well when it is bigger? Any experience here?

  3. Why do you have to go through the "scale to 0-1" operation first when processing the box coordinates? If the aspect ratio of my image is not 1:1, will the normalization operation cause the coordinates to not correspond to the original image?

I would like to get your answer!

Using the "Image grounded generation"

Hey,
How can I use your code for "Image grounded generation" as described in the paper? How should I provide the grounding image?
Such as you show in Figure 9 in the paper?
image

Can you please share some instructions?
Thanks!

tsv_split_merge

Can you provide the merged flickr tsv dataset? the file I merged using the tsv_split_merge.py does not work

CUDA out of memory

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.00 GiB total capacity; 10.15 GiB already allocated; 0 bytes free; 10.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

From looking at this issue it seems like the model needs ~16gb VRAM to run, my GTX 1080 ti only has 11gb with very little else consuming it.

Sun Oct 8 13:38:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50 Driver Version: 531.79 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| 0 NVIDIA GeForce GTX 1080 Ti On | 00000000:01:00.0 On | N/A |
| 19% 58C P0 65W / 280W| 644MiB / 11264MiB | 0% Default |

As suggested when googling the issue, I have tried reducing the batch size.
python gligen_inference.py --batch_size 1
But I am still running out of memory, does anyone have any ideas or is what I am attempting not possible?

Unable to load my trained checkpoints

First, Thank you for sharing Great work!

I trained my custom dataset using GLIGEN and obtained weights.

But I can't run gligen_inference.py using my custom weights with a error below.
image

The code with the problem is here.

def load_ckpt(ckpt_path):
saved_ckpt = torch.load(ckpt_path)
config = saved_ckpt["config_dict"]["_content"]
model = instantiate_from_config(config['model']).to(device).eval()
autoencoder = instantiate_from_config(config['autoencoder']).to(device).eval()
text_encoder = instantiate_from_config(config['text_encoder']).to(device).eval()
diffusion = instantiate_from_config(config['diffusion']).to(device)
# donot need to load official_ckpt for self.model here, since we will load from our ckpt
model.load_state_dict( saved_ckpt['model'] )
autoencoder.load_state_dict( saved_ckpt["autoencoder"] )
text_encoder.load_state_dict( saved_ckpt["text_encoder"] )
diffusion.load_state_dict( saved_ckpt["diffusion"] )
return model, autoencoder, text_encoder, diffusion, config

How can I solve it?

Evaluation code

The evaluation code for calculating FID and YOLO-score(AP) is not provided. Can you publish the code on github?

How to generate same style for one object in several view directions ?

Thanks for your great contribution ! This project is convenient to produce single image with desired style. It would be great if there are some suggestions for this situation:

I have a toy on the table and I take several pictures of this toy from different directions. If run each image by GLIGEN separately, this toy looks quite different in each image. How can I edit the appearance of this toy, with the same style in different view images ?

Grounded Inpainting with Image entity

Amazing paper with a lot of potential for really great applications !
Wanted to ask if there any demo of the feature I saw in the video and paper where we use GLIGEN for image editing like :

  1. Fill in the gaps of the image as shown in video
  2. change the location of the desired object in the image.

Also if it can be done in the same hugging face demo link shared can you please let me know how ?
Any help in this regard is deeply appreciated

Image size

Is it possible to adjust the image output size?

with stablediffusion i can do
image = pipe(prompt, negative_prompt=negative_prompt, height=536, width=768, generator=generator).images[0]

is something like that possible in the inference

gligen + dreambooth

hello, how can we use dreambooth with gligen?

I wanna generate specific subject at a specific location (given by box)

Convert checkpoint to Diffusers checkpoint

Great Work!
I see that with checkpoint text box and inpainting text checkpoint can use in Diffusers library and I want to use other checkpoint with Diffusers too. How can I convert your original checkpoint to Diffusers checkpoint.
Thank you !

Semantic map inpainting

Hi there,
This is a great tool! thanks for sharing it!
Is there support for semantic map inpainting or do you plan to add? The scenario is simple that a specific contour/segment needs to be replaced by another e.g., a wooden table by a steel table.
Thanks!

Support for custom dataset training

Thanks for your great effort and the results you have provided.

I wonder if you have any plans to provide a guideline on how to create custom dataset TSV files.
I am interested in reproducing your COCO dataset training results, but I am facing difficulties due to a lack of resources for creating TSV datasets.

I attempted to recreate the 'image_embedding_after' and 'image_embedding_before' data from the raw 'image' data in the flickr30k's TSV dataset, but unfortunately, I was unsuccessful in doing so.

For instance, I tried to using the data you provided in the HuggingFace dataset's flickr_tsv, which can be found here.

from ldm.modules.encoders.modules import FrozenClipImageEmbedder

# First solution
image = Image.open(image_path)
image = transforms.ToTensor()(image).unsqueeze(0).cuda()
clip_image_embedder = FrozenClipImageEmbedder("ViT-L/14", device="cuda", jit=False, antialias=False)
image_processed = clip_image_embedder.preprocess(image)
outputs = clip_image_embedder(image_processed)

# Second Solution
model = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14")
processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14")
inputs = processor(images=image, return_tensors="pt", padding=True, max_length=77, truncation=True)
outputs = model(**inputs)

The outputs I obtained are different from the "image_embedding_before" and "image_embedding_after" in the flickr_tsv dataset.
If you are unable to provide a solution directly at the moment, I would greatly appreciate any advice or guidance you can offer to help me resolve this issue.

how to load other safetensors model instead sd1.5

When I wanna try to changed the base1.4 model to 1.5, it worked successfully.
furthermore I wanna try to load other model like RealisticV51 but I donot want to convert it to ckpt formart, so I load it by safetensors format.
After that, I found the date in safetensors lack of somthing in ckpt format model.
as the screenshoot below, the safetensors lack of the data which key named diffusion.
image

Trouble getting startet

I have tested your model in your provided demo and find it very good!
Nevertheless, I am not able to set it up to run on my local computer. (I'm quite new in this and has e.g no experience with docker.)

Is it possible for you to create a more thorough guide to get the model up and running on a local system?

GLIGEN for SD v2 or XL

The released weights are based on SD v1.4. Are there any plans for v2 and XL support?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.