Coder Social home page Coder Social logo

showlab / visorgpt Goto Github PK

View Code? Open in Web Editor NEW
126.0 2.0 2.0 123.31 MB

[NeurIPS 2023] Customize spatial layouts for conditional image synthesis models, e.g., ControlNet, using GPT

License: MIT License

Python 100.00%
controlnet gpt image-generation diffusion-models

visorgpt's Introduction

VisorGPT 🎨 (NeurIPS 2023)

Learning Visual Prior via Generative Pre-Training

Jinheng Xie1Β  Kai Ye2Β  Yudong Li2Β  Yuexiang Li3Β  Yefeng Zheng3 Linlin Shen2Β  Mike Zheng Shou1

1 National University of SingaporeΒ  2 Shenzhen UniversityΒ  3 Jarvis Research Center, Tencent YouTu Lab

arXiv demo video webpage

Updates

  • [2023/05/23] Paper is available.
  • [2023/05/28] Gradio demo is available.
  • [2023/05/30] Hugging Face demo is available.
  • [2023/06/13] Training code and data are available.
  • [2023/09/22] VisorGPT is accepted by NeurIPS 2023.

Quick Start

Step 1

# clone the repo
git clone https://github.com/Sierkinhane/VisorGPT.git

# go to directory
cd VisorGPT

# create a new environment
conda create -n visorgpt python=3.8

# activate the new environment
conda activate visorgpt

# prepare the basic environments
pip3 install -r requirements.txt

# install controlnet and gligen
cd demo/ControlNet
pip3 install -v -e .
cd ../demo/GLIGEN
pip3 install -v -e .

Step 2 - Download pre-trained weights

Download visorgpt, controlnet-pose2img, controlnet-sd, gligen-bbox2img, and put them as follow:

β”œβ”€β”€ demo/
|   β”œβ”€β”€ ckpts
|   |   β”œβ”€β”€ controlnet
|   |   |   β”œβ”€β”€ control_v11p_sd15_openpose.pth
|   |   |   β”œβ”€β”€ v1-5-pruned-emaonly.safetensors
|   |   β”œβ”€β”€ gligen
|   |   |   β”œβ”€β”€ diffusion_pytorch_model_box.bin
|   |   β”œβ”€β”€ visorgpt
|   |   |   β”œβ”€β”€ visorgpt_dagger_ta_tb.pt

Step 3 - Run demo

CUDA_VISIBLE_DEVICES=0 python3 gradio_demo.py

Training

  1. Download the preprocessed json files from here.
  2. Process them into text corpora, e.g.,
# box type
python3 preprocess_coord.py --input_path path/to/coco_train.json --data_type box --output_dir txt_train
# keypoint type
python3 preprocess_coord.py --input_path path/to/cocokeypoints_train.json --data_type keypoint --output_dir txt_train
# mask type
python3 preprocess_coord.py --input_path path/to/coco_train.json --data_type mask --output_dir txt_train
  1. If you have processed several .txt files, you can merge them into one .txt file, e.g.,
python3 utiles/merge_files.py --file_dir txt_train --output_file_path train.txt
  1. Tokenize the text corpora.
cd train/
python3 preprocess.py --corpus_path ../train.txt \
                      --vocab_path models/google_uncased_en_coord_vocab.txt \
                      --dataset_path train.pt --processes_num 8 \
                      --seq_length 1024 --tgt_seq_length 1024 --data_processor lm
  1. Train GPT-2 (based) model. The training process requires 8 V100(32GB).
deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
                    --dataset_path train.pt \
                    --vocab_path models/google_uncased_en_coord_vocab.txt \
                    --config_path models/gpt2/config.json \
                    --output_model_path train.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 200000 --save_checkpoint_steps 5000 --report_steps 100 \
                    --learning_rate 5e-5 --batch_size 16

Or you can directly download the tokenized data from here (around 340K sequences) and put it into the directory of train/.

deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
                    --dataset_path visorgpt_dagger_train_seq.pt \
                    --vocab_path models/google_uncased_en_coord_vocab.txt \
                    --config_path models/gpt2/config.json \
                    --output_model_path models/visorgpt_dagger_train_seq.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 200000 --save_checkpoint_steps 10000 --report_steps 100 \
                    --learning_rate 5e-5 --batch_size 16

Inference

CUDA_VISIBLE_DEVICES=0 python3 scripts/generate_lm_multiple.py --load_model_path models/visorgpt_dagger_train_seq.bin/200000/mp_rank_00_model_states.pt \
                               --vocab_path models/google_uncased_en_coord_vocab.txt \
                               --test_path beginning.txt --prediction_path generated_sentence.txt \
                               --config_path models/gpt2/config.json --seq_length 512
                               
or 
CUDA_VISIBLE_DEVICES=0 python3 scripts/generate_lm_multiple.py --load_model_path models/visorgpt_dagger_train_seq.bin \
                               --vocab_path models/google_uncased_en_coord_vocab.txt \
                               --test_path beginning.txt --prediction_path generated_sentence.txt \
                               --config_path models/gpt2/config.json --seq_length 512

Visualization

cd ../
python utils/seq2coord.py --file_path path/to/your/inference/txt --visualize

The visualization results will be saved at ./debug

If you are using our code, please consider citing our paper.

@inproceedings{xie2023learning,
title={Learning Visual Prior via Generative Pre-Training},
author={Jinheng Xie and Kai Ye and Yudong Li and Yuexiang Li and Kevin Qinghong Lin and Yefeng Zheng and Linlin Shen and Mike Zheng Shou},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
}

visorgpt's People

Contributors

sierkinhane avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

mmarking haorand

visorgpt's Issues

Error while loading Gligen

Hello,

While loading GLigen, I have got the following errors. Are those the right weights please?

  File "gradio_demo.py", line 35, in <module>
    g_config, g_grounding_tokenizer_input = build_gligen_model(ckpt=gligen_model_path)
  File "/home/VisorGPT/demo/GLIGEN/gligen/gligen_inference_box.py", line 229, in build_gligen_model
    model, autoencoder, text_encoder, diffusion, config = load_ckpt(ckpt)
  File "/home/VisorGPT/demo/GLIGEN/gligen/gligen_inference_box.py", line 99, in load_ckpt
    text_encoder.load_state_dict( saved_ckpt["text_encoder"]  )
  File "/opt/conda/envs/visorgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for FrozenCLIPEmbedder:
        Unexpected key(s) in state_dict: "transformer.text_model.embeddings.position_ids".

Questions about Kullback-Leibler divergence calculation in Table 4.

Really appreicate your impressive work~

I wonder how the KL divergence in Table 4 is calculated ? Is it an average of KL in each category or calculated across all categories as a whole? For COCO, only 6400 generated samples are used, am I right ?

Thanks in advance for your kind response.

Some issues in training

Thank you greatly for your excellent work, as I try to reproduce the training process, I encountered the following problem and wondered if you have encountered it?

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/embeddings/word_embedding.py", line 27, in forward
emb = self.embedding(src)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/embeddings/embedding.py", line 27, in forward
emb = embedding(src, seg)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/models/model.py", line 33, in forward
emb = self.embedding(src, seg)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/trainer.py", line 160, in forward_propagation
loss_info = model(src, tgt, seg)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/trainer.py", line 110, in train
loss = self.forward_propagation(batch, model)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/trainer.py", line 638, in worker
trainer.train(args, gpu_id, rank, train_loader, model_for_training, optimizer, scheduler)
File "/storage/zhaoliuqing/code/VisorGPT/train/tencentpretrain/trainer.py", line 56, in train_and_validate
worker(args.local_rank, None, args, model_for_training, model_for_dataloader)
File "/storage/zhaoliuqing/code/VisorGPT/train/pretrain.py", line 117, in main
trainer.train_and_validate(args)
File "/storage/zhaoliuqing/code/VisorGPT/train/pretrain.py", line 121, in
main()
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Looking forward to your reply!

The original training data file

Hi Sierkinhane,
Very nice work. Can you provide the original training data file for us to understand how your data is organized? And how to process it as the visorgpt_dagger_train_seq.bin?

Thanks.

What does the generated_sentence.txt generated after training represent?

Hi, I followed the steps you provided for 200,000 steps training. When I used the inference test results, the generated_sentence.txt I got was different from the Output sequence shown in the paper. When I write "box; multiple instances; medium; 4; 0; apple, apple, cake, knife;" in beginning.txt, I get "[CLS] box; multiple instances; medium; 4; 0; apple, apple, cake, knife; [ ] 176 ymin 188 xmax 236 ymax 426 ] [SEP] banana xmin 112 ymin 181 xmax 167 ymax 429 ] [SEP] ##r xmin 138 ymin 189 xmax 180 ymax 427 ] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] cell phone xmin 83 ymin 197 xmax 143 ymax 448 ] [SEP] [SEP] [SEP] 94 ymin 202 xmax 139 ymax 422 ] [SEP] [SEP] [SEP] [SEP] [SEP] [ SEP] xmin 144 ymin 182 xmax 230 ymax 420 ] [SEP] [SEP] [SEP] 185 ] [SEP] [SEP] [ xmin [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] xmin . ...", what does [SEP] here mean?

.cache/torch_extensions/py38_cu117/utils/utils.so

Thanks for sharing your excellent work!

When I run the training command, I met the following error.

Loading extension module utils...
Traceback (most recent call last):
File "pretrain.py", line 121, in
main()
File "pretrain.py", line 117, in main
trainer.train_and_validate(args)
File "/mnt/data-1/data/jiagang.zhu/VisorGPT/train/tencentpretrain/trainer.py", line 56, in train_and_validate
worker(args.local_rank, None, args, model_for_training, model_for_dataloader)
File "/mnt/data-1/data/jiagang.zhu/VisorGPT/train/tencentpretrain/trainer.py", line 593, in worker
model_for_training, optimizer, _, scheduler = deepspeed.initialize(
File "/mnt/data-1/data/jiagang.zhu/miniconda3/envs/visorgpt/lib/python3.8/site-packages/deepspeed/init.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/data-1/data/jiagang.zhu/miniconda3/envs/visorgpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 336, in init
self._configure_optimizer(optimizer, model_parameters)
File "/mnt/data-1/data/jiagang.zhu/miniconda3/envs/visorgpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1284, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/mnt/data-1/data/jiagang.zhu/miniconda3/envs/visorgpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1533, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/mnt/data-1/data/jiagang.zhu/miniconda3/envs/visorgpt/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 165, in init
util_ops = UtilsBuilder().load()
File "/mnt/data-1/data/jiagang.zhu/miniconda3/envs/visorgpt/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 485, in load
return self.jit_load(verbose)
File "/mnt/data-1/data/jiagang.zhu/miniconda3/envs/visorgpt/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 520, in jit_load
op_module = load(
File "/mnt/data-1/data/jiagang.zhu/miniconda3/envs/visorgpt/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/mnt/data-1/data/jiagang.zhu/miniconda3/envs/visorgpt/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/mnt/data-1/data/jiagang.zhu/miniconda3/envs/visorgpt/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /home/user001/.cache/torch_extensions/py38_cu117/utils/utils.so: cannot open shared object file: No such file or directory

Have you met this problem before? Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.