iigroup / tedigan Goto Github PK

View Code? Open in Web Editor NEW

371.0 10.0 59.0 21.32 MB

[CVPR 2021] Pytorch implementation for TediGAN: Text-Guided Diverse Face Image Generation and Manipulation

Home Page: https://arxiv.org/abs/2012.03308

License: MIT License

Python 99.69% C++ 0.04% Cuda 0.27%

tedigan's Introduction

`TediGAN`

Preprint | Extended Version | Dataset | Video | Colab | Replicate

Implementation for the paper W. Xia, Y. Yang, J.-H. Xue, and B. Wu. TediGAN: Text-Guided Diverse Face Image Generation and Manipulation and Towards Open-World Text-Guided Face Image Generation and Manipulation in PyTorch.

Contact: weihaox AT outlook dot com

Update

[2021/8/28] add an online demo implemented by @bfirsh. This demo uses an open source tool called Cog.

[2021/4/20] add extended paper.

[2021/3/12] add support for high-resolution and multi-modality.

[2021/2/20] add Colab Demo for image editing using StyleGAN and CLIP.

[2021/2/16] add codes for image editing using StyleGAN and CLIP.

TediGAN Framework

We have proposed a novel method (abbreviated as TediGAN) for image synthesis using textual descriptions, which unifies two different tasks (text-guided image generation and manipulation) into the same framework and achieves high accessibility, diversity, controllability, and accurateness for facial image generation and manipulation. Through the proposed multi-modal GAN inversion and large-scale multi-modal dataset, our method can effectively synthesize images with unprecedented quality.

Train the StyleGAN Generator

We use the training scripts from genforce. You should prepare the required dataset to train StyleGAN generator (FFHQ for faces or LSUN Bird for birds).

Train on FFHQ dataset: GPUS=8 CONFIG=configs/stylegan_ffhq256.py WORK_DIR=work_dirs/stylegan_ffhq256_train ./scripts/dist_train.sh ${GPUS} ${CONFIG} ${WORK_DIR}
Train on LSUN Bird dataset: GPUS=8 CONFIG=configs/stylegan_lsun_bird256.py WORK_DIR=work_dirs/stylegan_lsun_bird256_train ./scripts/dist_train.sh ${GPUS} ${CONFIG} ${WORK_DIR}

Or you can directly use a pretrained StyleGAN generator for ffhq_face_1024, ffhq_face_256, cub_bird_256, or lsun_bird_256.

Invert the StyleGAN Generator

This step is to find the matching latent codes of given images in the latent space of a pretrained GAN model, e.g. StyleGAN, StyleGAN2, StyleGAN2Ada (should be the same model in the former step). We ~~will include~~ have included the inverted codes in our Multi-Modal-CelebA-HQ Dataset, which are inverted using idinvert.

Our original method is based on idinvert (including StyleGAN training and GAN inversion). To generate 1024 resolution images and show the scalability of our framework, we also learn the visual-linguistic similarity based on pSp.

Due to the scalability of our framework, there are two general ways that can be used to invert a pretrained StyleGAN.

Train an image encoder like in idinvert or other GAN inversion methods like pSp or e4e.
Project images to latent space directly like in StyleGAN2Ada.

Train the Text Encoder

This step is to learn visual-linguistic similarity, which aims to learn the text-image matching by mapping the image and text into a common embedding space. Compared with the previous methods, the main difference is that they learn the text-image relations by training from scratch on the paired texts and images, while ours forces the text embedding to match an already existing latent space learned from only images.

Using a Pretrained Text Encoder

We can also use some powerful pretrained language models, e.g., CLIP, to replace the visual-linguistic learning module. CLIP (Contrastive Language-Image Pre-Training) is a recent a neural network trained on 400 million image and text pairs.

In this case, we have the pretrained image model StyleGAN (or StyleGAN2, StyleGAN2Ada) and the pretrained text encoder CLIP. The inversion process is still necessary. Given the obtained inverted codes of a given image, the desired manipulation or generation result can be simply obtained using the instance-level optimization with an additional CLIP term.

The first step is to install CLIP by running the following commands:

pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

The pretrained model will be downloaded automatically from the OpenAI website (RN50 or ViT-B/32).

The manipulated or generated results can be obtained by simply running:

python invert.py --mode='man'               # 'man' for manipulation, 'gen' for generation
	--image_path='examples/142.jpg' # path of the input image
	--description='he is old'       # a textual description, e.g., he is old.
	--loss_weight_clip='1.0'        # weight for the CLIP loss.
	--num_iterations=200            # number of optimization iterations

or you can try the online demo:

streamlit run streamlit_app.py

The diverse and high-resolution results from sketch or label can be obtained by running:

cd ext/
python inference.py 
	--exp_dir=experiment                                 # path of logs and results
	--checkpoint_path=pretrained_models/{model_name}.pt  # path of pretrained models
	--data_path=experiment/images/{dir}                  # path of input images
python demo.py --description='he is old' 
	--mode='man' --f_oom=False                           # set as True if OOM error.
	--step=500   --loss_clip_weight=200

The pretrained models can be downloaded here.

Text-to-image Benchmark

Datasets

Multi-Modal-CelebA-HQ Dataset [Link]
CUB Bird Dataset [Link]
COCO Dataset [Link]

Publications

Below is a curated list of related publications with codes (The full list can be found here).

Text-to-image Generation

[DALL-E] Zero-Shot Text-to-Image Generation (2021) [paper] [code] [dVAE] [blog]
[DF-GAN] Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis (2020) [paper] [code]
[ControlGAN] Controllable Text-to-Image Generation (NeurIPS 2019) [paper] [code]
[DM-GAN] Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis (CVPR 2019) [paper] [code]
[MirrorGAN] Learning Text-to-image Generation by Redescription (CVPR 2019) [paper] [code]
[Obj-GAN] Object-driven Text-to-Image Synthesis via Adversarial Training (CVPR 2019) [paper] [code]
[SD-GAN] Semantics Disentangling for Text-to-Image Generation (CVPR 2019) [paper] [code]
[HD-GAN] Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network (CVPR 2018) [paper] [code]
[AttnGAN] Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks (CVPR 2018) [paper] [code]
[StackGAN++] Realistic Image Synthesis with Stacked Generative Adversarial Networks (TPAMI 2018) [paper] [code]
[StackGAN] Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks (ICCV 2017) [paper] [code]
[GAN-INT-CLS] Generative Adversarial Text to Image Synthesis (ICML 2016) [paper] [code]

Text-guided Image Manipulation

[ManiGAN] ManiGAN: Text-Guided Image Manipulation (CVPR 2020) [paper] [code]
[Lightweight-Manipulation] Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation (NeurIPS 2020) [paper] [code]
[SISGAN] Semantic Image Synthesis via Adversarial Learning (ICCV 2017) [paper] [code]
[TAGAN] Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language (NeurIPS 2018) [paper] [code]

Metrics

FID ([paper] [code] )
Inception-Score ([paper] [code])
LIPIPS ([paper] [code])

Acknowledgments

The GAN inversion codes borrow heavily from idinvert and pSp. The StyleGAN implementation is from genforce and StyleGAN2 from Kim Seonghyeon.

Citation

If you find our work, code, or the benchmark helpful for your research, please consider to cite:

@inproceedings{xia2021tedigan,
  title={TediGAN: Text-Guided Diverse Face Image Generation and Manipulation},
  author={Xia, Weihao and Yang, Yujiu and Xue, Jing-Hao and Wu, Baoyuan},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021}
}

@article{xia2021open,
  title={Towards Open-World Text-Guided Face Image Generation and Manipulation},
  author={Xia, Weihao and Yang, Yujiu and Xue, Jing-Hao and Wu, Baoyuan},
  journal={arxiv preprint arxiv: 2104.08910},
  year={2021}
}

tedigan's People

Contributors

Stargazers

Watchers

Forkers

cv-ip gaoxiang513 drwq ross-liu mattc95 zebrajack peterzhousz etri-visualcommonsense orange1999 templeblock cordob yangxs peterzs wysokistudent nivc jacobwjs anthonyyuan lx97 bfirsh liugengia milkigit sclbd cosmicbhejafry jamescoupe maxylee ji-in zengyh1900 nelsontseng0704 zeke celsopitta kashik0i free-elf visionu wewan iomrla mohamedsaeed9 ashahawy guoqi0531 mrk1992 excurl ayankumarbhunia gongshuai1 ingeniousfrog leyrio eswar540 dengjl-hub yws2017 cybersys abhishektandon-github thedhruvrawat wangzf-zz sunmeng7 anparashar molecular-medicine hafizhauliansyah sasi591 12wxy mohammadalikakavand gowthamkumar02

tedigan's Issues

How to use this with stylegan ADA

Hello!

Thanks for sharing your work, really interesting!

I was curious if you could offer a small write up, or advice, on how to use this approach with a stylegan-ada model. You mention briefly that we could project the image into the space, which is something used in other guiding approaches, but how can that happen within this code?

Thank you so much!

Pretrained StyleGAN generator links

Hello, thank you for your contribution.

The links you have provided for the pretrained StyleGAN generator for cub_bird_256 and lsun_bird_256 seem to be broken, giving a "page not found" error. Could you please update those links?

How the "control mechanism" chooses corresponding wi ?

style mix

about style mixing，how to select the layer to be replaced according to the text content

File "invert_v2.py", line 32 parser.add_argument('--learning_rate', type=float, default=0.01, ^ SyntaxError: invalid syntax

Getting this kind of error, Please Help me Sir/Madam

LPIPS

Is lower LPIPS better or higher better?

LPIPS measures diversity, so high diversity should be good?

SDG (https://github.com/xh-liu/SDG_code) uses generated paired images to calculate LPIPS, and they note that higher LPIPS is better.

Which is correct?

How much time do you need to train the model?

How to evaluate the results?

The dataset is divided by .pickle file and how can I put them into the fold and use the fid or lpips to earn the result?

Semantic labels

Hi!
Where do you get such hardly visible input semantic labels like this one:
https://github.com/IIGROUP/TediGAN/blob/main/ext/experiment/images/lab/input_label.png

Multi-Modal CelebA-HQ Dataset

Hello, I'm excited about your excellent work!
I want to download Multi-Modal CelebA-HQ Dataset described in your paper, according to the URL provided by https://drive.google.com/drive/folders/1eVrGKfkbw7bh9xPcX8HJa-qWQTD9aWvf.
Sadly, I can't access it. Is the URL correct? How can I download this dataset?
I look forward to your reply!

Do you have any recommended values for the hyperparameters in the demo.py section?

Hello, I've recently been learning about your TediGAN work and you've written it very well and I'm very interested! I would like to know what is your official recommended parameter value for this part of the code below?

weights file not accesible

The link https://mycuhk-my.sharepoint.com/:u:/g/personal/1155082926_link_cuhk_edu_hk/EdfMxgb0hU9BoXwiR3dqYDEBowCSEF1IcsW3n4kwfoZ9OQ?e=VwIV58&download=1 is irresponsive. Please fix this!

inference script for both text and label

Hi, thanks for open sourcing this work.

I wonder whether this work supports simultaneous control from both text and semantic parsing (i.e. label / segmentation mask) ? If so, is it possible to share the inference script?
Thank you very much.

Text-to-image generation

How can i use the released codes to generate image for texts? I run the code invert.py with mode of 'gen', but got worse results. And I found that the code of invert.py directely inits the latent code z and applies the instance-level optimization with the CLIP loss upon it. This is different from procedure annotated in the paper, which considers the texts input as w_c and random noise as w_s and then mix them. Could you help me to overcome this? Thanks！

Using my dataset to retrain

I follow the genforce 'config/stylegan_ffhq256.py' to train using my own dataset, and get this file.

Its structure and parameters are similar to 'stylegan_ffhq256.pth'.
How can I get the 'styleganinv_ffhq256_encoder.pth' and 'styleganinv_ffhq256_generator.pth' using my dataset?

Is training code available for new data?

Hi, great work there!!

I am wondering if there's training code available where I can try on my own data instead of faces?

Some question of the metric method---FID

Thank you for your work! I try to evaluate the FID and I used the code (fid.py)in the link you provided. But there always gets an error. I try to run the code on colb and my own computer and the errors are shown below. I don't know why this is happening. Can I use the PyTorch version of FID instead of the TensorFlow version of FID?

2021-06-30 13:57:01.685915: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using _XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
tcmalloc: large alloc 4718592000 bytes == 0x5586091f8000 @  0x7f26b949d1e7 0x7f26b6fdd46e 0x7f26b702dc7b 0x7f26b7030e83 0x7f26b703107b 0x7f26b70d2761 0x5584d699ccc0 0x5584d699ca50 0x5584d6a10be0 0x5584d6a0b4ae 0x5584d699e3ea 0x5584d6a0d32a 0x5584d6a0b4ae 0x5584d699e3ea 0x5584d6a0d32a 0x5584d6a0b4ae 0x5584d6a0b1b3 0x5584d6ad5182 0x5584d6ad54fd 0x5584d6ad53a6 0x5584d6aac723 0x5584d6aac3cc 0x7f26b8287bf7 0x5584d6aac2aa
tcmalloc: large alloc 1244758016 bytes == 0x55851527c000 @  0x7f26b949d1e7 0x7f26a01bbc05 0x7f26a0c908fe 0x7f26a306ca20 0x7f26a30d4b90 0x7f26a33328e8 0x7f26a3358926 0x7f26a335909a 0x7f26a335a00a 0x7f269ccc2136 0x7f269ccb25c5 0x7f269cd5c3ae 0x7f269cd59228 0x7f26b7d7fa50 0x7f26b92526db 0x7f26b838771f_
^C

Can you upload the train_vls.py

How to generate images from the given description?

Hello, I had a doubt. How do we generate images directly from text descriptions. I executed the invert_v2.py code and it seems that it manipulates an input image.

cub_bird_256

The link of pretrained model on cub_bird_256 is not valid now. Can you reload it? or provide a new link?

Training the text encoder

Hi, I want to train a text encoder but the code is unavailable now. Could you please upload the code if possible?
Thanks!

OSError: No pre-trained weights found for perceptual model!

I got this error in the Playground CoLab notebook. Also, can you help me with training and executing this model in CoLab as i dont have a GPU

class Predictor(cog.Predictor): AttributeError: module 'cog' has no attribute 'Predictor'

can't run project because this error, please i want answer

Missing key(s) in state_dict

i use genforce as you mentioned to train a stylegan with my data. but the pretrained model couldn't load. Help me,please
genforce link https://github.com/genforce/genforce

inverted_codes.pt not exists

Hi, I try to run some high-res result by

python demo.py --description='he is old' 
	--mode='man' --f_oom=False                           # set as True if OOM error.
	--step=500   --loss_clip_weight=200

but I get error as cant find the pt: FileNotFoundError: [Errno 2] No such file or directory: 'experiment/inference_results/inverted_codes.pt'

Where can I download or generate this inverted_codes.pt? Thanks!

where is the code for training and testing the text encoder?

Congratulations！That's a nice work!
I've skimmed through your code, most of it comes from psp(cvpr2021) and idivert(eccv2020), so I'd like to know where is your main contribution code, i.e., code for training and testing the text encoder.

instance-level optimization

Great paper! Sorry for my english. Can you explain why (5) formula help identity preservation. I also ask some in (5)

initialization z is inverted latent code of mixing text and visual latent code ?
x is origin visual input image ?
2 first terms help preserve pixel and semantic and last term help regularize the latent is in region visual encoder work.

Inverted stylegan

I trained stylegan for my data. How can I get inverted stylegan? Could you add the scripts for learning inverted stylegan to the README?

what is the epoch you set for comparison methods that mentioned in your paper when you retrain them on the Multi-Modal CelebA-HQ dataset?

"All methods are retrained with the default settings on the proposed Multi-Modal CelebA-HQ dataset." in your paper.
So what is the epoch you set for each of these methods?

Thanks a ton !

colab errors

!python demo.py --description='he is a young woman' --mode='woman' --step=500 --f_oom=False

Traceback (most recent call last):
  File "demo.py", line 15, in <module>
    from idinvert_pytorch.models.perceptual_model import PerceptualModel
ModuleNotFoundError: No module named 'idinvert_pytorch.models.perceptual_model'