CompSci_274E

This repo aims at using Dreambooth to teach a Diffusion model to learn our pictures and generate images of us from text prompts. We fine-tune stable-diffusion-xl model from Huggingface (over 10GB in size) on a single Turing T4 GPU (16GB) on Google Colab using LoRA and Accelerate from Huggingface. The repo also looks at merging different LoRA adapters in order to merge styles.

Motivation

Can Low Rank Adapters work well for training Dreambooth?
Dreambooth works well on pictures of objects, can it learn to represent human faces well?
How many images do we need to teach the model about myself?
What is Prior Preservation?
How will a model recognize me from the text prompt?
Can we merge different Adapters to learn different styles?
What are some difficulties when it comes to training on human faces and how can we offset them?
On what text prompts does the model do well and when does it mess up?

Project Structure `↩`

The data directory contains subfolders with our names which each contain 6 high resolution of the subject. This folder also contains prior.zip which contains 197 images of human faces (excluding my own). These images are used to train the model with prior preservation.
The train directory contains similar subfolders where each subfolder corresponds to a team-members training notebooks.
The inference directory contains similar subfolders where each subfolder contains a comprehensive and structured inference of the models trained with and without Prior Preservation. This contains all the images generated from text prompts post training.
The dream_booth.py file contains the model and the script to train it. The script is a simpler adaptation of this script from Huggingface
The eval directory contains a notebook which uses the official evaluation script from the original Dreambooth Repo. Google Research - Dreambooth to evlauate our trained models against various timesteps and schedulers

Dataset `↩`

The data contains 6 high resolution images of each team member. For Dreambooth, it is important that these images cover different angles and clearly display the face. According to the experiments, 5-6 images are enough to train stable-diffusion-xl (SDXL) with LoRA. For prior preservation, we also use 197 images of other humans faces to increase diversity and reduce language drift. These images are generated by the same Diffusion model itself.

Prior-Preservation `↩`

Fine-tuning layers that are conditioned on the text embeddings, gives rise to the problem of language drift where a model that is pre-trained on a large text corpus and later fine-tuned for a specific task progressively loses syntactic and semantic knowledge of the language. This phenomenon also affects diffusion models, where to model slowly forgets how to generate subjects of the same class as the target subject.

Another problem is the possibility of reduced output diversity. Text-to-image diffusion models naturally posses high amounts of output diversity. When fine-tuning on a small set of images we would like to be able to generate the subject in novel viewpoints, poses and articulations. Yet, there is a risk of reducing the amount of variability in the output poses and views of the subject. To mitigate the two aforementioned issues, the paper proposes an autogenous class-specific prior preservation loss that encourages diversity and counters language drift. The method is to supervise the model with its own generated samples, in order for it to retain the prior once the few-shot fine-tuning begins. This allows it to generate diverse images of the class prior, as well as retain knowledge about the class prior that it can use in conjunction with knowledge about the subject instance.

Training `↩`

To accomodate such a large model on a 16GB Turing T4 GPU, we make use of gradient accumulation, gradient checkpointing, 8-bit fused Adam (instead of the regular Adam). Training on 6 images for 1000 steps is conducted with and without the prior preservation loss to verify that the prior preservation actually helps.

Training Prompt

In order ot teach the model a mapping between text and a subject, Dreambooth proposes using a rare token from model's vocabulary and combining it with the subject prior. For instance, to train on my face, I use the prompt

A photo of rraj person

Here rraj is the rare vocabulary token and person is the class prior for the subject.

Results `↩`

It turns out that LoRA + Dreambooth with 1000 steps works decently well on human faces as well. Prior-Preservation definitely improves the model (as seen from images below). For me, the PNDM Scheduler works well with just 50 timesteps and DDIM with 80 timesteps.

Images	Prompt	Category
b,h	a of [SUBJECT] person with sunglasses	Accesorization
a,g	pixel, a photo of [SUBJECT] person wearing sunglasses	Merging LoRA Adapters
c,i	a of [SUBJECT] person at Oktoberfest	Recontextualization
d,j	a painting of [SUBJECT] person in the style of Van Gogh	Art Rendition
e,k	a of [SUBJECT] person with blonde hair	Property Modification
f,l	a side view photo of [SUBJECT] person	Novel-View Synthesis

CLIP-I Score

CLIP-I is the average pairwise cosine similarity between CLIP embeddings of generated and real images.

Scheduler	Steps	Prior Preservation	CLIP-I
DDIM	50	No	0.9580
DDIM	50	Yes	0.9760
DDIM	80	No	0.9663
DDIM	80	Yes	0.9683
PNDM	50	No	0.9761
PNDM	50	Yes	0.9702
PNDM	80	No	0.9751
PNDM	80	Yes	0.9688

Merging Adapters `↩`

We experiment generating my images in pixel-art style using two merged adapters. Particularly, we experiment with generating my pictures merged with the Pixel Art style.

prompt = "pixel, a photo of rraj person wearing sunglasses"

Limitations `↩`

Generating faces is tough, sometimes eyes and teeth are not rendered properly or could be mismatches. For instance, below, the subject's eyes are rendered in green but are black in the training images.

Compute Limits

The GPU did not allow us to fine-tune the text-encoder (2 text-encoders in case of SDXL). Fine-tuning text encoders certainly improves image generation quality.

References `↩`

[1] Huggingface Blog

[2] "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation"

[3] Dreambooth Diffusers Training Script

rajlm10 / compsci_274e Goto Github PK

compsci_274e's Introduction

CompSci_274E

Motivation

Jump To

Project Structure `↩`

Dataset `↩`

Prior-Preservation `↩`

Training `↩`

Training Prompt

Results `↩`

CLIP-I Score

Merging Adapters `↩`

Limitations `↩`

Compute Limits

References `↩`

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

rajlm10 / compsci_274e Goto Github PK

compsci_274e's Introduction

CompSci_274E

Motivation

Jump To

Project Structure ↩

Dataset ↩

Prior-Preservation ↩

Training ↩

Training Prompt

Results ↩

CLIP-I Score

Merging Adapters ↩

Limitations ↩

Compute Limits

References ↩

Recommend Projects

Recommend Topics

Recommend Org

Project Structure `↩`

Dataset `↩`

Prior-Preservation `↩`

Training `↩`

Results `↩`

Merging Adapters `↩`

Limitations `↩`

References `↩`