Coder Social home page Coder Social logo

compsci_274e's Introduction

CompSci_274E

This repo aims at using Dreambooth to teach a Diffusion model to learn our pictures and generate images of us from text prompts. We fine-tune stable-diffusion-xl model from Huggingface (over 10GB in size) on a single Turing T4 GPU (16GB) on Google Colab using LoRA and Accelerate from Huggingface. The repo also looks at merging different LoRA adapters in order to merge styles.

Motivation

  • Can Low Rank Adapters work well for training Dreambooth?
  • Dreambooth works well on pictures of objects, can it learn to represent human faces well?
  • How many images do we need to teach the model about myself?
  • What is Prior Preservation?
  • How will a model recognize me from the text prompt?
  • Can we merge different Adapters to learn different styles?
  • What are some difficulties when it comes to training on human faces and how can we offset them?
  • On what text prompts does the model do well and when does it mess up?

Jump To

Project Structure

  • The data directory contains subfolders with our names which each contain 6 high resolution of the subject. This folder also contains prior.zip which contains 197 images of human faces (excluding my own). These images are used to train the model with prior preservation.
  • The train directory contains similar subfolders where each subfolder corresponds to a team-members training notebooks.
  • The inference directory contains similar subfolders where each subfolder contains a comprehensive and structured inference of the models trained with and without Prior Preservation. This contains all the images generated from text prompts post training.
  • The dream_booth.py file contains the model and the script to train it. The script is a simpler adaptation of this script from Huggingface
  • The eval directory contains a notebook which uses the official evaluation script from the original Dreambooth Repo. Google Research - Dreambooth to evlauate our trained models against various timesteps and schedulers

Dataset

The data contains 6 high resolution images of each team member. For Dreambooth, it is important that these images cover different angles and clearly display the face. According to the experiments, 5-6 images are enough to train stable-diffusion-xl (SDXL) with LoRA. For prior preservation, we also use 197 images of other humans faces to increase diversity and reduce language drift. These images are generated by the same Diffusion model itself.

Prior-Preservation

Fine-tuning layers that are conditioned on the text embeddings, gives rise to the problem of language drift where a model that is pre-trained on a large text corpus and later fine-tuned for a specific task progressively loses syntactic and semantic knowledge of the language. This phenomenon also affects diffusion models, where to model slowly forgets how to generate subjects of the same class as the target subject.

Another problem is the possibility of reduced output diversity. Text-to-image diffusion models naturally posses high amounts of output diversity. When fine-tuning on a small set of images we would like to be able to generate the subject in novel viewpoints, poses and articulations. Yet, there is a risk of reducing the amount of variability in the output poses and views of the subject. To mitigate the two aforementioned issues, the paper proposes an autogenous class-specific prior preservation loss that encourages diversity and counters language drift. The method is to supervise the model with its own generated samples, in order for it to retain the prior once the few-shot fine-tuning begins. This allows it to generate diverse images of the class prior, as well as retain knowledge about the class prior that it can use in conjunction with knowledge about the subject instance.

image

Training

To accomodate such a large model on a 16GB Turing T4 GPU, we make use of gradient accumulation, gradient checkpointing, 8-bit fused Adam (instead of the regular Adam). Training on 6 images for 1000 steps is conducted with and without the prior preservation loss to verify that the prior preservation actually helps.

Training Prompt

In order ot teach the model a mapping between text and a subject, Dreambooth proposes using a rare token from model's vocabulary and combining it with the subject prior. For instance, to train on my face, I use the prompt

A photo of rraj person

Here rraj is the rare vocabulary token and person is the class prior for the subject.

Results

It turns out that LoRA + Dreambooth with 1000 steps works decently well on human faces as well. Prior-Preservation definitely improves the model (as seen from images below). For me, the PNDM Scheduler works well with just 50 timesteps and DDIM with 80 timesteps.

image
Images Prompt Category
b,h a of [SUBJECT] person with sunglasses Accesorization
a,g pixel, a photo of [SUBJECT] person wearing sunglasses Merging LoRA Adapters
c,i a of [SUBJECT] person at Oktoberfest Recontextualization
d,j a painting of [SUBJECT] person in the style of Van Gogh Art Rendition
e,k a of [SUBJECT] person with blonde hair Property Modification
f,l a side view photo of [SUBJECT] person Novel-View Synthesis

CLIP-I Score

CLIP-I is the average pairwise cosine similarity between CLIP embeddings of generated and real images.

Scheduler Steps Prior Preservation CLIP-I
DDIM 50 No 0.9580
DDIM 50 Yes 0.9760
DDIM 80 No 0.9663
DDIM 80 Yes 0.9683
PNDM 50 No 0.9761
PNDM 50 Yes 0.9702
PNDM 80 No 0.9751
PNDM 80 Yes 0.9688

Merging Adapters

We experiment generating my images in pixel-art style using two merged adapters. Particularly, we experiment with generating my pictures merged with the Pixel Art style.

prompt = "pixel, a photo of rraj person wearing sunglasses"

image

Limitations

Generating faces is tough, sometimes eyes and teeth are not rendered properly or could be mismatches. For instance, below, the subject's eyes are rendered in green but are black in the training images.

image

Compute Limits

The GPU did not allow us to fine-tune the text-encoder (2 text-encoders in case of SDXL). Fine-tuning text encoders certainly improves image generation quality.

References

[1] Huggingface Blog

[2] "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation"

[3] Dreambooth Diffusers Training Script

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.