Pin a prompt to a visual target!
An extension for AUTOMATIC1111's Stable Diffusion Web UI, based on:
- Variation in prompts is hard to “pin down”: it can be difficult to tell which parts of the prompts are “locking in” a particular result. For example, a highly-specified prompt can produce results with little variation, even at lower CFG scales.
- Why is this useful?
- Analyze larger prompts to tell which parts are “tighter” or “looser,” relative to a particular model, VAE, etc.
- Refine precise prompts by eliminating certain variations.
- Build “prompt pieces” for specifying particular behavior. E.g. prompt-based “bad hand” or “tarot card” embeddings.
- Advanced:
- Target images provide a simple way to pin to a particular image (i.e. for animation)
- Unimplemented (at time of writing):
- Target images that ignore an image mask, e.g. fix parts of an image for animation, solely using the prompt!
- CLIP-based analysis to allow pinning a result to a particular (set of) goal tag(s)
CMA (covariant matrix adaptation) is an efficient automatic evolutionary optimization method.
- It’s fit for problems where the input is a matrix and the metric is smooth.
- In practice, it converges exponentially.
Example run with 50 generations:
- Steps: "16"
- Size: "768x512"
- Duration: ~30 minutes
- Optimized for size because the original was
334.76 MB
Final result (full quality):
- Hyperbatch samplers: generate batches more quickly and with better variance distribution for pinning!
DPM++ 2M Karras - Hyperbatch
DPM++ SDE Karras - Hyperbatch
DPM++ SDE - Hyperbatch
- Pinning targets
- Visually pin to a generated batch
- Visually pin to a fixed image or batch
- (Coming soon?) Pin to a set of tags
- (Coming soon?) Pin to a local map of weights using UMAP + HDBSCAN
- Generation settings
- Multi-objective size limiter: limit the distance explored from the original prompt
- Hyperbatch-specific weights:
- Geometric
- Exponential
- Polynomial
- Analysis
- Per-individual:
- Statistics JSON
- Loss plots (histogram, etc.)
- Summary GIF
- Per-generation:
- Image loss distribution (and histogram)
- Individual loss distribution (and histogram)
- Per-run:
- HTML summary pages for each (see here for examples)
- Evolution convergence plots
- Per-individual:
This extension optionally depends on picobyte/stable-diffusion-webui-wd14-tagger for image tagging. (In progress: it may make more sense to use a different extension.)
Parameter | Default | Details |
---|---|---|
Target Images | None |
Use the provided image(s) as a target instead of the first generated batch |
CMA Logging | True |
Log CMA info to CLI (stdout) |
CMA Seed | [calculated from seed, subseed] |
Numpy seed, used for CMA sampling |
Number of generations | int(16 * floor(log(N))) |
Number of generations |
Initial population STD | 0.05 |
CMA initial population STD |
Initial population radius | 0.25 |
Radius of uniform distribution for CMA initial population |
Multi-objective size limiter | 0 |
Disabled when 0 . Apply a penalty using a multi-objective CMA when more than this distance from original prompt |
Size limit error | [size limiter] / 100 |
Error for multi-objective size limiter: vectors within this distance are "close" |
Size limit weight | [size limiter] * 10 |
Weight for multi-objective size limiter penalty |
lambda_ |
int(4 + 3 * log(N)) |
Number of children to produce at each generation, N is the individual's size (integer). |
mu |
int(lambda_ / 2) |
The number of parents to keep from the lambda children (integer). |
cmatrix |
identity(N) |
The initial covariance matrix of the distribution that will be sampled. |
weights |
"superlinear" |
Decrease speed, can be "superlinear" , "linear" or "equal" . |
cs |
(mueff + 2) / (N + mueff + 3) |
Cumulation constant for step-size. |
damps |
1 + 2 * max(0, sqrt((mueff - 1) / (N + 1)) - 1) + cs |
Damping for step-size. |
ccum |
4 / (N + 4) |
Cumulation constant for covariance matrix. |
ccov1 |
2 / ((N + 1.3)^2 + mueff) |
Learning rate for rank-one update. |
ccovmu |
2 * (mueff - 2 + 1 / mueff) / ((N + 2)^2 + mueff) |
Learning rate for rank-mu update. |
Hyperbatch Weights Enabled |
True |
Enable 'Hyperbatch' weights |
Hyperbatch Weights Force Allowed |
True |
Allow enabling 'Hyperbatch' weights when not using a 'Hyperbatch' sampler |
Hyperbatch Weight Type |
Geometric |
Type of weights used for 'Hyperbatches'. See the Hyperbatch section for more detail. |
Hyperbatch Weight Scale |
1.0 |
Weight scaling factor for 'Hyperbatches'. See the Hyperbatch section for more detail. |
NOTE: Some parameters may not work when multi-objective size limiting is enabled. The utilized CMA parameters for multi-objective optimization are as follows: (Some options may not yet be available in the UI.)
Parameter | Default | Details |
---|---|---|
mu |
len(population) |
The number of parents to use in the evolution. |
lambda_ |
1 |
Number of children to produce at each generation |
d |
1.0 + N / 2.0 |
Damping for step-size. |
ptarg |
1.0 / (5 + 1.0 / 2.0) |
Target success rate. |
cp |
ptarg / (2.0 + ptarg) |
Step size learning rate. |
cc |
2.0 / (N + 2.0) |
Cumulation time horizon. |
ccov |
2.0 / (N**2 + 6.0) |
Covariance matrix learning rate. |
pthresh |
0.44 |
Threshold success rate. |
Ref. Hansen and Ostermeier, 2001. Completely Derandomized Self-Adaptation in Evolution Strategies. Evolutionary Computation
Because the Number of generations
lacks a default in the original implementation,
the default was picked from the following observances:
- When the algorithm is efficient, the number of generations is proportional to
log(N)
- From the example
cma_minfct
and by eyeballing other examples, the multiplicative overhead is approximately16
(when the algorithm is efficient)
Hyperbatches are an experimental feature that no longer require a fork of AUTOMATIC1111 Web UI!
The following samplers have been implemented in this extension:
DPM++ 2M Karras - Hyperbatch
DPM++ SDE Karras - Hyperbatch
DPM++ SDE - Hyperbatch
Why?
- These samplers can be several times faster than their non-hyperbatch equivalents, dependent on the batch size and step count used. See the plot below for more details.
- These samplers are specifically designed to:
- Expose more of the generation (in)stability to the evolutionary algorithm's metric
- Take advantage of GPU's with
>= 8 GB
of RAM- E.g. a
g4dn.xlarge
has16 GiB
of VRAM and can generate71 512x512
images in a single batch - However, generating larger batches provides an unpredictable amount of certainty when calculating e.g. the ꟻLIP loss
- E.g. a
- In practice, it provides an appx.
10-20x
speedup on batch size8
with20
steps
This is acheived by modifying k-diffusion samplers as follows:
- Start with a single image
- Perform several sampling steps
- Double the batch size and assign different seeds to the copies
- Repeat from step (2) until the
Batch size
set in the UI is reached
NOTE: The estimates below assume that 8 images
take exactly 8 x
as long as
one image. This isn't quite true, so some of the benefit is reduced for batch
sizes smaller than 8
. However, this potentially provides a larger benefit for
"very-large" batches (i.e. >= 64
) than is lost from having Batch size > 8
.
See HyperbatchEfficiencyPlots.ipynb
for the plot-generation code.
Usage notes:
- Pick a batch size that's a power of 2 for best results, i.e.
8, 16, 32, 64, 128, 256, ..
- Hyperbatches are disabled when the number of steps is
<= floor(log2(Batch size))
. - Hyperbatches mess up the progress bar library's time estimates: it's
expected that the estimate will be
2-5x
too high, depending on batch size and number of steps. See the efficiency plots for more detail.
In this section, K
is the distance from the root of the binary tree,
starting from 1
. E.g. the root of the tree has K=1
, its leaves have K=2
,
their leaves K=3
, etc.
If the Hyperbatch Weights
option is enabled, the following options for
Hyperbatch Weight Type
become available:
Geometric
- Default weights
X ^ (hyperbatch_weight_scale / K)
Exponential
X * (0.5 + hyperbatch_weight_scale) ^ K
Polynomial
(1 + X)^(hyperbatch_weight_scale * K)
Assuming txt2img
(it works similarly for img2img
):
outputs/txt2img-images/prompt_pins
: all prompt pin runsoutputs/txt2img-images/prompt_pins/00prompt_pin_number
: a particular prompt pin runoutputs/txt2img-images/prompt_pins/00prompt_pin_number/cma_plot.png
: CMA algorithm statsoutputs/txt2img-images/prompt_pins/00prompt_pin_number/index.html
: all stats summary webpageoutputs/txt2img-images/prompt_pins/00prompt_pin_number/00generation_number
: a particular generation within the prompt pin runoutputs/txt2img-images/prompt_pins/00prompt_pin_number/00generation_number/ijdwfemknbidwjo..
: a particular individual (attempt) within a generationoutputs/txt2img-images/prompt_pins/00prompt_pin_number/00generation_number/ijdwfemknbidwjo../20..-..-..
: a particular individual's image outputoutputs/txt2img-images/prompt_pins/00prompt_pin_number/00generation_number/ijdwfemknbidwjo../batch_stats.json
: JSON of the batch's config and visual errorsoutputs/txt2img-images/prompt_pins/00prompt_pin_number/00generation_number/ijdwfemknbidwjo../loss_plot.png
: PNG plot of visual errorsoutputs/txt2img-images/prompt_pins/00prompt_pin_number/00generation_number/ijdwfemknbidwjo../summary.gif
: GIF of all images from the individual
The simplest technique that I've found to be effective is to:
- Experiment with a prompt and its batch size to capture the amount of variation you want.
- For example, if you want to pin down a character pose, find a batch size that's large enough to have several results that appear "close" to your goal.
- Alternatively, pick a sufficiently large batch size to reliably contain examples you want to avoid and use the result to identify which parts of the prompt are resulting in those effects.
- Pick a seed and use one of the results from that batch as the target image.
- Set the initial population STD and centroid radius fairly small (
< 0.1
). - Set the multi-objective size limiter also fairly small, but at least
<= 1
to ensure the prompts stay relatively close to the original.
If no target image is used:
- Find a target prompt
- Use X/Y/Z plot to hone in on number of steps, CFG scale, sampler, VAE, etc.
- Plot larger batches until it has the desired amount of variation:
- We want to find a batch size with many "good" results and visible variation (to "pin" down)
- Use prompt pinning script
- Ideally, it works with default arguments
- If not, look at the
cma_plot.png
in thetxt2img-image/prompt_pins
folder for your run- If the upper left graph is branching out (like a
<
shape):- The generation size (i.e. batch count times batch size) is too small for how much error you have.
- This could be because: your CFG scale is too small, the model is having trouble matching your prompt, or otherwise you have too much variation in your prompt
- If the upper left graph is converging (like a
>
shape):- Try running for a larger number of generations, unless:
- If the
>
shape is followed by a long line (like>----
or similar), you've achieved convergence! - Try lowering the initial population centroid radius and STD: it could be that the CMA algorithm is searching too far from your prompt
- If the
- Try running for a larger number of generations, unless:
- If the upper left graph is branching out (like a
- Use the descovered "pinned" prompt in other prompts!
If a target image is used, a similar approach may be effective, but it's likely that the generated images will need to be fairly close to the target image(s) provided for good results. Otherwise, it may end up finding color or image arrangement similarities to optimize.
Likewise, if visually-distinct target images are used, the algorithm is effectively finding the "visual average," which is likely to be blurry, distorted, or otherwise indistinct.
- Keep
Batch count
at1
for best results: increasinglambda_
has a similar effect (it's the number of batches per individual in a generation) and lets the CMA algorithm see more data points. - A sufficiently-large sample is required per attempt.
For many cases,
8-16
images are likely sufficient, but assuming efficiency of “perfect” binary search, it will require around3*num_tokens
steps to converge, or3*num_tokens*batch_size
images.- By the way, binary search is about as efficient as Stable Diffusion:
a few manual experiments showed that
2^steps
is approximatelybits_of_output
for "good" convergence, at least withDPM++ SDE Karras
.
- By the way, binary search is about as efficient as Stable Diffusion:
a few manual experiments showed that
- Upper right graph of
cma_plot.png
shows divergence- It's likely that it's not sampling "wide" enough, or is way too wide:
- If way too wide, try lowering the initial population radius and STD
- If not wide enough, try increasing the CFG scale, batch size, or
lambda_
- It's likely that it's not sampling "wide" enough, or is way too wide:
- Targeting a large batch results in blurry or "wallpaper"-type patterns
- This is expected when using ꟻLIP to target many images: you're calculating a sort of "visual average" of all of the images.
- By "visual average," I mean that it's an image that's approximately visually equidistant to an ensemble of images, according to the ꟻLIP loss function.
- Because it's difficult to specify “small” distance from original prompt,
the current approach is to limit the
L2
-norm from the original weights.- This means that certain tokens could get “washed out” with larger allowed distances. Try lowering them.
sd-prompt-pinning-test-cases (GitHub repo)
- DEAP
- Félix-Antoine Fortin, François-Michel De Rainville, Marc-André Gardner, Marc Parizeau and Christian Gagné, "DEAP: Evolutionary Algorithms Made Easy", Journal of Machine Learning Research, vol. 13, pp. 2171-2175, jul 2012.
- LDR ꟻLIP
- k-diffusion