Pedro Cuenca, Patrick von Platen, Suraj Patil, Jeremy Howard
Chances are you'll have seen examples in Twitter (and elsewhere) of images generated by typing a short description of the scene you want to create. This is the culmination of years of work in generative models. This notebook introduces Stable Diffusion, the highest-quality open source text to image model as of now. It's also small enough to run in consumer GPUs rather than in a datacenter. We use the 🤗 Hugging Face 🧨 Diffusers library, which is currently our recommended library for using diffusion models.
As we'll see during the course, understanding state-of-the-art generative models requires a deep understanding of many of the fundamental blocks in modern machine learning models. This notebook shows what Stable Diffusion can do and a glimpse of its main components.
If you open this notebook in Colab, or if you get type errors when generating your first image, please uncomment and run the following cell.
# !pip install -Uq diffusers transformers fastcore
To run Stable Diffusion on your computer you have to accept the model license. It's an open CreativeML OpenRail-M license that claims no rights on the outputs you generate and prohibits you from deliberately producing illegal or harmful content. The model card provides more details. If you do accept the license, you need to be a registered user in 🤗 Hugging Face Hub and use an access token for the code to work. You have two options to provide your access token:
huggingface-cli login command-line tool in your terminal and paste your token when prompted. It will be saved in a file in your computer.notebook_login() in a notebook, which does the same thing.import logging
from pathlib import Path
import matplotlib.pyplot as plt
import torch
from diffusers import StableDiffusionPipeline
from fastcore.all import concat
from huggingface_hub import notebook_login
from PIL import Image
logging.disable(logging.WARNING)
torch.manual_seed(1)
if not (Path.home()/'.cache/huggingface'/'token').exists(): notebook_login()
StableDiffusionPipeline is an end-to-end diffusion inference pipeline that allows you to start generating images with just a few lines of code. Many Hugging Face libraries (along with other libraries such as scikit-learn) use the concept of a "pipeline" to indicate a sequence of steps that when combined complete some task. We'll look at the individual steps of the pipeline later -- for now though, let's just use it to see what it can do.
When we say "inference" we're referring to using an existing model to generate samples (in this case, images), as opposed to "training" (or fine-tuning) models using new data.
We use from_pretrained to create the pipeline and download the pretrained weights. We indicate that we want to use the fp16 (half-precision) version of the weights, and we tell diffusers to expect the weights in that format. This allows us to perform much faster inference with almost no discernible difference in quality. The string passed to from_pretrained in this case (CompVis/stable-diffusion-v1-4) is the repo id of a pretrained pipeline hosted on Hugging Face Hub; it can also be a path to a directory containing pipeline weights. The weights for all the models in the pipeline will be downloaded and cached the first time you run this cell.
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", variant="fp16", torch_dtype=torch.float16).to("cuda")
Fetching 16 files: 0%| | 0/16 [00:00<?, ?it/s]
The weights are cached in your home directory by default.
!ls ~/.cache/huggingface/hub
models--CompVis--stable-diffusion-v1-4 models--pcuenq--jh_dreambooth_1000 models--google--ddpm-celebahq-256 models--runwayml--stable-diffusion-v1-5 models--google--ddpm-church-256
We are now ready to use the pipeline to start creating images.
If your GPU is not big enough to use pipe, run pipe.enable_attention_slicing()
As described in the docs:
When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.
#pipe.enable_attention_slicing()
prompt = "a photograph of an astronaut riding a horse"
pipe(prompt).images[0]
0%| | 0/51 [00:00<?, ?it/s]
torch.manual_seed(1024)
pipe(prompt).images[0]
0%| | 0/51 [00:00<?, ?it/s]
You will have noticed that running the pipeline shows a progress bar with a certain number of steps. This is because Stable Diffusion is based on a progressive denoising algorithm that is able to create a convincing image starting from pure random noise. Models in this family are known as diffusion models. Here's an example of the process (from random noise at top to progressively improved images towards the bottom) of a model drawing handwritten digits, which we'll build from scratch ourselves later in the course.
torch.manual_seed(1024)
pipe(prompt, num_inference_steps=3).images[0]
0%| | 0/4 [00:00<?, ?it/s]
torch.manual_seed(1024)
pipe(prompt, num_inference_steps=16).images[0]
0%| | 0/17 [00:00<?, ?it/s]
def image_grid(imgs, rows, cols):
w,h = imgs[0].size
grid = Image.new('RGB', size=(cols*w, rows*h))
for i, img in enumerate(imgs): grid.paste(img, box=(i%cols*w, i//cols*h))
return grid
Classifier-Free Guidance is a method to increase the adherence of the output to the conditioning signal we used (the text).
Roughly speaking, the larger the guidance the more the model tries to represent the text prompt. However, large values tend to produce less diversity. The default is 7.5, which represents a good compromise between variety and fidelity. This blog post goes into deeper details on how it works.
We can generate multiple images for the same prompt by simply passing a list of prompts instead of a string.
num_rows,num_cols = 4,4
prompts = [prompt] * num_cols
images = concat(pipe(prompts, guidance_scale=g).images for g in [1.1,3,7,14])
0%| | 0/51 [00:00<?, ?it/s]
0%| | 0/51 [00:00<?, ?it/s]
0%| | 0/51 [00:00<?, ?it/s]
0%| | 0/51 [00:00<?, ?it/s]
image_grid(images, rows=num_rows, cols=num_cols)