Under the hood of DALL-E 2
In April 2022, OpenAI released DALL-E 2, which by some is considered to be the greatest image generation machine learning model so far. It surpassed the current state-of-the-art by building on top of Dall-E and GLIDE models, generating super realistic images conditioned on text captions with some very useful image manipulation characteristics that arise naturally from the model architecture. Our panda mad scientist below is the product of DALL-E 2 generated solely using its caption - panda mad scientist mixing sparkling chemicals, artstation. Pretty amazing, right? Let's dive a bit into the history to see what inspired this great model.
DALL-E
In January 2021, the original DALL-E was released by OpenAI, which in itself was a game-changer. Its training process consisted of two important stages:
- VQ-VAE (a discrete variational autoencoder capable of transforming a pixel image into a 32x32 grid latent space, each element of which can assume 8192 possible values. This is necessary from an engineering standpoint, as the latent space is greatly reduced in comparison to the information content of the whole image. With this, we can compress the memory size of the whole image by a factor of 192 without a large degradation in visual quality.
- A transformer that combines a caption text embedding and an image embedding from VQ-VAE into a single stream of tokens and tries to predict the same stream of tokens, with the exception that the image tokens in the output are shifted one to the left, overriding the start-of-sequence token.
What is the purpose of shifting the tokens to the left?
Well, transformers work iteratively, one token at a time for each forward pass through the network. At inference time, we will simply input the text embedding tokens with an SOS (start-of-sequence) token appended at the end and the model will output the text embeddings with the first out of 32x32 image tokens at the end. Then you just take that last token and append it to the previous input, elongating the sequence by one token on each iteration until finally, you get 1024 (32x32) image tokens, which can be converted to the image using VQ-VAE.
In short, DALL-E built very realistic images at the time, but not so much in comparison with DALL-E 2.
https://arxiv.org/pdf/2102.12092.pdf
GLIDE
GLIDE is a fairly new model, introduced in December 2021 by OpenAI. It utilizes something called diffusion models. A diffusion model is a model that aims to reconstruct the image from progressively larger noise up to the point, where the image cannot be recognized anymore. By adding noise to the input image and forcing the model to learn from noise, we effectively create an image generator from purely random noise. This sounds great, but note that we do not have any control over the image output using this approach. It could as well be a cat, a dog, or your grandma.
GLIDE solves this by using a transformer architecture to output the text embeddings of an image caption and condition the diffusion model using this transformer embedding. How exactly this works is beyond the scope of this article, but check out the original paper if you want to learn more.
https://arxiv.org/pdf/2112.10741.pdf
All roads lead to DALL-E 2
DALL-E 2 unlike its younger brother trains in three stages instead of two:
- CLIP training process in which text and image embeddings are learned simultaneously using contrastive learning (given a batch of images and their corresponding captions placed in a matrix, where rows are image embeddings and columns are text embeddings, we try to maximize the cosine similarity between the true pairs to 1 and minimizing the other, random pairs to 0).
- a prior that generates a CLIP image embedding given a text caption
- a decoder that generates an image conditioned on the image embedding
While the prior model can be either autoregressive or diffuse, the decoder is always diffuse (see GLIDE). This has several important consequences, which can be exploited to generate more custom images.
1. Variations
Diffusion models allow for greater non-essential variability in the output while preserving the most important details.
2. Interpolations
It is also possible to blend two images together. If we take two images and embed them in the CLIP latent space and spherically interpolate between them, we can generate intermediary images by scaling the theta parameter from 0 to 1.
3. Text diffs
DALL-E 2 utilizes CLIP, which essentially embeds images and text to the same latent space, allowing for language-guided manipulation of outputs.
Further reading
If you found this article insightful, make sure to subscribe and check out the additional material.
https://arxiv.org/pdf/2006.11239.pdf (Denoising Diffusion Probabilistic Models)
https://arxiv.org/pdf/2204.06125.pdf (DALL-E 2 paper)