Understanding LDMs

Understanding latent diffusion models - The brains behind Stable Diffusion

In a previous post i discussed stable diffusion, a software that uses latent diffusion models to generate images from text. In this post we take a look at how these LDMs work.

To grasp the concept of latent diffusion models, it's essential to first understand the workings of diffusion models.

Initially, we start with ordinary images, such as pictures of cats for this example. These images are fed into the model, which gradually introduces noise over several steps. This results in our original image being overlaid with substantial noise. The transition from a structured state (our original image) to an unstructured state (our noisy image) is termed diffusion.

The subsequent phase involves supplying these noisy images to a neural network, where the model endeavors to eliminate the introduced noise progressively. By comparing the reconstructed data with the original input (our picture), the model is trained, ultimately yielding a model capable of predicting the noise to be removed from a noisy image.

Having developed a model adept at noise prediction and removal, we can proceed to generate new images.

We initiate this by providing a textual description of a desired new cat image, referred to as conditioning. Depending on the model's architecture, this conditioning may either assist in generating the initial noisy image or guide the denoising steps mentioned earlier. The conditioning leaves us with a starting point - an image imbued with random noise.

This image is then processed by the model, which removes the noise, resulting in a final image of a cat.

The distinction with latent models lies in their handling of smaller, encoded representations of images rather than the full-resolution versions. This is achieved by passing the noisy image through an encoder, which downsamples and condenses the image, preserving essential information while reducing size. The process concludes with the output image being upscaled by a decoder to produce a high-quality image.

This approach is not only crucial for speeding up the process but also enables the incorporation of multiple conditioning parameters, as these too are encoded. Consequently, it allows for the integration of both images and textual descriptions in the model's process