Original
Chapter 5 • 8 min read • Last reviewed: May 2026

Diffusion & Generative Media

Generative AI for images and videos has undergone a massive transformation. Early image generators, called GANs (Generative Adversarial Networks), were notoriously rough to train, often failing to produce coherent pictures. Today, almost all modern image and video generators (Stable Diffusion, Midjourney, Sora, Flux) rely on a mathematical concept called Diffusion.

The Diffusion Paradigm

Instead of trying to draw an image from scratch, a diffusion model is trained to do one thing: remove static noise. The process is split into two phases: the forward process and the reverse process.

1. The Forward Process (Destroying Information)

We take a clean photograph (say, of a golden retriever) and add a tiny layer of random mathematical noise. We repeat this step-by-step, perhaps 1,000 times, until the original dog is completely obliterated, leaving nothing but a block of pure gray static. This process needs no neural network; it is pure math.

2. The Reverse Process (Cooking up Information)

This is where the neural network lives. We show the model a noisy image and ask it: "Can you predict exactly how much noise was added in this step?"

By training the model on millions of pairs of clean and noisy images, it learns to recognize subtle structures within noise. When we want to generate a new image, we feed the model a block of pure, random noise and a text prompt (e.g., "A golden retriever playing in the grass"). The model subtracts a sliver of estimated noise. We repeat this subtraction loop 20 to 50 times. Bit by bit, structures appear, and a completely unique, high-resolution image emerges.

Key Concept: Latent Diffusion

Early diffusion models operated in "Pixel Space." Generating a 1024x1024 pixel image meant calculating noise values for over a million pixels at every step. This made early models incredibly slow and memory-intensive.

The breakthrough was Latent Diffusion (popularized by Stable Diffusion). It uses a Variational Autoencoder (VAE) to compress the image into a highly dense representation called "latent space" (shrinking a 512x512 image down to a 64x64 grid). The diffusion model does all its heavy lifting in this low-resolution space, and the VAE decodes the final latents back into pixels at the very end. This saved 90%+ of the compute, making image generation run on consumer laptops.

Classifier-Free Guidance (CFG)

How does the model make sure the image it generates actually matches your prompt, instead of wandering off on its own? This is controlled by Classifier-Free Guidance (CFG).

During training, the model is occasionally trained without text prompts (unconditioned). During generation, the model predicts two things: what the noise removal should look like with the prompt, and what it should look like without it. The CFG scale decides how much weight to give to the difference.

The Shift to Diffusion Transformers (DiT)

Traditional diffusion models used a convolutional network backbone called a U-Net to predict noise. However, U-Nets struggled to scale efficiently with massive datasets and compute budgets.

In 2023, researchers introduced the Diffusion Transformer (DiT). DiT replaces the U-Net with a standard Transformer backbone. By dividing the latent image into patches (similar to how an LLM divides text into tokens), DiT models can scale predictably: adding more parameters and compute directly correlates with better image and video fidelity. This architecture underpins the latest frontier-tier models like OpenAI's Sora, Stable Diffusion 3, and Flux.

Sources