Chapter 5 • 8 min read • Last reviewed: June 2026

Diffusion & Generative Media

Generative AI for images and videos has had a huge glow-up. Early image generators, GANs, were famously hard to train and often failed to produce coherent pictures. Today, most modern image and video generators, including Stable Diffusion, Midjourney, Sora, and Flux, rely on Diffusion.

The diffusion playbook

Rather than drawing from scratch, a diffusion model trains on one job: remove static noise. The process is split into two phases: the forward process and the reverse process.

1. Forward process: destroy the signal

Take a clean photo, say a golden retriever, and add a tiny layer of random mathematical noise. Repeat that maybe 1,000 times until the original image is gone and only gray static remains. No neural network needed here; it is pure math.

2. Reverse process: rebuild the signal

This is where the neural network enters. We show it a noisy image and ask: "Can you predict exactly how much noise was added in this step?"

By training on millions of clean/noisy image pairs, the model learns structure inside noise. To generate a new image, we feed it pure, random noise plus a text prompt, like "A golden retriever playing in the grass." The model subtracts a little estimated noise, then repeats that loop 20 to 50 times. Bit by bit, structure appears and a unique high-resolution image emerges.

Key Concept: Latent diffusion

Early diffusion models worked in "Pixel Space." Generating a 1024x1024 image meant calculating noise for more than a million pixels every step. Early models were slow and memory-hungry.

The glow-up was Latent Diffusion (popularized by Stable Diffusion). It uses a Variational Autoencoder (VAE) to compress the image into dense "latent space," like shrinking a 512x512 image into a 64x64 grid. The diffusion model does the heavy work in that smaller space, and the VAE decodes final latents back into pixels. That saves 90%+ of the compute and lets image generation run on consumer laptops.

Classifier-Free Guidance (CFG)

How does the model keep the image aligned with your prompt instead of wandering off? That is controlled by Classifier-Free Guidance (CFG).

During training, the model sometimes sees images without prompts. During generation, it predicts two versions of denoising: one with the prompt, and one without it. CFG scale controls how strongly to push toward the prompt.

Low CFG (1 to 3): More creative freedom. The image may look artistic but ignore parts of the prompt.
Medium CFG (7 to 9): Usually the sweet spot for high-quality images that follow the prompt.
High CFG (15+): Forces strict prompt adherence, but can make the image oversaturated or fake-looking.

The shift to Diffusion Transformers (DiT)

Traditional diffusion models used a convolutional backbone called a U-Net to predict noise. But U-Nets did not scale as cleanly with huge datasets and compute budgets.

In 2023, researchers introduced the Diffusion Transformer (DiT). DiT replaces the U-Net with a Transformer backbone. It splits the latent image into patches, similar to text tokens in an LLM. Add more parameters and compute, and quality scales predictably. This pattern underpins frontier models like OpenAI's Sora, Stable Diffusion 3, and Flux.

What Matters in Media Products

Real generative-media tools are rarely one prompt and one output. They combine text prompts with reference images, masks, control signals, style constraints, safety filters, and editing loops. The user may generate a rough image, inpaint one region, extend the canvas, upscale the result, then use a separate model to caption or moderate it.

The same idea extends to video and design workflows: the valuable product feature is often control, not raw generation. Teams need predictable character identity, readable text, brand-safe style, provenance metadata, and review tools for rights, likeness, and safety. Diffusion explains the engine; product constraints decide whether the output is usable.

Sources

Denoising Diffusion Probabilistic Models — Ho et al.
High-Resolution Image Synthesis with Latent Diffusion Models — Rombach et al.
Classifier-Free Diffusion Guidance — Ho and Salimans
Scalable Diffusion Models with Transformers — Peebles and Xie