Chapter 1 • 8 min read • Last reviewed: June 2026

The Transformer Core

Before 2017, natural language processing (NLP) models read text the slow way: one word at a time, left to right. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks kept a running "mental state" that updated with each new word. Sounds sensible, but the bottleneck was brutal: it could not parallelize. Because you needed the state of word n-1 to compute word n, GPUs could not process whole text blocks at the same time.

Then the landmark paper landed: "Attention Is All You Need", introducing the Transformer architecture. The Transformer dropped recurrence entirely and processed all words in a sequence at the same time. The trick was a mathematical shortcut called Self-Attention.

The Core Glow-Up: Self-Attention

Self-attention lets a model look at one word and figure out which other words matter most, even if they are far apart. Example:

"The bank robber ran to the river bank because he saw the police."

How does the model know the first "bank" means a financial institution, while the second means the edge of a river? In an RNN, the "bank robber" context might fade by the end. In a Transformer, the first "bank" gets compared with every other word at once and links strongly to "robber." The second "bank" links to "river." The model gives every word context from its surroundings in real time.

Key Concept: Query-Key-Value (QKV), no mystery

To compute attention, the Transformer gives every word three vectors:

Query (Q): What the word is looking for (for example, "I am a pronoun; where is my noun?").
Key (K): What the word offers (for example, "I am a noun, I describe a person").
Value (V): The actual content of the word: its meaning.

The model compares a word's Query vector against every other word's Key vectors. Higher score means more attention. The final representation is a weighted mix of Value vectors based on those scores.

Multi-Head Attention: more angles at once

Instead of doing the attention calculation once, the Transformer runs it several times in parallel. Each run is an Attention Head. That is Multi-Head Attention: more angles at once.

With multiple heads, the model can inspect different parts of meaning at the same time:

Head 1 might track grammar relationships, like which verb belongs to which noun.
Head 2 might resolve references, like matching "he" or "it" to the right entity.
Head 3 might watch local details, like nearby adjectives.

Together, the heads build a richer, more accurate language representation.

Positional Encoding: giving words coordinates

Because a Transformer processes every word at once, it does not automatically know order. To pure attention, "The dog bit the man" and "The man bit the dog" look the same because the words match.

The fix is Positional Encodings: mathematical values added to each word embedding that act like coordinates. They tell the model where each word sits, preserving the sentence structure.

Encoder vs. Decoder Architectures

The original Transformer had two halves: an Encoder (reads and understands text) and a Decoder (which writes new text). Depending on the task, modern glow-ups have split these into three variants:

Encoder-Only Models (e.g., BERT): Great for understanding, classifying, and extracting information from text. They look both left and right at the same time.
Decoder-Only Models (e.g., GPT, LLaMA): Great for generating text. They are autoregressive, meaning they produce one word at a time and only look backward to predict the next token.
Encoder-Decoder Models (e.g., T5, BART): Often used for translation or summarization: read the whole input, then generate a new output.

Many frontier text-first LLMs use decoder-only architectures, optimized for generating text by predicting the next token efficiently.

Why This Still Matters in Products

Attention is the reason a model can connect a bug report to a stack trace, a legal question to a clause thirty pages earlier, or a chart caption to the data it describes. It is also the reason long prompts get expensive: every token competes for attention with many other tokens.

When a system fails, the root cause is often not "the model is dumb." It may be that the relevant tokens were missing, buried under noisy context, truncated by the context window, or split across modalities the model cannot read together. Good AI products treat context as a scarce design surface: include what matters, remove what does not, and make the big-deal evidence easy for attention to find.

Sources

Attention Is All You Need — Vaswani et al.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Devlin et al.

Chapter 2 • 9 min read • Last reviewed: June 2026

LLM Training & Alignment

Cooking up a modern AI assistant like ChatGPT or Gemini is not one step. It starts with chaotic web-scale data, then refines it through several training stages. The arc from raw math to useful assistant has three big milestones: Pre-training, Supervised Fine-Tuning (SFT), and Alignment.

Phase 1: Pre-training, aka the base model era

Every LLM starts with a pre-trained base model. At this stage, the model gets petabytes of raw text from books, articles, code repositories, and web pages. The training objective is simple: predict the next word (token) in a sentence.

For example, given the text:

"The cat sat on the..."

The model calculates probabilities across its vocabulary and predicts "mat" (or "sofa", "bed", etc.). Do that trillions of times on supercomputer clusters, and the model builds an internal map of language, grammar, reasoning patterns, and facts. But a base model is not an assistant; it is a text completer. If you ask it "Write a recipe for chocolate cake," it might continue the pattern with another prompt: "And write a recipe for apple pie," because it is mimicking recipe lists from the internet.

Phase 2: Supervised Fine-Tuning, aka making it follow directions

To turn a text completer into an assistant, engineers run Supervised Fine-Tuning (SFT). In this phase, the base model trains on curated prompt-response examples written by human experts.

A typical training sample looks like:

Prompt: Explain photosynthesis in one sentence.
Response: Photosynthesis is the process by which plants use sunlight, water, and carbon dioxide to create oxygen and energy in the form of sugar.

After tens of thousands of these examples, the model learns "instruct" behavior: recognize the user request, answer directly, and keep the tone conversational.

Phase 3: Alignment with RLHF and DPO

Even after SFT, a model can still produce toxic, biased, wrong, or useless output. SFT teaches imitation; alignment teaches preference. Engineers align models with human preferences using two main techniques:

1. Reinforcement Learning from Human Feedback (RLHF)

RLHF is basically a grading loop with three steps:

Generate Options: The model generates several possible answers to a prompt.
Train a Reward Model: Human evaluators rank those answers from best to worst. A separate neural network, the Reward Model, learns to predict what score a human would give.
Reinforce: Using an RL algorithm, usually PPO, the LLM updates its parameters to maximize the Reward Model score. Human-approved answers get boosted; disliked answers get pushed down.

2. Direct Preference Optimization (DPO)

RLHF works, but it is famously unstable, expensive, and messy because you have to juggle the LLM, Reward Model, and reference models at the same time.

In 2023, researchers introduced Direct Preference Optimization (DPO). DPO skips the reward model entirely. It shows you can optimize the LLM directly from paired choices: a prompt, a preferred (chosen) response, and a disliked (rejected) response. DPO adjusts the weights so the chosen response becomes more likely than the rejected one. Same alignment goal, cleaner loop.

Key Concept: Kaplan vs. Chinchilla scaling laws

How do models get smarter? For a long time, the industry followed Kaplan's scaling laws (2020), which made parameter count look like the main lever. That pushed teams toward bigger models, even when they did not have enough data to train them properly.

In 2022, DeepMind published the Chinchilla scaling laws. The result: for optimal performance, parameter count and training tokens should scale together. Many models were actually under-trained on too little data. The industry shifted toward smaller, stronger models trained longer on high-quality tokens, like LLaMA and Mistral, which are much cheaper to run on standard hardware.

Choosing the Right Adaptation Method

Most product teams do not train foundation models from scratch. They choose among smaller adaptation levers:

Prompting: Best for behavior that can be described in instructions and examples.
RAG: Best when the answer depends on changing, private, or auditable knowledge.
Fine-tuning: Best when the model needs a consistent style, format, domain vocabulary, or task habit that is hard to teach in every prompt.
Preference tuning: Best when several answers are plausible but the product has a clear preference for one kind of response.
Guardrails and evals: Necessary when mistakes are expensive, no matter which training method is used.

A useful rule: do not fine-tune just to add facts. Facts change and should usually live in retrieval, tools, or databases. Fine-tune when you want the model to behave differently even when the same facts are already present.

Sources

Scaling Laws for Neural Language Models — Kaplan et al.
Training Compute-Optimal Large Language Models — Hoffmann et al.
Training Language Models to Follow Instructions with Human Feedback — Ouyang et al.
Direct Preference Optimization — Rafailov et al.

Chapter 3 • 9 min read • Last reviewed: June 2026

RAG & Context Windows

An AI model has two kinds of memory. First: Parametric Memory, information baked into model weights during training. Second: Working Memory, the space available in the current prompt, also called the Context Window. Reliable systems use both together through RAG and long-context architecture instead of letting the model freestyle.

Why parametric memory gets exposed

Only trusting what the model memorized has three huge problems:

Knowledge Cutoff: The model only knows what existed before training finished.
Hallucinations: For obscure facts, models can confidently guess and produce plausible-sounding nonsense.
No Access to Private Data: Models cannot see your local PDFs, company emails, or secure databases unless you provide them.

Retrieval-Augmented Generation (RAG)

RAG solves this by making the model an open-book test taker. Instead of answering from memory, the system searches an external database, inserts the relevant docs into the context window, and asks the model to answer from that evidence.

How the RAG pipeline actually works

Chunking: Large docs, like a 100-page manual, get split into small chunks.
Embeddings: An Embedding Model turns each chunk into a vector: numbers that represent meaning.
Vector Database: Those vectors go into a specialized database like Pinecone, Chroma, or pgvector.
Retrieval (Semantic Search): When the user asks a question, the system vectorizes it and finds the chunks closest in meaning.
Augmentation & Generation: The system fetches those chunks, puts them beside the user question, and sends the package to the LLM: "Here is the context: [chunks]. Answer this question based on that context: [query]."

Context windows leveled up

If RAG is that useful, why not dump the whole database into the model? Historically, attention made that impossible.

Standard self-attention memory and compute scale quadratically ($O(N^2)$) with input length. Double the input, and you need four times the compute and memory. Early models were capped around 2,048 tokens, roughly 1,500 words.

Recent architecture and serving breakthroughs broke that wall. Frontier systems now commonly offer hundreds of thousands to millions of tokens of working memory, making it possible to analyze long PDFs, codebases, transcripts, and multi-file projects in one request. The big levers are:

1. FlashAttention

Introduced by Tri Dao, FlashAttention is a software optimization. It does not change attention math; it changes GPU memory movement. Standard attention writes huge intermediate tables between slower HBM and fast on-chip SRAM. FlashAttention computes in blocks and keeps data in SRAM as much as possible, reducing memory traffic by up to 20x and letting context windows scale.

2. Rotary Position Embeddings (RoPE)

Older absolute position systems struggled with contexts longer than training length. RoPE represents positions by rotating word vectors in a multi-dimensional space. Because the rotation is relative, the model can track word distances even when the total text is much longer than training length. That lets teams extend context windows after training with minimal fine-tuning.

The "Needle in a Haystack" test

Just because a model can accept a million tokens does not mean it is actually reading them. Researchers test this with the Needle in a Haystack (NIAH) test.

A random fact, the "needle," gets hidden inside a massive document dump, the "haystack." The model must answer a question that depends on that exact fact. Modern models need near-perfect accuracy no matter where the needle appears.

Long context still is not a free RAG replacement. Million-token prompts can be slower, pricier, and harder to audit than a clean RAG pipeline. In real systems, engineers often combine both: retrieval grabs the best evidence, then long context handles cross-document synthesis, codebase-wide reasoning, or comparisons across many artifacts.

RAG vs. Long Context: How to Choose

Use RAG when the corpus is large, frequently changing, permissioned, or needs precise citations. Retrieval keeps prompts smaller, makes source selection auditable, and lets the application enforce access control before the model sees anything.

Use long context when the task needs comparing many pieces at once: reviewing a pull request across files, reconciling a contract with its exhibits, summarizing a full transcript, or finding contradictions across a small document set.

The most reliable pattern is often hybrid: retrieve the best candidates first, rerank them, then give a long-context model enough surrounding material to synthesize rather than quote isolated snippets. The eval should check both steps: did retrieval find the right evidence, and did generation stay faithful to it?

Sources

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al.
FlashAttention — Dao et al.
RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al.
Gemini 1.5 and long-context model capabilities — Google

Chapter 4 • 9 min read • Last reviewed: June 2026

Scaling Efficiency: MoE & Quantization

As AI models get bigger, running them gets wildly expensive. A dense 175-billion-parameter model needs multiple enterprise GPUs just to output one word at a time. To make these models usable in products and on smaller hardware, engineers lean on two major efficiency plays: Mixture of Experts (MoE) and Quantization.

Mixture of Experts (MoE)

In a standard "Dense" model, every parameter activates for every word. That is wasteful; the model does not need its entire math brain for a comma or a pronoun.

An MoE architecture turns a dense model into a "Sparse" model by splitting it into specialized compartments called Experts (usually inside feed-forward layers). Instead of sending every word through every path, a dynamic Gating Network (Router) decides which experts should handle each token.

Sparse routing in action

Imagine a model with 8 separate experts. When a token comes in:

If the token is Python code, the router sends it to Expert 3 (Code specialist) and Expert 5 (Logic specialist).
If the token is French, the router sends it to Expert 1 (Translation specialist).

Usually, the router selects only the Top-2 Experts for each token. If the model has 8x 7B experts, or 56B total parameters, it might activate only about 12B per token. You get huge total capacity with per-token compute closer to a smaller model.

Where MoE gets messy

MoE is not free. It brings real engineering headaches:

RAM Overhead: Even if only 12B parameters are active at a moment, the whole 56B model still has to sit in GPU memory. MoE can need much more VRAM than a dense model with similar active compute.
Routing Collapse: Early in training, the router can overuse one expert. That expert gets better, so the router sends it even more traffic. Engineers need load-balancing tricks so every expert learns.

Quantization

Neural networks store learned weights as high-precision decimals called floating-point numbers. During training, these usually use 16-bit precision (FP16 or BF16).

At 16-bit precision, each parameter needs 2 bytes of GPU memory. A 70B model needs at least 140GB of VRAM just to load, which is way beyond most consumer GPUs.

Quantization compresses weights by lowering numerical precision, mapping them to smaller formats like 8-bit integers (INT8), 4-bit integers (INT4), or custom formats like FP4.

Quantization intuition

Quantization is like lowering the color depth of a photo. Convert 24-bit true color to an 8-bit palette and the file shrinks hard. It looks a bit less smooth, but the shapes and meaning are still obvious.

Similarly, quantizing a model from 16-bit to 4-bit cuts size by 75%. A 70B model that needed 140GB of VRAM can fit around 35GB. Because neural networks have lots of redundancy, the reasoning hit can be surprisingly small.

Modern quantization formats

Several standard formats run compressed models:

GGUF (formerly GGML): Optimized for CPU execution, so large models can run on consumer laptops like Apple Silicon MacBooks using system RAM instead of GPU VRAM.
GPTQ / AWQ: GPU-focused quantized formats that keep compressed models generating quickly on standard desktop graphics cards.

Serving Efficiency in Real Systems

MoE and quantization are only part of the deployment story. Production inference stacks also rely on KV-cache reuse, batching, speculative decoding, model distillation, and careful routing between small and large models. A customer-support bot might use a small fast model for classification, a retrieval model for evidence, and a larger reasoning model only when the case is complex.

The practical question is not "what is the biggest model we can run?" It is "what is the cheapest system that meets the quality, latency, privacy, and reliability target?" Efficient AI products mix model sizes, precision levels, retrieval, caching, and fallback paths instead of sending every request to the same expensive endpoint.

Sources

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — Fedus et al.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Frantar et al.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Lin et al.
GGUF format documentation — ggml

Chapter 5 • 8 min read • Last reviewed: June 2026

Diffusion & Generative Media

Generative AI for images and videos has had a huge glow-up. Early image generators, GANs, were famously hard to train and often failed to produce coherent pictures. Today, most modern image and video generators, including Stable Diffusion, Midjourney, Sora, and Flux, rely on Diffusion.

The diffusion playbook

Rather than drawing from scratch, a diffusion model trains on one job: remove static noise. The process is split into two phases: the forward process and the reverse process.

1. Forward process: destroy the signal

Take a clean photo, say a golden retriever, and add a tiny layer of random mathematical noise. Repeat that maybe 1,000 times until the original image is gone and only gray static remains. No neural network needed here; it is pure math.

2. Reverse process: rebuild the signal

This is where the neural network enters. We show it a noisy image and ask: "Can you predict exactly how much noise was added in this step?"

By training on millions of clean/noisy image pairs, the model learns structure inside noise. To generate a new image, we feed it pure, random noise plus a text prompt, like "A golden retriever playing in the grass." The model subtracts a little estimated noise, then repeats that loop 20 to 50 times. Bit by bit, structure appears and a unique high-resolution image emerges.

Key Concept: Latent diffusion

Early diffusion models worked in "Pixel Space." Generating a 1024x1024 image meant calculating noise for more than a million pixels every step. Early models were slow and memory-hungry.

The glow-up was Latent Diffusion (popularized by Stable Diffusion). It uses a Variational Autoencoder (VAE) to compress the image into dense "latent space," like shrinking a 512x512 image into a 64x64 grid. The diffusion model does the heavy work in that smaller space, and the VAE decodes final latents back into pixels. That saves 90%+ of the compute and lets image generation run on consumer laptops.

Classifier-Free Guidance (CFG)

How does the model keep the image aligned with your prompt instead of wandering off? That is controlled by Classifier-Free Guidance (CFG).

During training, the model sometimes sees images without prompts. During generation, it predicts two versions of denoising: one with the prompt, and one without it. CFG scale controls how strongly to push toward the prompt.

Low CFG (1 to 3): More creative freedom. The image may look artistic but ignore parts of the prompt.
Medium CFG (7 to 9): Usually the sweet spot for high-quality images that follow the prompt.
High CFG (15+): Forces strict prompt adherence, but can make the image oversaturated or fake-looking.

The shift to Diffusion Transformers (DiT)

Traditional diffusion models used a convolutional backbone called a U-Net to predict noise. But U-Nets did not scale as cleanly with huge datasets and compute budgets.

In 2023, researchers introduced the Diffusion Transformer (DiT). DiT replaces the U-Net with a Transformer backbone. It splits the latent image into patches, similar to text tokens in an LLM. Add more parameters and compute, and quality scales predictably. This pattern underpins frontier models like OpenAI's Sora, Stable Diffusion 3, and Flux.

What Matters in Media Products

Real generative-media tools are rarely one prompt and one output. They combine text prompts with reference images, masks, control signals, style constraints, safety filters, and editing loops. The user may generate a rough image, inpaint one region, extend the canvas, upscale the result, then use a separate model to caption or moderate it.

The same idea extends to video and design workflows: the valuable product feature is often control, not raw generation. Teams need predictable character identity, readable text, brand-safe style, provenance metadata, and review tools for rights, likeness, and safety. Diffusion explains the engine; product constraints decide whether the output is usable.

Sources

Denoising Diffusion Probabilistic Models — Ho et al.
High-Resolution Image Synthesis with Latent Diffusion Models — Rombach et al.
Classifier-Free Diffusion Guidance — Ho and Salimans
Scalable Diffusion Models with Transformers — Peebles and Xie

Chapter 6 • 9 min read • Last reviewed: June 2026

Agentic AI & Reasoning

For the first few years of the LLM boom, AI models were treated like passive chatbots: write a prompt, get an instant answer. Now the frontier has shifted toward Agentic AI. Instead of one static answer, agentic systems can plan, use tools, inspect their output, and loop through multi-step work.

Tool use and function calling

LLMs are famously bad at exact math, like multiplying two 8-digit numbers, and they cannot fetch live data or touch the physical world by themselves. They are word-prediction engines.

Tool Use (or Function Calling) overcomes this limitation by letting the host application expose specific capabilities. The model is provided with a list of available tools, described in plain text. For example:

Available Tool: calculate_weather(location, date)
- Returns the temperature forecast for a location.

If the user asks: "Should I wear a coat in Chicago tomorrow?", the LLM should realize memory is not enough. Instead of guessing, it emits a structured instruction:

{
  "call": "calculate_weather",
  "arguments": { "location": "Chicago", "date": "tomorrow" }
}

The host app intercepts the JSON, calls the real weather API, gets the result, and adds it back to the chat history. The LLM reads the result and finishes: "Yes, you should wear a coat. Chicago will be 41°F and raining tomorrow."

Reasoning loops: ReAct and reflection

For complex tasks, agents use structured loops instead of one-shot answers.

1. ReAct: reason, then act

ReAct makes the model reason before taking action. The loop goes like this:

Thought: The model states a plan, like "find France's population, then multiply by 0.12."
Action: The model calls a search engine, calculator, or other tool.
Observation: The model reads the tool output, updates the plan, and loops until the task is done.

2. Reflection and self-correction

If a model writes code, the first draft may be buggy. A reflection agent does not ship it immediately. It runs the code in an isolated environment, reads errors, feeds those errors back into the model, and rewrites the code. That feedback loop boosts task success.

What Makes an Agent Safe Enough to Ship

An agent is more than a prompt with tools. The product around it needs boundaries:

Scoped tools: Each tool should do one clear thing with the least permission needed.
Typed arguments: The host application validates tool inputs before execution.
Approval gates: Irreversible actions such as sending emails, charging cards, deleting data, or changing permissions should require confirmation.
State and memory rules: The system should decide what is saved, what expires, and what the model may read later.
Trace logs: Operators need to see prompts, tool calls, observations, errors, and final answers when debugging failures.

The model can decide which tool to request, but the application must decide whether that request is allowed. Instructions are not access control.

System 1 vs. System 2 thinking in AI

Cognitive psychologist Daniel Kahneman famously split human thinking into two modes:

System 1 (Fast): Fast, intuitive, automatic actions, like answering "2+2=?" or reading a familiar road sign.
System 2 (Slow): Slow, deliberate reasoning, like solving "17 x 24" or filling out a tax form.

Standard LLMs mostly act like System 1. They output the next token immediately, with limited room to plan, test, or revise. If the answer starts badly, they cannot truly rewind.

Modern System 2 Reasoning Models spend extra inference-time compute before giving the final answer. Current reasoning systems point to the same shift: models plan, use tools, check intermediate work, and keep going across longer workflows. Some expose controllable reasoning effort; others keep it internal and return a concise answer.

Sources

ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al.
Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al.
Reflexion: Language Agents with Verbal Reinforcement Learning — Shinn et al.
Learning to Reason with LLMs — OpenAI
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning — Nature

Chapter 7 • 9 min read • Last reviewed: June 2026

Future Frontiers & Physical AI

AI is entering a new era. The frontier is no longer just "make the model bigger." Researchers are pushing models into the physical world, giving them native multimodal senses, and shifting from one-shot answers toward agents that act over time.

Native multimodality

Early multimodal systems were "Stitched" together. To let AI "see" an image, engineers would caption the image as text, then feed that caption to the LLM. That loses a lot: spatial layout, facial expression, sound pitch, and other details get flattened.

Modern frontier models are increasingly Natively Multimodal. They use a unified architecture or tightly integrated model system. Text, pixels, audio, video frames, and tool outputs become shared embeddings that models can reason across.

That lets the model reason across modalities at the same time. A native multimodal model can watch a video, hear sarcasm, read background slides, and produce one real-time analysis, catching details stitched systems miss.

The data wall and synthetic data

For a decade, AI progress came from feeding models more data. But the industry is hitting a Data Wall: LLMs have already consumed most high-quality public human-written text on the internet.

To keep training, researchers are turning to Synthetic Data: data generated by AI models to train other AI models.

Synthetic data: promise and risk

If models train on unverified synthetic data, they risk Model Collapse: errors, biases, and weird language quirks compound over generations until the model drifts away from reality.

The fix is Verified Synthesis: external environments validate the AI-generated data before training. For example:

AI generates code, then it runs through a compiler to verify it works. Only passing code gets used for training.
AI solves a math problem, then the solution gets checked by formal math verifiers.
AI reasons about physics, then the scenario runs through a physics engine to make sure it follows real-world laws.

Robotics and physical grounding

For AI to understand the world, it has to interact with it. By combining multimodal LLMs with robotic control, researchers have developed Vision-Language-Action (VLA) models like Google's RT-2 and Gemini Robotics.

A VLA model does not just output text; it outputs physical actions for robot joints and grippers. Tell a VLA robot arm: "Pick up the yellow banana and put it in the basket," and the model processes the camera feed, matches words to objects, calculates the path, and controls the motors. The LLM becomes the planning layer, giving the robot common-sense reasoning without custom programming for every object.

The next paradigm: test-time compute

Pre-training scaling laws, meaning more parameters and GPUs during training, are no longer the only progress axis. The newer lever is Test-Time Compute (scaling at inference time).

Instead of forcing an answer in a fraction of a second, test-time compute lets the model spend extra work planning, checking, searching, or coordinating tools. That is why frontier releases increasingly emphasize agentic coding, computer use, document work, and scientific workflows, not just chat benchmarks. The practical question becomes: how much thinking should this task buy?

What Is Relevant Now

The frontier is becoming less about one chatbot box and more about systems that coordinate perception, memory, tools, and verification. The most relevant product questions are practical:

Can the model read the actual modality the user cares about, or is information lost in conversion?
Can synthetic data be checked by compilers, tests, simulators, formal verifiers, humans, or trusted datasets?
Can a robot or agent fail safely when perception is uncertain?
Is extra test-time compute buying real accuracy, or just slower answers?
Can the team observe and evaluate the full workflow rather than only the final response?

These questions make the chapter more concrete: multimodality, robotics, and reasoning are not separate trends. They are ways of giving AI systems better inputs, better actions, and better checks.

Sources

RT-2: New model translates vision and language into action — Google DeepMind
Gemini Robotics: Bringing AI into the Physical World — Google DeepMind
AI models collapse when trained on recursively generated data — Nature
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters — Snell et al.
Learning to Reason with LLMs — OpenAI

Chapter 8 • 10 min read • Last reviewed: June 2026

Evaluation, Safety & Production AI

Building an AI demo is mostly about capability: can the model answer, retrieve, reason, or act? Shipping an AI product is about repeatability: can the system keep doing the right thing when prompts change, documents drift, users attack it, models upgrade, and costs spike?

Production AI teams treat the model as one component in a larger system. The surrounding product needs evals, guardrails, observability, rollback plans, and human review for cases where model confidence is not enough.

Eval-Driven Development

An eval is a repeatable test for behavior that matters. Instead of asking "does the answer look good?", an eval asks a narrower question: did the assistant cite the right policy section, refuse the unsafe request, preserve the JSON schema, choose the right tool, or solve the task within the latency budget?

Useful eval suites mix several test types:

Golden examples: known prompts with expected answers, labels, or rubrics.
Regression cases: failures from production that must not come back after a prompt, retrieval, or model change.
Adversarial cases: inputs designed to trigger jailbreaks, prompt injection, data leakage, or unsafe tool calls.
Performance cases: examples that measure cost, latency, refusal rate, and answer length, not just correctness.

The big-deal habit is to run evals before changing models, prompts, retrieval settings, or tool permissions. In AI systems, a "minor" prompt edit can behave like a code change across thousands of hidden branches.

The Production Eval Loop

The loop is simple: collect examples, run evals, block risky releases, monitor live behavior, review failures, and add those failures back to the suite.

Groundedness and Source Verification

For RAG systems, the most common failure is not a totally random answer. It is an answer that sounds plausible but is only partly supported by the retrieved evidence. A groundedness check compares each big-deal claim against the source passages the system provided.

Good groundedness evaluation asks:

Does every factual claim have supporting evidence in the retrieved context?
Did the answer cite the specific source that supports the claim?
Did the model ignore conflicting evidence or overstate uncertainty?
Should the system answer, ask a clarifying question, retrieve again, or refuse?

This is why citations are not just decoration. A citation should be a checkable pointer to the evidence that justifies the answer. If the pointer is wrong, the system is teaching users to trust the wrong thing.

Prompt Injection and Tool Safety

Prompt injection happens when untrusted text tries to override the system's instructions. In a RAG app, the attack might live inside a PDF. In an agent, it might appear on a web page the agent browses. The dangerous pattern is the same: the model reads attacker-controlled text and treats it like an instruction from the product owner.

Tool use makes this risk sharper. A model that can only write text can mislead a user; a model with tools can email customers, change records, run code, or expose private data. Production systems reduce that risk with least-privilege tool scopes, allowlists, confirmation steps, output validation, and audit logs.

A strong rule of thumb: model instructions are not access control. The host application must enforce permissions outside the model.

Observability for AI Apps

Traditional logs often show an HTTP request, a status code, and a response time. AI observability needs more: prompt versions, retrieved chunks, model names, tool calls, token usage, evaluator scores, refusals, user feedback, and traces across the full agent loop.

Without traces, teams cannot answer basic production questions: Did retrieval fail? Did the model ignore good evidence? Did a tool return bad data? Did a prompt change increase cost? Did a new model improve benchmark scores but hurt real support tickets?

Human Review and Launch Gates

Human review is not a failure of automation; it is a control surface. High-impact workflows often need human approval for irreversible actions, sensitive domains, edge cases, and low-confidence answers. The product should make review efficient by showing the prompt, evidence, model answer, tool actions, and eval signals in one place.

Before launch, teams usually define gates: minimum eval scores, maximum hallucination rate, maximum latency, acceptable cost per task, security test coverage, and rollback criteria. After launch, sampled production traces become new tests so the system gets harder to break over time.

A Practical Launch Checklist

Before putting an AI feature in front of users, a team should be able to answer these questions:

What are the top user tasks, and which eval cases represent them?
What failures are unacceptable, and how are they detected before release?
Which sources, tools, and permissions can the model access?
What does the system do when retrieval is empty, sources conflict, tools fail, or confidence is low?
Who reviews risky outputs, and what information do they see?
How quickly can the team roll back a prompt, model, retrieval index, or tool permission change?

This checklist matters because AI quality is distributed across the full stack. The model, prompt, retrieval index, tools, UI, logs, evals, and review process all decide whether the product is trustworthy.

Sources

Working with evals — OpenAI API docs
OpenAI Evals — OpenAI
AI Risk Management Framework — NIST
AI RMF Generative AI Profile — NIST
OWASP Top 10 for LLM Applications — OWASP Foundation
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al.
RAGTruth: A Hallucination Corpus for Retrieval-Augmented Language Models — Niu et al.

AI 101 Guide

The Transformer Core

The Core Glow-Up: Self-Attention

Key Concept: Query-Key-Value (QKV), no mystery

Multi-Head Attention: more angles at once

Positional Encoding: giving words coordinates

Encoder vs. Decoder Architectures

Why This Still Matters in Products

Sources

LLM Training & Alignment

Phase 1: Pre-training, aka the base model era

Phase 2: Supervised Fine-Tuning, aka making it follow directions

Phase 3: Alignment with RLHF and DPO

1. Reinforcement Learning from Human Feedback (RLHF)

2. Direct Preference Optimization (DPO)

Key Concept: Kaplan vs. Chinchilla scaling laws

Choosing the Right Adaptation Method

Sources

RAG & Context Windows

Why parametric memory gets exposed

Retrieval-Augmented Generation (RAG)

How the RAG pipeline actually works

Context windows leveled up

1. FlashAttention

2. Rotary Position Embeddings (RoPE)

The "Needle in a Haystack" test

RAG vs. Long Context: How to Choose

Sources

Scaling Efficiency: MoE & Quantization

Mixture of Experts (MoE)

Sparse routing in action

Where MoE gets messy

Quantization

Quantization intuition

Modern quantization formats

Serving Efficiency in Real Systems

Sources

Diffusion & Generative Media

The diffusion playbook

1. Forward process: destroy the signal

2. Reverse process: rebuild the signal

Key Concept: Latent diffusion

Classifier-Free Guidance (CFG)

The shift to Diffusion Transformers (DiT)

What Matters in Media Products

Sources

Agentic AI & Reasoning

Tool use and function calling

Reasoning loops: ReAct and reflection

1. ReAct: reason, then act

2. Reflection and self-correction

What Makes an Agent Safe Enough to Ship

System 1 vs. System 2 thinking in AI

Sources

Future Frontiers & Physical AI

Native multimodality

The data wall and synthetic data

Synthetic data: promise and risk

Robotics and physical grounding

The next paradigm: test-time compute

What Is Relevant Now

Sources

Evaluation, Safety & Production AI

Eval-Driven Development

The Production Eval Loop

Groundedness and Source Verification

Prompt Injection and Tool Safety

Observability for AI Apps

Human Review and Launch Gates

A Practical Launch Checklist

Sources