Printable edition
AI 101 Guide
A complete single-page edition assembled from the canonical chapters. Use browser print, save as PDF, or read it as one continuous Kindle-friendly document.
The Transformer Core
Before 2017, natural language processing (NLP) models read text the slow way: one word at a time, left to right. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks kept a running "mental state" that updated with each new word. Sounds sensible, but the bottleneck was brutal: it could not parallelize. Because you needed the state of word n-1 to compute word n, GPUs could not process whole text blocks at the same time.
Then the landmark paper landed: "Attention Is All You Need", introducing the Transformer architecture. The Transformer dropped recurrence entirely and processed all words in a sequence at the same time. The trick was a mathematical shortcut called Self-Attention.
The Core Glow-Up: Self-Attention
Self-attention lets a model look at one word and figure out which other words matter most, even if they are far apart. Example:
"The bank robber ran to the river bank because he saw the police."
How does the model know the first "bank" means a financial institution, while the second means the edge of a river? In an RNN, the "bank robber" context might fade by the end. In a Transformer, the first "bank" gets compared with every other word at once and links strongly to "robber." The second "bank" links to "river." The model gives every word context from its surroundings in real time.
Key Concept: Query-Key-Value (QKV), no mystery
To compute attention, the Transformer gives every word three vectors:
- Query (Q): What the word is looking for (for example, "I am a pronoun; where is my noun?").
- Key (K): What the word offers (for example, "I am a noun, I describe a person").
- Value (V): The actual content of the word: its meaning.
The model compares a word's Query vector against every other word's Key vectors. Higher score means more attention. The final representation is a weighted mix of Value vectors based on those scores.
Multi-Head Attention: more angles at once
Instead of doing the attention calculation once, the Transformer runs it several times in parallel. Each run is an Attention Head. That is Multi-Head Attention: more angles at once.
With multiple heads, the model can inspect different parts of meaning at the same time:
- Head 1 might track grammar relationships, like which verb belongs to which noun.
- Head 2 might resolve references, like matching "he" or "it" to the right entity.
- Head 3 might watch local details, like nearby adjectives.
Together, the heads build a richer, more accurate language representation.
Positional Encoding: giving words coordinates
Because a Transformer processes every word at once, it does not automatically know order. To pure attention, "The dog bit the man" and "The man bit the dog" look the same because the words match.
The fix is Positional Encodings: mathematical values added to each word embedding that act like coordinates. They tell the model where each word sits, preserving the sentence structure.
Encoder vs. Decoder Architectures
The original Transformer had two halves: an Encoder (reads and understands text) and a Decoder (which writes new text). Depending on the task, modern glow-ups have split these into three variants:
- Encoder-Only Models (e.g., BERT): Great for understanding, classifying, and extracting information from text. They look both left and right at the same time.
- Decoder-Only Models (e.g., GPT, LLaMA): Great for generating text. They are autoregressive, meaning they produce one word at a time and only look backward to predict the next token.
- Encoder-Decoder Models (e.g., T5, BART): Often used for translation or summarization: read the whole input, then generate a new output.
Many frontier text-first LLMs use decoder-only architectures, optimized for generating text by predicting the next token efficiently.
Sources
- Attention Is All You Need — Vaswani et al.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Devlin et al.
LLM Training & Alignment
Cooking up a modern AI assistant like ChatGPT or Gemini is not one step. It starts with chaotic web-scale data, then refines it through several training stages. The arc from raw math to useful assistant has three big milestones: Pre-training, Supervised Fine-Tuning (SFT), and Alignment.
Phase 1: Pre-training, aka the base model era
Every LLM starts with a pre-trained base model. At this stage, the model gets petabytes of raw text from books, articles, code repositories, and web pages. The training objective is simple: predict the next word (token) in a sentence.
For example, given the text:
"The cat sat on the..."
The model calculates probabilities across its vocabulary and predicts "mat" (or "sofa", "bed", etc.). Do that trillions of times on supercomputer clusters, and the model builds an internal map of language, grammar, reasoning patterns, and facts. But a base model is not an assistant; it is a text completer. If you ask it "Write a recipe for chocolate cake," it might continue the pattern with another prompt: "And write a recipe for apple pie," because it is mimicking recipe lists from the internet.
Phase 2: Supervised Fine-Tuning, aka making it follow directions
To turn a text completer into an assistant, engineers run Supervised Fine-Tuning (SFT). In this phase, the base model trains on curated prompt-response examples written by human experts.
A typical training sample looks like:
Prompt: Explain photosynthesis in one sentence.
Response: Photosynthesis is the process by which plants use sunlight, water, and carbon dioxide to create oxygen and energy in the form of sugar.
After tens of thousands of these examples, the model learns "instruct" behavior: recognize the user request, answer directly, and keep the tone conversational.
Phase 3: Alignment with RLHF and DPO
Even after SFT, a model can still produce toxic, biased, wrong, or useless output. SFT teaches imitation; alignment teaches preference. Engineers align models with human preferences using two main techniques:
1. Reinforcement Learning from Human Feedback (RLHF)
RLHF is basically a grading loop with three steps:
- Generate Options: The model generates several possible answers to a prompt.
- Train a Reward Model: Human evaluators rank those answers from best to worst. A separate neural network, the Reward Model, learns to predict what score a human would give.
- Reinforce: Using an RL algorithm, usually PPO, the LLM updates its parameters to maximize the Reward Model score. Human-approved answers get boosted; disliked answers get pushed down.
2. Direct Preference Optimization (DPO)
RLHF works, but it is famously unstable, expensive, and messy because you have to juggle the LLM, Reward Model, and reference models at the same time.
In 2023, researchers introduced Direct Preference Optimization (DPO). DPO skips the reward model entirely. It shows you can optimize the LLM directly from paired choices: a prompt, a preferred (chosen) response, and a disliked (rejected) response. DPO adjusts the weights so the chosen response becomes more likely than the rejected one. Same alignment goal, cleaner loop.
Key Concept: Kaplan vs. Chinchilla scaling laws
How do models get smarter? For a long time, the industry followed Kaplan's scaling laws (2020), which made parameter count look like the main lever. That pushed teams toward bigger models, even when they did not have enough data to train them properly.
In 2022, DeepMind published the Chinchilla scaling laws. The result: for optimal performance, parameter count and training tokens should scale together. Many models were actually under-trained on too little data. The industry shifted toward smaller, stronger models trained longer on high-quality tokens, like LLaMA and Mistral, which are much cheaper to run on standard hardware.
Sources
- Scaling Laws for Neural Language Models — Kaplan et al.
- Training Compute-Optimal Large Language Models — Hoffmann et al.
- Training Language Models to Follow Instructions with Human Feedback — Ouyang et al.
- Direct Preference Optimization — Rafailov et al.
RAG & Context Windows
An AI model has two kinds of memory. First: Parametric Memory, information baked into model weights during training. Second: Working Memory, the space available in the current prompt, also called the Context Window. Reliable systems use both together through RAG and long-context architecture instead of letting the model freestyle.
Why parametric memory gets exposed
Only trusting what the model memorized has three huge problems:
- Knowledge Cutoff: The model only knows what existed before training finished.
- Hallucinations: For obscure facts, models can confidently guess and produce plausible-sounding nonsense.
- No Access to Private Data: Models cannot see your local PDFs, company emails, or secure databases unless you provide them.
Retrieval-Augmented Generation (RAG)
RAG solves this by making the model an open-book test taker. Instead of answering from memory, the system searches an external database, inserts the relevant docs into the context window, and asks the model to answer from that evidence.
How the RAG pipeline actually works
- Chunking: Large docs, like a 100-page manual, get split into small chunks.
- Embeddings: An Embedding Model turns each chunk into a vector: numbers that represent meaning.
- Vector Database: Those vectors go into a specialized database like Pinecone, Chroma, or pgvector.
- Retrieval (Semantic Search): When the user asks a question, the system vectorizes it and finds the chunks closest in meaning.
- Augmentation & Generation: The system fetches those chunks, puts them beside the user question, and sends the package to the LLM: "Here is the context: [chunks]. Answer this question based on that context: [query]."
Context windows leveled up
If RAG is that useful, why not dump the whole database into the model? Historically, attention made that impossible.
Standard self-attention memory and compute scale quadratically ($O(N^2)$) with input length. Double the input, and you need four times the compute and memory. Early models were capped around 2,048 tokens, roughly 1,500 words.
Recent architecture and serving breakthroughs broke that wall. By 2026, frontier systems commonly offer hundreds of thousands to millions of tokens of working memory; OpenAI lists a 1 million token API context window for GPT-5.5, while Google's Gemini line has pushed long-context reasoning into mainstream multimodal products. The big levers are:
1. FlashAttention
Introduced by Tri Dao, FlashAttention is a software optimization. It does not change attention math; it changes GPU memory movement. Standard attention writes huge intermediate tables between slower HBM and fast on-chip SRAM. FlashAttention computes in blocks and keeps data in SRAM as much as possible, reducing memory traffic by up to 20x and letting context windows scale.
2. Rotary Position Embeddings (RoPE)
Older absolute position systems struggled with contexts longer than training length. RoPE represents positions by rotating word vectors in a multi-dimensional space. Because the rotation is relative, the model can track word distances even when the total text is much longer than training length. That lets teams extend context windows after training with minimal fine-tuning.
The "Needle in a Haystack" test
Just because a model can accept a million tokens does not mean it is actually reading them. Researchers test this with the Needle in a Haystack (NIAH) test.
A random fact, the "needle," gets hidden inside a massive document dump, the "haystack." The model must answer a question that depends on that exact fact. Modern models need near-perfect accuracy no matter where the needle appears.
Long context still is not a free RAG replacement. Million-token prompts can be slower, pricier, and harder to audit than a clean RAG pipeline. In real systems, engineers often combine both: retrieval grabs the best evidence, then long context handles cross-document synthesis, codebase-wide reasoning, or comparisons across many artifacts.
Sources
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al.
- FlashAttention — Dao et al.
- RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al.
- Gemini 1.5 and long-context model capabilities — Google
- Introducing GPT-5.5 — OpenAI
- Gemini 3.5: frontier intelligence with action — Google
Scaling Efficiency: MoE & Quantization
As AI models get bigger, running them gets wildly expensive. A dense 175-billion-parameter model needs multiple enterprise GPUs just to output one word at a time. To make these models usable in products and on smaller hardware, engineers lean on two major efficiency plays: Mixture of Experts (MoE) and Quantization.
Mixture of Experts (MoE)
In a standard "Dense" model, every parameter activates for every word. That is wasteful; the model does not need its entire math brain for a comma or a pronoun.
An MoE architecture turns a dense model into a "Sparse" model by splitting it into specialized compartments called Experts (usually inside feed-forward layers). Instead of sending every word through every path, a dynamic Gating Network (Router) decides which experts should handle each token.
Sparse routing in action
Imagine a model with 8 separate experts. When a token comes in:
- If the token is Python code, the router sends it to Expert 3 (Code specialist) and Expert 5 (Logic specialist).
- If the token is French, the router sends it to Expert 1 (Translation specialist).
Usually, the router selects only the Top-2 Experts for each token. If the model has 8x 7B experts, or 56B total parameters, it might activate only about 12B per token. You get huge total capacity with per-token compute closer to a smaller model.
Where MoE gets messy
MoE is not free. It brings real engineering headaches:
- RAM Overhead: Even if only 12B parameters are active at a moment, the whole 56B model still has to sit in GPU memory. MoE can need much more VRAM than a dense model with similar active compute.
- Routing Collapse: Early in training, the router can overuse one expert. That expert gets better, so the router sends it even more traffic. Engineers need load-balancing tricks so every expert learns.
Quantization
Neural networks store learned weights as high-precision decimals called floating-point numbers. During training, these usually use 16-bit precision (FP16 or BF16).
At 16-bit precision, each parameter needs 2 bytes of GPU memory. A 70B model needs at least 140GB of VRAM just to load, which is way beyond most consumer GPUs.
Quantization compresses weights by lowering numerical precision, mapping them to smaller formats like 8-bit integers (INT8), 4-bit integers (INT4), or custom formats like FP4.
Quantization intuition
Quantization is like lowering the color depth of a photo. Convert 24-bit true color to an 8-bit palette and the file shrinks hard. It looks a bit less smooth, but the shapes and meaning are still obvious.
Similarly, quantizing a model from 16-bit to 4-bit cuts size by 75%. A 70B model that needed 140GB of VRAM can fit around 35GB. Because neural networks have lots of redundancy, the reasoning hit can be surprisingly small.
Modern quantization formats
Several standard formats run compressed models:
- GGUF (formerly GGML): Optimized for CPU execution, so large models can run on consumer laptops like Apple Silicon MacBooks using system RAM instead of GPU VRAM.
- GPTQ / AWQ: GPU-focused quantized formats that keep compressed models generating quickly on standard desktop graphics cards.
The 2025-2026 open-weight wave made the efficiency story real. OpenAI's gpt-oss models use Transformer MoE: the 117B model activates only 5.1B parameters per token, and the 21B model activates 3.6B. That gives builders large total capacity while keeping per-token compute closer to a smaller dense model.
Sources
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — Fedus et al.
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Frantar et al.
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Lin et al.
- GGUF format documentation — ggml
- Introducing gpt-oss — OpenAI
Diffusion & Generative Media
Generative AI for images and videos has had a huge glow-up. Early image generators, GANs, were famously hard to train and often failed to produce coherent pictures. Today, most modern image and video generators, including Stable Diffusion, Midjourney, Sora, and Flux, rely on Diffusion.
The diffusion playbook
Rather than drawing from scratch, a diffusion model trains on one job: remove static noise. The process is split into two phases: the forward process and the reverse process.
1. Forward process: destroy the signal
Take a clean photo, say a golden retriever, and add a tiny layer of random mathematical noise. Repeat that maybe 1,000 times until the original image is gone and only gray static remains. No neural network needed here; it is pure math.
2. Reverse process: rebuild the signal
This is where the neural network enters. We show it a noisy image and ask: "Can you predict exactly how much noise was added in this step?"
By training on millions of clean/noisy image pairs, the model learns structure inside noise. To generate a new image, we feed it pure, random noise plus a text prompt, like "A golden retriever playing in the grass." The model subtracts a little estimated noise, then repeats that loop 20 to 50 times. Bit by bit, structure appears and a unique high-resolution image emerges.
Key Concept: Latent diffusion
Early diffusion models worked in "Pixel Space." Generating a 1024x1024 image meant calculating noise for more than a million pixels every step. Early models were slow and memory-hungry.
The glow-up was Latent Diffusion (popularized by Stable Diffusion). It uses a Variational Autoencoder (VAE) to compress the image into dense "latent space," like shrinking a 512x512 image into a 64x64 grid. The diffusion model does the heavy work in that smaller space, and the VAE decodes final latents back into pixels. That saves 90%+ of the compute and lets image generation run on consumer laptops.
Classifier-Free Guidance (CFG)
How does the model keep the image aligned with your prompt instead of wandering off? That is controlled by Classifier-Free Guidance (CFG).
During training, the model sometimes sees images without prompts. During generation, it predicts two versions of denoising: one with the prompt, and one without it. CFG scale controls how strongly to push toward the prompt.
- Low CFG (1 to 3): More creative freedom. The image may look artistic but ignore parts of the prompt.
- Medium CFG (7 to 9): Usually the sweet spot for high-quality images that follow the prompt.
- High CFG (15+): Forces strict prompt adherence, but can make the image oversaturated or fake-looking.
The shift to Diffusion Transformers (DiT)
Traditional diffusion models used a convolutional backbone called a U-Net to predict noise. But U-Nets did not scale as cleanly with huge datasets and compute budgets.
In 2023, researchers introduced the Diffusion Transformer (DiT). DiT replaces the U-Net with a Transformer backbone. It splits the latent image into patches, similar to text tokens in an LLM. Add more parameters and compute, and quality scales predictably. This pattern underpins frontier models like OpenAI's Sora, Stable Diffusion 3, and Flux.
Sources
- Denoising Diffusion Probabilistic Models — Ho et al.
- High-Resolution Image Synthesis with Latent Diffusion Models — Rombach et al.
- Classifier-Free Diffusion Guidance — Ho and Salimans
- Scalable Diffusion Models with Transformers — Peebles and Xie
Agentic AI & Reasoning
For the first few years of the LLM boom, AI models were treated like passive chatbots: write a prompt, get an instant answer. Now the frontier has shifted toward Agentic AI. Instead of one static answer, agentic systems can plan, use tools, inspect their output, and loop through multi-step work.
Tool use and function calling
LLMs are famously bad at exact math, like multiplying two 8-digit numbers, and they cannot fetch live data or touch the physical world by themselves. They are word-prediction engines.
Tool Use (or Function Calling) overcomes this limitation by giving models hands. The model is provided with a list of available tools, described in plain text. For example:
Available Tool: calculate_weather(location, date)
- Returns the temperature forecast for a location.
If the user asks: "Should I wear a coat in Chicago tomorrow?", the LLM should realize memory is not enough. Instead of guessing, it emits a structured instruction:
{
"call": "calculate_weather",
"arguments": { "location": "Chicago", "date": "tomorrow" }
}
The host app intercepts the JSON, calls the real weather API, gets the result, and adds it back to the chat history. The LLM reads the result and finishes: "Yes, you should wear a coat. Chicago will be 41°F and raining tomorrow."
Reasoning loops: ReAct and reflection
For complex tasks, agents use structured loops instead of one-shot answers.
1. ReAct: reason, then act
ReAct makes the model reason before taking action. The loop goes like this:
- Thought: The model states a plan, like "find France's population, then multiply by 0.12."
- Action: The model calls a search engine, calculator, or other tool.
- Observation: The model reads the tool output, updates the plan, and loops until the task is done.
2. Reflection and self-correction
If a model writes code, the first draft may be buggy. A reflection agent does not ship it immediately. It runs the code in an isolated environment, reads errors, feeds those errors back into the model, and rewrites the code. That feedback loop boosts task success.
System 1 vs. System 2 thinking in AI
Cognitive psychologist Daniel Kahneman famously split human thinking into two modes:
- System 1 (Fast): Fast, intuitive, automatic actions, like answering "2+2=?" or reading a familiar road sign.
- System 2 (Slow): Slow, deliberate reasoning, like solving "17 x 24" or filling out a tax form.
Standard LLMs mostly act like System 1. They output the next token immediately, with limited room to plan, test, or revise. If the answer starts badly, they cannot truly rewind.
Modern System 2 Reasoning Models spend extra inference-time compute before giving the final answer. OpenAI's GPT-5.5, Google's Gemini 3.5 Flash, DeepSeek-R1, and related reasoning systems all point to the same shift: models plan, use tools, check intermediate work, and keep going across longer workflows. Some show controllable "thinking" effort; others keep it internal and return a concise answer.
Sources
- ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al.
- Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al.
- Reflexion: Language Agents with Verbal Reinforcement Learning — Shinn et al.
- Learning to Reason with LLMs — OpenAI
- Introducing GPT-5.5 — OpenAI
- Gemini 3.5: frontier intelligence with action — Google
- DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning — Nature
Future Frontiers & Physical AI
AI is entering a new era. The frontier is no longer just "make the model bigger." Researchers are pushing models into the physical world, giving them native multimodal senses, and shifting from one-shot answers toward agents that act over time.
Native multimodality
Early multimodal systems were "Stitched" together. To let AI "see" an image, engineers would caption the image as text, then feed that caption to the LLM. That loses a lot: spatial layout, facial expression, sound pitch, and other details get flattened.
Modern frontier models, like Gemini and GPT-5.5, are Natively Multimodal. They use a unified architecture or tightly integrated model system. Text, pixels, audio, video frames, and tool outputs become shared embeddings that models can reason across.
That lets the model reason across modalities at the same time. A native multimodal model can watch a video, hear sarcasm, read background slides, and produce one real-time analysis, catching details stitched systems miss.
The data wall and synthetic data
For a decade, AI progress came from feeding models more data. But the industry is hitting a Data Wall: LLMs have already consumed most high-quality public human-written text on the internet.
To keep training, researchers are turning to Synthetic Data: data generated by AI models to train other AI models.
Synthetic data: promise and risk
If models train on unverified synthetic data, they risk Model Collapse: errors, biases, and weird language quirks compound over generations until the model drifts away from reality.
The fix is Verified Synthesis: external environments validate the AI-generated data before training. For example:
- AI generates code, then it runs through a compiler to verify it works. Only passing code gets used for training.
- AI solves a math problem, then the solution gets checked by formal math verifiers.
- AI reasons about physics, then the scenario runs through a physics engine to make sure it follows real-world laws.
Robotics and physical grounding
For AI to understand the world, it has to interact with it. By combining multimodal LLMs with robotic control, researchers have developed Vision-Language-Action (VLA) models like Google's RT-2 and Gemini Robotics.
A VLA model does not just output text; it outputs physical actions for robot joints and grippers. Tell a VLA robot arm: "Pick up the yellow banana and put it in the basket," and the model processes the camera feed, matches words to objects, calculates the path, and controls the motors. The LLM becomes the planning layer, giving the robot common-sense reasoning without custom programming for every object.
The next paradigm: test-time compute
Pre-training scaling laws, meaning more parameters and GPUs during training, are no longer the only progress axis. The newer lever is Test-Time Compute (scaling at inference time).
Instead of forcing an answer in a fraction of a second, test-time compute lets the model spend extra work planning, checking, searching, or coordinating tools. That is why frontier releases increasingly emphasize agentic coding, computer use, document work, and scientific workflows, not just chat benchmarks. The practical question becomes: how much thinking should this task buy?
Sources
- RT-2: New model translates vision and language into action — Google DeepMind
- Gemini Robotics: Bringing AI into the Physical World — Google DeepMind
- AI models collapse when trained on recursively generated data — Nature
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters — Snell et al.
- Learning to Reason with LLMs — OpenAI
- Introducing GPT-5.5 — OpenAI
Evaluation, Safety & Production AI
Building an AI demo is mostly about capability: can the model answer, retrieve, reason, or act? Shipping an AI product is about repeatability: can the system keep doing the right thing when prompts change, documents drift, users attack it, models upgrade, and costs spike?
Production AI teams treat the model as one component in a larger system. The surrounding product needs evals, guardrails, observability, rollback plans, and human review for cases where model confidence is not enough.
Eval-Driven Development
An eval is a repeatable test for behavior that matters. Instead of asking "does the answer look good?", an eval asks a narrower question: did the assistant cite the right policy section, refuse the unsafe request, preserve the JSON schema, choose the right tool, or solve the task within the latency budget?
Useful eval suites mix several test types:
- Golden examples: known prompts with expected answers, labels, or rubrics.
- Regression cases: failures from production that must not come back after a prompt, retrieval, or model change.
- Adversarial cases: inputs designed to trigger jailbreaks, prompt injection, data leakage, or unsafe tool calls.
- Performance cases: examples that measure cost, latency, refusal rate, and answer length, not just correctness.
The big-deal habit is to run evals before changing models, prompts, retrieval settings, or tool permissions. In AI systems, a "minor" prompt edit can behave like a code change across thousands of hidden branches.
The Production Eval Loop
The loop is simple: collect examples, run evals, block risky releases, monitor live behavior, review failures, and add those failures back to the suite.
Groundedness and Source Verification
For RAG systems, the most common failure is not a totally random answer. It is an answer that sounds plausible but is only partly supported by the retrieved evidence. A groundedness check compares each big-deal claim against the source passages the system provided.
Good groundedness evaluation asks:
- Does every factual claim have supporting evidence in the retrieved context?
- Did the answer cite the specific source that supports the claim?
- Did the model ignore conflicting evidence or overstate uncertainty?
- Should the system answer, ask a clarifying question, retrieve again, or refuse?
This is why citations are not just decoration. A citation should be a checkable pointer to the evidence that justifies the answer. If the pointer is wrong, the system is teaching users to trust the wrong thing.
Prompt Injection and Tool Safety
Prompt injection happens when untrusted text tries to override the system's instructions. In a RAG app, the attack might live inside a PDF. In an agent, it might appear on a web page the agent browses. The dangerous pattern is the same: the model reads attacker-controlled text and treats it like an instruction from the product owner.
Tool use makes this risk sharper. A model that can only write text can mislead a user; a model with tools can email customers, change records, run code, or expose private data. Production systems reduce that risk with least-privilege tool scopes, allowlists, confirmation steps, output validation, and audit logs.
A strong rule of thumb: model instructions are not access control. The host application must enforce permissions outside the model.
Observability for AI Apps
Traditional logs often show an HTTP request, a status code, and a response time. AI observability needs more: prompt versions, retrieved chunks, model names, tool calls, token usage, evaluator scores, refusals, user feedback, and traces across the full agent loop.
Without traces, teams cannot answer basic production questions: Did retrieval fail? Did the model ignore good evidence? Did a tool return bad data? Did a prompt change increase cost? Did a new model improve benchmark scores but hurt real support tickets?
Human Review and Launch Gates
Human review is not a failure of automation; it is a control surface. High-impact workflows often need human approval for irreversible actions, sensitive domains, edge cases, and low-confidence answers. The product should make review efficient by showing the prompt, evidence, model answer, tool actions, and eval signals in one place.
Before launch, teams usually define gates: minimum eval scores, maximum hallucination rate, maximum latency, acceptable cost per task, security test coverage, and rollback criteria. After launch, sampled production traces become new tests so the system gets harder to break over time.
Sources
- Working with evals — OpenAI API docs
- OpenAI Evals — OpenAI
- AI Risk Management Framework — NIST
- AI RMF Generative AI Profile — NIST
- OWASP Top 10 for LLM Applications — OWASP Foundation
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al.
- RAGTruth: A Hallucination Corpus for Retrieval-Augmented Language Models — Niu et al.