Chapter 2 • 9 min read • Last reviewed: June 2026

LLM Training & Alignment

Creating a modern AI assistant like ChatGPT or Gemini is not a single-step process. It requires taking raw, chaotic web data and refining it through multiple training stages. The journey from raw math to a helpful assistant is divided into three major milestones: Pre-training, Supervised Fine-Tuning (SFT), and Alignment.

Phase 1: Pre-training (Creating the "Base Model")

The foundation of any LLM is the pre-trained base model. During this stage, the model is fed petabytes of raw text from books, articles, code repositories, and web pages. The training objective is simple: predict the next word (token) in a sentence.

For example, given the text:

"The cat sat on the..."

The model calculates probability distributions over its entire vocabulary and predicts "mat" (or "sofa", "bed", etc.). By repeating this trillions of times across vast supercomputer clusters, the model builds a rich internal map of language, grammar, reasoning patterns, and encyclopedic facts. However, a base model is not an assistant; it is a text completer. If you ask a base model "Write a recipe for chocolate cake," it might reply with a second question: "And write a recipe for apple pie," because it is mimicking lists of recipes found on the internet.

Phase 2: Supervised Fine-Tuning (Creating the "Instruct Model")

To turn a text completer into an interactive assistant, engineers perform Supervised Fine-Tuning (SFT). In this phase, the base model is trained on a curated dataset of high-quality conversational prompts and responses, written by human experts.

A typical training sample looks like:

Prompt: Explain photosynthesis in one sentence.
Response: Photosynthesis is the process by which plants use sunlight, water, and carbon dioxide to create oxygen and energy in the form of sugar.

By training on tens of thousands of these conversational examples, the model learns the "instruct" behavior: it recognizes when it is being asked a question and understands that it must respond with helpful answers, adopting a conversational and polite tone.

Phase 3: Alignment (RLHF and DPO)

Even after SFT, a model can still produce toxic, biased, incorrect, or unhelpful output. SFT only teaches the model to imitate the training dialogues. To ensure the model is helpful, honest, and harmless, engineers "align" it with human preferences using two primary techniques:

1. Reinforcement Learning from Human Feedback (RLHF)

RLHF works by using a grading system. The process involves three steps:

Generate Options: The model generates multiple candidate answers to a prompt.
Train a Reward Model: Human evaluators rate these candidate answers from best to worst. A separate neural network—the Reward Model—is trained to predict what score a human would give to any given response.
Reinforce: Using an RL algorithm (typically PPO), the LLM's parameters are updated to maximize the score predicted by the Reward Model. Responses that humans like are rewarded, and disliked responses are penalized.

2. Direct Preference Optimization (DPO)

While RLHF is highly effective, it is notoriously unstable, expensive, and complex to train because it requires maintaining multiple models simultaneously (the LLM, the Reward Model, and reference models).

In 2023, researchers introduced Direct Preference Optimization (DPO). DPO bypasses the reward model entirely. It mathematically proves that you can optimize the LLM policy directly using a dataset of paired choices: a prompt, a preferred (chosen) response, and a disliked (rejected) response. DPO adjusts the weights so that the probability of the chosen response increases relative to the rejected response, creating a much simpler, faster, and more stable alignment loop.

Key Concept: Kaplan vs. Chinchilla Scaling Laws

How do we make models smarter? For a long time, the industry followed Kaplan's scaling laws (2020), which suggested that parameter size was the single most important factor—urging engineers to build larger models, even if they couldn't afford to train them on more data.

In 2022, DeepMind published the Chinchilla scaling laws. They proved that for optimal performance, parameter count and training data (tokens) should scale in equal proportion. Most models were actually under-trained on too little data. This shifted the industry toward training smaller, highly efficient models (like LLaMA or Mistral) for much longer on high-quality tokens, making them far cheaper to run on standard hardware.

Choosing the Right Adaptation Method

Most product teams do not train foundation models from scratch. They choose among smaller adaptation levers:

Prompting: Best for behavior that can be described in instructions and examples.
RAG: Best when the answer depends on changing, private, or auditable knowledge.
Fine-tuning: Best when the model needs a consistent style, format, domain vocabulary, or task habit that is hard to teach in every prompt.
Preference tuning: Best when several answers are plausible but the product has a clear preference for one kind of response.
Guardrails and evals: Necessary when mistakes are expensive, no matter which training method is used.

A useful rule: do not fine-tune just to add facts. Facts change and should usually live in retrieval, tools, or databases. Fine-tune when you want the model to behave differently even when the same facts are already present.

Sources

Scaling Laws for Neural Language Models — Kaplan et al.
Training Compute-Optimal Large Language Models — Hoffmann et al.
Training Language Models to Follow Instructions with Human Feedback — Ouyang et al.
Direct Preference Optimization — Rafailov et al.