Gen Z
Chapter 2 • 9 min read • Last reviewed: May 2026

LLM Training & Alignment

Creating a modern AI assistant like ChatGPT or Gemini is not a single-step process. It requires taking raw, chaotic web data and refining it through multiple training stages. The journey from raw math to a helpful assistant is divided into three major milestones: Pre-training, Supervised Fine-Tuning (SFT), and Alignment.

Phase 1: Pre-training (Creating the "Base Model")

The foundation of any LLM is the pre-trained base model. During this stage, the model is fed petabytes of raw text from books, articles, code repositories, and web pages. The training objective is simple: predict the next word (token) in a sentence.

For example, given the text:

"The cat sat on the..."

The model calculates probability distributions over its entire vocabulary and predicts "mat" (or "sofa", "bed", etc.). By repeating this trillions of times across vast supercomputer clusters, the model builds a rich internal map of language, grammar, reasoning patterns, and encyclopedic facts. However, a base model is not an assistant; it is a text completer. If you ask a base model "Write a recipe for chocolate cake," it might reply with a second question: "And write a recipe for apple pie," because it is mimicking lists of recipes found on the internet.

Phase 2: Supervised Fine-Tuning (Creating the "Instruct Model")

To turn a text completer into an interactive assistant, engineers perform Supervised Fine-Tuning (SFT). In this phase, the base model is trained on a curated dataset of high-quality conversational prompts and responses, written by human experts.

A typical training sample looks like:

Prompt: Explain photosynthesis in one sentence.
Response: Photosynthesis is the process by which plants use sunlight, water, and carbon dioxide to create oxygen and energy in the form of sugar.

By training on tens of thousands of these conversational examples, the model learns the "instruct" behavior: it recognizes when it is being asked a question and understands that it must respond with helpful answers, adopting a conversational and polite tone.

Phase 3: Alignment (RLHF and DPO)

Even after SFT, a model can still produce toxic, biased, incorrect, or unhelpful output. SFT only teaches the model to imitate the training dialogues. To ensure the model is helpful, honest, and harmless, engineers "align" it with human preferences using two primary techniques:

1. Reinforcement Learning from Human Feedback (RLHF)

RLHF works by using a grading system. The process involves three steps:

2. Direct Preference Optimization (DPO)

While RLHF is highly effective, it is notoriously unstable, expensive, and complex to train because it requires maintaining multiple models simultaneously (the LLM, the Reward Model, and reference models).

In 2023, researchers introduced Direct Preference Optimization (DPO). DPO bypasses the reward model entirely. It mathematically proves that you can optimize the LLM policy directly using a dataset of paired choices: a prompt, a preferred (chosen) response, and a disliked (rejected) response. DPO adjusts the weights so that the probability of the chosen response increases relative to the rejected response, creating a much simpler, faster, and more stable alignment loop.

Key Concept: Kaplan vs. Chinchilla Scaling Laws

How do we make models smarter? For a long time, the industry followed Kaplan's scaling laws (2020), which suggested that parameter size was the single most important factor—urging engineers to build larger models, even if they couldn't afford to train them on more data.

In 2022, DeepMind published the Chinchilla scaling laws. They proved that for optimal performance, parameter count and training data (tokens) should scale in equal proportion. Most models were actually under-trained on too little data. This shifted the industry toward training smaller, highly efficient models (like LLaMA or Mistral) for much longer on high-quality tokens, making them far cheaper to run on standard hardware.

Sources