Chapter 1 • 8 min read • Last reviewed: June 2026

The Transformer Core

Before 2017, natural language processing (NLP) models read text the slow way: one word at a time, left to right. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks kept a running "mental state" that updated with each new word. Sounds sensible, but the bottleneck was brutal: it could not parallelize. Because you needed the state of word n-1 to compute word n, GPUs could not process whole text blocks at the same time.

Then the landmark paper landed: "Attention Is All You Need", introducing the Transformer architecture. The Transformer dropped recurrence entirely and processed all words in a sequence at the same time. The trick was a mathematical shortcut called Self-Attention.

The Core Glow-Up: Self-Attention

Self-attention lets a model look at one word and figure out which other words matter most, even if they are far apart. Example:

"The bank robber ran to the river bank because he saw the police."

How does the model know the first "bank" means a financial institution, while the second means the edge of a river? In an RNN, the "bank robber" context might fade by the end. In a Transformer, the first "bank" gets compared with every other word at once and links strongly to "robber." The second "bank" links to "river." The model gives every word context from its surroundings in real time.

Key Concept: Query-Key-Value (QKV), no mystery

To compute attention, the Transformer gives every word three vectors:

Query (Q): What the word is looking for (for example, "I am a pronoun; where is my noun?").
Key (K): What the word offers (for example, "I am a noun, I describe a person").
Value (V): The actual content of the word: its meaning.

The model compares a word's Query vector against every other word's Key vectors. Higher score means more attention. The final representation is a weighted mix of Value vectors based on those scores.

Multi-Head Attention: more angles at once

Instead of doing the attention calculation once, the Transformer runs it several times in parallel. Each run is an Attention Head. That is Multi-Head Attention: more angles at once.

With multiple heads, the model can inspect different parts of meaning at the same time:

Head 1 might track grammar relationships, like which verb belongs to which noun.
Head 2 might resolve references, like matching "he" or "it" to the right entity.
Head 3 might watch local details, like nearby adjectives.

Together, the heads build a richer, more accurate language representation.

Positional Encoding: giving words coordinates

Because a Transformer processes every word at once, it does not automatically know order. To pure attention, "The dog bit the man" and "The man bit the dog" look the same because the words match.

The fix is Positional Encodings: mathematical values added to each word embedding that act like coordinates. They tell the model where each word sits, preserving the sentence structure.

Encoder vs. Decoder Architectures

The original Transformer had two halves: an Encoder (reads and understands text) and a Decoder (which writes new text). Depending on the task, modern glow-ups have split these into three variants:

Encoder-Only Models (e.g., BERT): Great for understanding, classifying, and extracting information from text. They look both left and right at the same time.
Decoder-Only Models (e.g., GPT, LLaMA): Great for generating text. They are autoregressive, meaning they produce one word at a time and only look backward to predict the next token.
Encoder-Decoder Models (e.g., T5, BART): Often used for translation or summarization: read the whole input, then generate a new output.

Many frontier text-first LLMs use decoder-only architectures, optimized for generating text by predicting the next token efficiently.

Why This Still Matters in Products

Attention is the reason a model can connect a bug report to a stack trace, a legal question to a clause thirty pages earlier, or a chart caption to the data it describes. It is also the reason long prompts get expensive: every token competes for attention with many other tokens.

When a system fails, the root cause is often not "the model is dumb." It may be that the relevant tokens were missing, buried under noisy context, truncated by the context window, or split across modalities the model cannot read together. Good AI products treat context as a scarce design surface: include what matters, remove what does not, and make the big-deal evidence easy for attention to find.

Sources

Attention Is All You Need — Vaswani et al.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Devlin et al.