Original
Chapter 1 • 8 min read • Last reviewed: May 2026

The Transformer Core

Before 2017, natural language processing (NLP) models read text like humans do: one word at a time, from left to right. These models, known as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, kept a running "mental state" that updated with each new word. While intuitive, this approach had a catastrophic bottleneck: it could not be parallelized. Because you needed the state of word n-1 to compute the state of word n, GPUs could not process entire blocks of text simultaneously.

Then came the landmark paper, "Attention Is All You Need", which introduced the Transformer architecture. The Transformer discarded recurrence entirely, opting to process all words in a sequence at the exact same time. To do this, it relied on a mathematical shortcut called Self-Attention.

The Core Breakthrough: Self-Attention

Self-attention allows a model to look at a specific word and determine which other words in the sentence are most relevant to it, regardless of how far apart they are. Consider this example:

"The bank robber ran to the river bank because he saw the police."

How does the model know that the first "bank" refers to a financial institution, while the second "bank" refers to the edge of a river? In an RNN, by the time the model reached the end of the sentence, the context of "bank robber" might have faded. In a Transformer, the first "bank" is compared to all other words in the sentence concurrently, finding a strong mathematical connection to "robber." The second "bank" finds a connection to "river." The model dynamically contextualizes each word based on its surroundings.

Key Concept: The Query-Key-Value (QKV) Analogy

Query (Q) "What I seek" Key 1 (K₁) High Match (90%) Key 2 (K₂) Low Match (10%) Value 1 (V₁) Content 1 Value 2 (V₂) Content 2 Σ

To compute attention, the Transformer assigns three vectors to every single word:

  • Query (Q): What the word is looking for (e.g., "I am a pronoun, where is my noun?").
  • Key (K): What the word represents or offers (e.g., "I am a noun, I describe a person").
  • Value (V): The actual content of the word (the semantic meaning).

The model multiplies the Query vector of a word by the Key vectors of all other words. The higher the score, the more attention that word gets. The final representation is a weighted sum of the Value vectors based on these attention scores.

Multi-Head Attention

Instead of doing this attention calculation once, the Transformer does it a bunch of times in parallel. Each calculation is called an Attention Head. This is known as Multi-Head Attention.

By using a bunch of heads, the model can look at different aspects of the text at the same time. For example:

Combined, these heads build a highly dimensional and accurate understanding of language.

Positional Encoding

Since a Transformer processes all words simultaneously, it has no natural understanding of order. To a pure attention mechanism, "The dog bit the man" and "The man bit the dog" look identical because the words are the same.

To fix this, the Transformer uses Positional Encodings—a set of mathematical values added to each word's embedding that act as a coordinate. These coordinates tell the model exactly where each word sits in the sentence, allowing it to preserve the structural grammar of the text.

Encoder vs. Decoder Architectures

The original Transformer consisted of two halves: an Encoder (which reads and understands text) and a Decoder (which writes new text). Depending on the task, modern glow-ups have split these into three variants:

  1. Encoder-Only Models (e.g., BERT): goated for understanding, classifying, and extracting information from text. They look in both directions (left and right) simultaneously.
  2. Decoder-Only Models (e.g., GPT, LLaMA): goated for generating text. They are autoregressive, meaning they generate one word at a time, looking only at past words (left-to-right masking) to predict the next word.
  3. Encoder-Decoder Models (e.g., T5, BART): Often used for translation or summarization, where an input sequence is processed entirely, and a brand new output sequence is generated.

Many frontier text-first LLMs (LLMs) use decoder-only architectures, optimized for generating text by predicting the next token with massive efficiency.

Sources