Original
Chapter 4 • 9 min read • Last reviewed: May 2026

Scaling Efficiency: MoE & Quantization

As AI models grow larger, running them becomes incredibly expensive. A dense 175-billion-parameter model needs a bunch of high-end enterprise GPUs running concurrently just to output a single word. To make these models practical for commercial use and deployable on smaller hardware, engineers rely on two massive efficiency glow-ups: Mixture of Experts (MoE) and Quantization.

Mixture of Experts (MoE)

In a standard "Dense" model, every single parameter (the neural connections) is activated for every single word processed. This is highly inefficient; a model doesn't need to invoke its entire mathematical knowledge base to process a simple punctuation mark or pronoun.

An MoE architecture turns a dense model into a "Sparse" model by breaking it up into specialized compartments called Experts (typically inside the Feed-Forward Network layers). Instead of passing a word through all pathways, a dynamic Gating Network (Router) decides which experts should handle which word.

Sparse Routing in Action

Imagine a model with 8 distinct "Experts." When a token is processed:

  • If the token is a line of Python code, the Router sends it to Expert 3 (Code specialist) and Expert 5 (Logic specialist).
  • If the token is a word in French, the Router sends it to Expert 1 (Translation specialist).

Typically, the Router selects only the Top-2 Experts for each token. If a model has a total of 8x 7B experts (56B total parameters), it only activates roughly 12B parameters per token. This gives the model the vast knowledge capacity of a 56B model, but with the speedy generation speed and compute cost of a much smaller 12B model.

The Challenges of MoE

MoE is not a free lunch. It introduces several hard engineering hurdles:

  1. RAM Overhead: Although only 12B parameters are active at any millisecond, the entire 56B parameter model must still be loaded into the GPU's memory (VRAM). This means MoE needs significantly more memory than dense models of equivalent speed.
  2. Routing Collapse: During early training, the router might favor one expert, making it smarter, which causes the router to send even more traffic to it. Engineers must write custom algorithms to force load-balancing so all experts are trained evenly.

Quantization

Neural networks represent their learned weights as high-precision decimals called floating-point numbers. During training, these are typically represented in 16-bit precision (FP16 or BF16).

Storing weights in 16-bit precision means every single parameter needs 2 bytes of GPU memory. A 70-billion-parameter model needs at least 140 gigabytes of VRAM just to load, which exceeds the capacity of almost all consumer GPUs.

Quantization is the process of compressing these weights by reducing their numerical precision—mapping them to smaller formats like 8-bit integers (INT8), 4-bit integers (INT4), or even custom formats like FP4.

The Intuition Behind Quantization

Think of quantization like reducing the color depth of a digital photo. If you convert a photo from 24-bit true color to an 8-bit color palette, the file size shrinks by 66%. The image looks slightly less smooth, but the shapes, objects, and overall context are still perfectly recognizable.

Similarly, when we quantize a model from 16-bit to 4-bit, we decrease its size by 75%. A 70B model that once required 140GB of VRAM can now fit into roughly 35GB of VRAM. Remarkably, due to the high mathematical redundancy in neural networks, this massive compression results in only a tiny degradation in reasoning capability.

Modern Quantization Formats

Several standard file formats are used to run these compressed models:

Sources