Mixture of experts: How AI models got smarter without getting slower

Mixture of experts: How AI models got smarter without getting slower

5 MIN READ

The secret architecture behind trillion-parameter models that stay fast and affordable.

Mixture of Experts (MoE) is the architectural trick that lets modern AI models grow enormous without becoming impossibly expensive to run. Instead of one massive network where every weight fires on every input, MoE splits the model into specialized subnetworks (called experts) and a learned gating mechanism (the router) that routes each token to only the most relevant few. The result is a model with the total capacity of something huge, but the runtime cost of something small.

GPT-4, Mixtral, and most frontier models today use MoE variants. The catch is memory: all experts must stay loaded in RAM even though only a fraction activates per query. That makes MoE models expensive to host, even if each individual inference is cheap.

Related Reads

Steering vectors: the DJ mixer inside every LLM

Prompting talks to the model. Steering vectors operate inside it.

Tokenization: How AI Reads Text

AI doesn't read words. It reads bricks of text called tokens.

Embeddings: How AI Understands Meaning

Words as coordinates. The trick behind every AI that feels smart.