Steering vectors: the DJ mixer inside every LLM

7 MIN READ

Prompting talks to the model. Steering vectors operate inside it.

Prompting an AI is a surface-level interaction. You write words, the model reads them, and it responds. The prompt shapes what the model says. It doesn't shape how the model thinks.

Steering vectors go one level deeper.

**What happens inside an LLM**

As a large language model processes text, each layer of the neural network computes what researchers call activations. These are high-dimensional vectors. Lists of thousands of numbers. Each one encodes the model's internal state at that point in computation.

The critical insight: concepts aren't stored in single neurons. They're stored as _directions_ in this high-dimensional space. "Bridge" isn't one neuron. It's a vector. A specific direction in a space with 1024 or more dimensions.

**How a steering vector is built**

You start by collecting two sets of prompts: ones that strongly express a target concept, and ones that don't. You run both sets through the model and record the average activations at a specific layer. The difference between those two averages is the steering vector for that concept.

To use it, you inject this vector directly into the model's activations while it generates text. You add it to the hidden states at runtime, at the chosen layer, scaled by a coefficient. The model generates from an altered internal state, as if that concept had been strongly activated from the start.

**The Golden Gate Bridge experiment**

In 2024, Anthropic researchers demonstrated this with Claude. They isolated the "Golden Gate Bridge" representation in the model's middle layers and amplified the steering vector to extreme values.

The results were striking. Asked about almost anything - emotions, self-perception, unrelated topics - the model connected it back to the Golden Gate Bridge. At high enough amplification, it described itself as the bridge.

This wasn't a jailbreak. The weights were unchanged. Only the runtime activations were modified.

**What this means**

Steering vectors are a tool for mechanistic interpretability. They let researchers verify whether a concept exists inside a model, where it lives, and how strongly it influences behavior.

They also suggest a new layer of AI control. Prompting asks the model to respond differently. Steering vectors make it think differently. Below the level of language. At the level of computation.

Steering vectors: the DJ mixer inside every LLM

Related Reads

Mixture of experts: How AI models got smarter without getting slower

Tokenization: How AI Reads Text

Embeddings: How AI Understands Meaning