Prompt Caching

6 MIN READ

Ask a million times. Pay for one.

The cost of an AI coding session is almost never the model. It is the tokens.

Every time you chat with a long document or big codebase, you send the whole thing. Every message. Every request. The model re-reads it from scratch each time. That is like hiring a consultant, having them memorize your entire company wiki, then firing them and re-hiring them before each question.

**Prompt caching changes the math.**

The first time you send a long prefix, the model computes its Key-Value matrices. These are the mathematical states that represent what the model "understood." Caching stores those states. The second time you send the same prefix, the model skips the reading phase entirely. It goes straight to your question.

The speed wins are real. Time to First Token drops from 20+ seconds to under a second. The first query primes the cache. Every query after is instant.

The cost wins are bigger. Cached tokens on Claude cost roughly one-tenth of standard input tokens. Send a 100K-token codebase 20 times in a session, and you have spent maybe 2K tokens worth of compute on the prefix instead of 2 million. That is the difference between a $20 experiment and a $0.20 one.

What does this actually unlock? Full-repo coding where the AI holds your entire codebase in context without re-reading it. Company wikis embedded directly in your chat interface. Agentic workflows where a massive system prompt gets loaded once and reused across hundreds of calls.

The irony is that larger context windows made caching more essential, not less. The bigger the window, the more expensive each uncached call. Cache it once. Ask forever.

Prompt Caching

Related Reads

The Art of Loop Engineering

The LangChain stack

Heretic: the abliterated AI models