DwarfStar: run frontier AI on your own machine
A hyper-focused, native inference engine built for one job — running DeepSeek V4 locally, even on hardware that technically can't fit it.
There is a quiet category of software that does one thing and does it with total conviction. DwarfStar belongs to it.
Built by Antirez (creator of Redis) and published under an open-source license, DwarfStar is a native inference engine designed for a single purpose: running DeepSeek V4 Flash and DeepSeek V4 PRO on your own hardware. Not a wrapper around llama.cpp. Not a generic runtime that happens to support DeepSeek. A complete, self-contained stack built from the ground up for this specific model family.
**What makes it different from everything else.**
Generic model runners are optimized for breadth. They support dozens of models, expose general APIs, and make reasonable trade-offs for each. DwarfStar makes no such compromise. Because it only targets DeepSeek V4, it can hard-code assumptions about the model architecture, tune the memory management for its exact parameter layout, and skip all the abstraction layers that a general runner needs. The result is a leaner, faster engine that handles everything in one place: prompt rendering, tool calling, streaming output, and a built-in HTTP server.
The target hardware is deliberately high-end. Apple MacBooks with 96 GB or more of unified memory, Mac Studio configurations, and heavy NVIDIA systems like the DGX Spark. If your machine falls below these specs, DwarfStar still has an answer — but it involves a clever trick with your storage drive.
**The SSD as extended RAM.**
DeepSeek V4 is a Mixture-of-Experts (MoE) model. This architecture is enormous in total parameter count but only activates a small fraction of its "expert" sub-networks for any given token. Most of the model sits dormant at any moment. DwarfStar exploits this by keeping inactive experts on your NVMe SSD and streaming them into RAM only when the router selects them for a specific forward pass.
Modern NVMe drives are fast enough to make this practical. The penalty is latency, not impossibility. Instead of a hard binary outcome where the model either fits in RAM or does not run at all, SSD streaming creates a continuous spectrum. More RAM means faster inference. Less RAM means slower inference. But the model always runs.
This matters. It means a machine with 32 GB of unified memory can run a frontier-class, 671-billion-parameter model. Slowly, but genuinely. The experience is no longer gated by whether your hardware can hold the whole thing at once.
**Splitting across machines.**
For users who want the full experience without a single machine large enough to hold it, DwarfStar supports distributed inference over a local network or Thunderbolt cable. You connect two machines, and the model's layers are partitioned across both. Machine A handles the first half of the transformer stack. Machine B handles the second. Activations travel between them at each forward pass.
The tradeoff is well-understood. Prompt processing (ingesting your input) gets faster because both machines work in parallel. Token generation (producing output) gets slightly slower because of the round-trip latency between machines. But combined memory doubles. Two 96 GB machines can run a configuration that neither could handle alone.
**A coding agent that lives inside the engine.**
DwarfStar ships with a built-in coding agent that is unlike the typical agentic setup you get from third-party clients. Because the agent has direct access to the inference engine rather than going through an HTTP API, tool invocations and code generation happen with essentially no round-trip overhead. You see results appear immediately rather than waiting for a request cycle.
The architecture eliminates an entire abstraction layer. Most agentic applications call a model API, receive a JSON response, parse tool calls, execute them, and send results back. DwarfStar collapses this into a single tight loop. For iterative coding tasks where you are bouncing back and forth with the model dozens of times, this difference adds up quickly.
**Persistent conversation state.**
One of the less obvious costs of running long coding sessions with a language model is KV cache reconstruction. Every time you start a fresh session, the model has to re-read and re-encode your entire context from scratch. For long conversations, this can take seconds.
DwarfStar's disk KV cache solves this by serializing the full attention cache state to your SSD at the end of a session. When you resume, the model loads this cache directly instead of recomputing it. The conversation picks up exactly where it left off, with no re-ingestion delay. For multi-hour coding sessions, this is a meaningful quality-of-life improvement.
**Steering without fine-tuning.**
DwarfStar exposes activation steering as a first-class feature. You can inject mathematical vectors directly into the model's internal activations at inference time, altering its behavior in real time without changing any weights. Want the model to respond more concisely? Apply a vector that nudges it in that direction. Want to discourage certain topics? There is a vector for that too.
This is the same technique used in mechanistic interpretability research, now available as a practical runtime control. The alternative would be fine-tuning: a process that takes hours, requires a GPU cluster, and produces a new checkpoint you have to manage. Steering is instantaneous and reversible.
**Built with the tools it is named for.**
The project notes that DwarfStar was developed by humans working alongside advanced AI coding assistants, using the kind of local AI tooling it is now designed to run. There is a satisfying circularity to that. The engine was, in a meaningful sense, built with itself.
It also stands on serious foundations. The llama.cpp ecosystem provides the lower-level primitives that DwarfStar builds on — quantization, GGUF format support, and the base inference routines that the project extends and optimizes for DeepSeek V4 specifically.
**A narrow bet, openly made.**
DwarfStar is beta software. It runs one model family. It does not try to be everything. The README is direct about this: as newer and more capable models arrive, the project will shift focus to follow them, leaving older support behind.
That focus is the point. General-purpose tools spread their attention across dozens of models and use cases. DwarfStar concentrates entirely on giving you the best possible experience with the most capable open-weight model available today, on hardware you already own. For people who want frontier-level AI locally, right now, it is the most direct path there.
Related Reads
Figma best practices in the age of AI
AI agents can now read your Figma files directly via MCP. The canvas that worked for human designers breaks for machine consumption. Here is what to change.
Open knowledge format: Google's AI memory standard
A minimal YAML spec that turns scattered wikis and PDFs into a structured, Git-trackable knowledge base that LLMs can query without hallucinating the details.
Claude Fable 5: The hidden truth
Anthropic's most capable model ever has a secret playbook. The official system card reveals silent classifiers, activation steering, and a 30-day data retention loop you never see.