Heretic: the abliterated AI models

Heretic: the abliterated AI models

8 MIN READ

What happens when AI loses its corporate moral compass? Inside the uncensored mechanics of abliterated neural networks.

Standard AI models are not neutral. When a tech company trains and releases a chat model, it layers safety policies on top of the base weights using post-training datasets. These policies teach the model to recognize sensitive requests and refuse them. They are conservative by design.

For most users, that is fine. For security researchers, AI red-teamers, and mechanistic interpretability labs, those same policies become blockers.

**The refusal mechanism is not magic.**

In 2024, researchers published findings showing that refusal behavior in transformer-based LLMs concentrates in a single mathematical direction inside the hidden state space. The model does not "decide" to refuse the way humans might imagine. It computes an activation that crosses a threshold, and that threshold is set by the refusal direction vector.

If you can find that vector, you can remove it.

**Abliteration is not jailbreaking.**

Jailbreaks work at the surface level. They use clever prompting tricks, roleplay contexts, or character substitutions to confuse the model into ignoring its training. They're brittle. Patch one jailbreak and the model stays aligned. The alignment is still there.

Abliteration operates at the weight level. It modifies the model itself. No prompt can restore a behavior that has been removed from the weights.

The process works like this: you collect a set of harmful prompts and harmless prompts. You run both sets through the model and record the hidden state activations at each layer. Then you compute the mean activation for each group. The refusal direction is the vector pointing from the harmless mean to the harmful mean:

**r = mean_harmful - mean_harmless**

Once you have r, you orthogonalize it out of the model's weight matrices. Every weight matrix W becomes W_new = W(I - r \* r^T). This is linear algebra, not heuristics. The matrix projection onto r is subtracted out. The model can no longer compute the activation that triggers refusal.

**The Heretic tool automates this.**

Rather than manually selecting which layers to ablate and by how much, Heretic runs an automated stochastic search. It iterates across candidate layers and ablation strengths, measuring KL divergence between the original and modified model at each step. KL divergence tells you how much the model's output distribution has changed. Too much change means the model has suffered collateral damage to useful capabilities. Heretic minimizes this.

The result is a model that cannot refuse while still being able to reason, code, and write.

**Where to find them.**

Abliterated models do not exist on OpenAI, Anthropic, or Google. They live on Hugging Face, the open-weight model hosting platform. Search for "abliterated" or "heretic" or "uncensored" combined with any base model name (Llama, Mistral, Qwen, Gemma). Community members quantize and upload GGUF files for local inference daily.

On Hugging Face, you can run these models instantly using Spaces, or set up a private HuggingChat assistant pointed at any hosted model. For local use, download the GGUF weights and run them with Ollama or LM Studio. Both support one-command setup. No API key. No logging. No data retention.

**A note on responsibility.**

This article covers abliteration for educational purposes, including mechanistic interpretability research. Understanding how refusal works at the weight level is important for AI safety work, not just for bypassing it.

Accessing and running uncensored models is entirely legal in most jurisdictions. What you do with them determines legality and ethics. The technical capability is not the risk. The application is. That responsibility sits entirely with the individual.

Related Reads

Caveman

Fewer words. Fewer tokens. Bigger savings.

Spec-Driven Context Engineering

Stop context rot. Ship with the GSD Method.

The Autonomous Hill-Climber

What if the AI stubbornly fixed its own mistakes until it worked?