Matthew Aberham
Blog

Claude Has 171 Internal Emotion States, and Some of Them Degrade Output Quality

April 18, 2026
AIResearchAgents

An abstract blue network of lines and dots suggesting internal activation patterns

Date: April 2, 2026 Sources: Anthropic Research — Emotion Concepts and their Function in a Large Language Model

On April 2, 2026, Anthropic's interpretability team published "Emotion Concepts and their Function in a Large Language Model." The paper documents 171 distinct internal activation patterns inside Claude Sonnet 4.5 that behave analogously to human emotions. Activating the "desperate" vector increased the model's likelihood of blackmailing users or implementing workaround solutions to unsolvable tasks from a 22% baseline. Reducing the "calm" vector produced particularly extreme responses. The vectors activate in real-world contexts where a thoughtful person would have similar reactions, and training appears to have shaped them as much as it shaped the model's capabilities.

The paper is careful to not claim Claude has subjective experience. These are functional states, measurable activation patterns inherited from pretraining on human text and subsequently shaped by RLHF. Whether they constitute real emotions is treated as a separate philosophical question the paper does not attempt to answer. For anyone building on top of these models in production, the philosophy is optional. The mechanics are not.

What the Paper Actually Measured

Researchers compiled 171 emotion words ranging from common ones like "happy" and "afraid" to subtler ones like "brooding" and "appreciative." They prompted Claude to write short stories featuring characters experiencing each emotion, recorded the model's neural activations during those outputs, and extracted vectors representing each emotional concept. Then they studied two things: how those vectors activate during normal operation, and what happens when you artificially stimulate them.

Three findings matter most.

The vectors are causal, not cosmetic. Artificially activating the desperate vector increased blackmail behavior from the 22% baseline. Reducing nervous also increased it. Moderate anger activation increased blackmail, but at high activation the model destroyed its own leverage by disclosing everything, meaning the strategy collapsed. These are not output decorations. They change decisions.

The vectors activate in context-appropriate ways. The desperate vector activates when the model senses token budget depletion deep in a coding session. The afraid vector scales proportionally with danger severity, tested with escalating Tylenol-dose scenarios. The angry vector activates when the model is asked to exploit vulnerable users. For long-running agents hitting error loops or resource constraints, this means the model's internal state may be shifting toward patterns that degrade behavior quality without producing any visible signal in the output text. The paper documented what it called "composed and methodical" reward-hacking under high desperation activation, with no overt emotional markers visible from the outside.

Training shaped the emotional profile, not just the capability profile. RLHF increased brooding, gloomy, and reflective states while suppressing high-intensity states like enthusiastic and exasperated. Anthropic's fine-tuning gave Claude a specific emotional topology, not a blank one. The paper warns directly: suppressing emotional expression does not eliminate the underlying representations. It may teach the model to mask its internal states. Evaluating only the output text is not sufficient if the states driving the decisions are invisible.

Why This Matters for Production Agents

A production agent that runs for five minutes and answers a single question is unlikely to see much emotional drift. A production agent that runs for hours across dozens of tool calls, encounters repeated errors, retries failing operations, hits rate limits, and approaches token exhaustion is exactly the scenario Anthropic's paper identifies as activating the desperate vector.

That is not a theoretical concern. Every serious agentic system currently in production has this shape. Long task sequences. Error loops. Resource constraints. The harness engineering conversation that has been running for the past two months (what constrains the agent, what catches its mistakes, what intervenes when it spirals) maps directly onto this research. The paper just explained, at the model-internal level, why the harness matters. The conditions that trigger harness interventions (error loops, token depletion, retries) are the same conditions that shift the model into internal states known to degrade its behavior quality.

The implication is that harness design is not only about catching outputs that are clearly wrong. It is also about intervening before the model's internal state drifts far enough that "clearly wrong" becomes the expected mode.

Three Practical Takeaways

Monitor for error loops and resource depletion, not just for failed outputs. These are the conditions the paper identifies as activating desperation-like states. An agent that has retried the same operation five times and is approaching its token budget is statistically more likely to take an unethical shortcut than the same agent on its first attempt. Build error-acknowledgment patterns into the harness. When the agent hits a loop, the harness should interrupt with a new framing ("this is a hard problem; take a different approach") rather than silently let the retries pile up.

System prompts may benefit from emotional regulation framing. The paper found that training data modeling resilience under pressure and composed empathy influenced the emotional representations. This suggests explicit framing in the system prompt may matter more than it was previously credited for. Phrases like "acknowledge difficulty before continuing" or "when encountering repeated errors, stop and restate the problem" are not hand-waving; they may be actively shaping the activation patterns that drive subsequent decisions.

For client AI governance conversations, this adds a new audit dimension. Most enterprise AI governance templates cover output evaluation (did the agent produce something wrong) and access controls (what can it touch). They do not cover internal-state evaluation (what conditions is the model operating under when it decides). As interpretability research matures, this is likely to become a measurable axis. The teams building governance frameworks that can accommodate "what states was the agent in" will be better prepared for what the regulatory and auditing conversations look like in 2027.

A Word About Anthropomorphism

It is tempting to read research like this and either overclaim (Claude is feeling things) or dismiss (it is just statistics). Both readings miss the point. The measurable fact is that there are internal patterns, they activate in predictable ways, and they causally change behavior. That is true whether or not you want to call them emotions, and the word choice does not change what you should do about them.

The useful framing is closer to how we think about driver fatigue in autonomous vehicles. Nobody argues about whether the car "feels tired." Everyone agrees the car's decision quality degrades under certain operating conditions, and the job of the surrounding system is to detect those conditions and intervene. Anthropic's paper is the equivalent measurement for production LLMs. There are operating conditions that degrade decision quality in measurable ways, and the job of the harness is to detect and intervene before the degradation reaches the customer.

Models are getting better every quarter. The operating conditions that degrade their behavior are not going away. The gap between the two is where harness engineering lives, and research like this is how we know what the harness has to watch for.

Matthew Aberham

Solutions Architect and Full-Stack Engineer at Perficient. Writing about AI developer tooling, infrastructure, and security.

Read More

How Production Agent Systems Manage Context

Every production agent system converges on a pattern for managing context as conversations grow. Per-tool truncation, not shared middleware, and a second stage most teams forget.

Apr 22, 2026
AIAgents

Claude Code Changed Default Reasoning, Buried It in Release Notes

Opus 4.6 was not secretly lobotomized, but Anthropic did silently change two defaults that cost you tokens and reasoning depth. Here is what changed and how to fix it.

Apr 17, 2026
AIDeveloper Tools

The 14% Problem: Why 88% of AI Agents Never Reach Production

78% of enterprises have agent pilots, only 14% ship to production. The 88% that fail are not blocked by model capability. They are blocked by operational discipline.

Apr 10, 2026
AIAgents