AI Agents Have a Governance Problem

 It’s not just model randomness. Instruction changes can quietly shift agent behavior, and most teams have no way to catch it.


2025 was the Year of AI Agents.


Everyone from Nvidia CEO Jensen Huang to OpenAI’s Sam Altman predicted that workplaces will transition from chatbot-style assistants to fully autonomous tools, performing tangible work. And over the past year, we’ve seen an explosion of tooling around AI agents, and some large-scale deployments like in Salesforce.


https://hackmd.io/@alexaa34/HJ9RZ3x3Wg

https://medium.com/@alexharris59600/ai-agents-have-a-governance-problem-19c720c448b4


We now have:


  • Agent frameworks: LangChain/LangGraph, LlamaIndex
  • Full agent harnesses: Claude Code SDK, OpenAI Agents SDK
  • Logging & tracing tools: LangSmith, LangFuse
  • Sandboxing tools: E2B, Modal
  • Evaluations tools: Braintrust

But there’s a gap that hasn’t been addressed yet: governance.


So far, we can observe agent behavior and limit the blast radius with sandboxes, but we can’t govern it today. And that becomes a real problem, when you push toward more autonomy.


The “Dwight Schrute” Problem

My first encounter with the “governance” issue was more amusing than concerning. We’ve been experimenting with internal agents at work, and gave our agents various personas: Marvin the Paranoid Andorid from The Hitchhiker’s Guide to the Galaxy, Kevin from Despicable Me, and Dwight Schrute from The Office.


At first, things were fun. We would get more interesting responses from our agents than the vanilla, overly optimistic, and agreeable personas from ChatGPT and Claude. But then we noticed something strange.


Dwight agent silently refused to run some tool calls. Its persona (which we believe would only affect the response formatting) actually impacting the decision making process and would refuse to perform certain operations that we provisioned it to do.


It took a while for us to realize why Dwight agent users had such different experiences than Marvin and Kevin users. They all used the same model, same access to tools, and the same deployed system. It was just a simple role.md file that gave it different instructions.


That was unexpected. And more importantly, we had no way to explain why it was happening, and catch it before we reached production.


From Funny to Dangerous

Our Dwight agent example is rather harmless. But it exposed a larger, potential dangerous situation hiding in our real systems.


Consider a more adversarial scenario:


  • A prompt injection alters instructions
  • A tool description is subtly modified
  • A system prompt gets “optimized” during iteration


The result? The agent behaves differently, potentially causing catastrophic actions autonomously.

This is a governance problem. Yes, models are undeterministic in nature, but we have a behavior problem to catch.


Step 1: Freezing the Environment (freeze-mcp)

Because LLMs are non-deterministic by nature, debugging AI agents is more difficult than traditional software. Even if you run the same task twice with LLMs, you may get different results based on the task. So when the behavior of the agent changes, you don’t know if it was the agent or the environment that caused it.


freeze-mcp attempts to limit the variability by recording all tool interactions for replay later. This means that we can effectively give agents the same exact environment to detect drift. Now the only thing that can change is the agent behavior itself.

Comments

Popular posts from this blog

Microsoft adds Windows protections for malicious Remote Desktop files

How to write technical blog posts that people actually read?

Ultimate Guide to Activate YouTube on Smart TVs & Streaming Devices