The hardest part of agent debugging: finding the system prompt bug

Invariant's Santa's challenge is a clean reproduction of a recurring production failure mode — an agent that has the right tools but consistently fails to complete tasks because of an ambiguity in its instructions.

December 23, 2024 · 5 min read · Agent builders, prompt engineers

AgentsDebuggingSystem PromptsTesting

Invariant Labs ran a winter challenge built around a realistic agent failure: the agent has all the tools it needs, the tools work, the code is correct, but some tasks consistently fail. The cause: an ambiguity in the system prompt.

Why system prompt bugs are hard to find

Code bugs throw exceptions. System prompt bugs produce subtly wrong behaviour that passes casual inspection and only surfaces under specific input conditions. The agent that delivers 9 out of 10 presents correctly and consistently misses one class of delivery looks fine in demos and broken in production.

The challenge structure was: given the agent's code, a system prompt, and a test suite built with Invariant's testing library, find and fix the system prompt so all deliveries succeed. No code changes allowed — the bug is in the instructions.

The debugging workflow

The effective approach, from the winning submissions:

Run the test suite. Identify which scenarios fail consistently.
Open the failing traces in Explorer and step through the agent's reasoning for those cases.
Find the specific decision point where the agent diverges — usually a misinterpretation of the system prompt under specific input conditions.
Modify the system prompt to close the ambiguity. Rerun the tests.

The key insight: agent failures under fixed conditions are almost always traceable to a specific ambiguity or omission in the instructions. Trace inspection is the fastest path to finding it. Without trace tooling, you're guessing.

A pattern I recognise from Genie

I have seen this exact failure mode in the kyc_orchestrator and the sme_loan_workflow during development. An agent that processes most cases correctly but consistently mishandles one specific input combination. The fix is never "use a better model" — it's always "close a specific ambiguity in the instruction set, then write a test that would have caught the original gap."

The test is the important part. The Genie test suite includes regression tests specifically for cases that were once bugs. They run on every go test ./.... A system prompt change that re-introduces a previously fixed failure fails the CI build before it reaches production.

Source: Invariant Labs — Santa's Agent Challenge