Agent observability and trace-level testing — the infrastructure that makes debugging tractable

Invariant released Explorer (trace visualisation) and a testing library (trace-level assertions). Together they enable the debugging workflow that should be standard for agent development.

December 17, 2024 · 7 min read · Agent builders, platform engineers, SRE

ObservabilityTestingDebuggingAgentsOpen Source

Invariant Labs released Explorer and their testing library in December 2024. These are the infrastructure pieces I wished existed when I started building Genie. The absence of them is why most agent debugging today is guesswork — tweak the prompt, rerun, see if the number went up.

The trace problem

Every agent session produces a trace: a chronological record of every LLM call, every tool invocation, every response, every decision branch. In theory this contains everything needed to understand why the agent behaved a particular way. In practice, raw traces are JSON files that might be thousands of lines long, with no navigation, no annotation, no search, and no way to share a specific moment with a colleague.

The gap between "the information exists" and "the information is usable" is exactly what Explorer closes. It provides chronological trace visualisation, annotation of specific decision moments, filter and search by event type, and shareable links to specific trace segments.

Trace-level assertions: unit testing for agent behaviour

The testing library is the piece I find most valuable architecturally. Traditional unit tests assert on function outputs. Agent testing is harder because:

Sessions are stochastic — the same input can produce different traces.
The unit of correctness is often a pattern across the session, not a single output value.
A test that passes on one trace run might fail on a re-run.

Invariant's library addresses this with trace-level assertions: instead of asserting on final outputs, tests assert on patterns in the trace. "The agent must not call execute_code after reading from an external URL" is a testable assertion. "The agent must eventually call send_email if the user asked it to send an email" is a testable assertion.

The debugging workflow this enables

With trace tooling in place, the debugging loop becomes:

Write a test that captures a desired behavioural invariant.
Run the agent; inspect the failing trace in Explorer.
Identify the specific step where the agent diverged from the intended behaviour.
Modify the agent (system prompt, tool selection, planning logic) to make the test pass.
Regression test — the full test suite runs on every change.

This is test-driven development applied to agents. The test isn't "does the function return the right value" but "does the agent behave correctly under this scenario."

What Genie does for the same problem

Genie's approach is structurally similar. Every agent invocation produces an OTel span. The tests/security_envelope_test.go asserts on behavioural patterns — "sketch-tier agent must not handle this message type," "fallback_request must carry original tenant metadata" — rather than on specific output values. The test suite is the living specification of what the system is supposed to do.

The Explorer + testing library combination is now on my short list for the Genie observability stack. The public benchmark registry — with traces from SWE-Bench, Cybench, and WebArena in a navigable format — is immediately useful for anyone doing comparative agent evaluation.

Source: Invariant Labs — Releasing Explorer & Testing: Visualize and Understand AI agents