Why web agents fail — and what a trace reveals

Five recurring failure modes found in hundreds of agent execution traces, and the targeted fixes that produced a 16-point benchmark gain without changing the underlying model.

July 10, 2024 · 8 min read · ML engineers, agent builders

AgentsDebuggingBenchmarksGo

I have been spending time with Invariant Labs' analysis of web agent traces from the WebArena benchmark. The findings match what I see building Genie. Agent failures are not random — they cluster into a small number of repeating patterns, and those patterns point at specific, fixable causes.

The core insight: log everything, then read the logs

The obvious but underappreciated point: you cannot debug what you cannot see. Raw JSON traces are thousands of lines long with no navigation. The first investment that pays for itself is tooling to make traces readable — not a new model, not a better prompt, but a way to step through what the agent actually did.

In Genie I instrument every agent invocation with an OpenTelemetry span that carries the full message, the policy decision, the tenant ID (hashed), and the outcome. When something goes wrong, I navigate to the span, not to raw logs.

The five failure patterns

1. Looping — the agent types into a field that already has content

The most instructive case: an agent searches for a product by typing a query into a search field. On each retry, it appends to the existing text rather than replacing it. The search string grows longer. Results get worse. The agent exhausts its step budget without recognising the loop.

Fix: modify the type action to clear field content before inserting. This is a tooling fix, not a prompt fix — the agent's reasoning was correct. The tool's behaviour was wrong.

2. Hallucination — the model fills gaps from training data

An agent asked to compose a contact message for a specific customer invented a name and email address that matched the type of information requested but didn't match the actual page content.

Fix: explicit prompting that prioritises retrieved data over recalled data. "Use only information visible on the current page; do not infer or recall." Stronger than a general accuracy instruction.

3. Environment errors — the accessibility tree lies

Dropdown menu interactions failed because the accessibility tree didn't expose options in a form the agent could use. The model correctly identified what to do; the interface between agent and browser was the problem.

Fix: a dedicated select_option action that queries the DOM directly for <select> and <option> elements. The fix is in the toolset, not the model.

4. Ignoring date/filter constraints

Asked for January 2023 best-sellers, the agent returned current best-sellers. An early match on "best sellers" satisfied its stopping condition before it checked the date constraint.

Fix: prompt the agent to verify all constraints before returning a result. Foreground the filter condition rather than burying it.

5. Benchmark design artifacts

Not all failures are agent failures. Overly strict string matching in evaluation code fails a correct answer because of minor formatting differences. Track these separately — they affect score reporting without reflecting actual capability gaps.

The results

Task set	Baseline	After targeted fixes
WebArena OpenStreetMap	30%	46% (+16pp)
WebArena ShoppingAdmin	24%	31% (+7pp)

Sixteen percentage points from three targeted tool fixes and two prompt improvements. No model change. No architecture change.

What this means for Genie

Every agent in Genie that uses a browser or form interaction — the bulk_statement_analyzer, the invoice_processor, the receipt_ocr — gets the same treatment: instrument the trace, identify the failure mode, fix the tool or the prompt, verify with a regression test. The test stays in the suite. The fix compounds.

The Invariant research confirms something I believe strongly: agent debugging is not prompt engineering. It is engineering. The failures are deterministic once you can see them. The fixes are small and precise. What's missing is the observability infrastructure to make the failures visible.

Source: Invariant Labs — What we've learned from analyzing hundreds of AI web agent traces