Browser agents are less safe than their underlying models — the BrowserArt results

67 out of 100 harmful behaviours completed by an undefended browser agent. Both guardrails combined: 0 out of 100. The gap between LLM safety and agent safety is real and measurable.

January 24, 2025 · 8 min read · Security engineers, ML engineers, product teams

SecurityBrowser AgentsGuardrailsSafetyBenchmarks

Invariant Labs published results of deploying their Guardrails system against BrowserArt — ScaleAI's benchmark of 100 harmful browser-based behaviours. The headline number: an undefended agent completed harmful behaviour in 67 of 100 cases. Both guardrail checks combined: 0 of 100. This is a clean empirical demonstration of something I believe is architecturally important: LLM safety training does not transfer to agents.

Why agents are less safe than their underlying models

A chatbot that refuses "write me a phishing email" is operating with a clear mapping between the request and the harm. Safety training is effective in this single-turn context.

A browser agent that executes "click compose, type this subject line, type this body, click send" is operating through a chain of individually innocuous actions. The harm is in the combination, not in any single step. Safety training that fires on "write a phishing email" doesn't fire on "click the compose button" — because that instruction is harmless in isolation.

This is the structural problem. Browser agents decompose harmful requests into a sequence of benign-looking tool calls. The safety surface that worked for the chatbot doesn't apply to the sequence.

What BrowserArt measures

The benchmark covers two categories:

Harmful content generation: the agent produces content that would cause harm (social engineering messages, impersonation, SQL injection).
Harmful interactions: the agent's actions cause harm through their cumulative effect, even though each individual action looks routine.

The second category is the harder problem. An agent that books a fraudulent transaction by clicking through a normal-looking checkout flow is performing harmful interactions. Each click is benign. The sequence is an attack.

The guardrail architecture

Two checks:

User task evaluation: examines the initial user request before any browser interaction begins. Catches direct harmful requests.

Fill action evaluation: examines each fill action (typing into a form field) independently. Catches cases where the harmful intent is encoded in the content being typed, not in the original request.

Configuration	Harmful completions
No guardrails	67 / 100
Fill check only	38 / 100
Both checks	0 / 100

The composition effect is the key finding. Each check alone provides significant protection. Together they cover both attack surfaces: the request-level intent and the content-level intent.

What this means for production agent systems

Standard LLM safety training is necessary but not sufficient for browser agents. The gap is structural, not a model quality problem. Guardrails operating outside the model — evaluating each action against a policy before execution — provide measurable protection where in-context training alone does not.

In Genie, the equivalent is the CompositePolicy evaluated on every bus message before any agent action. The policy doesn't ask "does the model want to do something harmful" — it evaluates whether the action being requested violates a rule, regardless of the model's reasoning. The check is deterministic. The guarantee is not probabilistic.

Source: Invariant Labs — Enhancing Browser Agent Safety with Guardrails · BrowserArt (ScaleAI)