AgentDojo wins the Center for AI Safety SafeBench competition

The $50,000 first prize validates the core architectural bet: measuring agent security and utility simultaneously is the right framing for production AI deployment.

April 29, 2025 · 5 min read · AI safety researchers, platform architects, enterprise AI teams

AI SafetyBenchmarksAgentDojoMulti-Agent AICompetition

The Center for AI Safety's SafeBench competition challenges teams to build benchmarks that evaluate AI risk across security, robustness, monitoring, alignment, and safety. First prize: $50,000. Invariant's AgentDojo won.

I covered AgentDojo when it was presented at NeurIPS 2024. The win validates the framing that I think is most important for production AI deployment: measuring agent security and utility simultaneously rather than treating them as independent properties.

Why the competition matters

Most AI benchmarks measure capability. SafeBench explicitly asks for benchmarks that measure risk. The competition is a signal from one of the most credible safety organisations in the field that the ability to measure security properties in AI systems is itself a valued and underinvested capability.

AgentDojo's 97 tasks and 629 adversarial test cases, with dynamic evaluation that doesn't fix attack strings, gave the judges something rare: an empirically grounded measurement framework where the numbers mean something specific and where improvements can be verified.

The result that stayed with me from NeurIPS

GPT-4o achieved the highest utility score. Claude 3.5 Sonnet showed the highest resilience to prompt injection. Neither model dominated both dimensions. The trade-off between capability and security is real and measurable. Any team choosing a model for a production agentic system without measuring both dimensions is making a decision without data.

What this means for how I build Genie

Genie's evaluation posture is directly influenced by the AgentDojo framing. The tests/security_envelope_test.go suite measures both sides: it verifies that the policy stack correctly denies attacks (tier blocking, tenant policy, cross-tenant attempts), and that legitimate requests reach their destination. Utility and security are verified in the same test run, not in separate suites.

The AgentDojo repository is open source. If you're building evaluation infrastructure for production agents, the framework is worth running — particularly the dynamic evaluation mode where you plug in your own attack strategies and see how your defence stack responds.

Source: Invariant Labs — Invariant Research wins first prize of Center for AI Safety competition · AgentDojo on GitHub