Multi-turn evals from first principles
Single-turn evals check one decision. Multi-turn evals check the whole trajectory. Here
Posts about evaluation. ← All posts
Single-turn evals check one decision. Multi-turn evals check the whole trajectory. Here
How to instrument multi-agent systems with OpenTelemetry, propagate trace context across an in-memory bus, and build a layered evaluation pipeline — from real-time policy gates to async LLM-as-judge to SLO-based trust scoring. Everything I learned building Genie.
How a single sprint of specialty-rule work — guided by a benchmark that wasn't afraid to print embarrassing numbers — turned a 'demo respiratory differential' into a five-condition rule-based diagnostic engine.