Evaluation — Blog — Pratik Dhanave

Jun 06, 2026 · Engineering

Multi-turn evals from first principles

Single-turn evals check one decision. Multi-turn evals check the whole trajectory. Here

MAFEvaluationLLM-as-JudgePython

May 29, 2026 · Engineering

OpenTelemetry and Evaluation in Multi-Agent Workflows — the full production stack — Pratik Dhanave

How to instrument multi-agent systems with OpenTelemetry, propagate trace context across an in-memory bus, and build a layered evaluation pipeline — from real-time policy gates to async LLM-as-judge to SLO-based trust scoring. Everything I learned building Genie.

OpenTelemetryEvaluationMulti-Agent AIObservabilityGenie

May 20, 2026 · ML engineers, AI medicine

Moving Diagnostic Accuracy 42.9% → 85.7% by Changing Two Files

How a single sprint of specialty-rule work — guided by a benchmark that wasn't afraid to print embarrassing numbers — turned a 'demo respiratory differential' into a five-condition rule-based diagnostic engine.

ML EngineeringBenchmarksGoEvaluation

#Evaluation

Multi-turn evals from first principles

OpenTelemetry and Evaluation in Multi-Agent Workflows — the full production stack — Pratik Dhanave

Moving Diagnostic Accuracy 42.9% → 85.7% by Changing Two Files