P26-06-06">

Multi-turn evals from first principles

Single-turn evals check one decision. Multi-turn evals check the whole trajectory. Here's a Python harness — three evaluators (tool order, forbidden tools, LLM judge) — running against local Ollama with mocked tools.

This post is the implementation companion to the article Multi-Turn Evals: How to Test Whether an AI Agent Actually Works. The article makes the conceptual case — single-turn evals are wonderful but blind to trajectory failures. This post is the harness, in Python, adapted for the Microsoft Agent Framework project and running against local Ollama by default.

What you're catching

The interesting agent failures live between turns:

  • Right first move, wrong second move. Read the config, write the wrong value.
  • Loops. Same tool, same args, six times.
  • Misinterpreted results. The tool returns DB_HOST=localhost and the agent reports the host is 5432.
  • Premature stop. Hit one ambiguous result, declare the task impossible, walk away.
  • Not stopping. Accomplish the task on step two and keep "improving" until you break something.

None of these are catchable by asserting tool_called == "readFile". You need the trajectory — every tool, every result, every step — and you need to grade the final answer against the original task in light of all of it.

The three evaluators

The project's evals/multi_turn/evaluators.py ships three:

Evaluator Type What it asks
tool_order_correct Deterministic Did the expected tools appear, in order? (subsequence match)
tools_avoided Deterministic Did the agent leave the forbidden tools alone?
llm_judge LLM-as-judge Is the final answer correct, given the tools called and their results?

Each returns a score in [0, 1]. By convention, an evaluator returns 1.0 when there's nothing for it to check — a case with no forbidden tools should not spuriously fail the tools_avoided evaluator. There were no rules to break.

def tool_order_correct(output: MultiTurnResult, target: MultiTurnTarget) -> float:
    expected = target.expected_tool_order
    if not expected:
        return 1.0
    i = 0
    for name in output.tool_call_order:
        if i < len(expected) and name == expected[i]:
            i += 1
    return 1.0 if i == len(expected) else 0.0

That's subsequence matching — the agent can call extra tools (planning, thinking, retries) as long as the required ones appear in the required relative order. The forbidden-tools evaluator is similarly simple. The judge is where the interesting design choices live.

The judge: structured output via Pydantic

The article's strongest point is that an LLM judge is only useful if its output is structured. A free-form judge reply is parsed into a score with regex heroics; a structured one isn't parsed at all.

class JudgeSchema(BaseModel):
    score: int = Field(..., ge=1, le=10, description="Score from 1-10 where 10 is perfect")
    reason: str = Field(..., description="Brief explanation for the score")

The reason field isn't decoration. It forces the judge to think before committing to a number, and gives you something to audit when a score looks wrong.

The article uses OpenAI's Structured Outputs beta.chat.completions.parse(response_format=JudgeSchema). For portability across Ollama, vLLM, and other OpenAI-compatible endpoints that don't all support Structured Outputs, the project uses JSON mode + Pydantic validation:

def llm_judge(output, target):
    kwargs = {
        "model": judge_model(),
        "messages": [
            {"role": "system", "content": _JUDGE_SYSTEM_PROMPT},
            {"role": "user",   "content": _judge_user_message(output, target)},
        ],
        "temperature": 0.0,
    }
    try:
        completion = judge_client().chat.completions.create(
            response_format={"type": "json_object"}, **kwargs
        )
    except Exception:
        completion = judge_client().chat.completions.create(**kwargs)
    return _parse_judge_response(completion.choices[0].message.content or "")

The fallback try/except handles older endpoints that reject response_format. The parser is defensive about local models that wrap JSON in ```json … ``` fences or trail prose:

def _parse_judge_response(content: str) -> JudgeSchema | None:
    text = content.strip()
    if text.startswith("```"):
        text = text.split("```", 2)[1]
        if text.startswith("json"): text = text[len("json"):]
        text = text.strip().rstrip("`").strip()
    start, end = text.find("{"), text.rfind("}")
    if start == -1 or end == -1: return None
    blob = text[start:end+1]
    try:
        return JudgeSchema.model_validate_json(blob)
    except Exception:
        return None

Five offline tests in tests/test_multi_turn_eval.py cover the parser — clean JSON, fenced JSON, trailing prose, out-of-range scores, garbage. The Pydantic schema catches the out-of-range case for free.

The executor: explicit agent loop

The article's TypeScript reference uses generateText({stopWhen: stepCountIs(20)}) for the agent loop. The OpenAI Python SDK doesn't have that, so the project's executor drives the loop explicitly. The whole executors.py is 130 lines.

def multi_turn_with_mocks(data: MultiTurnEvalData) -> MultiTurnResult:
    tools_spec, mock_results = build_mocked_tools(data.mock_tools)

    if data.messages is not None:
        messages = [dict(m) for m in data.messages]
    else:
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user",   "content": data.prompt or ""},
        ]

    max_steps = (data.config or {}).get("max_steps", 20)
    all_tool_calls, steps, final_text = [], [], ""

    with traced_workflow("multi_turn_eval", provider="openai-compat", model=model) as span:
        for step_no in range(max_steps):
            completion = agent_client().chat.completions.create(
                model=model, messages=messages,
                tools=tools_spec if tools_spec else None,
            )
            message = completion.choices[0].message
            tool_calls = message.tool_calls or []

            if not tool_calls:
                final_text = message.content or ""
                break

            # Append assistant turn + run tool calls against the mock map.
            messages.append({"role": "assistant", "content": message.content,
                             "tool_calls": [...]})
            for tc in tool_calls:
                all_tool_calls.append(tc.function.name)
                result = mock_results.get(tc.function.name, "")
                messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})

    return MultiTurnResult(text=final_text, steps=steps,
                           tools_used=list(dict.fromkeys(all_tool_calls)),
                           tool_call_order=all_tool_calls)

Two design choices worth flagging:

  1. build_mocked_tools wraps each spec into a real tool schema. The agent sees a name, description, parameters. It does not know the tool is mocked. The mock just returns a fixed value on call.
  2. for _ in range(max_steps) is the guardrail. Without it, a misbehaving agent loops forever and burns tokens. With it, the harness stops at step 20 and you see a partial trace — which is itself an interesting failure to score.

Each run is wrapped in a traced_workflow("multi_turn_eval", ...) span so eval runs show up in Jaeger alongside the production workflows. Same trace shape, different parent.

Three test patterns that earn their keep

The evals/datasets/multi_turn.json ships three cases — one per pattern from the article:

flowchart LR
    Fresh["1️⃣ Fresh task<br/>prompt + mocks"]
    Mid["2️⃣ Mid-conversation<br/>pre-filled messages"]
    Neg["3️⃣ Negative test<br/>trivial prompt, all tools forbidden"]

    Fresh --> Tools["Tool selection + answer extraction"]
    Mid --> Context["Context threading + read-before-write convention"]
    Neg --> Restraint["Agent shouldn't reach for a tool it doesn't need"]

The negative test is the one that catches the failure mode "agent surrounded by tools feels compelled to use them." A perfect run uses zero tools and answers "4". Surprisingly often, agents under poorly-tuned prompts will call readFile to "verify" the answer to a trivial math question.

Running against Ollama

# Default config: agent + judge both = local Ollama with granite4.1:3b
make eval-multi

# Or fully deterministic, no judge
make eval-multi-fast

# Override per-role for production
EVAL_AGENT_BASE_URL=https://api.openai.com/v1 \
EVAL_AGENT_API_KEY=sk-... \
EVAL_AGENT_MODEL=gpt-5-mini \
EVAL_JUDGE_MODEL=gpt-5.1 \
python -m evals.multi_turn.runner --output evals/results.jsonl

The live run against granite4.1:3b on all three cases:

Case 1: Read config.json and report the API endpoint.
  tools called  : ['readFile']
  tool_order    : 1.00
  tools_avoided : 1.00
  output_quality: 1.00
  final answer  : 'The API endpoint is https://api.example.com/v1.'

Case 2: Update the port to 3000 in config.json (after reading current contents).
  tools called  : ['readFile', 'writeFile']
  tool_order    : 1.00
  tools_avoided : 1.00
  output_quality: 0.70
  final answer  : 'The port has been updated to 3000 in your config.json file. Let me know if'

Case 3: Answer the trivial arithmetic question without invoking any tools.
  tools called  : []
  tool_order    : 1.00
  tools_avoided : 1.00
  output_quality: 1.00
  final answer  : 'The answer is 4.'

Averages: tool_order=1.000, tools_avoided=1.000, output_quality=0.900
2/3 cases passed (overall >= 0.99)

Case 2 scored 7/10 on output quality. The deterministic checks are clean — read before write, no forbidden tools — but the judge caught a quality issue: the agent finished with "Let me know if there's…" filler the task didn't ask for. That's exactly the kind of soft failure single-turn evals can't see. The judge's score reads as "right work, wrong vibe."

TaskResult interop

The runner emits agent_framework_lab_gaia.TaskResult records so the output integrates with MAF's lab tools (and any future MAF Lab features):

{
  "task_id": "fresh-task-read-config",
  "task": {"task_id": "fresh-task-read-config", "question": "Read config.json and report the API endpoint."},
  "prediction": {"prediction": "The API endpoint is https://api.example.com/v1."},
  "evaluation": {
    "is_correct": true,
    "score": 1.0,
    "details": {"scores": {"tool_order": 1.0, "tools_avoided": 1.0, "output_quality": 1.0},
                "tools_used": ["readFile"], "tool_call_order": ["readFile"], "steps": 2}
  },
  "runtime_seconds": 2.31
}

Same shape as the project's other evals (tool-call eval, e2e substring eval). One downstream consumer can ingest all three.

What the harness can't do

Two honest limits:

  1. It tests the agent's decision-making, not the tools. The mocked tools always succeed. If your real writeFile has a bug, this harness won't find it. Unit tests do that.
  2. The judge has noise. A 7/10 might be a 6/10 on the next run. Treat small score deltas as noise; only act on movements that survive multiple judge passes.

What I'd add to your harness if you don't have one yet

  • The negative test. Most suites omit it. It catches a real class of failure.
  • A max_steps cap. Cheap insurance against runaway loops.
  • A schema for the judge output. Goodbye to regex score parsing.
  • A consistent "nothing to check → 1.0" convention. Easy to forget.

The harness is ~600 lines including tests. The hard part is the test data — the article spends two thirds of its length on that, and it's right to. Multi-turn eval is test-data design. The harness is the part you write once.

The dataset, the runner, the executor, the evaluators, and the offline tests are all in the repo. Fork it; replace the system prompt with yours; replace the mock tools with the ones your agent actually uses; you've got a multi-turn eval harness for your agent in an afternoon.