Observability for multi-agent: traces and metrics

OpenTelemetry through MAF's `configure_otel_providers`, custom workflow spans, custom metrics for runs / duration / agent invocations / policy decisions, OTel Collector → Prometheus + Jaeger, and a Grafana dashboard you can actually use.

June 4, 2026 · Engineers running multi-agent systems in production

MAFObservabilityOpenTelemetryGrafana

The reference architecture's Chapter 7 on Observability calls out four areas worth measuring in a multi-agent system: agent communication, performance, errors, and security/compliance. This post is about how each of those falls out of wiring MAF's OpenTelemetry support correctly.

The headline: MAF's auto-instrumentation gives you most of it. The project's observability.py adds three things on top — a traced_workflow context manager, a WorkflowMetrics recorder, and test helpers for an in-memory exporter — and the result is a span + metric set that survives the round-trip through an OTel Collector to Jaeger and Prometheus.

What MAF emits for free

configure_otel_providers() is one call. It picks up standard OTEL_* env vars (endpoint, protocol, service name, headers) and sets up trace, metric, and log providers globally.

def setup_observability(settings=None, *, extra_exporters=None, enable_console_exporters=None) -> bool:
    from agent_framework.observability import configure_otel_providers
    configure_otel_providers(
        enable_sensitive_data=settings.enable_sensitive_data,
        enable_console_exporters=enable_console_exporters,
        exporters=extra_exporters,
    )
    return True

After that, every MAF agent run emits OpenTelemetry GenAI semantic-convention spans automatically: gen_ai.client.operation.duration, gen_ai.client.token.usage, gen_ai.provider.name, gen_ai.request.model. Every workflow edge emits executor.process <name>. Every tool call emits a span tagged with the tool name.

What it doesn't emit is a workflow-level parent span that ties an entire workflow.run(prompt) together with project-specific attributes. That's what traced_workflow is for.

A workflow-level parent span

traced_workflow is a context manager that wraps create_workflow_span from MAF, attaches OTel GenAI attributes, records workflow-level metrics, and captures exceptions:

@contextmanager
def traced_workflow(name, *, provider=None, model=None, extra=None):
    from agent_framework.observability import capture_exception, create_workflow_span
    attrs = {"multi_agent.workflow.name": name}
    if provider: attrs["gen_ai.provider.name"] = provider
    if model:    attrs["gen_ai.request.model"] = model
    if extra:    attrs.update(extra)

    metrics = get_workflow_metrics()
    metric_attrs = {"workflow": name, "provider": provider or "unknown"}
    started = time.perf_counter()
    status = "ok"
    with create_workflow_span(f"workflow.{name}", attributes=attrs) as span:
        try:
            yield span
        except Exception as exc:
            status = "error"; capture_exception(span, exc); raise
        finally:
            duration = time.perf_counter() - started
            metrics.runs.add(1, {**metric_attrs, "status": status})
            metrics.duration.record(duration, metric_attrs)

Each workflow's run_* function wraps its body in it:

async def run_sequential(prompt: str, *, client=None) -> list[Message]:
    settings = load_settings()
    model = settings.ollama_model if settings.provider == "ollama" else ...
    get_workflow_metrics().prompt_chars.record(
        len(prompt), {"workflow": "sequential", "provider": settings.provider}
    )
    with traced_workflow(
        "sequential", provider=settings.provider, model=model,
        extra={"multi_agent.prompt.length": len(prompt)},
    ) as span:
        ...

After one run you get a trace with 18 spans in it (verified live): your workflow.sequential parent + MAF's 17 auto-instrumented children.

Custom metrics worth capturing

WorkflowMetrics is a lazy singleton over MAF's get_meter(). Five instruments:

Metric	Type	Tags
`multi_agent.workflow.runs`	counter	workflow, provider, status
`multi_agent.workflow.duration`	histogram (s)	workflow, provider
`multi_agent.agent.invocations`	counter	agent, capability
`multi_agent.prompt.length`	histogram (char)	workflow, provider
`multi_agent.policy.decisions`	counter	tool, action, rule

The last one is bridged from the Agent Governance Toolkit — there's a separate post on that.

Wiring: app → collector → Prometheus + Jaeger

flowchart LR
    App[multi-agent app] -->|OTLP gRPC :4317| Coll[OTel Collector]
    Coll -->|otlp/jaeger| Jaeger[(Jaeger)]
    Coll -->|prometheus exporter :8889| Prom[(Prometheus)]
    Prom --> Graf[Grafana]
    Jaeger --> Graf
    Browser[👤] --> Graf

The collector config in docs/otel-collector.yaml declares an OTLP receiver, a batch processor, and two exporters (debug + otlp/jaeger for traces, debug + prometheus for metrics). Two real configuration choices were needed:

Jaeger 1.57 accepts OTLP natively on port 4317. The legacy 14250 endpoint isn't OTLP; pointing the collector there gives you a rpc error: code = Unimplemented once per batch and dropped spans. Use endpoint: jaeger:4317 and set COLLECTOR_OTLP_ENABLED=true on the Jaeger container.
Don't double the metric prefix. My instrument names already start with multi_agent.*, so the Prometheus exporter's namespace: multi_agent setting gives you series like multi_agent_multi_agent_workflow_runs_total. Remove the namespace.

Verified end-to-end

A live run of make custom_graph PROMPT="hello otel" flows through to:

=== Jaeger ===
services: ['agent_framework', 'multi-agent-maf']
traces: 1 — trace f1ddb9c302a7a009...: 18 spans
  - executor.process classify
  - executor.process finalize
  - executor.process long_path
  - executor.process normalize
  - executor.process short_path
  - message.send
  - workflow.build
  - workflow.custom_graph         ← my parent span
  - workflow.run                  ← MAF's auto span

=== Prometheus ===
multi_agent_workflow_runs_total{workflow="custom_graph", status="ok", provider="unknown"} = 1
multi_agent_workflow_duration_seconds_count{workflow="custom_graph"} = 1
multi_agent_prompt_length_char_sum{workflow="custom_graph"} = 8

Custom attributes survive the collector → Jaeger pipeline:

workflow.custom_graph attributes:
  multi_agent.workflow.name   = custom_graph
  multi_agent.prompt.length   = 10
  multi_agent.outputs.count   = 1
  span.kind                   = internal

The Grafana dashboard

docs/grafana/dashboards/multi-agent.json ships with 10 panels:

Workflow runs per second by workflow + status
Workflow duration p50 / p95
Agent invocations per second
Prompt-length distribution
Total runs (15m)
Error rate (15m) with green/orange/red thresholds at 5% and 20%
AGT policy decisions per second by tool + action
Policy denies table by rule + tool
Policy deny rate (15m) with green/orange/red thresholds at 1% and 10%
Total policy decisions (15m)

The Grafana container is provisioned with the dashboard and a Prometheus + Jaeger datasource via the docs/grafana/provisioning/ folder, so a make docker-up gives you the dashboard at http://localhost:3001 with no clicks.

Testing all of this offline

You don't need a collector to test instrumentation. MAF accepts span/metric exporters into configure_otel_providers, so the project's tests use OpenTelemetry's in-memory exporters and assert on captured spans:

def test_traced_workflow_emits_span_with_attributes(exporter) -> None:
    exporter.clear()
    with traced_workflow("unit-test-wf", provider="ollama", model="llama3.2") as span:
        span.set_attribute("multi_agent.foo", "bar")
    get_tracer_provider().force_flush(5000)

    spans = [s for s in exporter.get_finished_spans() if s.name == "workflow.unit-test-wf"]
    assert spans
    span = spans[-1]
    assert span.attributes.get("multi_agent.workflow.name") == "unit-test-wf"
    assert span.attributes.get("gen_ai.provider.name") == "ollama"
    assert span.attributes.get("gen_ai.request.model") == "llama3.2"
    assert span.attributes.get("multi_agent.foo") == "bar"

There are seven of these — three for spans, four for metrics — and they catch instrumentation regressions in ~150ms.

The set-once latch nobody mentions

One real gotcha that took an afternoon: OpenTelemetry's global meter provider has a set-once latch (_METER_PROVIDER_SET_ONCE). Tests that swap providers between modules silently fail to install the second one. The latch lives in opentelemetry.metrics._internal, not on the public opentelemetry.metrics facade. The fix is one line, but you have to know where to look:

from opentelemetry.metrics import _internal as _otel_metrics_internal
once = getattr(_otel_metrics_internal, "_METER_PROVIDER_SET_ONCE", None)
if once is not None and hasattr(once, "_done"):
    once._done = False
_metrics.set_meter_provider(provider)

That's only safe in test code. Don't do that in production.

Two things worth tagging your spans with

Project-local attributes that have actually paid for themselves:

multi_agent.workflow.name — lets Jaeger filter to one workflow type in one click.
multi_agent.prompt.length — lets you correlate big spikes in latency or token usage with input size, on the same trace.

Pick a small set of project-local labels and use them consistently. Don't try to enrich every span with every interesting attribute — the operations that matter are the ones tagged with the same field across spans you'll group on later.

What's next

The next post in the series is about the Agent Governance Toolkit. The policy_decisions counter mentioned here is bridged from AGT's govern_tool() wrapper — every allow/deny gets a metric increment, and the Grafana dashboard's policy panel shows you the deny rate over time.