Observability for multi-agent: traces and metrics
OpenTelemetry through MAF's `configure_otel_providers`, custom workflow spans, custom metrics for runs / duration / agent invocations / policy decisions, OTel Collector → Prometheus + Jaeger, and a Grafana dashboard you can actually use.
The reference architecture's Chapter 7 on Observability calls out four areas worth measuring in a multi-agent system: agent communication, performance, errors, and security/compliance. This post is about how each of those falls out of wiring MAF's OpenTelemetry support correctly.
The headline: MAF's auto-instrumentation gives you most of it. The project's observability.py adds three things on top — a traced_workflow context manager, a WorkflowMetrics recorder, and test helpers for an in-memory exporter — and the result is a span + metric set that survives the round-trip through an OTel Collector to Jaeger and Prometheus.
What MAF emits for free
configure_otel_providers() is one call. It picks up standard OTEL_* env vars (endpoint, protocol, service name, headers) and sets up trace, metric, and log providers globally.
def setup_observability(settings=None, *, extra_exporters=None, enable_console_exporters=None) -> bool:
from agent_framework.observability import configure_otel_providers
configure_otel_providers(
enable_sensitive_data=settings.enable_sensitive_data,
enable_console_exporters=enable_console_exporters,
exporters=extra_exporters,
)
return True
After that, every MAF agent run emits OpenTelemetry GenAI semantic-convention spans automatically: gen_ai.client.operation.duration, gen_ai.client.token.usage, gen_ai.provider.name, gen_ai.request.model. Every workflow edge emits executor.process <name>. Every tool call emits a span tagged with the tool name.
What it doesn't emit is a workflow-level parent span that ties an entire workflow.run(prompt) together with project-specific attributes. That's what traced_workflow is for.
A workflow-level parent span
traced_workflow is a context manager that wraps create_workflow_span from MAF, attaches OTel GenAI attributes, records workflow-level metrics, and captures exceptions:
@contextmanager
def traced_workflow(name, *, provider=None, model=None, extra=None):
from agent_framework.observability import capture_exception, create_workflow_span
attrs = {"multi_agent.workflow.name": name}
if provider: attrs["gen_ai.provider.name"] = provider
if model: attrs["gen_ai.request.model"] = model
if extra: attrs.update(extra)
metrics = get_workflow_metrics()
metric_attrs = {"workflow": name, "provider": provider or "unknown"}
started = time.perf_counter()
status = "ok"
with create_workflow_span(f"workflow.{name}", attributes=attrs) as span:
try:
yield span
except Exception as exc:
status = "error"; capture_exception(span, exc); raise
finally:
duration = time.perf_counter() - started
metrics.runs.add(1, {**metric_attrs, "status": status})
metrics.duration.record(duration, metric_attrs)
Each workflow's run_* function wraps its body in it:
async def run_sequential(prompt: str, *, client=None) -> list[Message]:
settings = load_settings()
model = settings.ollama_model if settings.provider == "ollama" else ...
get_workflow_metrics().prompt_chars.record(
len(prompt), {"workflow": "sequential", "provider": settings.provider}
)
with traced_workflow(
"sequential", provider=settings.provider, model=model,
extra={"multi_agent.prompt.length": len(prompt)},
) as span:
...
After one run you get a trace with 18 spans in it (verified live): your workflow.sequential parent + MAF's 17 auto-instrumented children.
Custom metrics worth capturing
WorkflowMetrics is a lazy singleton over MAF's get_meter(). Five instruments:
| Metric | Type | Tags |
|---|---|---|
multi_agent.workflow.runs |
counter | workflow, provider, status |
multi_agent.workflow.duration |
histogram (s) | workflow, provider |
multi_agent.agent.invocations |
counter | agent, capability |
multi_agent.prompt.length |
histogram (char) | workflow, provider |
multi_agent.policy.decisions |
counter | tool, action, rule |
The last one is bridged from the Agent Governance Toolkit — there's a separate post on that.
Wiring: app → collector → Prometheus + Jaeger
flowchart LR
App[multi-agent app] -->|OTLP gRPC :4317| Coll[OTel Collector]
Coll -->|otlp/jaeger| Jaeger[(Jaeger)]
Coll -->|prometheus exporter :8889| Prom[(Prometheus)]
Prom --> Graf[Grafana]
Jaeger --> Graf
Browser[👤] --> Graf
The collector config in docs/otel-collector.yaml declares an OTLP receiver, a batch processor, and two exporters (debug + otlp/jaeger for traces, debug + prometheus for metrics). Two real configuration choices were needed:
-
Jaeger 1.57 accepts OTLP natively on port 4317. The legacy
14250endpoint isn't OTLP; pointing the collector there gives you arpc error: code = Unimplementedonce per batch and dropped spans. Useendpoint: jaeger:4317and setCOLLECTOR_OTLP_ENABLED=trueon the Jaeger container. -
Don't double the metric prefix. My instrument names already start with
multi_agent.*, so the Prometheus exporter'snamespace: multi_agentsetting gives you series likemulti_agent_multi_agent_workflow_runs_total. Remove the namespace.
Verified end-to-end
A live run of make custom_graph PROMPT="hello otel" flows through to:
=== Jaeger ===
services: ['agent_framework', 'multi-agent-maf']
traces: 1 — trace f1ddb9c302a7a009...: 18 spans
- executor.process classify
- executor.process finalize
- executor.process long_path
- executor.process normalize
- executor.process short_path
- message.send
- workflow.build
- workflow.custom_graph ← my parent span
- workflow.run ← MAF's auto span
=== Prometheus ===
multi_agent_workflow_runs_total{workflow="custom_graph", status="ok", provider="unknown"} = 1
multi_agent_workflow_duration_seconds_count{workflow="custom_graph"} = 1
multi_agent_prompt_length_char_sum{workflow="custom_graph"} = 8
Custom attributes survive the collector → Jaeger pipeline:
workflow.custom_graph attributes:
multi_agent.workflow.name = custom_graph
multi_agent.prompt.length = 10
multi_agent.outputs.count = 1
span.kind = internal
The Grafana dashboard
docs/grafana/dashboards/multi-agent.json ships with 10 panels:
- Workflow runs per second by workflow + status
- Workflow duration p50 / p95
- Agent invocations per second
- Prompt-length distribution
- Total runs (15m)
- Error rate (15m) with green/orange/red thresholds at 5% and 20%
- AGT policy decisions per second by tool + action
- Policy denies table by rule + tool
- Policy deny rate (15m) with green/orange/red thresholds at 1% and 10%
- Total policy decisions (15m)
The Grafana container is provisioned with the dashboard and a Prometheus + Jaeger datasource via the docs/grafana/provisioning/ folder, so a make docker-up gives you the dashboard at http://localhost:3001 with no clicks.
Testing all of this offline
You don't need a collector to test instrumentation. MAF accepts span/metric exporters into configure_otel_providers, so the project's tests use OpenTelemetry's in-memory exporters and assert on captured spans:
def test_traced_workflow_emits_span_with_attributes(exporter) -> None:
exporter.clear()
with traced_workflow("unit-test-wf", provider="ollama", model="llama3.2") as span:
span.set_attribute("multi_agent.foo", "bar")
get_tracer_provider().force_flush(5000)
spans = [s for s in exporter.get_finished_spans() if s.name == "workflow.unit-test-wf"]
assert spans
span = spans[-1]
assert span.attributes.get("multi_agent.workflow.name") == "unit-test-wf"
assert span.attributes.get("gen_ai.provider.name") == "ollama"
assert span.attributes.get("gen_ai.request.model") == "llama3.2"
assert span.attributes.get("multi_agent.foo") == "bar"
There are seven of these — three for spans, four for metrics — and they catch instrumentation regressions in ~150ms.
The set-once latch nobody mentions
One real gotcha that took an afternoon: OpenTelemetry's global meter provider has a set-once latch (_METER_PROVIDER_SET_ONCE). Tests that swap providers between modules silently fail to install the second one. The latch lives in opentelemetry.metrics._internal, not on the public opentelemetry.metrics facade. The fix is one line, but you have to know where to look:
from opentelemetry.metrics import _internal as _otel_metrics_internal
once = getattr(_otel_metrics_internal, "_METER_PROVIDER_SET_ONCE", None)
if once is not None and hasattr(once, "_done"):
once._done = False
_metrics.set_meter_provider(provider)
That's only safe in test code. Don't do that in production.
Two things worth tagging your spans with
Project-local attributes that have actually paid for themselves:
multi_agent.workflow.name— lets Jaeger filter to one workflow type in one click.multi_agent.prompt.length— lets you correlate big spikes in latency or token usage with input size, on the same trace.
Pick a small set of project-local labels and use them consistently. Don't try to enrich every span with every interesting attribute — the operations that matter are the ones tagged with the same field across spans you'll group on later.
What's next
The next post in the series is about the Agent Governance Toolkit. The policy_decisions counter mentioned here is bridged from AGT's govern_tool() wrapper — every allow/deny gets a metric increment, and the Grafana dashboard's policy panel shows you the deny rate over time.