What Multi-Agent Architecture Actually Looks Like in Production
Why single-LLM “AI assistants” don’t survive contact with real systems — and the seven load-bearing pieces of an architecture that does.
The pattern that keeps failing
Every week, another “AI assistant” demo built on the same shape: one HTTP endpoint, one mega-prompt, one LLM call, a thin retry loop. It demos beautifully. It dies in production.
Six months in, the team is rewriting from scratch — and rediscovering, the hard way, a pattern the distributed systems community has known for two decades. The pattern has a name. Multi-agent architecture. Microsoft formalised it as the Multi-Agent Reference Architecture (MARA), and it’s what production-grade LLM systems converge on regardless of language or framework.
This article is a tour of the seven pieces that actually do load-bearing work, with examples from an open-source Go implementation called Genie. The implementation language doesn’t matter. The pattern does.
Why the monolith breaks
A single-LLM-call assistant has five structural problems that no amount of prompt engineering fixes:
- It cannot be audited. A 4,000-token system prompt that branches on input is not a specification. The compliance team’s question — “show me what this does with a PAN number” — has no answer except “depends.” Regulators won’t accept that.
- It cannot be partially upgraded. Every edit touches every capability. Every regression touches every capability.
- It cannot be differentially controlled. Different operations need different caching, rate limits, and budgets. One LLM call gives you one set of controls.
- Failure is all-or-nothing. Provider returns a 500, the whole assistant is down. There’s no degraded path because there’s nothing else in the system.
- Latency is the slowest path. You cannot parallelise what you cannot decompose.
These are properties of the architecture. Polishing won’t fix them.
The seven load-bearing pieces
Skip any one of these and you re-create the failure modes in a new disguise.
1. The protocol
A typed message envelope. Not function calls — messages.
type Message struct {
From string
To string
Type string
Payload []byte
Classification Classification // public | internal | pii | secret
Metadata map[string]string
}
Type makes the system inspectable. Classification makes governance possible. Metadata carries W3C traceparent so async hops still show up as one distributed trace.
Anti-pattern: direct method calls between agents (recommender.Recommend(analysis)). You’ve just rebuilt the monolith with extra files.
2. The registry
Agents declare capabilities at registration:
type Agent interface {
Name() string
Capabilities() []string
RiskLevel() RiskClass
HandleMessage(ctx context.Context, msg Message) ([]Message, error)
}
The orchestrator queries the registry to discover who handles what. Adding an agent is a one-line registration, not an orchestrator refactor. This is also what makes a live AI inventory endpoint real — built from the registry, it cannot drift from what’s actually running.
3. The bus
Pub/sub transport so agents are decoupled in time and space. Start in-memory (a channel-backed pub/sub is ~200 lines). Swap for Kafka or NATS when you outgrow one process. The agents don’t change.
The decoupling matters more than the technology. When the analyzer fans out to forecaster, anomaly_detector, and recommender in parallel — three messages, three goroutines, results aggregated by correlation ID — latency becomes max(stages) instead of sum(stages). 3× speedup with zero algorithmic work.
4. The orchestrator
The orchestrator subscribes agents to message types, enforces policy before dispatch, and traces every hop.
The critical word is before. Policy inside a handler can be skipped by a buggy or compromised handler. Policy in the orchestrator cannot be bypassed.
func (o *Orchestrator) dispatch(ctx context.Context, msg Message) error {
if err := o.policy.Evaluate(ctx, msg); err != nil {
o.incidents.Record(msg, err)
return err
}
return o.bus.Publish(ctx, msg)
}
5. Governance as middleware
Every message passes through a composite policy before any agent runs: length, required metadata, RBAC, classification ceiling, residency, consent, explainability, PII regex, prompt injection, JSON schema. Each is small, independently testable, composable. Loaded from a board-approved YAML.
Why middleware, not handler code: middleware is a single audit surface. The compliance team reads one composite, not 40 handlers. The red-team corpus runs against one composite. The denial counter is one metric. Put policy inside handlers and you re-verify it every time you add an agent.
6. Per-agent risk class
func (a *AMLMonitor) RiskLevel() RiskClass { return RiskHigh }
The orchestrator enforces ceilings — a RiskHigh agent cannot execute on a message lacking an advisor or admin role. Autonomous loops (ReAct, Reflexion) are bounded by deadline, circuit, and budget wrappers on the LLM provider. An agent cannot accidentally run away or DoS a downstream.
7. Fallback agents
The piece that turns a research project into a production system.
orchestrator.SetFallback("portfolio_advisor", deterministic.PortfolioFallback{})
When the primary times out or panics, the orchestrator dispatches a deterministic fallback that needs neither LLM nor network. The user gets a degraded but truthful answer — “live analytics unavailable, here’s the cached snapshot from 14:00 IST.” Audit log records the fallback. On-call gets paged. System stays up.
CI proves it works: a drill target forces a failure and asserts the fallback fires.
What this buys you, concretely
- Parallelism for free — fan-out to multiple agents, aggregate by correlation ID. Latency = max(stages), not sum.
- Surgical caching — educator caches 6h, rate_watcher caches 5m, recommender doesn’t cache. Each agent owns its policy.
- Differential observability — per-agent SLOs in Tempo. “p99 latency of anomaly_detector for
enriched_transactionsover 7 days” is a query, not a project. - Composable safety — drop a new policy into the composite, it applies everywhere.
- Surgical upgrades — swap the model behind one agent; the other 39 are unaffected.
- Honest failure stories — uptime isn’t hostage to the LLM provider’s status page.
What Genie ships
Genie is the open-source Go implementation. The architecture is the deliverable; the financial-assistant domain is the demonstration.
- 40+ specialist agents — canonical pipeline (ingestor → normalizer → enricher → analyzer → forecaster → anomaly → recommender → reporter) plus 26 domain specialists (fraud, AML, LCR, VaR, ALM, tax, lending, complaint triage…)
- 5 LLM providers behind one interface, with cost/cache/router/shadow/circuit/deadline/budget wrappers
- Hybrid RAG (vector + BM25 + RRF + cross-encoder rerank + HyDE + Self-RAG + Corrective RAG) plus GraphRAG entity walks
- Reasoning patterns — CoT, ReAct, Reflexion, Chain-of-Verification, Step-Back, Semantic Router
- Workflow runtime — DAG + Saga compensation + HITL approval + event-sourced log
- MCP + A2A for interop and federation
- OpenTelemetry end-to-end with OpenInference semantics
57 test packages, all green. The sandbox is the production code:
git clone https://github.com/c2siorg/genie.git
cd genie
go test ./... # 57 packages green
go run ./cmd/genie # full pipeline, in-process, no Postgres, no network
make compose-up # full stack with Postgres + Tempo + Grafana + Ollama
The takeaway
The protocol and the bus are the architecture. Everything else — the agents, the LLM, the RAG strategy, the prompts — is replaceable. The shape of how messages flow determines whether you can audit, upgrade, observe, and keep the system running.
Most “multi-agent” projects skip this part. They have multiple agents, but the agents call each other directly. They have governance, but it’s inside the agents. They have observability, but it’s per-agent and doesn’t compose. They have fallbacks, but only for the LLM call itself.
That’s not multi-agent architecture. That’s a monolith with extra files.
The seven pieces are not optional.
References
- Microsoft Multi-Agent Reference Architecture
- Anthropic Model Context Protocol
- Google Agent2Agent Protocol
- Genie — Go reference implementation (MIT)
If you’re building multi-agent systems, what’s the piece you skipped first — and what did it cost you?
MultiAgent #AIArchitecture #SystemDesign #LLM #DistributedSystems #MARA #Golang #SoftwareArchitecture
Explore further
- Production agentic on Kubernetes — deployment patterns
- Why Go for agentic AI — language choice rationale
- Defence in depth — security model