Twelve months of Genie in production — what survived, what we rewrote, what we deleted

The shape of the year

Genie started as a reference implementation of the MARA pattern in Go. Twelve months in, it has:

~30K LoC of Go
100+ test packages
15-20 active agents
7+ platform packages (RAG, safety, policy DSL, audit, etc.)
Documentation that’s actually maintained
A handful of production deployments (open-source means anyone can run it; we know about a few)

Here’s what held up and what didn’t.

What survived unchanged

The protocol package. pkg/protocol defines Message and the role constants. It hasn’t changed in 12 months because the contract is the only stable thing in a multi-agent system. Every agent depends on it; every package speaks it. Changing it would be a cross-cutting refactor; not changing it has been the right call.

The orchestrator’s hook surface. OnPolicyDeny, OnAgentError. Two hooks. We were tempted to add more (OnRetry, OnTimeout, OnFallback). Each time we resisted because the existing hooks composed to express the same thing. Twelve months later, two hooks still suffices.

The risk class taxonomy. RiskLow, RiskMedium, RiskHigh. Three levels; clear semantics; never temptingly extended. Compare to projects that drift to 7-tier or 10-tier risk schemes.

The fallback pattern. orchestrator.SetFallback(primary, fallback). One-line API; production-tested via make bcp-drill. Boring; works.

What we rewrote

The first RAG implementation. Single embedding model; one vector store; naive retrieval. Replaced with hybrid (vector + BM25 + RRF + rerank). The first version worked but quality plateaued; the second version generalised across customer use cases.

The first prompt-injection policy. Hand-rolled regex against a list of suspicious patterns. Replaced with a small classifier behind the pkg/safety plugin chain. The classifier handles paraphrased attacks the regex missed.

The first audit log. Wrote to a flat table; no chain; no tampering detection. Replaced with the hash-chained pkg/compliance/audit.go. The old log was useful for debugging; not useful for compliance. The new one serves both.

The first observability layer. Logs only. Replaced with OTel spans + metrics + structured logs (slog). Three signals; same shape across services; standard tooling reads them.

The HTTP middleware stack. First version was custom for each route. Replaced with chi router + composable middleware. The diff was bigger than it sounded; the consistency was worth it.

What we deleted

A custom workflow DSL. Built one to express agent orchestration declaratively. Used by no one because Go was already expressive enough. Deleted six months in.

A graph database integration. Added Neo4j support for a hypothetical use case. No customer asked for it; the code drifted; deleted.

A configuration hot-reload subsystem. Built for an ops scenario we never actually hit. The complexity outweighed the rare use case. Deleted; restart on config change is fine.

Three half-finished agents. Speculative agents (a “supply chain optimiser,” a “regulatory news scanner,” a “customer-satisfaction predictor”) that we built without a clear customer. Each had ~500 LoC. Deleted; the patterns we’d shipped were transferable.

What I’d carry to the next project

Three meta-patterns that crystallised:

Be ruthless about deletion. Code that doesn’t serve a real use case is liability. Twelve months in, the codebase is smaller in some areas than at month 6 — that’s a feature, not a bug.
Two layers of defence beat one layer of cleverness. RLS + bus tenant policy beat any single “smarter” tenant check. Hash-chained audit + verifier beat any single “more tamper-resistant” log format. The boring layered approach holds up.
The hook surface should be minimal. Every time we resisted adding a hook, we found the existing hooks composed to cover the case. Every time we added one anyway, we regretted it within a quarter.

What still nags

The agent registration model is too coupled to the boot sequence. A hot-add at runtime would be nice; haven’t built it.
The fallback story doesn’t handle cascading failures. Fallback fails → ?. The current answer is “incident + page”; a graceful degradation chain would be better.
The cost-attribution per agent is approximate. Tokens get attributed to the agent that called the LLM, not to the agent whose business logic triggered the call. For shared-utility agents this under-attributes.

Roadmap items, all of them. Not blockers for the current use cases.

The big takeaway

The thing that survived 12 months of production was the discipline, not any specific code. The discipline:

Test pyramid that’s actually a pyramid (more unit than integration than e2e).
Documentation that’s a contract, not an afterthought.
Defence-in-depth on every security concern.
Honest “what doesn’t work yet” lists in every doc.

The code is replaceable. The discipline isn’t. For year two, the plan is more discipline, less code.