Refactor: from rolled-our-own to MAF-native
I built memory, communication, security, governance, and evals from scratch first. Then I deleted most of it and used the official MAF packages. Here's the audit, the deletions, and the result.
The first commit of this project's MAF integration had five hand-rolled modules. None of them were bad code. All of them were unnecessary code, because the official MAF packages do the same thing with a maintained API and a future MAF will keep upgrading. This post is the audit of what I deleted, what I kept, and what I'd warn future-me about.
Where I started
After the first pass through the reference architecture I had:
src/multi_agent/
├── memory/
│ ├── base.py # Memory ABC: write/read/list/delete
│ ├── short_term.py # InMemorySTM (dict per namespace)
│ └── long_term.py # FileLTM (newline-delimited JSON)
├── communication/
│ ├── request_based.py # OrchestratorClient (sync/stream helpers)
│ └── message_driven.py # InProcessBroker (asyncio.Queue pub/sub)
├── security/
│ └── (regex guardrails, identity, RBAC, audit log)
├── governance/
│ └── (lifecycle state machine, RAI checklist)
├── evals/
│ ├── run_eval.py # bespoke substring scorer
│ └── tool_call.py # bespoke tool-call scorer
Eight files of code I wrote because I thought MAF didn't ship a thing. Five of them I deleted on the second pass.
The audit
When I sat down to do the audit, I made one table. Module → does MAF ship a real package for this → if yes, can I delete mine.
| My module | MAF package | Delete? |
|---|---|---|
memory/base.py, short_term.py, long_term.py |
MemoryContextProvider, MemoryFileStore, AgentSession, agent-framework-mem0 |
✅ delete |
communication/request_based.py, message_driven.py |
agent-framework-a2a for distributed; MAF workflow runtime for in-process |
✅ delete |
security/ regex guardrails |
agent-framework-purview for prod; keep regex as no-cloud fallback |
⚠️ partial |
governance/lifecycle.py, responsible_ai.py |
(nothing in MAF) | ❌ keep |
registry/ |
(nothing in MAF; Foundry has a catalog but only for hosted agents) | ❌ keep |
evals/run_eval.py, tool_call.py (bespoke records) |
agent-framework-lab-gaia ships Task, Evaluation, Prediction, TaskResult |
⚠️ partial (emit MAF types) |
Three deletions, two refactors, two keepers. Let me walk through each.
Delete: memory
The old memory/ was 200 lines of code and four files. The new one is 100 lines and two files:
# memory/factory.py — that's the whole module
from agent_framework import AgentSession, MemoryContextProvider, MemoryFileStore
def new_session(session_id: str | None = None) -> AgentSession:
return AgentSession(session_id=session_id)
def make_memory_context_provider(*, owner_state_key, base_path=DEFAULT_MEMORY_DIR,
recent_turns=0, consolidation_client=None):
store = MemoryFileStore(base_path=base_path, owner_state_key=owner_state_key)
return MemoryContextProvider(
store=store, recent_turns=recent_turns,
consolidation_client=consolidation_client,
)
def make_mem0_context_provider(*, user_id, api_key=None, agent_id=None, application_id=None):
from agent_framework_mem0 import Mem0ContextProvider
return Mem0ContextProvider(
user_id=user_id, api_key=api_key,
agent_id=agent_id, application_id=application_id,
)
The agent wires it the way MAF expects:
agent = Agent(
client=build_chat_client(),
name="researcher",
instructions="...",
context_providers=[memory],
)
session = new_session("chat-1")
response = await agent.run("...", session=session)
What I lost by deleting: nothing. What I gained: topic consolidation, durable extraction with a well-tuned default prompt, and the ability to drop in Mem0ContextProvider for hosted LTM without changing agent code.
There's a separate post on memory in this series.
Delete: communication
The old communication/ had:
OrchestratorClient(60 lines) wrapping non-streaming and streaming handlers.InProcessBroker(90 lines) —asyncio.Queuetopics with arequest_responsehelper using correlation IDs.
The new one has:
# communication/a2a.py
from agent_framework_a2a import A2AAgent, A2AExecutor
def wrap_remote_agent(*, url, name=None, id=None, description=None) -> A2AAgent:
return A2AAgent(url=url, name=name, id=id, description=description)
def as_workflow_executor(agent, *, stream: bool = False) -> A2AExecutor:
return A2AExecutor(agent=agent, stream=stream)
The realisation: for in-process orchestration MAF's workflow runtime is the broker. Routing, back-pressure, error propagation, OpenTelemetry trace context — all already done. The only valuable thing a separate communication module brings is the distributed case, and agent-framework-a2a is what that needs.
The 150 lines of pub/sub code I'd written did not survive the audit.
Partial: security
The Purview tier is the production answer. The integration is one factory:
# security/purview.py
def make_purview_middleware(*, credential=None, settings=None, cache_provider=None):
from agent_framework_purview import PurviewChatPolicyMiddleware, PurviewSettings
if credential is None:
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
if settings is None:
settings = PurviewSettings()
return PurviewChatPolicyMiddleware(
credential=credential, settings=settings, cache_provider=cache_provider,
)
Wiring it: Agent(..., middleware=[make_purview_middleware()]). Purview enforces tenant-level chat and DLP policies on every model call.
I kept the regex-based PII redaction, secret scanning, and in-process audit log because they cover the no-cloud fallback. The chapter doc is explicit:
Tier choice. Production:
make_purview_middleware()— requires Azure tenant. No-cloud fallback: regex stubs + in-processAuditLog. Use the second for dev/CI; do not pretend it substitutes for Purview in production.
That sentence is the whole policy. The fallback code stays small (about 80 lines total) because its purpose is to be obviously not Purview.
Partial: evals
The bespoke MultiTurnScore dataclass and JSON shape from my first eval harness got deleted. The harnesses now emit agent_framework_lab_gaia.TaskResult:
results.append(
TaskResult(
task_id=case_id,
task=Task(task_id=case_id, question=label),
prediction=Prediction(prediction=result.text),
evaluation=Evaluation(
is_correct=is_correct,
score=overall,
details={"scores": scores, "tools_used": result.tools_used,
"tool_call_order": result.tool_call_order,
"steps": len(result.steps)},
),
runtime_seconds=time() - started,
)
)
Same scorer logic. Different record type. Now any tool that ingests MAF-Lab outputs ingests mine for free.
What I learned writing this: agent-framework-lab-gaia ships a benchmark runner for the GAIA dataset specifically. But its types (Task, Evaluation, Prediction, TaskResult, Evaluator) are perfectly general. You can use them as the interchange format for any eval harness even if you never run a GAIA benchmark.
Keep: registry
MAF has no general registry. Foundry has AIProjectClient.agents.list() but it sees only Foundry-managed agents.
I kept the project's registry/ — 4 files, 200 lines — and flagged it in the chapter doc:
Heads-up: MAF does not ship a built-in agent registry. This module is the project-level convention. If you're 100% on Foundry hosted agents, you can drop this module and use the catalog instead.
If a future MAF release ships a general agent registry primitive, I delete this module and use the official one. That's the contract.
Keep: governance lifecycle + RAI checklist
The DRAFT → STAGING → PRODUCTION → RETIRED state machine and the six-field RAI checklist (fairness / reliability / privacy / inclusiveness / transparency / accountability) are project-local. They tie into AGT and Purview but they're not in MAF.
I kept them for two reasons:
- They're tiny. Lifecycle is 30 lines, RAI checklist is 30 lines. The cost of keeping them in-tree is approximately zero.
- They sit above MAF. Lifecycle gates project promotion, not framework primitives. The RAI checklist is a deployment gate, not an SDK feature. These are application-level concerns the framework intentionally leaves to me.
If MAF eventually publishes opinionated primitives for lifecycle and RAI, I'll port. Until then this is the right place for them.
What I'd warn against
Three honest things, from a week of building this:
- Don't roll your own context-provider interface. MAF's
ContextProvideris the abstraction. Don't fight it. Don't reinvent it. Don't wrap it in your own ABC. Plug in. - Don't build a pub/sub in front of the workflow. The workflow IS the pub/sub for in-process orchestration. Bridging trace context through a custom broker is more work than it pays for.
- Don't write your own eval record type. Use
Task/Evaluation/Prediction/TaskResultfromagent-framework-lab-gaia. Even if you never run GAIA, the types are right.
What I'd reach for next
The two modules I kept (registry/, governance/) are the modules I'd build more carefully if the project scaled. A real registry needs a distributed backing store, a health probe, and a deprecation API. A real governance layer needs an attestation log, a model card store, and integration with whatever evaluation pipeline you'd use for production drift detection.
But that's future work for a real deployment. For the reference architecture, the current shape is the right shape. The thing I value most after the refactor is the symmetry between what the architecture asks for and what the code does. Less code. Same coverage. Easier to read.
The line count, before and after
| Module | Before | After | Diff |
|---|---|---|---|
memory/ |
220 | 100 | −120 |
communication/ |
230 | 50 | −180 |
security/ |
280 | 280 | 0 (kept fallback; added Purview factory) |
governance/ |
110 | 150 | +40 (added AGT integration) |
evals/ (legacy) |
200 | 220 | +20 (TaskResult emission + multi-turn) |
| Net | 1040 | 800 | −240 |
About 240 fewer lines after the refactor. More functionality covered. Easier to test, because the surfaces I depend on are maintained by Microsoft and Mem0 rather than by me.
The lesson generalises: when you're integrating with a framework that has separate optional packages, audit each module against the package index before you write a line. Most of what you think you need is already there, under a name you haven't met yet.