The pattern
For some classes of query, multiple agents could answer. They differ on:
- Latency: small/fast agent vs large/thorough agent.
- Confidence: a quick approximate answer vs a slow precise one.
- Cost: $0.001/query vs $0.05/query.
Latency-aware dispatch picks the agent based on the user’s latency SLO, not the agent’s “quality.”
The dispatcher
type SLOBudget struct {
MaxLatency time.Duration
MinConfidence float64
}
func dispatchBySLO(ctx context.Context, query Query, slo SLOBudget) Response {
for _, agent := range agentsByLatency { // sorted fastest first
if agent.P99Latency > slo.MaxLatency { continue }
resp := agent.Run(ctx, query)
if resp.Confidence >= slo.MinConfidence {
return resp
}
}
return Response{Error: ErrNoAgentMeetsSLO}
}
Iterate fastest-first. Return the first response that meets confidence. Cap at the SLO.
Why this matters more for agentic than for microservices
In microservices, latency-aware routing is about replicas and locality. Same code; pick the closest instance.
In agents, the differences are conceptual: a small model can answer some queries; a large model has to for others. The dispatcher picks the right model, not the right replica.
For Genie’s agents/financial_supervisor, the dispatch decision factors in:
- The user’s session SLO (interactive: 2s; batch: 60s).
- The query class (simple lookup vs domain analysis).
- The current cost budget for the user’s tier.
The SLO as a contract
The SLO is the user-facing promise. The dispatcher’s job is to honour it.
What happens when no agent meets the SLO:
- Option A: degrade gracefully — return the fastest agent’s result even if confidence is low, plus a disclaimer.
- Option B: defer — return “we’re working on it; come back in 30 seconds.” Show progress in the UI.
- Option C: surface — fail the request with an explicit error so the user knows the system can’t do this within the time budget.
For Genie, option A is the default. The disclaimer (FREE-AI Rec 25) carries the confidence level so the user can decide whether to trust it.
Where this shows up in practice
For the conversational call-centre agent (a Pratik-Kinetic-style voice deployment):
- User says “what’s my balance” → simple lookup path → 80ms → high confidence.
- User says “explain this charge” → small LLM → 500ms → medium confidence.
- User says “why does my interest rate vary?” → production agent → 3s → high confidence with full audit.
The user perceives a consistently responsive system. The team doesn’t pay full-production-agent cost on every query. Win/win.
What to monitor
Per-class latency histograms. Per-class confidence distributions. SLO-met rate (queries served within the budget).
When the SLO-met rate drops, you have two levers: add capacity to the slow path, or move more queries to the fast path (re-tune the classifier). Watch the rate weekly; adjust quarterly.