Latency-aware agent dispatch — picking by SLO, not by capability

The pattern

For some classes of query, multiple agents could answer. They differ on:

Latency: small/fast agent vs large/thorough agent.
Confidence: a quick approximate answer vs a slow precise one.
Cost: $0.001/query vs $0.05/query.

Latency-aware dispatch picks the agent based on the user’s latency SLO, not the agent’s “quality.”

The dispatcher

type SLOBudget struct {
    MaxLatency time.Duration
    MinConfidence float64
}

func dispatchBySLO(ctx context.Context, query Query, slo SLOBudget) Response {
    for _, agent := range agentsByLatency {  // sorted fastest first
        if agent.P99Latency > slo.MaxLatency { continue }
        resp := agent.Run(ctx, query)
        if resp.Confidence >= slo.MinConfidence {
            return resp
        }
    }
    return Response{Error: ErrNoAgentMeetsSLO}
}

Iterate fastest-first. Return the first response that meets confidence. Cap at the SLO.

Why this matters more for agentic than for microservices

In microservices, latency-aware routing is about replicas and locality. Same code; pick the closest instance.

In agents, the differences are conceptual: a small model can answer some queries; a large model has to for others. The dispatcher picks the right model, not the right replica.

For Genie’s agents/financial_supervisor, the dispatch decision factors in:

The user’s session SLO (interactive: 2s; batch: 60s).
The query class (simple lookup vs domain analysis).
The current cost budget for the user’s tier.

The SLO as a contract

The SLO is the user-facing promise. The dispatcher’s job is to honour it.

What happens when no agent meets the SLO:

Option A: degrade gracefully — return the fastest agent’s result even if confidence is low, plus a disclaimer.
Option B: defer — return “we’re working on it; come back in 30 seconds.” Show progress in the UI.
Option C: surface — fail the request with an explicit error so the user knows the system can’t do this within the time budget.

For Genie, option A is the default. The disclaimer (FREE-AI Rec 25) carries the confidence level so the user can decide whether to trust it.

Where this shows up in practice

For the conversational call-centre agent (a Pratik-Kinetic-style voice deployment):

User says “what’s my balance” → simple lookup path → 80ms → high confidence.
User says “explain this charge” → small LLM → 500ms → medium confidence.
User says “why does my interest rate vary?” → production agent → 3s → high confidence with full audit.

The user perceives a consistently responsive system. The team doesn’t pay full-production-agent cost on every query. Win/win.

What to monitor

Per-class latency histograms. Per-class confidence distributions. SLO-met rate (queries served within the budget).

When the SLO-met rate drops, you have two levers: add capacity to the slow path, or move more queries to the fast path (re-tune the classifier). Watch the rate weekly; adjust quarterly.