Ardan Ultimate AI #22 — Cascading model router (cheap first, expensive on miss)

Field notes from working through example 22 of Ardan Labs’ Ultimate AI course by Bill Kennedy and Florin Pățan (Apache 2.0). My fork: PratikDhanave/ai-training. Thank you Bill and Florin for teaching this material — the patterns in this post are derived from the course; the production reflections at the end are mine.

What the example teaches

A cascade:

Try the small model (3B params, sub-second, costs basically nothing).
If its confidence is high and its output validates → return.
Otherwise, escalate to the large model (70B+, slower, expensive).

Most queries succeed at step 1 or 2. The large model only runs for the hard ones. Average cost per query drops 5-10× with quality holding steady.

What it looks like

resp := small.Generate(ctx, prompt)
if resp.Confidence > 0.85 && schema.Validates(resp.Output) {
    return resp.Output, nil
}

// Escalate
resp = large.Generate(ctx, prompt)
return resp.Output, nil

What I learned

The confidence threshold is the only knob that matters. Too high and everything escalates (no savings). Too low and quality drops. 0.80-0.85 is the working range for most workloads; tune per use case.

Schema validation is the cheapest second check. Confidence alone is unreliable; a small model can be confidently wrong. Validating the output against an expected schema catches structural failures without another LLM call.

Production connection

Genie’s pkg/llm/router.go does cascade routing alongside sovereignty (which provider, which region) and budget (per-principal token cap). The cascade is the cost lever; sovereignty is the compliance lever. Both running at every call. The example’s clean isolation of the cascade pattern made the design easier when we wired all three together.

Credit & reference. This post is field notes on example 22 from Ardan Labs’ Ultimate AI by Bill Kennedy + Florin Pățan, licensed Apache 2.0. The original example: cmd/examples/example22-cascade/. My fork with notes: PratikDhanave/ai-training. Highly recommend the course for anyone building AI applications in Go — the material is rigorous and the Kronk + yzma + llama.cpp pipeline gives you hardware-accelerated local inference end-to-end. Thank you, Bill and Florin.