Field notes from working through example 22 of Ardan Labs’ Ultimate AI course by Bill Kennedy and Florin Pățan (Apache 2.0). My fork: PratikDhanave/ai-training. Thank you Bill and Florin for teaching this material — the patterns in this post are derived from the course; the production reflections at the end are mine.
What the example teaches
A cascade:
- Try the small model (3B params, sub-second, costs basically nothing).
- If its confidence is high and its output validates → return.
- Otherwise, escalate to the large model (70B+, slower, expensive).
Most queries succeed at step 1 or 2. The large model only runs for the hard ones. Average cost per query drops 5-10× with quality holding steady.
What it looks like
resp := small.Generate(ctx, prompt)
if resp.Confidence > 0.85 && schema.Validates(resp.Output) {
return resp.Output, nil
}
// Escalate
resp = large.Generate(ctx, prompt)
return resp.Output, nil
What I learned
The confidence threshold is the only knob that matters. Too high and everything escalates (no savings). Too low and quality drops. 0.80-0.85 is the working range for most workloads; tune per use case.
Schema validation is the cheapest second check. Confidence alone is unreliable; a small model can be confidently wrong. Validating the output against an expected schema catches structural failures without another LLM call.
Production connection
Genie’s pkg/llm/router.go does cascade routing alongside sovereignty (which provider, which region) and budget (per-principal token cap). The cascade is the cost lever; sovereignty is the compliance lever. Both running at every call. The example’s clean isolation of the cascade pattern made the design easier when we wired all three together.
Credit & reference. This post is field notes on example 22 from Ardan Labs’ Ultimate AI by Bill Kennedy + Florin Pățan, licensed Apache 2.0. The original example: cmd/examples/example22-cascade/. My fork with notes: PratikDhanave/ai-training. Highly recommend the course for anyone building AI applications in Go — the material is rigorous and the Kronk + yzma + llama.cpp pipeline gives you hardware-accelerated local inference end-to-end. Thank you, Bill and Florin.