Ardan Ultimate AI #19 — Speculative decoding with a draft model

Field notes from working through example 19 of Ardan Labs’ Ultimate AI course by Bill Kennedy and Florin Pățan (Apache 2.0). My fork: PratikDhanave/ai-training. Thank you Bill and Florin for teaching this material — the patterns in this post are derived from the course; the production reflections at the end are mine.

What the example teaches

The large model generates one token at a time — slow. A small draft model generates 4-8 tokens at a time — fast but lower quality. Speculative decoding:

Draft model emits N candidate tokens.
Large model verifies all N in one parallel pass.
Accept the prefix that matches; reject and re-roll the suffix.

End result: the same output as running the large model alone, with 2-3× lower latency.

What it looks like

The example uses Kronk (Ardan’s llama.cpp wrapper) which exposes speculative decoding as a configuration:

client := kronk.New(kronk.Config{
    Model:       "llama-70b",
    DraftModel:  "llama-3b",
    Speculative: true,
    NumDraftTokens: 6,
})

resp, _ := client.Generate(ctx, prompt)

The math is in the verification step; the Go code just turns it on.

What I learned

Speculative decoding works because most tokens are predictable. The draft model’s hit rate on common patterns (“the”, “of”, end-of-sentence) is high. The large model only needs to do real work on the unpredictable tokens. Production serving stacks (vLLM, TensorRT-LLM, Anthropic’s serving) all use the technique.

It’s invisible to the application. The API is the same; the optimisation is server-side. Worth knowing the technique exists when you’re explaining “why is the same model faster now.”

Production connection

When we evaluated Bedrock vs Vertex vs Ollama for a Searce client, the latency comparisons were apples-to-oranges because some endpoints had speculative decoding enabled and others didn’t. Knowing the trick exists made the comparison honest.

Credit & reference. This post is field notes on example 19 from Ardan Labs’ Ultimate AI by Bill Kennedy + Florin Pățan, licensed Apache 2.0. The original example: cmd/examples/example19-speculative/. My fork with notes: PratikDhanave/ai-training. Highly recommend the course for anyone building AI applications in Go — the material is rigorous and the Kronk + yzma + llama.cpp pipeline gives you hardware-accelerated local inference end-to-end. Thank you, Bill and Florin.