Field notes from working through example 19 of Ardan Labs’ Ultimate AI course by Bill Kennedy and Florin Pățan (Apache 2.0). My fork: PratikDhanave/ai-training. Thank you Bill and Florin for teaching this material — the patterns in this post are derived from the course; the production reflections at the end are mine.
What the example teaches
The large model generates one token at a time — slow. A small draft model generates 4-8 tokens at a time — fast but lower quality. Speculative decoding:
- Draft model emits N candidate tokens.
- Large model verifies all N in one parallel pass.
- Accept the prefix that matches; reject and re-roll the suffix.
End result: the same output as running the large model alone, with 2-3× lower latency.
What it looks like
The example uses Kronk (Ardan’s llama.cpp wrapper) which exposes speculative decoding as a configuration:
client := kronk.New(kronk.Config{
Model: "llama-70b",
DraftModel: "llama-3b",
Speculative: true,
NumDraftTokens: 6,
})
resp, _ := client.Generate(ctx, prompt)
The math is in the verification step; the Go code just turns it on.
What I learned
Speculative decoding works because most tokens are predictable. The draft model’s hit rate on common patterns (“the”, “of”, end-of-sentence) is high. The large model only needs to do real work on the unpredictable tokens. Production serving stacks (vLLM, TensorRT-LLM, Anthropic’s serving) all use the technique.
It’s invisible to the application. The API is the same; the optimisation is server-side. Worth knowing the technique exists when you’re explaining “why is the same model faster now.”
Production connection
When we evaluated Bedrock vs Vertex vs Ollama for a Searce client, the latency comparisons were apples-to-oranges because some endpoints had speculative decoding enabled and others didn’t. Knowing the trick exists made the comparison honest.
Credit & reference. This post is field notes on example 19 from Ardan Labs’ Ultimate AI by Bill Kennedy + Florin Pățan, licensed Apache 2.0. The original example: cmd/examples/example19-speculative/. My fork with notes: PratikDhanave/ai-training. Highly recommend the course for anyone building AI applications in Go — the material is rigorous and the Kronk + yzma + llama.cpp pipeline gives you hardware-accelerated local inference end-to-end. Thank you, Bill and Florin.