· 2 min read · ← All posts
Ardan Labs Go LLM Ops Performance

Field notes from working through example 19 of Ardan Labs’ Ultimate AI course by Bill Kennedy and Florin Pățan (Apache 2.0). My fork: PratikDhanave/ai-training. Thank you Bill and Florin for teaching this material — the patterns in this post are derived from the course; the production reflections at the end are mine.

What the example teaches

The large model generates one token at a time — slow. A small draft model generates 4-8 tokens at a time — fast but lower quality. Speculative decoding:

  1. Draft model emits N candidate tokens.
  2. Large model verifies all N in one parallel pass.
  3. Accept the prefix that matches; reject and re-roll the suffix.

End result: the same output as running the large model alone, with 2-3× lower latency.

What it looks like

The example uses Kronk (Ardan’s llama.cpp wrapper) which exposes speculative decoding as a configuration:

client := kronk.New(kronk.Config{
    Model:       "llama-70b",
    DraftModel:  "llama-3b",
    Speculative: true,
    NumDraftTokens: 6,
})

resp, _ := client.Generate(ctx, prompt)

The math is in the verification step; the Go code just turns it on.

What I learned

Speculative decoding works because most tokens are predictable. The draft model’s hit rate on common patterns (“the”, “of”, end-of-sentence) is high. The large model only needs to do real work on the unpredictable tokens. Production serving stacks (vLLM, TensorRT-LLM, Anthropic’s serving) all use the technique.

It’s invisible to the application. The API is the same; the optimisation is server-side. Worth knowing the technique exists when you’re explaining “why is the same model faster now.”

Production connection

When we evaluated Bedrock vs Vertex vs Ollama for a Searce client, the latency comparisons were apples-to-oranges because some endpoints had speculative decoding enabled and others didn’t. Knowing the trick exists made the comparison honest.


Credit & reference. This post is field notes on example 19 from Ardan Labs’ Ultimate AI by Bill Kennedy + Florin Pățan, licensed Apache 2.0. The original example: cmd/examples/example19-speculative/. My fork with notes: PratikDhanave/ai-training. Highly recommend the course for anyone building AI applications in Go — the material is rigorous and the Kronk + yzma + llama.cpp pipeline gives you hardware-accelerated local inference end-to-end. Thank you, Bill and Florin.

← Back to all posts