Ardan Ultimate AI #18 — Incremental message caching (IMC) for chat

Field notes from working through example 18 of Ardan Labs’ Ultimate AI course by Bill Kennedy and Florin Pățan (Apache 2.0). My fork: PratikDhanave/ai-training. Thank you Bill and Florin for teaching this material — the patterns in this post are derived from the course; the production reflections at the end are mine.

What the example teaches

Without prefix caching, an LLM serving a 20-turn conversation re-encodes all 20 turns on the 21st turn. KV cache from turn 20 is discarded. Wasteful.

With prefix caching, the server keeps the KV cache for the prefix. Turn 21 only requires encoding the new user message. Latency on a long conversation goes from “scales linearly with turns” to “essentially constant.”

What it looks like

Application code structures the message history in a stable order so the cache prefix matches:

// good — system prompt + conversation grows append-only
messages := []Message{
    {Role: "system", Content: systemPrompt},   // stable
    {Role: "user", Content: turn1User},        // stable after turn 1
    {Role: "assistant", Content: turn1AI},     // stable after turn 1
    {Role: "user", Content: turn2User},        // stable after turn 2
    // ... grows by appending
}

// bad — RAG context inserted between system and history shifts the prefix
//       on every turn, invalidating the cache.

The Kronk client passes a prefix_id header; the server matches the prefix against its cache.

What I learned

Message-history shape is the cache key. If you insert RAG context between the system prompt and the conversation history, the prefix shifts every turn and the cache never hits. Append-only history with RAG context at the end (or out-of-band) preserves the prefix.

Production serving relies on this. OpenAI, Anthropic, Bedrock all do prefix caching server-side; you just need to structure your messages to benefit. Most application engineers don’t think about it; the ones who do see 2-5× lower latency on long conversations.

Production connection

Genie’s chat handler appends to the history rather than inserting in the middle. Now I know that’s not just a code style — it’s a cache-friendliness decision. The example’s framing made the reason explicit.

Credit & reference. This post is field notes on example 18 from Ardan Labs’ Ultimate AI by Bill Kennedy + Florin Pățan, licensed Apache 2.0. The original example: cmd/examples/example18-prefix-cache/. My fork with notes: PratikDhanave/ai-training. Highly recommend the course for anyone building AI applications in Go — the material is rigorous and the Kronk + yzma + llama.cpp pipeline gives you hardware-accelerated local inference end-to-end. Thank you, Bill and Florin.