Ardan Ultimate AI #20 — Embedding-based semantic cache

Field notes from working through example 20 of Ardan Labs’ Ultimate AI course by Bill Kennedy and Florin Pățan (Apache 2.0). My fork: PratikDhanave/ai-training. Thank you Bill and Florin for teaching this material — the patterns in this post are derived from the course; the production reflections at the end are mine.

What the example teaches

Traditional cache: key on the literal prompt string. Miss on every paraphrase.

Semantic cache: embed the prompt, nearest-neighbour against cached embeddings, return the cached answer if similarity > threshold.

What it looks like

func cached(ctx context.Context, prompt string) (string, bool) {
    queryEmb := embed.Generate(prompt)
    hit := cache.NearestNeighbour(queryEmb, threshold=0.92)
    if hit != nil {
        return hit.Response, true
    }
    return "", false
}

func generate(ctx context.Context, prompt string) string {
    if resp, ok := cached(ctx, prompt); ok {
        return resp
    }
    resp := llm.Chat(ctx, prompt)
    cache.Insert(embed.Generate(prompt), resp)
    return resp
}

What I learned

The threshold is the entire game. Too high and you barely cache anything. Too low and the cache returns “close enough” answers that are actually wrong. 0.90-0.95 is the working range; tune per workload.

TTL matters because answers go stale. Static FAQ → long TTL. Recent-data queries → short TTL or no cache. Add a per-entry expiry on insert.

Production connection

Genie’s pkg/llm has a per-prompt response cache (exact match). Adding the semantic layer is on the roadmap — for a customer-support workload it would land a 25-35% additional hit rate. The example is the cleanest implementation I’ve found in Go.

Credit & reference. This post is field notes on example 20 from Ardan Labs’ Ultimate AI by Bill Kennedy + Florin Pățan, licensed Apache 2.0. The original example: cmd/examples/example20-semantic-cache/. My fork with notes: PratikDhanave/ai-training. Highly recommend the course for anyone building AI applications in Go — the material is rigorous and the Kronk + yzma + llama.cpp pipeline gives you hardware-accelerated local inference end-to-end. Thank you, Bill and Florin.