· 2 min read · ← All posts
Ardan Labs Go Caching Cost Optimisation

Field notes from working through example 20 of Ardan Labs’ Ultimate AI course by Bill Kennedy and Florin Pățan (Apache 2.0). My fork: PratikDhanave/ai-training. Thank you Bill and Florin for teaching this material — the patterns in this post are derived from the course; the production reflections at the end are mine.

What the example teaches

Traditional cache: key on the literal prompt string. Miss on every paraphrase.

Semantic cache: embed the prompt, nearest-neighbour against cached embeddings, return the cached answer if similarity > threshold.

What it looks like

func cached(ctx context.Context, prompt string) (string, bool) {
    queryEmb := embed.Generate(prompt)
    hit := cache.NearestNeighbour(queryEmb, threshold=0.92)
    if hit != nil {
        return hit.Response, true
    }
    return "", false
}

func generate(ctx context.Context, prompt string) string {
    if resp, ok := cached(ctx, prompt); ok {
        return resp
    }
    resp := llm.Chat(ctx, prompt)
    cache.Insert(embed.Generate(prompt), resp)
    return resp
}

What I learned

The threshold is the entire game. Too high and you barely cache anything. Too low and the cache returns “close enough” answers that are actually wrong. 0.90-0.95 is the working range; tune per workload.

TTL matters because answers go stale. Static FAQ → long TTL. Recent-data queries → short TTL or no cache. Add a per-entry expiry on insert.

Production connection

Genie’s pkg/llm has a per-prompt response cache (exact match). Adding the semantic layer is on the roadmap — for a customer-support workload it would land a 25-35% additional hit rate. The example is the cleanest implementation I’ve found in Go.


Credit & reference. This post is field notes on example 20 from Ardan Labs’ Ultimate AI by Bill Kennedy + Florin Pățan, licensed Apache 2.0. The original example: cmd/examples/example20-semantic-cache/. My fork with notes: PratikDhanave/ai-training. Highly recommend the course for anyone building AI applications in Go — the material is rigorous and the Kronk + yzma + llama.cpp pipeline gives you hardware-accelerated local inference end-to-end. Thank you, Bill and Florin.

← Back to all posts