Ardan Ultimate AI #11 — RAG performance: parallel and batched embeddings, response cache

Field notes from working through example 11 of Ardan Labs’ Ultimate AI course by Bill Kennedy and Florin Pățan (Apache 2.0). My fork: PratikDhanave/ai-training. Thank you Bill and Florin for teaching this material — the patterns in this post are derived from the course; the production reflections at the end are mine.

What the example teaches

Three optimisations stack:

Batch embeddings. Instead of one API call per chunk, batch 64-128 chunks per call. Embedding APIs are billed (and accelerated) per call; batching is free throughput.
Parallel chunk processing. A document with 200 chunks shouldn’t wait for chunk 1 before starting chunk 2. Goroutines + a worker pool fan out the work.
Response cache. If the same query has been asked recently, return the cached answer. Combine with the semantic cache (post #20) for fuzzy hits.

What it looks like

// Batched embeddings via Kronk
embs, _ := kronk.EmbedBatch(ctx, chunkTexts)  // one call, N results

// Parallel via errgroup
g, ctx := errgroup.WithContext(ctx)
sem := make(chan struct{}, 8)
for _, doc := range docs {
    doc := doc
    sem <- struct{}{}
    g.Go(func() error {
        defer func() { <-sem }()
        return processDocument(ctx, doc)
    })
}
g.Wait()

What I learned

The bottleneck moves. Once embeddings are batched and parallel, the bottleneck shifts to the embedding API’s rate limit, then to your pgvector insert speed, then to your network bandwidth. Profile after every optimisation; the next bottleneck is rarely where you expect.

The cache TTL is workload-specific. A docs-RAG with stable content can cache for hours. A news-RAG can cache for minutes. Get the TTL wrong and you serve stale answers.

Production connection

For one Searce client we replaced a 45-minute nightly ingestion with a 4-minute one purely via batching + parallelism. Same code; same data; different concurrency shape. The Ardan example is the cleanest template I’ve seen for the pattern.

Credit & reference. This post is field notes on example 11 from Ardan Labs’ Ultimate AI by Bill Kennedy + Florin Pățan, licensed Apache 2.0. The original example: cmd/examples/example11-rag-perf/. My fork with notes: PratikDhanave/ai-training. Highly recommend the course for anyone building AI applications in Go — the material is rigorous and the Kronk + yzma + llama.cpp pipeline gives you hardware-accelerated local inference end-to-end. Thank you, Bill and Florin.