Field notes from working through example 28 of Ardan Labs’ Ultimate AI course by Bill Kennedy and Florin Pățan (Apache 2.0). My fork: PratikDhanave/ai-training. Thank you Bill and Florin for teaching this material — the patterns in this post are derived from the course; the production reflections at the end are mine.
What the example teaches
Image search without training a custom model. For each image: ask a vision LLM to describe it (5-10 sentences), embed the description, store in pgvector. Query: embed the search text, nearest-neighbour over descriptions.
What it looks like
for _, img := range images {
description := vision.Describe(ctx, img,
"Describe this image in 5-10 sentences. Include objects, " +
"colors, mood, and any text visible.")
emb := embed.Generate(description)
db.Insert(img.ID, img.URL, description, emb)
}
// Query
hits := db.Nearest(embed.Generate("a red car parked at sunset"), k=10)
What I learned
Description quality is the whole game. A vision LLM that describes “a man in a hat” loses to one that describes “a Black man in his 30s wearing a charcoal fedora, mid-laugh, against a brick wall under late-afternoon light.” The prompt for the description step is more important than the embedding model.
This beats CLIP for many use cases. CLIP needs the dual encoder; this approach gets you 80% of the value with off-the-shelf vision + text embedding. The 20% gap matters at scale; doesn’t matter for a prototype.
Production connection
For a media archive search use case, I’d reach for this first. The Docling work (post #30) is the document analogue; this is the image one. Combine them and you have a “search across everything” pipeline that doesn’t need bespoke training.
Credit & reference. This post is field notes on example 28 from Ardan Labs’ Ultimate AI by Bill Kennedy + Florin Pățan, licensed Apache 2.0. The original example: cmd/examples/example28-image-vision-rag/. My fork with notes: PratikDhanave/ai-training. Highly recommend the course for anyone building AI applications in Go — the material is rigorous and the Kronk + yzma + llama.cpp pipeline gives you hardware-accelerated local inference end-to-end. Thank you, Bill and Florin.