Self-RAG and CRAG — when to retrieve, when to skip, when to correct

Three RAG philosophies

Naive RAG. Every query → retrieve → answer. Simple; uniform cost; sometimes retrieves when it shouldn’t (chitchat) and sometimes doesn’t retrieve enough (the first hits weren’t great).

Self-RAG. The LLM is trained (or prompted) to emit special tokens that signal “I should retrieve now” or “I have what I need.” Retrieval becomes a runtime decision the LLM makes.

CRAG (Corrective RAG). Retrieve normally; then a small classifier scores the retrieved content as “correct / ambiguous / incorrect.” If incorrect, do a web search (or alternate corpus) to correct. If ambiguous, mix.

When each pays back

Self-RAG is the right pick when: - Query distribution mixes chitchat with substantive questions (“hi” doesn’t need retrieval; “what’s the refund policy” does). - LLM cost dwarfs retrieval cost; saving retrieval is cheap, saving LLM calls is expensive. - You have a model that supports the special tokens or can be prompted into the pattern reliably.

CRAG is the right pick when: - Retrieval quality is variable. Some queries get great hits; others get nonsense. - You have a fallback corpus (web search, second-tier KB) that’s worth the cost on the bad-retrieval cases. - You can run a small classifier cheaply (a fine-tuned small model, or a prompted small model).

What CRAG looks like

hits := retriever.Search(ctx, query, k=5)
score := classifier.Score(ctx, query, hits)

switch {
case score.Class == "correct":
    return llm.Answer(ctx, query, hits)
case score.Class == "incorrect":
    fallback := webSearch.Search(ctx, query)
    return llm.Answer(ctx, query, fallback)
case score.Class == "ambiguous":
    merged := mergeHits(hits, webSearch.Search(ctx, query))
    return llm.Answer(ctx, query, merged)
}

The classifier is the load-bearing part. Cheap to run; precise enough; tunable. We used a fine-tuned BERT for one engagement; a Gemini Flash prompt for another. Both worked.

Production numbers

For a customer-support copilot:

Naive RAG: 78% correct answers
Self-RAG: 81% correct, 22% reduction in LLM calls (chitchat skipped retrieval)
CRAG: 87% correct, 31% increase in cost (web search on misses)

The right choice depends on what you optimise for. Cost-sensitive deployment → Self-RAG. Quality-sensitive → CRAG. Cheapest baseline that works → Naive.

Combine them

The “Adaptive RAG” pattern combines both: Self-RAG for the retrieve-or-not decision, then CRAG for the retrieved content quality. Three classifier runs total (whether to retrieve, retrieved quality, answer-supported), all small.

Genie’s pkg/rag ships naive + hybrid + reranking. CRAG-style corrective retrieval is on the roadmap for the high-stakes use cases (Bancnet, KYC) where wrong retrieval has downstream cost.