Ardan Ultimate AI #32 — Embedded React chat over RAG (Go backend + bundled UI)
A complete chat application: Go backend with RAG, React frontend, single binary. Showed me how to ship a full-stack AI demo without a separate frontend deployment.
Posts about rag. ← All posts
A complete chat application: Go backend with RAG, React frontend, single binary. Showed me how to ship a full-stack AI demo without a separate frontend deployment.
PDFs are the format that breaks every RAG pipeline. Docling is the IBM-research extractor that handles layout, tables, and figures. The example wires Docling + LLM to make PDFs usable.
Transcribe a video, chunk by timestamp, embed each chunk, RAG-style chat over the result. The shape that powers "ask questions about this meeting recording."
Generate a text description of an image with a vision LLM, embed the description, store in pgvector. Search becomes "find images that match this query" — works surprisingly well.
A RAG pipeline that ingests user-supplied documents is a prompt-injection vector. An attacker uploads a document with hidden instructions; the LLM retrieves it and follows them. Defense: input filtering, content classification, output verification.
Not every question needs retrieval. A classifier gates RAG: chat or general knowledge questions skip it; factual or document-grounded questions trigger it. Saves latency and tokens on the simple half of queries.
A simple RAG pipeline embeds documents one at a time. The performant version batches the embeddings, parallelises the chunks, and caches the responses. Throughput goes up 5-10×.
Tie all the RAG pieces together into one interactive REPL. Type a question, see the retrieval, see the answer, ask follow-ups. The shape of every "chat with your docs" demo.
When RAG gives wrong answers, the problem is usually retrieval, not the LLM. The example isolates the retrieval step so you can see exactly what chunks come back for a given query, with what scores, and tune K and the similarity threshold accordingly.
Ingest → embed → store → retrieve → answer. The full pipeline applied to Bill Kennedy's Go notebook. The result: a system that answers "how do channels work?" with quotes from the source material.
The ingestion step that turns a corpus into a vector database. Chunk the source, embed each chunk, store with metadata. The pre-work without which RAG is impossible.
pgvector adds vector similarity to Postgres. The example shows the schema, the indexes, the query, and what an ANN index buys you over a brute-force scan.
Side-by-side comparison: ask the LLM a domain question with no context, then ask with retrieved context. The without-RAG answer is plausible nonsense. The with-RAG answer is correct. The example that motivates everything else in the course.
Vector search treats every chunk as independent. GraphRAG models the relationships between entities, communities, and concepts. For corpus-spanning questions ("what's the relationship between X and Y"), graph wins.
Embedding a question and embedding an answer often produce different vectors. HyDE generates a hypothetical answer to the question, embeds *that*, and retrieves on it. Retrieval quality goes up disproportionately.
Naive RAG retrieves on every query. Self-RAG decides whether to retrieve. CRAG decides whether the retrieved content is good enough or needs corrective retrieval. Two papers; both worth implementing.
An Indian banking deployment needs to handle Hindi, Marathi, Tamil, Bengali, and English in the same retrieval pipeline. Bhashini (the government's language stack) plus cross-lingual embeddings make it tractable.