Ardan Ultimate AI #04 — Streaming chat completions via SSE

Field notes from working through example 04 of Ardan Labs’ Ultimate AI course by Bill Kennedy and Florin Pățan (Apache 2.0). My fork: PratikDhanave/ai-training. Thank you Bill and Florin for teaching this material — the patterns in this post are derived from the course; the production reflections at the end are mine.

What the example teaches

A non-streaming chat handler returns the entire response when the LLM finishes. The user waits in silence for 10-30 seconds.

A streaming handler flushes each token as it arrives. The user sees the response forming in real time. Same total latency; dramatically better perceived experience.

What it looks like

func chatStream(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "text/event-stream")
    w.Header().Set("Cache-Control", "no-cache")
    flusher := w.(http.Flusher)

    stream, _ := llm.GenerateStream(r.Context(), prompt)
    for chunk := range stream {
        fmt.Fprintf(w, "data: %s\n\n", chunk.Token)
        flusher.Flush()
    }
}

What I learned

http.Flusher is the load-bearing part. Without flusher.Flush(), Go buffers the output and the user gets the whole thing at the end anyway. Easy to forget; the chat feels broken when it happens.

SSE beats WebSockets for this use case. SSE goes through every CDN and proxy without configuration. WebSockets need upgrade-aware infrastructure. Pick SSE unless you need bidirectional.

Production connection

Genie’s /v1/ask/stream is this pattern, plus named SSE events (ai_disclosure, agent.handle, report) so the UI can render different stream sections differently. The base streaming handler came straight from this example.

Credit & reference. This post is field notes on example 04 from Ardan Labs’ Ultimate AI by Bill Kennedy + Florin Pățan, licensed Apache 2.0. The original example: cmd/examples/example04-chat-streaming/. My fork with notes: PratikDhanave/ai-training. Highly recommend the course for anyone building AI applications in Go — the material is rigorous and the Kronk + yzma + llama.cpp pipeline gives you hardware-accelerated local inference end-to-end. Thank you, Bill and Florin.