LLM Ops — Blog — Pratik Dhanave

Mar 29, 2026 · Engineering

Ardan Ultimate AI #22 — Cascading model router (cheap first, expensive on miss)

Most queries are simple. A cascading router tries a small/fast/cheap model first; if confidence is low or the task is hard, it escalates to a larger one. Costs collapse without hurting quality.

Ardan LabsGoLLM OpsCost Optimisation

Mar 26, 2026 · Engineering

Ardan Ultimate AI #19 — Speculative decoding with a draft model

Run a small draft model to predict several tokens at once; verify them in a single pass with the large model. Latency drops without quality dropping. The technique production LLM serving uses but most application engineers don't see.

Ardan LabsGoLLM OpsPerformance

Mar 25, 2026 · Engineering

Ardan Ultimate AI #18 — Incremental message caching (IMC) for chat

A long chat reprocesses the entire history on every turn. Prefix caching lets the LLM serve the cached KV-cache prefix from the previous turn and only compute the new suffix. Massive latency win on long conversations.

Ardan LabsGoLLM OpsPerformance

Feb 17, 2026 · Engineering

Cost-aware agent dispatch — when the cheap agent is enough

Not every query needs the production agent. A cost-aware dispatcher decides whether to route to the cheap-and-fast agent or the expensive-and-thorough one. Same UX, dramatically lower bill.

AgentsCost OptimisationLLM Ops

#LLM Ops

Ardan Ultimate AI #22 — Cascading model router (cheap first, expensive on miss)

Ardan Ultimate AI #19 — Speculative decoding with a draft model

Ardan Ultimate AI #18 — Incremental message caching (IMC) for chat

Cost-aware agent dispatch — when the cheap agent is enough