The Fallback Is the Contract: Reliability Patterns for Clinical AI

Why every LLM-backed agent in my medical platform has a deterministic rule-based fallback — and why "the case always finalises" is the only contract that matters for clinical work.

The thing nobody warns you about

A research-grade LLM agent has a really nice failure profile in dev: occasionally returns malformed JSON, occasionally hallucinates a diagnosis, occasionally times out on a slow API call. You catch those in eval, add retries, move on.

A production-grade LLM agent in a clinical system has a much worse failure profile. Same failure modes, but now every one of them is a case that doesn’t finalise. A clinician opened a chart, expected a recommendation, and gets a spinner. Or a 500. Or worse: a partial result that looks like a recommendation but is actually a JSON parse fragment from a model that ran out of tokens.

The reflex is to add more retries, more timeouts, more error handling. That doesn’t fix the underlying issue. The underlying issue is that failure is the default behaviour of any system whose primary path involves a model that can occasionally just not work.

For Bodh, the open-source medical multi-agent platform I’ve been building, I went the other direction. Every LLM-backed agent has a deterministic rule-based fallback. The fallback isn’t a workaround — it’s the contract.

Here’s what that looks like, why it matters, and where it doesn’t apply.


The shape

Every LLM-backed diagnostician looks like this:

type LLMDiagnostician struct {
    Client        llm.Client            // Anthropic / OpenAI / Ollama / Voyage
    Fallback      *DiagnosticianAgent   // always present
    MinConfidence float64               // default 0.6
    Timeout       time.Duration         // default 30s
    MaxRetries    int                   // default 1 transient retry
}

func (a *LLMDiagnostician) HandleMessage(
    ctx context.Context, msg agent.Message, env agent.Environment,
) ([]agent.Message, error) {
    // Try LLM path with timeout
    llmCtx, cancel := context.WithTimeout(ctx, a.Timeout)
    defer cancel()

    resp, err := a.Client.Complete(llmCtx, a.buildRequest(msg))
    if err != nil {
        env.Logf("[diagnostician/llm] error: %v — falling back to rule-based", err)
        return a.Fallback.HandleMessage(ctx, msg, env)
    }

    proposal, err := medical.DecodeProposal(resp.Content)
    if err != nil {
        env.Logf("[diagnostician/llm] malformed JSON — falling back")
        return a.Fallback.HandleMessage(ctx, msg, env)
    }

    if proposal.Confidence < a.MinConfidence {
        env.Logf("[diagnostician/llm] confidence %.2f < threshold — falling back",
            proposal.Confidence)
        return a.Fallback.HandleMessage(ctx, msg, env)
    }

    if !safety.IsGroundedInCitations(proposal.Rationale, proposal.Citations) {
        env.Logf("[diagnostician/llm] hallucination check failed — falling back")
        return a.Fallback.HandleMessage(ctx, msg, env)
    }

    if safety.ContainsPHIPattern(proposal.Rationale) {
        env.Logf("[diagnostician/llm] PHI pattern in output — falling back")
        return a.Fallback.HandleMessage(ctx, msg, env)
    }

    return []agent.Message{a.emit(proposal, msg)}, nil
}

Six paths to the fallback. Each one is named, logged, and recorded in the [llm-trace] JSON line for that call:

Trigger What it means
Network timeout / transport error Provider is down or slow; degrade to deterministic path
Malformed JSON from model Model didn’t follow the schema; deterministic path catches it
Confidence below MinConfidence Model isn’t sure; rule-based path’s confidence is at least known and bounded
Hallucination check (RAG grounding) Rationale doesn’t reference the supplied citations; can’t trust the reasoning
Safety guardrail trips (refusal phrase, PHI pattern in output) Output is malformed for a different reason; deterministic path is safer
Latency budget exceeded Even if the LLM eventually returns, the user has already moved on

The case always finalises. That’s the contract.


Why this matters for clinical work

A diagnostic case in Bodh runs 12 messages through 8 agents. Twelve Publish calls. Twelve subscribers picking them up. Each message can be governance-gated, audit-logged, HITL-queued, or routed onward. The orchestrator is small (~30 lines for the critical closure); the real complexity is in the state machine the supervisor runs.

If the LLM diagnostician errors mid-case, the supervisor is sitting at incoming msg.Type == "case_state", expecting a diagnosis_proposal. The downstream reasoning_verifier is waiting too. The HITL gate is waiting on the reasoning_verifier. The clinician is staring at a spinner.

Add a fallback, and:

The clinician never sees the underlying LLM hiccup. The audit log shows the fallback path was taken (so operators can graph it, alert on it, debug it). The case finalises.

That’s the difference between “LLM eventually works most of the time” and “this is a clinical system.”


“Degrade gracefully” — what graceful actually means

The rule-based fallback isn’t going to be as good as the LLM on hard cases. That’s fine. The question is what does graceful degradation look like for clinical reasoning?

Bodh’s rule-based diagnostician, after PR #11, hits 85.7% accuracy on a 7-case synthetic bench (vs ~42.9% before specialty rules landed). The LLM-backed diagnosticians should beat it on hard cases — that’s the whole point of having them. But on easy cases, the rule provider is fast (~50µs), cheap (\$0), and confident in a bounded way (we know the rule).

The graceful degradation curve looks like this:

LLM available + confident → LLM diagnosis (~2-5s, $0.01-0.05 per call, ~90% accuracy)
LLM available + low conf  → Rule fallback (~50µs, $0, ~85.7% accuracy)
LLM unavailable           → Rule fallback (same)
LLM hallucinating         → Rule fallback (same)

The floor isn’t “the system breaks.” The floor is “85.7% on this bench, with the cost and rationale audit-trailed.” That’s a number a compliance team can plan against.


Three things the fallback contract enables

1. Operational metrics that matter

Every call to Complete emits an [llm-trace] JSON line with provider, model, agent_id, case_id (hashed for privacy), duration_ms, token counts, stop_reason, and an explicit outcome field: success | fallback_timeout | fallback_malformed | fallback_low_confidence | fallback_hallucination | fallback_phi_pattern | fallback_safety.

Aggregate the outcome field:

bodh_llm_calls_total{provider="anthropic", outcome="success"}             142
bodh_llm_calls_total{provider="anthropic", outcome="fallback_timeout"}    7
bodh_llm_calls_total{provider="anthropic", outcome="fallback_malformed"}  2

That’s a Service Level Indicator. The SLO is “fallback rate ≤ 5%, alert at > 10% over 15 min.” When the SLO trips, you know exactly which failure mode is driving it and which provider.

Without the fallback contract, the metric is “success rate” — a binary that hides the distribution of failure modes you actually care about.

2. Cross-provider failover gets simple

For multi-provider robustness (Anthropic outage → OpenAI), the pattern composes:

primary  := &LLMDiagnostician{Client: anthropic, Fallback: ruleBased}
failover := &LLMDiagnostician{Client: openai,    Fallback: primary}

failover tries OpenAI first; on failure falls back to primary, which tries Anthropic, which falls back to rule-based. Each layer has its own confidence threshold, its own timeout, its own safety checks. The case still finalises if every layer fails — at the rule provider’s floor.

This is the same shape as a circuit-breaker chain in any distributed system. It’s not novel. But clinical AI systems are routinely shipped without it, and the result is that a degraded LLM provider becomes a degraded clinical workflow.

3. Eval and bench numbers stay honest

The bench (cmd/bench) runs against every provider including the rule provider. When an LLM provider drops below the rule provider’s accuracy on a per-condition basis, that’s a signal — maybe the model is wrong, maybe the prompt is wrong, maybe the RAG corpus is missing the right citations. The bench surfaces the regression because you have the floor to compare against.

Without the fallback as a measurable baseline, every LLM result is its own island. “Is the new model better?” is a hard question without a deterministic comparison point.


What this doesn’t mean

A few things the fallback contract is not:

Not “the LLM is unreliable, so don’t trust it”

The LLM is fine. Production LLMs from Anthropic, OpenAI, and (with appropriate hardware) Ollama are operationally sound. Bodh’s measured LLM-path success rate in dev with reasonable prompts is ≥95%.

The fallback contract is for the ≤5% that — across thousands of cases — will fail in some specific way at some specific time. For clinical work, the 5% matters as much as the 95%.

Not “use rules instead of LLMs”

The rule provider exists to be the fallback. It’s not the primary path for production. It’s the floor under the primary path.

For most clinical reasoning, LLMs are dramatically better than my hand-coded rules. The rule provider exists because there’s always going to be a 5% — and because having a measurable floor improves the whole system, not just the failure cases.

Not “fallback is free”

Maintaining the rule provider has a cost. PR #11 was specifically about lifting the rule provider from 42.9% to 85.7% accuracy — that took rule-tuning, fixture-tuning, and bug-hunting in test result parsing. The rule provider needs ongoing care as the bench grows, the same way a unit test suite needs care.

The investment pays back when the LLM provider has a bad day. The rule provider’s bench accuracy is the SLA you can promise.


Where the pattern doesn’t apply

Three places the fallback contract doesn’t shape the architecture:

1. Reads with no decision

Pre-visit chart prep is read-only. It summarises the patient’s open gaps, recent labs, last visit notes. If the LLM-backed summariser fails, the clinician gets the raw chart — same as if Bodh wasn’t running. No fallback needed because there’s no decision to fall back from.

2. Pure observability outputs

The OpenTelemetry exporter (PR #7) is fire-and-forget. If it fails to ship a span to the collector, the span is dropped and the application continues. There’s no fallback because the alternative — making the application block on observability — is the worst outcome.

This is the same shape as audit recording (PR #12): fail-open on observability, fail-closed on safety. Audit write failure logs an error and continues; policy denial blocks the message.

3. Inputs that can be retried explicitly

The HL7 v2 ADT endpoint (PR #8) returns AE ACK on parse failure with HTTP 400. The caller knows to retry with a corrected message. There’s no fallback because the caller has the context to fix the input.

The pattern is: fallback when the system can’t tell the caller about the failure usefully. When the caller can fix it, surface the error. When the system has to keep going without the caller’s help, fallback.


Five SLOs the fallback contract makes possible

SLI Suggested SLO What you do when breached
LLM call success rate (no fallback) ≥ 95% Check provider status; check API key; check prompt rev
Fallback rate per agent ≤ 5% Identify which trigger; tune prompt or rule path accordingly
Case completion p95 (incl. fallback latency) < 60 s Profile the slow case; look at agent hops
Per-provider Pareto degradation < 5pp accuracy regression vs rule Bench regression alert; investigate model / prompt
Hallucination check fail rate < 1% with RAG enabled Check corpus drift; expand RAG knowledge base

Each one has a defined response. That’s the difference between “the AI is having a bad day” and an actionable operational metric.


Try it

The pattern lives in agents/diagnostician_llm.go, with the factory in agents/diagnostician_factory.go doing the LLM-client construction. The Debate / Reflexion / Cascade variants (diagnostician_debate.go, _reflexion.go, _cascade.go) all wrap the same fallback contract.

git clone https://github.com/PratikDhanave/bodh.git
cd bodh
export ANTHROPIC_API_KEY=sk-ant-...

# Side-by-side: rule vs LLM, with the rule provider as the visible floor
go run ./cmd/bench -providers=rule,llm-anthropic -manifest=data/cases/manifest.json

# Force LLM failures to test the fallback (use a bogus key)
ANTHROPIC_API_KEY=sk-bogus go run ./cmd/demo -diagnostician=llm-anthropic
# Logs:
#   [diagnostician/llm] error: 401 — falling back to rule-based
#   [diagnostician] CAP @ 0.87
#   [reasoning_verifier] VERIFIED
#   [supervisor] gold standard: Community-acquired pneumonia | match=true

Repo: github.com/PratikDhanave/bodh

If you’re building production AI systems that touch clinical workflows and want to compare reliability patterns — issues, PRs, and DMs all welcome.


Bodh is a research and engineering reference. Not a medical device. Not approved for clinical use. The reliability patterns described here are architectural; clinical AI in production additionally requires regulatory review, clinical validation, and operational tooling I have not built.

SRE #Reliability #ClinicalAI #LLM #ProductionEngineering #SoftwareEngineering #HealthTech #Go #OpenSource