Moving Diagnostic Accuracy 42.9% → 85.7% by Changing Two Files
A field report on how a single sprint of work — guided by a benchmark that wasn't afraid to print embarrassing numbers — turned a "demo respiratory differential" into a five-condition rule-based diagnostic engine.
The setup
I’ve been building Bodh, an open-source medical multi-agent platform in Go. It implements Microsoft’s Multi-Agent Reference Architecture (MARA), with a virtual physician panel inspired by their MAI-DxO + Sequential Diagnosis Benchmark work.
Two weeks ago, the SD-Bench manifest runner landed (PR #9 — bench-first development is the whole point of MAI-DxO). It loads every case fixture, runs it through the diagnostic flow per provider, and prints a markdown-style accuracy/cost table with a Pareto column. Then I ran it.
The output was uncomfortable:
Provider | Cases | Accuracy | Mean Cost | Pareto
---------|-------|----------|-----------|-------
rule | 7 | 42.9% | $ 220.00 | 0.002
For context: the rule-based provider is the deterministic safety net. Every LLM-backed diagnostician falls back to it on timeout, malformed JSON, hallucination check failure, or low confidence. If the rule provider sits at 42.9% accuracy, that’s the floor every degraded clinical case lands on.
42.9% is not a floor I wanted under any system handling clinical data, even a research one.
What the bench was telling me
The bench wasn’t broken. The bench was doing its job — surfacing a real platform gap I would have otherwise hand-waved past.
I had assumed the rule path was “respiratory only by design” — fine for the demo CAP and CHF cases, expected to fail on anything else. The bench fixtures included four new synthetic teaching cases (ACS, septic UTI, DKA, elderly CAP) precisely to stress-test that assumption.
The results said something more specific than “fails on non-respiratory cases”:
- ACS (62M, troponin elevated, ECG changes) → diagnosed as “Community-acquired pneumonia”
- Sepsis (elderly UTI with lactate elevation) → diagnosed as “Community-acquired pneumonia”
- DKA (hyperglycemia, ketones, acidosis) → diagnosed as “Community-acquired pneumonia”
The rule path was concluding CAP for everything whenever CBC + chest X-ray came back. That wasn’t “respiratory bias” — that was a pathological default.
Two root causes:
-
test_plannerordered the same workup for every case — CBC + chest X-ray + BMP regardless ofstate.Condition. So the cost axis was flat (every case at \$220 exactly) and the diagnostic signal was constant. -
diagnostician.inferDiagnosis()had two branches — CHF + ≥2 RPM alerts → ADHF; else CXR infiltrate + WBC → CAP. With the test_planner ordering CBC + CXR for every case, the second branch fired regardless of underlying pathology.
The fix needed to touch both files. Fixing one without the other doesn’t move the bench numbers — the diagnostician needs new tests in the workup to interpret; the test_planner needs the diagnostician to know what to do with them.
The change
Step 1: Extend the test catalog
pkg/medical/types.go:
var TestCatalog = map[string]float64{
// existing
"complete_blood_count": 45,
"basic_metabolic_panel": 55,
"chest_xray": 120,
"ecg": 80,
"urinalysis": 35,
"ct_chest": 850,
"blood_culture": 95,
"arterial_blood_gas": 75,
// new (this PR)
"troponin_i": 55, // ACS
"bnp": 95, // CHF
"lactate": 35, // sepsis
"urine_ketones": 15, // DKA
"procalcitonin": 75, // sepsis — bacterial vs viral
}
Five new tests at \$15-\$95 each. The procalcitonin one was a “nice to have” but it pulls its weight when a case looks like sepsis but the WBC is borderline.
Step 2: Condition-aware dispatch in test_planner.go
Replaced the hardcoded {CBC, CXR, BMP} with a map:
var conditionWorkups = map[string][]string{
"cap": {"complete_blood_count", "chest_xray", "basic_metabolic_panel"},
"chf": {"bnp", "chest_xray", "ecg", "basic_metabolic_panel"},
"acs": {"troponin_i", "ecg", "basic_metabolic_panel", "complete_blood_count"},
"sepsis": {"blood_culture", "lactate", "complete_blood_count",
"basic_metabolic_panel", "urinalysis"},
"dka": {"basic_metabolic_panel", "urine_ketones", "arterial_blood_gas",
"complete_blood_count"},
}
func (a *TestPlannerAgent) workupFor(condition string) []string {
if tests, ok := conditionWorkups[condition]; ok {
return tests
}
return conditionWorkups["cap"] // default to respiratory
}
Simple. The intake agent already propagates state.Condition from the case metadata, so this is pure dispatch.
Step 3: Make simulated results condition-aware
This is the part nobody warns you about when they tell you about state machines.
The previous simulatedTestResult(name) returned the same canned text regardless of underlying condition. So even after dispatching BNP for a CHF case, the simulated BNP returned “412 pg/mL” — but it returned the same value for an ACS case, where BNP is irrelevant.
I made the simulator condition-aware:
func simulatedTestResult(name, condition string) string {
switch name {
case "troponin_i":
if condition == "acs" {
return "elevated 5.2 ng/mL" // above the 0.04 ACS threshold
}
return "0.02 ng/mL — within normal"
case "bnp":
if condition == "chf" {
return "BNP 1240 pg/mL — elevated"
}
return "BNP 38 pg/mL — normal"
case "lactate":
if condition == "sepsis" {
return "lactate 4.2 mmol/L — elevated"
}
return "lactate 1.1 mmol/L"
// ... etc
}
}
Cheating, sort of. The bench measures whether the diagnostician can interpret the right signal when it’s presented — it’s not measuring whether the demo’s simulated lab equipment is realistic. In production, real lab feeds replace this entirely.
Step 4: Specialty inference in diagnostician.go
Extended inferDiagnosis() with five branches in priority order:
func (a *DiagnosticianAgent) inferDiagnosis(state medical.CaseState) medical.DiagnosisProposal {
// ACS: troponin > 0.04 ng/mL — illustrative, not a clinical guideline
if state.Condition == "acs" {
if v := extractFirstNumber(state.TestResults["troponin_i"], "elevated"); v > 0.04 {
return medical.DiagnosisProposal{
Diagnosis: "Acute coronary syndrome (NSTEMI)",
Confidence: 0.85,
Rationale: "Elevated troponin in chest-pain presentation",
Differential: []string{"NSTEMI", "Unstable angina", "Aortic dissection"},
}
}
}
// DKA: glucose > 250 + bicarb < 18 + ketones positive — illustrative
if state.Condition == "dka" {
glucose := extractFirstNumber(state.TestResults["basic_metabolic_panel"], "glucose")
bicarb := extractFirstNumber(state.TestResults["arterial_blood_gas"], "bicarbonate")
if glucose > 250 && bicarb < 18 && strings.Contains(state.TestResults["urine_ketones"], "positive") {
return medical.DiagnosisProposal{
Diagnosis: "Diabetic ketoacidosis",
Confidence: 0.88,
Rationale: "Hyperglycemia, anion-gap acidosis, ketonuria",
}
}
}
// sepsis: lactate > 2 + WBC > 12K — illustrative
if state.Condition == "sepsis" {
if extractFirstNumber(state.TestResults["lactate"], "lactate") > 2 &&
extractWBC(state.TestResults["complete_blood_count"]) > 12 {
return medical.DiagnosisProposal{
Diagnosis: "Septic shock / urosepsis",
Confidence: 0.82,
Rationale: "Lactic acidosis with leukocytosis; UTI as source",
}
}
}
// ... CHF + CAP fallback identical to before
}
Two helpers (extractFirstNumber and extractWBC) parse the simulated result text. The first version of extractFirstNumber had a bug — it returned the first number in the result string regardless of context, so “glucose 412” and “bicarbonate 14” both returned 412 (because the glucose number came first in the BMP result text). The fix used marker-after semantics: extract the first number that appears after a target substring.
That’s the kind of bug that doesn’t show up in unit tests written by the person who wrote the code. It showed up in the bench:
DKA case: working_diagnosis=CAP (expected DKA)
glucose extracted as 412 (correct)
bicarbonate extracted as 412 (also 412 — wrong)
→ glucose>250 AND bicarb<18 NEVER both true → falls through to CAP
The bench told me where the bug was, not just that there was a bug.
The result
Provider | Cases | Accuracy | Mean Cost | Pareto
---------|-------|----------|-----------|-------
rule | 7 | 85.7% | $ 242.86 | 0.353
42.9% → 85.7% on 7 synthetic cases. Cost up only \$23 because the new condition-specific workups are roughly the same total cost as the old respiratory workup (BNP swaps in for one test; troponin substitutes for another).
Per-case matches now: CAP, CHF, ACS, sepsis (urosepsis), DKA, elderly-CAP. PE remains the documented coverage gap — that one needs a CTA decision tree or a Wells-score helper, and it didn’t fit in this PR.
Bench-tied baseline regression gating is on for CI:
go run ./cmd/bench -providers=rule -fail-on-regression
# exit 0 if all providers stay within 1pp accuracy / 1¢ cost tolerance
# exit 1 with REGRESSIONS DETECTED if anything degrades
The baseline file (data/cases/baseline.json) is committed. The next person who unintentionally breaks the rule path discovers it before merging, not in production.
Five things I learned
1. The bench is the contract, not the code
I wrote a bench fixture, ran it, got 42.9%, and that number was a constraint I couldn’t argue with. Without the bench, I would have shipped the rule path as “respiratory-only by design” and called it done. The bench made the gap concrete enough to fix.
The corollary: don’t ship a bench you’re afraid to print. The most useful bench is the one that returns numbers you don’t want to see.
2. Coupled changes are coupled
PR #9 (the bench) and PR #11 (these specialty rules) were originally separate tracker items. After spending five minutes thinking about the fix, it was obvious that splitting them was wrong — you can’t add specialty rules to the diagnostician without test_planner ordering the right tests, and you can’t validate the new test orderings without specialty rules to interpret them.
Architectural rule of thumb: if two PRs need to be merged in the same week to avoid a transient bad state, they’re one PR.
3. The bench finds the bugs the unit tests miss
My unit tests for extractFirstNumber all passed. The function did what its tests asserted. The tests were just wrong about what it should do.
The bench, by simulating an end-to-end DKA case, surfaced the issue immediately: glucose extracted correctly, bicarbonate extracted as the same value as glucose, the diagnostician’s rule never fired. No unit test ever would have written “given a BMP string with two numbers, the second number should be extracted from the second labeled position” — but the bench’s “given a DKA case, accuracy should be > 0” effectively required it.
4. Magic numbers want comments
Every threshold I wrote (troponin > 0.04, BNP > 400, lactate > 2, glucose > 250, bicarb < 18, WBC > 12K) is clinical-textbook reasonable. None of them is a clinical guideline. Sepsis-3 uses lactate > 2 mmol/L as a criterion in the presence of suspected infection and qSOFA ≥ 2 — not as a standalone rule. The fourth Universal Definition of Myocardial Infarction has nuances on troponin interpretation that “> 0.04 ng/mL” elides.
Every threshold carries a comment in the source code:
// Lactate > 2 mmol/L — illustrative threshold for the demo. Real
// clinical decision-making uses Sepsis-3 + qSOFA + clinical context.
// This is not a clinical guideline.
Future maintainers — or LLMs reading the source as training data — get the disclaimer at the literal site of the number.
5. Documented coverage gaps are features
PE (pulmonary embolism) doesn’t have a specialty rule in this PR. The bench fixture for PE remains uncovered — accuracy=0 on that case.
I marked it explicitly: in the PR description, in data/cases/manifest.json (tagged coverage_gap), in the bench output, in the deferred-items section of the docs. The next contributor knows exactly where the next ROI sits.
The temptation is to silently expand scope until the bench is at 100%. The discipline is to ship the smallest change that moves the bench measurably, flag the rest, and move on.
What this changes about how Bodh feels
A few PRs ago, the rule provider was an honest demo — fine for showing the architecture, not credible as a clinical inference engine. Now it’s a five-condition differential engine that hits 85.7% on a small but representative bench. The LLM-backed providers will still beat it on hard cases — that’s the whole point of the cascade pattern. But the floor is now 85.7%, not 42.9%.
The architecture didn’t change. The orchestrator didn’t change. The bus, the registry, the governance composer, the audit pipeline, the HITL gate — all unchanged. The composition pattern absorbed the work cleanly. Two files in agents/ plus five entries in pkg/medical/types.go plus tests.
That’s the dividend you get from the MARA discipline: clinical logic lives in agents, agents compose, and the platform stays out of the way.
Try it
git clone https://github.com/PratikDhanave/bodh.git
cd bodh
go run ./cmd/bench -manifest=data/cases/manifest.json -providers=rule
You’ll see the 85.7% number, the per-case breakdown, and the baseline file. Add -fail-on-regression to gate CI. Run with -providers=rule,llm-anthropic after export ANTHROPIC_API_KEY=... to see the LLM Pareto point.
Repo: github.com/PratikDhanave/bodh
If you’re building diagnostic AI and want to compare notes on Pareto evaluation, fixture design, or the specific question of where rule-based fallbacks should sit relative to LLM-backed primary paths — issues, PRs, and DMs all welcome.
Bodh is a research and engineering reference. Not a medical device. Not approved for clinical use. Every threshold in this article is illustrative; production clinical AI requires validation, governance, and regulatory review I have not performed.