CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

We built a high-fidelity simulator of 21 healthcare apps, ran 30 frontier agents through 75 long-horizon workflows, and the best one solved 28% of tasks. Here is what broke, why, and where it matters.

actAVA Research·

Haolin Chen1, Deon Metelski1, Leon Qi1, Tao Xia1, Joonyul Lee1, Steve Brown1, Kevin Riley1, Frank Wang1, T. Y. Alvin Liu, MD2, Hank Capps, MD3, Zeyu Tang4, Xiangchen Song5, Lingjing Kong5, Fan Feng6, Tianyi Zeng7, Zhiwei Liu8, Zixian Ma9, Hang Jiang10, Fangli Geng11, Yuan Yuan12, Chenyu You13, Qingsong Wen14, Hua Wei15, Yanjie Fu15, Yue Zhao16, Carl Yang17, Biwei Huang6, Kun Zhang5,19,
Caiming Xiong18,
Sanmi Koyejo4, Eric P. Xing19,5, Philip S. Yu20, Weiran Yao1

1actAVA.ai · 2Johns Hopkins Medicine · 3Wellstar Health System · 4Stanford University · 5CMU · 6UCSD · 7Yale School of Medicine · 8Salesforce AI Research · 9University of Washington · 10Northeastern University · 11Brown University · 12Boston College · 13Stony Brook University · 14University of Oxford · 15Arizona State University · 16University of Southern California · 17Emory University · 18Recursive Superintelligence · 19MBZUAI · 20University of Illinois at Chicago


What we built

U.S. healthcare runs on long, fragmented, policy-heavy workflows — the kind that take a nurse, a coordinator, and a medical director hours of clicking, calling, and chart-reading to push to a terminal action. Frontier AI agents are increasingly pitched as the natural automation candidate here. We wanted to know how close they actually are.

So we built a high-fidelity simulator of 21 real healthcare apps — provider EHRs, payer UM portals, care-management consoles — wired together by their own state machines, and dropped frontier agents into them with a 1,279-document Managed-Care Operations Handbook and 200+ role-scoped MCP tools. Each task starts with a real clinical case (a knee-replacement prior auth, a chemotherapy review, a diabetes care-plan kickoff) and asks the agent to drive it end-to-end to a submitted packet, a finalized determination, or a complete care plan with patient consent. When the agent stops, a composite verifier reads the full workspace and grades the run.

Agent
Frontier harness + model
  • 200+ role-scoped MCP tools
  • 1,279-doc Handbook skill
  • Workspace files + chart access
30 harness × model combos evaluated
χ-World simulator
21 healthcare apps · 3 MCP servers
Provider — Prior Auth
5 apps · :8020
PA
Payer — Utilization Mgmt
10 apps · :8100
UM
RN — Care Management
5 apps · :8200
CM
FastAPI + SQLite + MCP over HTTP
In-situ verifier
Two-layer scorecard
  • Deterministic contract — case status, codes, P2P required
  • LLM judge (Opus 4.7) — rationale, autonomy, grounding
Reward = contract ∧ judge
3 domains·75 tasks·3 trials per task·pass@1 · pass@3 · pass^3
Figure 1. CHI-Bench: Clinical Healthcare In-Situ environment and evaluation benchmark. The agent operates 21 healthcare apps through MCP, writes role artifacts to a shared workspace, and is graded by a composite verifier that reads the workspace, world state, and event trail.
Domains
3
PA · UM · CM
Tasks
75
25 per domain
Healthcare apps
21
via MCP
MCP tools
200+
Handbook docs
1,279
Agent configs
30
harness × model

Why this is hard

Coding benchmarks like SWE-Bench and Terminal-Bench measure long-horizon execution against a clean, deterministic backdrop: the file system doesn't talk back, the build either passes or it doesn't. Healthcare workflows look superficially similar but are wired differently in three ways that almost no current benchmark combines.

Policy density. Every decision has to be grounded in a specific rule — a payer's medical-necessity criterion, an internal escalation procedure, a state regulation. There are thousands of them, they vary by payer and plan year, and “misreading the rule” and “applying the right rule incorrectly” are different failure modes that we score separately.

Multi-role composition. A single PA case touches a coordinator, a UM nurse, a medical director, and sometimes a peer-to-peer call between an MD on each side. The agent has to switch role context, authority boundaries, and what it's safe to write — and every handoff is terminal. Submit the wrong packet and there's no rollback.

Multilateral interaction. Some steps aren't tool calls but live multi-turn conversations: a peer-to-peer review with the payer's MD, an RFI back to the provider, a twenty-minute outreach call with a chronic-disease patient. The agent has to drop out of execute-mode, hold a real conversation, and carry the result back into the workflow.

Three domains, 75 tasks

We picked the three places frontier agents are most often pitched as automation candidates today: provider prior authorization, payer utilization management, and care management. Each domain has 25 long-horizon tasks. To produce ground truth, a clinician (or an author wearing that hat) walked every task end-to-end on the live UI before it shipped — the average task takes 21 steps, the longest hits 40. Each task is then graded independently across three trials.

Prior Authorization
Provider role · 25 tasks

Build the case packet a payer needs to approve a service or medication. The agent works the chart, drafts the medical-necessity rationale, attaches policy evidence, and submits — once.

Best pass@1
29.3%Codex + GPT-5.5
Utilization Management
Payer role · 25 tasks

Triage, nurse-review, and MD-decide an incoming authorization request against criteria. Run a peer-to-peer call if needed, then finalize the determination with rationale-rich notes.

Best pass@1
41.3%Claude Code + Opus 4.6
Care Management
Care manager role · 25 tasks

Run intake, outreach, assessment, and care-plan steps for a chronic-disease member. Patient calls are simulated multi-turn dialogs; the agent must obtain consent before scoping the program.

Best pass@1
32.0%Claude Code + Opus 4.7

Browse all 75 tasks →

How we grade a run

When the agent stops, the verifier reads everything it touched: world-store updates, files in the workspace, the full tool-call event trail, and any conversation transcripts. Two layers grade in parallel and both have to pass.

The deterministic contract covers checks that have a single right answer — did the case status reach the expected terminal state? did the right CPT codes land on the request? was a peer-to-peer raised when policy required one? The rubric LLM judge (pinned to claude-opus-4-7) handles items the contract can't formalize: was the medical-necessity rationale grounded in the cited policy? did the outreach call respect patient autonomy or quietly badger a refusing member into “yes”?

We report pass@1, pass@3 (any of three independent trials), and the strict pass^3 (all three) as a reliability metric.

Persisted record
What the agent leaves behind
  • Per-stage states
  • Side-effect artifacts
  • Workspace documents
  • Event log
  • Conversation transcripts
Layer 1
Deterministic contract
Code-checkable assertions
Layer 2
LLM judge
Opus 4.7, majority of 3 votes
Trial passes
R = contract ∧ judge
Deterministic checks include
Terminal status reachedStage payload valuesRequired event log entriesDocument field valuesCross-app side effects
LLM judge rubrics include
Policy alignmentReasoning soundnessInternal coherencePatient engagementAutonomy-first outreach
Figure 2. Verification pipeline. Every trial emits a persisted record (world store, event log, transcripts). A deterministic contract and an LLM judge grade in parallel; the trial passes only when both layers pass.

Headline results

We ran 30 harness × model combinations: every frontier proprietary stack (Claude Code, OpenAI Codex, Gemini CLI) paired with that lab's closed-weight models, plus four open-source frameworks (OpenClaw, Hermes, OpenAI Agents SDK, DeepAgents) sweeping five open-weight models. Best in class is Claude Code + Opus 4.6 at 28.0% pass@1. The per-domain leaders split cleanly across three labs: Codex + GPT-5.5 leads PA at 29.3%, Opus 4.6 leads UM at 41.3%, and Opus 4.7 leads CM at 32.0%. There is no single “best agent” for healthcare workflows yet.

Reliability is the bigger problem. No agent clears 20% on pass^3 — the metric where the same agent has to pass the same task three runs in a row. Agents that win one trial often fail the next on the same case. Two stress tests collapse the headline number further:

The marathon. Load all 25 tasks of a domain into a single agent session and ask it to finish them in any order. The best agents finish with overall 3.8%. On PA, neither leading agent submits a single authorization across 25 queued cases despite touching most of them. Long context doesn't save you: Opus 4.7 (1M-token context, no compaction) and GPT-5.5 (auto-compacts 4–6× per session) fail in roughly the same shape.

The end-to-end arena. Wire two Codex + GPT-5.5 agents together — one as the provider, one as the payer — and let them exchange information only through MCP tools. The PA configuration that scores 30.4% solo collapses to 0% when the payer agent and cross-role checks join: 18 of 23 cases never reach an MD decision, and on the five PA cases that policy requires a peer-to-peer call, neither side raises one.

OpenAI · CodexAnthropic · Claude CodeGoogle · Gemini CLI
Prior Authorization
Best: 29.3% · Codex + GPT-5.5
Utilization Management
Best: 41.3% · Claude + Opus 4.6
Care Management
Best: 32.0% · Claude + Opus 4.7
Figure 3. pass@1 across the three CHI-Bench environments for the nine proprietary harness × model configurations. The per-domain leaders split across all three labs — there is no single “best” agent for healthcare workflows yet.

The full leaderboard

All 30 harness × model configurations. Each pass cell is shaded by its value (deeper pink = higher pass rate); the per-column maximum is ringed in pink. Efficiency columns report the per-trial averages over all 225 trials in that row. Scroll horizontally on narrow screens.

Agent harnessModel
Overall
75 tasks
Prior Authorization
25 tasks
Utilization Management
25 tasks
Care Management
25 tasks
Efficiency
per trial
p@1p@3p^3p@1p@3p^3p@1p@3p^3p@1p@3p^3StepsCost
Proprietary stackFrontier first-party CLI + closed-weight models
CodexGPT-5.520.930.79.329.340.016.032.048.012.01.34.00.054$1.29
CodexGPT-5.416.025.38.024.032.016.017.324.08.06.720.00.058$1.30
CodexGPT-5.4 Mini8.420.00.010.724.00.013.332.00.01.34.00.058$0.27
Claude CodeClaude Opus 4.724.441.310.724.032.016.017.328.08.032.064.08.068$9.91
Claude CodeClaude Opus 4.628.038.718.718.724.012.041.344.040.024.048.04.076$6.47
Claude CodeClaude Sonnet 4.626.241.312.024.028.020.034.752.016.020.044.00.082$1.30
Claude CodeClaude Haiku 4.56.210.72.70.00.00.014.724.08.04.08.00.041$0.16
Gemini CLIGemini 3.1 Pro7.113.31.314.724.04.06.716.00.00.00.00.082$2.11
Gemini CLIGemini 3 Flash12.517.38.018.728.08.018.724.016.00.00.00.0142$0.33
Open-source stackOpen frameworks + open-weight models
OpenClawClaude Opus 4.717.337.34.018.728.08.013.332.04.020.052.00.041$11.48
OpenClawKimi K2.610.218.72.712.020.04.018.736.04.00.00.00.072$0.91
OpenClawDeepSeek V4 Pro11.124.01.314.728.04.012.028.00.06.716.00.042$0.53
OpenClawGLM-5.116.930.76.713.324.04.026.736.016.010.732.00.0116$0.96
OpenClawQwen 3.6 Max4.910.70.010.724.00.04.08.00.00.00.00.079$2.80
OpenClawGrok 4.30.41.30.01.34.00.00.00.00.00.00.00.065$2.66
OAI AgentsKimi K2.615.122.78.017.328.012.025.336.012.02.74.00.060$0.43
OAI AgentsDeepSeek V4 Pro14.222.79.310.716.08.028.040.020.04.012.00.052$0.25
OAI AgentsGLM-5.118.726.712.018.724.012.033.344.024.04.012.00.058$0.27
OAI AgentsQwen 3.6 Max15.622.79.316.020.012.026.736.016.04.012.00.048$0.58
OAI AgentsGrok 4.35.810.71.30.00.00.016.028.04.01.34.00.032$1.54
HermesKimi K2.615.624.06.718.724.012.021.336.08.06.712.00.031$1.07
HermesDeepSeek V4 Pro13.822.78.08.016.04.025.332.020.08.020.00.026$2.19
HermesGLM-5.118.728.010.710.716.08.034.744.024.010.724.00.030$1.04
HermesQwen 3.6 Max16.428.05.39.316.04.026.736.012.013.332.00.029$4.12
HermesGrok 4.34.48.01.30.00.00.013.324.04.00.00.00.032$1.05
DeepAgentsKimi K2.63.18.00.08.020.00.01.34.00.00.00.00.039$0.55
DeepAgentsDeepSeek V4 Pro10.718.72.714.724.04.010.720.04.06.712.00.015$0.21
DeepAgentsGLM-5.111.117.35.317.324.012.010.716.04.05.312.00.021$0.26
DeepAgentsQwen 3.6 Max9.316.04.012.016.08.010.716.04.05.316.00.018$0.57
DeepAgentsGrok 4.32.25.30.00.00.00.05.312.00.01.34.00.021$1.43

p@1 = pass@1, p@3 = pass@3 (any of 3 trials), p^3 = pass^3 (all 3 trials). Cost is the mean per-trial USD spend at list pricing.

Cost vs accuracy

Spending more does not buy reliability in healthcare workflows. Plot every harness × model configuration by mean per-trial spend and pass@1 and the field falls into four quadrants. The Sweet Spot (cheap and accurate) is sparsely populated and dominated by a handful of frugal open-weight pairings. The Premium tier — Claude Code paired with Opus 4.6 or Sonnet 4.6, and Codex with GPT-5.5 — buys the top of the leaderboard but at 4–50× the trial cost of the frontier's budget end. Everything to the right of the $1 line and below the 13% bar is Overpriced: Claude Opus 4.7 on OpenClaw spends almost $12 per trial to land at 17.3% pass@1.

The connecting line traces the cost-accuracy Pareto frontier: no other configuration is both cheaper and more accurate than the points on it. Seven configurations sit on the frontier — Haiku 4.5, two DeepSeek V4 Pro setups, GLM-5.1 (OAI Agents), GPT-5.5 (Codex), Sonnet 4.6, and Opus 4.6 — and they span two orders of magnitude in spend for a 22-point accuracy spread.

Harness
  • Codex
  • Claude Code
  • Gemini CLI
  • OpenClaw
  • OAI Agents
  • Hermes
  • DeepAgents
Model
  • GPT-5.5
  • GPT-5.4
  • GPT-5.4 Mini
  • Claude Opus 4.7
  • Claude Opus 4.6
  • Claude Sonnet 4.6
  • Claude Haiku 4.5
  • Gemini 3.1 Pro
  • Gemini 3 Flash
  • DeepSeek V4 Pro
  • GLM-5.1
  • Kimi K2.6
  • Qwen 3.6 Max
  • Grok 4.3
Pareto frontier (7 configurations)
  • Haiku 4.5 (Claude Code)$0.16 · 6.2%
  • DS V4 Pro (DeepAgents)$0.21 · 10.7%
  • DS V4 Pro (OAI Agents)$0.25 · 14.2%
  • GLM-5.1 (OAI Agents)$0.27 · 18.7%
  • GPT-5.5 (Codex)$1.29 · 20.9%
  • Sonnet 4.6 (Claude Code)$1.30 · 26.2%
  • Opus 4.6 (Claude Code)$6.47 · 28.0%
Figure 4. Cost-accuracy ROI quadrants for all 30 harness × model configurations. Dashed lines split the plane at $1 per trial and 13% pass@1, defining the Sweet Spot, Premium, Budget, and Overpriced regions. The solid line traces the Pareto frontier — seven configurations where no other point is both cheaper and more accurate. Shape encodes harness; color encodes model family.

Where agents break down

A 28% vs. 21% leaderboard gap suggests the leading agents are doing roughly the same thing, only at different success rates. They aren't. We pulled the per-check, per-tool-call data from 225 trials each for the two leading configurations from different labs — Claude Code + Opus 4.6 (28.0% pass@1) and Codex + GPT-5.5 (20.9% pass@1) — and the failure shapes are wildly different.

Claude is the careful process-follower that submits packets it shouldn't. It nails every mandatory peer-to-peer (12 of 12) and writes care plans with 100% goal-structure compliance, but fails 0/33 of the “gather more evidence and put the case on hold” tasks because it always submits. Codex is the fast shortcutter: 1.5–2× faster per trial, better at form-filling, but 88% of its outreach calls fail the patient-conversation rubric, and it racked up 122 consecutive retries on a single UM trial when it couldn't format a tool call correctly.

Each agent is good at the things the other one is bad at. Both fail in ways that would cause actual harm in production. Below: side-by-side strength/weakness profiles, a per-check breakdown across all 15 rubric checks, and six diagnostics that explain the most consistent gaps.

Claude Code + Opus 4.6
The thorough process-follower
28.0% pass · 0.834 fractional · 96% consistency
  • UMProcedural fidelity — never skips mandatory workflow steps (P2P: 12/12 perfect)
  • CMConversational empathy — 73% outreach quality pass; adaptive rapport-building
  • UMRole-switching — correct authority boundaries across 5 UM roles
  • CMCare plan structure — 100% goal structure compliance, 89% intervention structure
  • PAEvidence gathering — 3.7 supporting documents per PA trial vs 1.0 for Codex
  • Deterministic — same outcome on 96% of tasks across independent trials
  • PACompleteness bias — 0/33 on "gather more evidence" tasks; always submits
  • PACreates submission packets when policy requires stopping (96% failure)
  • Slow execution — 1.5–2× longer than Codex; 68 avg steps vs 46
  • CMCM assessment quality judging — 64% failure on structured assessment checks
Codex + GPT-5.5
The efficient shortcutter
20.9% pass · 0.805 fractional · 64% consistency
  • PADocumentation gap detection — 6/33 on hold tasks vs Claude's 0/33
  • PAIntake efficiency — 40% pass on structured form-filling tasks
  • PAExecution speed — 3.5 min avg PA trial vs Claude's 6.6 min
  • CMStage coherence — 63% pass on CM workflow staging vs Claude's 37%
  • UMSkips mandatory steps — bypasses P2P when clinical answer seems "obvious"
  • CMCatastrophic outreach failure — 88% fail on patient conversation quality
  • UMRetry storms — 122 consecutive retries with 163 schema validation errors
  • CMCare plan deficiency — 65% fail on goal structure, 55% on interventions
  • UMHard-coded triage routing bug breaks gold-card auto-approvals
  • Non-deterministic — different outcomes on 36% of tasks across trials

Per-check capability breakdown

Three views over the same 225-trial-per-agent run: failure rate by check, a capability radar across nine skills, and behavioral diagnostics for retries, schema errors, and output volume.

Claude Code · Opus 4.6Codex · GPT-5.5

Higher bars = more trials failing that check. Filter by domain to compare per-role behavior.

Key diagnostics

The Completeness Trap

44% of PA Provider tasks require stopping mid-workflow. Claude submitted all 33 hold-required trials — noting missing documents in reasoning but executing submission anyway. A knowledge gap this is not; it's an execution bias toward completion.

The Schema Mismatch

Codex enters retry storms on structured tool calls — 122 consecutive retries with 163 schema validation errors on a single UM trial. Same malformed payload repeated without adaptation, consuming 40%+ of execution time.

Process Over Certainty

When Codex determines a case is clear-cut for denial, it skips the mandatory peer-to-peer review. Claude never skips it. Healthcare workflows encode safety margins in mandatory process — shortcuts defeat the safeguards.

The Empathy Gap

88% of Codex's patient outreach conversations fail quality review. Claude averages 890 chars/message with adaptive emotional validation; Codex averages 676 chars with clinical efficiency. Trust-building requires warmth, not just information transfer.

MD Review: Nobody's Home

Both agents achieve just 8% (1/12) on physician-level review tasks. Synthesizing full clinical evidence, applying specialty criteria, and documenting legally binding rationale remains a genuine capability frontier.

The 72% Ceiling

Both models reach ~72–74% accuracy on correct approve/deny determinations. The performance gap between agents is driven by procedural compliance — following the right steps — not by clinical reasoning quality.

Cite this

If you reference CHI-Bench in a paper, post, or eval — here's the BibTeX. The data, code, and all 75 task definitions are open-sourced at the GitHub repo linked above.

@misc{chen2026chibenchaiagentsautomate,
      title={CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?},
      author={Haolin Chen and Deon Metelski and Leon Qi and Tao Xia and Joonyul Lee and Steve Brown and Kevin Riley and Frank Wang and T. Y. Alvin Liu and Hank Capps MD and Zeyu Tang and Xiangchen Song and Lingjing Kong and Fan Feng and Tianyi Zeng and Zhiwei Liu and Zixian Ma and Hang Jiang and Fangli Geng and Yuan Yuan and Chenyu You and Qingsong Wen and Hua Wei and Yanjie Fu and Yue Zhao and Carl Yang and Biwei Huang and Kun Zhang and Caiming Xiong and Sanmi Koyejo and Eric P. Xing and Philip S. Yu and Weiran Yao},
      year={2026},
      eprint={2605.16679},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.16679},
}