Blog
Why Healthcare AI Needs a Better Benchmark
The numbers are out, and they are staggering. U.S. healthcare spending has surpassed $5.3 trillion, accounting for 18% of GDP. But here is the kicker: roughly one in five dollars never actually reaches a patient. Instead, it is swallowed whole by a $1 trillion administrative machinery of billing, credentialing, and the infamous prior authorization (PA) process. While AI agents are being pitched as the ultimate savior for healthcare’s back office, a massive gap remains between tech-vendor promises and real-world execution. Here is a summary of where healthcare administration stands, why current AI solutions are stalling, and how the industry is trying to fix its measurement problem.
By Weiran Yao
The United States spends more than $1 trillion every year on healthcare administration. Not care. Not drugs. Not devices. Administration. Billing departments, prior authorization queues, claims clerks, compliance reviewers — an entire shadow economy layered on top of medicine. The question healthcare technology has struggled to answer honestly is: where exactly does AI actually help, and how do you prove it?
This article maps the burden by the numbers, traces where current AI pilots stall, and explains why measurement — not more models — is the missing piece.
Prior Auth: Where Physicians Spend Their Days
Prior authorization has become the defining friction point in US healthcare delivery. The AMA's 2023 survey of over 1,000 physicians found that 94% reported that prior authorization delayed access to necessary care for their patients, and 33% said the delays led patients to abandon treatment entirely.5 The average physician practice completes 45 prior authorization requests per physician per week — consuming nearly two full business days of staff time.5
KFF analysis of Medicare Advantage found that 1 in 7 prior authorization requests was denied on initial submission, with denial rates varying dramatically by plan — some exceeding 35%.6 Premier Healthcare research estimates that 75% of denied claims that are appealed are ultimately overturned, confirming that the initial denial was clinically unjustified in the majority of cases — yet providers incur the full cost of the appeals process regardless.7
Clinical Staff Are Paying the Hidden Tax
The $1 trillion headline figure tends to focus attention on billing and claims. But administrative burden exacts a parallel cost on clinical staff — particularly nurses and care managers — that manifests as burnout, turnover, and reduced patient contact time rather than a line item on the income statement.
A 2023 JAMA Internal Medicine study found that for every hour physicians spend in direct patient care, they spend nearly two hours on EHR and administrative tasks.9 For nurses, the ratio is similar: only 37% of nursing shift time is spent on direct patient care, with documentation, care coordination paperwork, and administrative follow-up consuming the rest.10
Utilization management workflows are a particular offender. A care manager handling concurrent UM reviews for a panel of Medicare Advantage patients may touch the same case three to five times across multiple platforms before a decision is rendered — repeating data entry, chasing clinical notes, and resubmitting forms that could have been automated end-to-end.
The Pilot-to-Production Gap
Every major health system and payer has run AI pilots. Most of them work in the pilot. The failure occurs during the handoff to production, where the complexity of real clinical environments overwhelms models evaluated on curated datasets.
The pattern is consistent across workflow categories: strong pilot performance, sharp falloff at production deployment. Provider–payer handoff workflows — the most administratively costly — remain almost entirely undeployed at scale, because no existing benchmark has adequately measured end-to-end completion across real multi-app, multi-system environments.
Why Existing Benchmarks Don't Predict Production Performance
The healthcare AI evaluation landscape has significant gaps. MedQA, MedMCQA, and similar benchmarks test factual clinical knowledge — useful for gauging general medical reasoning, but orthogonal to whether an agent can actually complete a prior authorization workflow inside Availity or navigate a payer portal to retrieve a remittance advice.
MedHELM, one of the more comprehensive recent evaluation frameworks, reports task-completion scores in the 0.53–0.63 range for clinical NLP tasks — but it evaluates language models, not agents executing multi-step administrative workflows.11 A 2024 JMIR Medical Informatics review found zero FDA-cleared agentic AI systems for payer operations workflows, reflecting how early the field remains in translating benchmark performance into regulated deployment.12
Task-level question answering is not a proxy for end-to-end workflow completion. An agent that scores 0.85 on MedQA may fail when asked to submit a prior authorization through a live portal, extract a denial reason from an EOB, and escalate to a peer-to-peer review — a sequence any experienced UM coordinator handles dozens of times per week.
χ-Bench: Measuring What Actually Matters
actAVA built χ-Bench to fill the gap that all existing healthcare AI benchmarks leave open: end-to-end agentic task completion across real payer operations workflows. Rather than testing what a model knows, χ-Bench tests what an agent can accomplish — inside real applications, navigating real UI, completing tasks a human UM coordinator would actually be assigned.
χ-Bench evaluates 30 distinct agent configurations across the 75 tasks, using a 1,279-document clinical handbook and 200+ MCP tools that mirror real operational environments. The 28% best pass@1 is not a failure — it is a calibration. It tells the industry where AI agents actually are, rather than where vendor demos suggest they are.
The near-zero rate on end-to-end provider–payer handoff tasks is the finding that matters most for the $1 trillion administrative burden story. The workflows that cost the most — multi-system prior authorization spanning the provider EHR, the payer portal, and clinical review — are precisely the ones where current agents are not yet performing well. That is the honest picture. It is also the roadmap.
χ-Bench is open-sourced under Apache 2.0. Benchmark results, methodology, and the full task suite are available at actava.ai/benchmarks and arXiv 2605.16679.
Sources
- Tseng P, Kaplan RS, Richman BD, et al. "Administrative Costs Associated With Physician Billing and Insurance-Related Activities at an Academic Health Care System." JAMA. 2018;319(7):691–697. jamanetwork.com. Updated estimates in 2022–2023 literature place the total at over $1 trillion annually.
- Dieleman JL, et al. "US Health Care Spending by Payer and Health Condition, 1996–2016." JAMA. 2020;323(9):863–884. Payer operations overhead analysis. jamanetwork.com
- Shrank WH, Rogstad TL, Parekh N. "Waste in the US Health Care System: Estimated Costs and Potential for Savings." JAMA. 2019;322(15):1501–1509. jamanetwork.com
- CAQH. 2023 CAQH Index: Conducting Electronic Business Transactions. Council for Affordable Quality Healthcare. caqh.org. Manual PA transaction cost $12–$40; electronic $3–$4.
- American Medical Association. 2023 AMA Prior Authorization Physician Survey. ama-assn.org. 94% delay care; 33% lead to abandonment; 45 requests/physician/week.
- KFF. "Medicare Advantage Prior Authorization and Access to Care." 2023. kff.org. 1 in 7 requests denied on initial submission.
- Premier Inc. "Addressing the Administrative Burden: How Automation Can Help." Premier Healthcare Research. premierinc.com. 75% of appealed denials overturned.
- Centers for Medicare & Medicaid Services. CMS-0057-F: Interoperability and Prior Authorization Final Rule. Effective January 1, 2026. cms.gov
- Arndt BG, et al. "Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations." Ann Fam Med. 2017;15(5):419–426. annfammed.org. 2 hours admin per 1 hour patient care.
- Hendrich A, et al. "A 36-Hospital Time and Motion Study: How Do Medical-Surgical Nurses Spend Their Time?" Perm J. 2008;12(3):25–34. thepermanentejournal.org. 37% of nursing shift time is on direct patient care.
- Harrington SG, et al. "MedHELM: A Comprehensive Benchmark for Evaluating Clinical NLP Systems." Stanford Center for Biomedical Informatics Research. 2023. Scores in the 0.53–0.63 range for task-completion on clinical NLP. stanfordmlgroup.github.io
- Li K, et al. "Regulatory Landscape for AI in Healthcare: Barriers to Clinical Deployment." JMIR Med Inform. 2024;12:e52010. Zero FDA-cleared agentic AI systems for payer operations. medinform.jmir.org
All statistics are sourced from peer-reviewed literature, government data, or named industry research as cited. actAVA χ-Bench results reflect internal evaluation as of May 2025 across 30 agent configurations. For full methodology, see actava.ai/benchmarks and arXiv 2605.16679.

Written by
Weiran Yao
CAIO & Co-Founder


