actAVA Publishes χ-Bench, Exposing Agentic AI Reliability Gap in Healthcare Administration
Frontier AI agents completed only 28% of complex workflows on the first try, raising questions about readiness to automate healthcare’s $200 billion transaction layer.

The best-performing AI agent in our new benchmark completed only 28% of complex healthcare administrative tasks on the first try. No agent exceeded 8% in consistency testing. And in a realistic provider-payer workflow — the kind that governs prior authorization and utilization management for millions of patients — performance dropped to 0% at the handoff.
Those are the opening numbers from χ-Bench, the benchmark we published today. They are not a criticism of frontier models — they are remarkable technology. They are a measurement of a gap between what the industry is assuming and what the data actually shows.
A $1 Trillion Problem That Demands an Honest Measurement
U.S. healthcare administrative spending runs to approximately $1 trillion annually, with roughly $200 billion tied specifically to financial transactions — claims processing, payments, patient collections, and prior authorization.[1] That is the layer of the system AI agents are being rushed toward.
The pressure to automate is real. Deloitte research shows that 61% of healthcare organizations are already building or implementing agentic AI, with 85% planning to increase investment over the next two to three years. Vendors are promising. Executives are approving budgets. Pilots are launching.
What has been missing is a benchmark that honestly tests whether agents can actually do this work — not answer a clinical question, not navigate a demo environment, but complete the real administrative transactions that run between providers and payers every day.
— Kevin Riley, Co-founder & CEO, actAVA
What χ-Bench Actually Tests
Most healthcare AI benchmarks test narrow tasks: a clinical question-answering set, a FHIR API call sequence, a website navigation challenge, or a customer service exchange. These are real skills. They are not the full job.
χ-Bench evaluates end-to-end administrative transactions across three domains: provider prior authorization, payer utilization management, and care management. Each task places an agent inside a high-fidelity simulator of 20 healthcare applications, exposed via 87 MCP tools, and hands it a clinical case it must drive to a terminal status — guided by a 1,290+ document managed-care operations handbook.
What χ-Bench Measures vs. Other Benchmarks
The benchmark is built around three capabilities that every real administrative workflow demands but no prior benchmark has combined:
Policy Density
Every agent decision must be grounded in policy — medical guidelines, insurance rules, operational procedures that vary across providers and payers and shift over time. Agents must navigate a large policy library, interpret conditions correctly, and adhere to them across long tool-call chains.
Multi-Role Composition
An end-to-end workflow is divided among roles: clinician, coordinator, UM nurse, medical director, RN care manager. An agent must possess the domain knowledge of each, switching context and goals as the case progresses through the system.
Multilateral Interaction
Intermediate workflow steps are multi-turn dialogs — peer-to-peer review, patient outreach, provider escalations. The agent must navigate these exchanges, not just execute a pre-defined tool sequence.
Hidden State & Handoffs
The most revealing test: a realistic provider-payer handoff where the full workflow must transfer between organizational boundaries. This is where performance dropped to 0% for every model-agent combination tested.
The Hard Numbers
The results are not a reason to abandon AI. They are a reason to measure it correctly before deploying it into workflows that affect patient care and organizational finances.
χ-Bench Results: Agent Performance Across Task Types
On nearly one-third of tasks, every model-agent combination failed — no combination of frontier model and agent architecture found a path to completion. That is the category the industry needs to talk about honestly before expanding automation into high-stakes workflows.
Reviewed Across Institutions
The research was reviewed with clinical and academic expertise from researchers at Stanford, Johns Hopkins Medicine, Wellstar Health System, Carnegie Mellon University, the University of California San Diego, Yale, Salesforce AI Research, the University of Washington, Oxford, Brown, Emory, the University of Southern California, and other institutions.
The multi-institutional review matters because χ-Bench is not a product benchmark — it is a field benchmark. Its purpose is to give every organization in healthcare — providers, payers, healthtech builders, and investors — a shared, rigorous standard for evaluating AI before deploying it into administrative operations.
The Question Has Shifted
The question is no longer whether agentic AI will enter administrative operations — that decision has already been made across the industry. The question is whether organizations can reliably detect when it fails before that failure propagates through a prior authorization, a claims adjudication, a care management handoff.
Organizations currently deploying AI agents without a formal measurement framework are making a bet that their vendor's internal evals are sufficient. χ-Bench suggests that is an incomplete picture.
χ-Bench shows the hard truth. Frontier models are amazing — no doubt. But they are not yet reliable enough to run the administrative machinery of healthcare. The next era will belong to organizations that can measure failure, contain it, and build agentic infrastructure around AI before it touches mission-critical workflows.
That is exactly what actAVA KORA is built for: not just deploying agents, but governing them — measuring their performance, containing failure, and providing the RED layer that tests and remediates agents before they touch production workflows. χ-Bench is the measuring stick. The platform is the infrastructure that acts on those measurements.
Read the Full Paper
The full χ-Bench paper is available now on arXiv and HuggingFace. Healthcare leaders, researchers, analysts, and journalists can review the benchmark methodology, task design, and full results to understand where frontier agentic AI succeeds, where it fails, and what must be measured before deployment in real administrative workflows.
Full benchmark results and methodology at actava.ai/benchmarks
Sources
- Tseng et al. "Active steps to reduce administrative spending associated with financial transactions in US health care." Health Affairs Scholar, 2023. Estimates $1 trillion in annual U.S. healthcare administrative spending; ~$200 billion in financial transaction costs.
- Stanford Medicine. "The $1 Trillion Problem AI Still Can't Yet Solve." April 2026. Confirms administrative spending figures and notes the gap between benchmark performance and real-world readiness.
- actAVA. "χ-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?" arXiv, May 2026. Primary source for all benchmark results and methodology.
- Stanford HAI. "Stanford Develops Real-World Benchmarks for Healthcare AI Agents." Corroborating context on the gap between existing benchmarks and real administrative workflow demands.