Architecture

A single Python package hosts a FastAPI server, three MCP servers, and the workspace judge. Each trial runs in a fresh Docker container that bundles all of the above plus the per-task fixtures.

Trial flow

  1. Harborcb experiment run -f <config> shells out to Harbor, which spawns one container per trial via ChiBenchDockerEnvironment (local) or ChiBenchModalEnvironment (Modal sandbox).
  2. Container startdocker/entrypoint.sh reads CHI_BENCH_TASK_ID, wires /opt/chi-bench/tasks/<id>/fixtures to /fixtures, starts the unified server (HTTP + 3 MCP threads on fixed ports), and waits for all four endpoints to accept traffic before exec'ing the agent harness CLI.
  3. Agent — the harness drives the agent against the role-scoped MCP tools. Some steps are multi-turn dialogs (peer-to-peer, patient outreach) that the harness mediates as natural-language conversations.
  4. Verifier — after the agent stops, Harbor invokes WorkspaceJudge on claude-opus-4-7 in the same container; it reads /fixtures/expectations.json (hidden from the agent) and the full workspace, then writes verifier/scorecard.json + verifier/verdicts.json.
  5. Reward — Harbor writes result.json. Trial reward is the AND of rubric verdicts (or a continuous score for Care Management).

Module layout

Source lives under src/chi_bench/:

  • core/ — domain models (PriorAuthCase, CMOutreachTask, …), state machines, world store.
  • services/ — ~29 HTTP/MCP-backed domain services (chart, coverage, intake, p2p, …).
  • server/ — FastAPI app exposing the services as REST endpoints under /api/....
  • mcp/ — three MCP servers wrapping the services; see mcp/{server,payer_server,cm_server}.py.
  • conversation/ — patient simulator and peer-to-peer session orchestration.
  • experiment/ — Harbor-driven trial runner + agents/ (seven harnesses) + dual_pa_e2e_*.
  • verifier/ — pluggable judge (default WorkspaceJudge), rubric stages, and rejudge runner.

Service ports

ServicePortRole
FastAPI backend:8010HTTP REST surface for all services.
Provider MCP:8020Tools scoped to the provider/PA-author role.
Payer MCP:8100Tools scoped to the UM nurse / medical director.
Care Management MCP:8200Tools scoped to the RN care manager.

The verifier (workspace judge)

The verifier is a composite: deterministic rubric checks (file exists, payload field equals X, terminal status reached) combined with a rubric-based LLM judge that scores reasoning soundness, policy alignment, and patient-engagement quality.

  • Judge model: pinned to claude-opus-4-7. Override with CHI_BENCH_JUDGE_MODEL but you'll deviate from the paper protocol.
  • Voting: CHI_BENCH_JUDGE_NUM_VOTES > 1 for majority-voted judging.
  • Re-judge: cb verifier rejudge re-scores trials without re-running the agent.

Key invariants

  • /fixtures is not exposed to the agent — expectations, scoring contracts, and manifests are reserved for the verifier.
  • cb serve starts the payer in agent mode by setting CHI_BENCH_PAYER_MODE=agent if unset.
  • The /opt/chi-bench/tasks/<id>/fixtures directory is mounted read-only.

Further reading