Quickstart: run one task

Smoke-test that everything is wired up by running a single Utilization Management medical-director-review task.

One trial, one command

uv run cb experiment run \
  --dataset data/prior_auth_um/tasks/pa_t008_t008_o002_p01_mdreview_payer \
  --agent codex --model openai/gpt-5.5

Trial output lands under logs/experiments/.../trial_*/. Two files give you everything you need at a glance:

  • result.json — the verifier reward and agent metadata.
  • verifier/scorecard.json — per-check verdicts (deterministic + LLM judge).

See Read the scorecard for the full field-by-field walkthrough — what binary vs fractional reward means, the stage structure, the deterministic-check namespaces, and how Care Management's two-axis rubric differs from PA/UM.

What just happened

  1. Harbor spawned a fresh Docker container for the trial.
  2. The container started the unified FastAPI + 3 MCP servers and waited for them to accept traffic.
  3. The agent harness drove the agent against the role-scoped MCP tools.
  4. After the agent stopped, the workspace judge (pinned to claude-opus-4-7) read the workspace, world state, and event trail and wrote the scorecard.
  5. Harbor wrote result.json with the AND of rubric verdicts as the final reward.

See Architecture for the full picture of how the pieces fit together.

Trying other agents

Replace --agent and --model with any supported pair:

# Claude Code with Opus 4.6
uv run cb experiment run \
  --dataset data/prior_auth_provider/tasks/pa_t001_t001_o001_p01_new_referral_provider \
  --agent claude-code --model anthropic/claude-opus-4-6

# Open-stack: Hermes harness on GLM-5.1 via OpenRouter
uv run cb experiment run \
  --dataset data/prior_auth_um/tasks/pa_t008_t008_o002_p01_mdreview_payer \
  --agent hermes --model openrouter/z-ai/glm-5.1

The full 30-row matrix lives in configs/experiments/table1_main_matrix.yaml. See Run experiments for paper-table reproduction.

Next steps