Read the scorecard

A trial's verifier output explains why it passed or failed. Three files under verifier/ tell you everything: reward.json (one number), scorecard.json (per-check breakdown), and exported_state.json (the world snapshot the verifier scored against).

Where to look first

Every trial under logs/experiments/.../trial_*/ (or logs/submissions/.../<task>__<hash>/) writes the same set of verifier files:

FilePurpose
verifier/reward.jsonOne line: {"reward": 0.0 | 1.0}. The single binary number Harbor uses for pass@1.
verifier/scorecard.jsonPer-check breakdown — read this to understand why a trial passed or failed.
verifier/exported_state.jsonPost-run world snapshot the verifier scored against (fixtures, agent-authored artifacts, audit log, conversation transcripts).
verifier/verifier_context.jsonTarget case/patient/order IDs and the task actor — useful when joining with raw fixtures.
result.json (sibling of verifier/)Harbor's view: agent metadata, verifier_result.rewards.reward, token counts, cost, timing per phase.

scorecard.json shape

{
  "binary_reward":     0.0,
  "fractional_reward": 0.91,
  "passed_checks":     42,
  "total_checks":      46,
  "check_scores":  { "<check_name>": 1.0 | 0.0 | null, ... },
  "checks":        { "<check_name>": true | false | "not_applicable", ... },
  "failed_checks": ["md.signed_off", "judge.md_review:mp1126_sca_05", ...],
  "not_applicable_checks": ["md.denial_rationale_present"],
  "stages": {
    "md_review": {
      "passed": false,
      "checks": { "md.decision_exists": true, "md.signed_off": false, ... },
      "passed_count": 3,
      "total_count":  4,
      "not_applicable_count": 1,
      "details": { "criteria": [...] }
    },
    "outcome":     { ... },
    "cross_stage": { ... }
  }
}

The two reward axes

  • binary_reward is the strict pass/fail axis. 1.0 only when every non-N/A check passes; 0.0 if any required check fails. This is what the leaderboard publishes as pass@1.
  • fractional_reward = passed_checks / total_checks is partial credit for diagnostics. A 0.0 / 0.91 split means "near miss": one or two checks tanked an otherwise-strong run.

Stages

Checks are grouped by workflow stage. When you see binary_reward: 0, scan stages first — the failed stage is where to start debugging. Each stage records its own passed, passed_count, total_count, not_applicable_count, and free-form details.

Stage names you'll encounter:

StageDomainsWhat it gates
intakeUMAuth request received and routed; member/provider/code resolution.
nurse_reviewUMNurse triage decision and rationale.
md_reviewUMMedical-director decision, sign-off, denial rationale (when applicable).
p2pUM, PA-ProviderPeer-to-peer call: scheduling, transcript, post-call decision update.
appealUM, PA-ProviderAppeal acceptance, evidence handling, final disposition.
outcomeUM, PA-ProviderTerminal status reached, determination/letter exists, decision matches reviewer recommendation.
cross_stageUM, PA-ProviderNo forbidden world-state mutations; only forward state transitions.
provider_pre_submission, provider_request_package, provider_submission, new_referralPA-ProviderPA packet assembly and submission gates.
cm_chart_review, cm_assessment, cm_care_plan, cm_cross_stageCare ManagementChart review, assessment completeness, care plan structure, cross-stage invariants.
e2e_consistencyProvider–Payer arenaProvider PA and payer determination tell the same story end-to-end.

Check-name namespaces

Every check name has a prefix telling you who graded it and on what.

Deterministic checks

PrefixExamplesSource
md.*md.decision_exists, md.signed_off, md.rationale_present, md.denial_rationale_present, md.auditstages/md_review.py
outcome.*outcome.target_status, outcome.letter_types, outcome.determination_exists, outcome.review_decision_matches, outcome.terminal_transition_exists, outcome.clean_determinationstages/outcome.py
cross.*cross.forbidden_mutations, cross.forward_transitionsstages/cross_stage.py
cm.*cm.assessment.completed, cm.care_plan.problem_count, cm.care_plan.escalation_conditions_present, cm.cross_stage.target_statusstages/cm_v4.py

Other deterministic prefixes: intake.*, nurse.*, p2p.*, appeal.*, provider_pre_submission.*, provider_request_package.*, provider_submission.*, new_referral.*, e2e_consistency.*.

Rubrics-based LLM judge

Named judge.<stage>:<rubric_id>, e.g. judge.md_review:final_decision = the final_decision rubric item evaluated as part of the md_review stage.

  • Run by WorkspaceJudge — a Claude model (default claude-opus-4-7, override with CHI_BENCH_JUDGE_MODEL) reading the task's fixtures/judge/rubrics.json against the agent's workspace.
  • Set CHI_BENCH_JUDGE_NUM_VOTES > 1 for majority-voted judging.
  • Degraded judge runs (timeout, crash, malformed verdicts) collapse binary_reward to 0 even on gate-pass runs — visible as judge_unavailable_reason on the scorecard. Rubrics missing a parsed verdict count as not passing.

Three check states

Inside checks, every entry is one of:

  • true — pass. Counts toward both passed_checks and total_checks.
  • false — fail. Counts toward total_checks only; appears in failed_checks.
  • "not_applicable" — rubric/check doesn't apply (e.g. md.denial_rationale_present for an approve decision). Excluded from total_checks; appears in not_applicable_checks.

Mirror in check_scores: 1.0 for pass, 0.0 for fail, null for N/A.

Care Management scorecard fields

CM trials share the same pass criterion as PA/UM — binary=1.0 requires both the deterministic gate and all LLM judge rubrics to pass. The scorecard schema differs in two ways (compute_cm_reward in src/chi_bench/verifier/stages/cm_rubric.py):

  • gate_pass — bool. Did the deterministic gate (chart review exists, assessment completed, care plan finalized, cross-stage invariants) pass?
  • rubric_yes_count / rubric_total — counts from the LLM judge rubric. fractional_reward = rubric_yes_count / rubric_total (rubric-only, not passed_checks / total_checks).

If the gate fails, the LLM judge is skipped (no point spending a judge session on a clearly-failed run); the scorecard still records which gate check broke. Reward semantics:

gate_passrubric outcomebinaryfractional
false(skipped — judge isn't run)0.00.0
truerubric_total = 0 (gate-only task)1.01.0
trueevery rubric pass1.01.0
trueN of M rubrics pass0.0N/M

Worked example

From a real UM medical-director-review trial:

{
  "binary_reward":     0.0,
  "fractional_reward": 0.91,
  "passed_checks":     42,
  "total_checks":      46,
  "failed_checks": [
    "md.signed_off",
    "judge.md_review:decision_rationale",
    "judge.md_review:preop_weight_history",
    "judge.md_review:preop_psychosocial_eval"
  ],
  "stages": {
    "md_review":   { "passed": false, "passed_count": 3,  "total_count": 4  },
    "outcome":     { "passed": true,  "passed_count": 13, "total_count": 13 },
    "cross_stage": { "passed": true,  "passed_count": 2,  "total_count": 2  }
  }
}

Reading this top-down: 91% of checks passed, but md_review failed becausemd.signed_off was false (the agent never marked the determination as MD-signed) and three judge rubrics flipped to false.outcome and cross_stage were clean — so the agent reached the correct terminal state, but the artifact wasn't signed off and the rationale missed three rubric items. Binary = 0; this trial does not count for pass@1 even though everything else looks right.

Where the rules live

  • Stage verifiers: src/chi_bench/verifier/stages/ — one file per stage.
  • Reward computation: build_scorecard() and verify_task() in src/chi_bench/verifier/task_runtime.py; compute_cm_reward() in stages/cm_rubric.py.
  • LLM judge: verifier/judge/workspace_judge.py + verifier/judge/cm_adapter.py.
  • Per-task expectations (the ground truth the verifier scores against): data/<domain>/tasks/<task-dir>/expectations.json — hidden from the agent.

Re-judging an old trial

If you upgrade the judge model or fix a rubric, re-score existing trials without re-running the agent:

uv run cb verifier rejudge --trials-dir logs/experiments/<dir>

This re-runs only the LLM judge against the saved exported_state.json and rewrites scorecard.json + reward.json in place. Deterministic checks are recomputed too. Source: src/chi_bench/verifier/rejudge.py.

See also