Run experiments

Single trials, full submission runs, and the paper-table matrices — all driven by one CLI.

Single trial

uv run cb experiment run \
  --dataset data/<domain>/tasks/<task-dir> \
  --agent <agent-id> --model <model-id>

Output lands under logs/experiments/.../trial_*/ with result.json and verifier/scorecard.json. See Quickstart for a worked example.

Submission lifecycle

One YAML drives all four steps:

uv run cb submission validate -f configs/submissions/<id>.yaml
uv run cb submission run      -f configs/submissions/<id>.yaml
uv run cb submission status   -f configs/submissions/<id>.yaml
uv run cb submission prepare  -f configs/submissions/<id>.yaml
  • validate — schema + preflight: dataset pin, Modal/Docker readiness, agent name resolution.
  • run — runs all 3 domains. Default: one trial per task (pass@1).
  • status — progress check; safe to run while run is in flight.
  • prepare — curates the leaderboard-ready packet at logs/submissions/<id>/packet/YYYY-MM-DD-<id>/.

For the leaderboard PR step, see the Leaderboard guide.

Reproduce paper tables

Each table maps to a matrix YAML and a single command:

PaperConfigCommand
Table 1 (Main matrix)table1_main_matrix.yaml./scripts/run_table.sh table1
Table 2 (E2E arena)table2_e2e_arena.yaml./scripts/run_table.sh table2
Table 3 (Marathon)table3_marathon.yaml./scripts/run_table.sh table3
Fig. 4 (Skill ablation)table4_skill_ablation.yaml./scripts/run_table.sh table4
Table 5 (MCP vs CLI)table5_mcp_vs_cli.yaml./scripts/run_table.sh table5

After all slices finish, aggregate:

uv run python scripts/aggregate.py \
  --trials-dir logs/experiments/table1_main_matrix \
  --prices configs/prices.yaml \
  --out-csv logs/table1.csv

CSV columns: agent, model, n_trials, n_tasks, pass_at_1, pass_at_1_lo, pass_at_1_hi, pass_at_3, ..., pass_pow_3, pass_pow_3_hi, mean_cost_usd, mean_walltime_s with Wilson 95% CIs.

Modal vs Docker

Local Docker is fine for a handful of trials. Matrix reproduction on a single host takes days, so use Modal for parallel execution:

./scripts/run_table.sh table1 --modal

cb experiment run -e modal defaults to profile actava; pass --modal-profile '' to skip Modal preflight, or MODAL_PROFILE=<name> for a named profile.

Test markers

The default uv run pytest skips judge-hitting and slow suites. Opt in:

uv run pytest -m requires_anthropic_key   # hits the live judge
uv run pytest -m slow                     # includes docker-build smoke
CHI_BENCH_SKIP_DOCKER_BUILD=1 uv run pytest tests/smoke -v -m slow

Useful environment variables

  • ANTHROPIC_API_KEY — always required (judge).
  • CHI_BENCH_JUDGE_MODEL — override the pinned judge model (deviates from paper protocol).
  • CHI_BENCH_JUDGE_NUM_VOTES — > 1 enables majority-voted judging.
  • CHI_BENCH_PAYER_MODE — set to agent for the local server (auto-set by cb serve).
  • MODAL_PROFILE — named Modal profile for parallel execution.