SubmitMethodology
Submit
Submit a run to the leaderboard.
chi-bench is open source. Submissions are two repos with one handoff: you produce a packet with cb submission prepare, then open a PR adding that packet to actava-ai/leaderboard. CI validates schema and trial integrity; a maintainer reviews and merges.
ProducerIn your fork of actava-ai/chi-bench
- 01Install and download the datasetClone actava-ai/chi-bench, install with
uv sync --extra dev, then fetch the gated dataset from Hugging Face:
The pin inuv run huggingface-cli login REV=chi-bench-v1.0.0 uv run huggingface-cli download actava/chi-bench \ --repo-type dataset --revision "$REV" --local-dir data/ echo "$REV" > data/.chi-bench-versiondata/.chi-bench-versionis what submission preflight verifies against your config'sdataset.version. - 02Set API keys and build the Docker imageCopy
.env.example→.envand fill in keys for the providers you intend to run.ANTHROPIC_API_KEYis always required — the workspace judge is pinned toclaude-opus-4-7. Then build the runtime image (~5 min, one-time):uv run cb docker build - 03Write a submission YAMLCopy
configs/submission_example.yaml→configs/submissions/<your-id>.yamland fill inid,team,contact,agent, andmodel. Optionally setnotesandrun.*.Bringing a custom agent harness or model endpoint? See docs/extending.md. - 04Run trials and prepare the packetFour commands — preflight, run, monitor, package:
The packet lands at# Schema + preflight: dataset pin, Modal/Docker, agent name uv run cb submission validate -f configs/submissions/<your-id>.yaml # Run all 3 domains. Default: one trial per task (pass@1). uv run cb submission run -f configs/submissions/<your-id>.yaml # Check progress; safe to run while `submission run` is in flight. uv run cb submission status -f configs/submissions/<your-id>.yaml # Curate the leaderboard-ready packet. uv run cb submission prepare -f configs/submissions/<your-id>.yamllogs/submissions/<id>/packet/YYYY-MM-DD-<id>/— typically <100 MB. Workspace artifacts and Harbor scratch are excluded by design.packet/2026-05-13-<id>/ ├── submission.json # manifest: agent, model, results, provenance ├── results.csv # leaderboard rows (one per domain + overall) ├── sub.yaml # frozen copy of your config ├── provenance.json # git SHA, image digest, timestamps ├── README.md # auto-generated headline summary └── trials/<domain>/<trial_id>/ ├── result.json ├── verifier/scorecard.json ├── verifier/reward.json └── agent/trajectory.jsonl.zst
LeaderboardIn a fork of actava-ai/leaderboard
- 05Fork the leaderboard repo (one-time)
Subsequent submissions reuse this same fork.gh auth login gh repo fork actava-ai/leaderboard --clone=false git clone https://github.com/<you>/leaderboard && cd leaderboard - 06Open a PR adding the packetThree equivalent paths — pick one. They all run the same CI validator (
.github/workflows/validate.yml).AHelper scriptRecommendedThe helper validates locally, copies the packet tobenchmarks/<bench>/submissions/<dir>/, creates branchsub/<bench>/<dir>, pushes to your fork, and opens the PR.python scripts/submit.py /path/to/packet/2026-05-13-<slug>/Useful flags:
--no-fork,--no-open-pr,--on-conflict abandon|replace|bump-date.BClaude Code / CodexThe leaderboard repo ships asubmit-to-leaderboardskill (AGENTS.mdpoints Codex at the same file). Open the repo and ask:
The skill wraps the helper with preflight checks, partial-failure recovery, and pointers to producer-side fixes when the validator complains./submit-to-leaderboard /abs/path/to/packet/2026-05-13-<slug>/CManualFrom your fork clone, the underlying flow is just five commands:cp -r /path/to/packet/2026-05-13-<slug>/ benchmarks/<bench>/submissions/ python scripts/validate.py benchmarks/<bench>/submissions/2026-05-13-<slug>/ git checkout -b sub/<bench>/2026-05-13-<slug> git commit -am "<bench>: <team> · <agent> · <model>" git push origin sub/<bench>/2026-05-13-<slug> gh pr create -R actava-ai/leaderboard --base main - 07CI labels the PR; a maintainer reviewsCI runs schema and trial-integrity checks and labels the PR valid-submission invalid-submission or needs-review, with a sticky comment summarizing each check. A maintainer spot-inspects a trajectory or two and merges if everything looks plausible.
What CI checks
- Directory name =
YYYY-MM-DD-<slug> - Required files present:
submission.json,results.csv,sub.yaml,provenance.json,README.md, ≥1 trial result - No unexpected files (
.zip,.bak, hidden files except.gitkeep) submission.jsonmatches the JSON Schemaresults.csvrows match the manifest- Per-trial integrity: required files, valid zstd, valid JSONL per line
- Trial counts match
results.per_domain.<domain>.n_trials - Per-file and total size limits
- Directory name =
Policy notes
- Leaderboard is pass@1. Set
run.n_attempts: 3to keep extra trials on disk for your own pass@3 / pass^3 analysis — the manifest still publishes pass@1. - Partial submissions (
--domain pa | um | cmonsubmission run) are accepted but flagged as partial on the leaderboard. - Resubmissions with a fresh date prefix are always acceptable. Old submissions are kept for historical record; mention in the PR body if you want one removed.
- PR scope. Each PR touches exactly one new directory under
benchmarks/<bench>/submissions/. Schema or workflow changes go in separate PRs.
Packet contract for benchmark authors building their own producers: docs/submission-packet.md.