Submit

Submit a run to the leaderboard.

chi-bench is open source. Submissions are two repos with one handoff: you produce a packet with cb submission prepare, then open a PR adding that packet to actava-ai/leaderboard. CI validates schema and trial integrity; a maintainer reviews and merges.

ProducerIn your fork of actava-ai/chi-bench

01
Install and download the dataset
Clone actava-ai/chi-bench, install with uv sync --extra dev, then fetch the gated dataset from Hugging Face:
```
uv run huggingface-cli login

REV=chi-bench-v1.0.0
uv run huggingface-cli download actava/chi-bench \
  --repo-type dataset --revision "$REV" --local-dir data/
echo "$REV" > data/.chi-bench-version
```
The pin in data/.chi-bench-version is what submission preflight verifies against your config's dataset.version.
02
Set API keys and build the Docker image
Copy .env.example → .env and fill in keys for the providers you intend to run. ANTHROPIC_API_KEY is always required — the workspace judge is pinned to claude-opus-4-7. Then build the runtime image (~5 min, one-time):
```
uv run cb docker build
```
03
Write a submission YAML
Copy configs/submission_example.yaml → configs/submissions/<your-id>.yaml and fill in id, team, contact, agent, and model. Optionally set notes and run.*.
Bringing a custom agent harness or model endpoint? See docs/extending.md.

Run trials and prepare the packet

Four commands — preflight, run, monitor, package:

# Schema + preflight: dataset pin, Modal/Docker, agent name
uv run cb submission validate -f configs/submissions/<your-id>.yaml

# Run all 3 domains. Default: one trial per task (pass@1).
uv run cb submission run      -f configs/submissions/<your-id>.yaml

# Check progress; safe to run while `submission run` is in flight.
uv run cb submission status   -f configs/submissions/<your-id>.yaml

# Curate the leaderboard-ready packet.
uv run cb submission prepare  -f configs/submissions/<your-id>.yaml

The packet lands at logs/submissions/<id>/packet/YYYY-MM-DD-<id>/ — typically <100 MB. Workspace artifacts and Harbor scratch are excluded by design.

packet/2026-05-13-<id>/
├── submission.json          # manifest: agent, model, results, provenance
├── results.csv              # leaderboard rows (one per domain + overall)
├── sub.yaml                 # frozen copy of your config
├── provenance.json          # git SHA, image digest, timestamps
├── README.md                # auto-generated headline summary
└── trials/<domain>/<trial_id>/
    ├── result.json
    ├── verifier/scorecard.json
    ├── verifier/reward.json
    └── agent/trajectory.jsonl.zst

LeaderboardIn a fork of actava-ai/leaderboard

Fork the leaderboard repo (one-time)

gh auth login
gh repo fork actava-ai/leaderboard --clone=false
git clone https://github.com/<you>/leaderboard && cd leaderboard

Subsequent submissions reuse this same fork.

06
Open a PR adding the packet
Three equivalent paths — pick one. They all run the same CI validator (.github/workflows/validate.yml).
AHelper scriptRecommended
The helper validates locally, copies the packet to benchmarks/<bench>/submissions/<dir>/, creates branch sub/<bench>/<dir>, pushes to your fork, and opens the PR.
python scripts/submit.py /path/to/packet/2026-05-13-<slug>/
Useful flags: --no-fork, --no-open-pr, --on-conflict abandon|replace|bump-date.
BClaude Code / Codex
The leaderboard repo ships a submit-to-leaderboard skill ( AGENTS.md points Codex at the same file). Open the repo and ask:
/submit-to-leaderboard /abs/path/to/packet/2026-05-13-<slug>/
The skill wraps the helper with preflight checks, partial-failure recovery, and pointers to producer-side fixes when the validator complains.
CManual
From your fork clone, the underlying flow is just five commands:
cp -r /path/to/packet/2026-05-13-<slug>/ benchmarks/<bench>/submissions/ python scripts/validate.py benchmarks/<bench>/submissions/2026-05-13-<slug>/ git checkout -b sub/<bench>/2026-05-13-<slug> git commit -am "<bench>: <team> · <agent> · <model>" git push origin sub/<bench>/2026-05-13-<slug> gh pr create -R actava-ai/leaderboard --base main
07
CI labels the PR; a maintainer reviews
CI runs schema and trial-integrity checks and labels the PR valid-submission invalid-submission or needs-review, with a sticky comment summarizing each check. A maintainer spot-inspects a trajectory or two and merges if everything looks plausible.
What CI checks
- Directory name = YYYY-MM-DD-<slug>
- Required files present: submission.json, results.csv, sub.yaml, provenance.json, README.md, ≥1 trial result
- No unexpected files (.zip, .bak, hidden files except .gitkeep)
- submission.json matches the JSON Schema
- results.csv rows match the manifest
- Per-trial integrity: required files, valid zstd, valid JSONL per line
- Trial counts match results.per_domain.<domain>.n_trials
- Per-file and total size limits

Policy notes

Leaderboard is pass@1. Set run.n_attempts: 3 to keep extra trials on disk for your own pass@3 / pass^3 analysis — the manifest still publishes pass@1.
Partial submissions (--domain pa | um | cm on submission run) are accepted but flagged as partial on the leaderboard.
Resubmissions with a fresh date prefix are always acceptable. Old submissions are kept for historical record; mention in the PR body if you want one removed.
PR scope. Each PR touches exactly one new directory under benchmarks/<bench>/submissions/. Schema or workflow changes go in separate PRs.

Open chi-bench on GitHub Open leaderboard on GitHub Questions? Contact us

Packet contract for benchmark authors building their own producers: docs/submission-packet.md.