The leaderboard repo
actava-ai/leaderboard is a public, data-only record of benchmark submissions. The full audit packet lives in git so reviewers can inspect any submission directly from the PR diff. This page covers the PR workflow, the CI validator, and the resubmission policy.
Two repos, one handoff
- Producer — run trials and produce a packet with
cb submission preparein actava-ai/chi-bench. See Run experiments. - Leaderboard — fork actava-ai/leaderboard and open a PR adding that packet under
benchmarks/<bench>/submissions/<YYYY-MM-DD>-<slug>/. - CI labels the PR
valid-submission/invalid-submission/needs-reviewand posts a sticky report. A maintainer reviews and merges.
One-time setup
Fork the leaderboard repo once; subsequent submissions reuse the same fork.
gh auth login # authenticate to GitHub
gh repo fork actava-ai/leaderboard --clone=false
git config --global user.email "you@example.com" # if not set
git clone https://github.com/<you>/leaderboard && cd leaderboardThree submission paths
All three run the same CI validator (.github/workflows/validate.yml).
A. Helper script (recommended)
The helper validates locally, copies the packet, branches, pushes to your fork, and opens the PR.
python scripts/submit.py /path/to/packet/2026-05-13-<slug>/Useful flags: --no-fork, --no-open-pr, --on-conflict abandon|replace|bump-date, --leaderboard-repo <slug>.
B. Claude Code / Codex
The repo ships a submit-to-leaderboard skill (AGENTS.md points Codex at the same file). Open the repo and ask:
/submit-to-leaderboard /abs/path/to/packet/2026-05-13-<slug>/The skill wraps the helper with preflight checks, partial-failure recovery, and pointers to producer-side fixes.
C. Manual
cp -r /path/to/packet/2026-05-13-<slug>/ benchmarks/<bench>/submissions/
python scripts/validate.py benchmarks/<bench>/submissions/2026-05-13-<slug>/
git checkout -b sub/<bench>/2026-05-13-<slug>
git commit -am "<bench>: <team> · <agent> · <model>"
git push origin sub/<bench>/2026-05-13-<slug>
gh pr create -R actava-ai/leaderboard --base mainPre-PR sanity check
scripts/validate.py is a thin shim around the CI validator — same code path. Run it locally before opening a PR:
python scripts/validate.py benchmarks/chi-bench/submissions/2026-05-13-<slug>/Exit 0 = passed. Exit 1 = errors printed; fix and rerun.
The validator depends on
jsonschema,zstandard,pyyaml— these are CI-only and not in anypyproject.toml. Inject them withuv run --with jsonschema --with zstandard --with pyyaml python scripts/validate.py ...
What CI catches
- Directory name =
YYYY-MM-DD-<slug> - Required files:
submission.json,results.csv,sub.yaml,provenance.json,README.md, ≥1 trial result - No unexpected files (
.zip,.bak, hidden files except.gitkeep) submission.jsonmatches the JSON Schema (perbenchmarks/<bench>/schema/submission-v<N>.json)results.csvrows match the manifest exactlyprovenance.jsonhas required keys- Per-trial integrity: required files, valid zstd, valid JSONL per line
- Trial counts match
results.per_domain.<domain>.n_trials - Per-file and total size limits
Soft warnings (not failures): unknown dataset version, duplicate submission.id.
What reviewers do beyond CI
- Sanity-check headline metrics (a 99% pass@1 on a benchmark where state-of-the-art is 30% warrants a closer look).
- Spot-inspect one or two trajectories:
zstdcat trials/<dom>/<id>/agent/trajectory.jsonl.zst | jq . - Confirm the producer repo and dataset version look right.
- For resubmissions: decide whether to keep the old submission alongside the new one.
CI does not re-judge submissions in v1 (trust-the-evidence model). Maintainers may manually re-judge a random trial via the producer's tooling if a submission looks suspicious.
Resubmission policy
- A new submission with a fresh date prefix is always acceptable, even if the slug is identical to an existing submission.
- Old submissions are kept by default for historical record. If you want an old run removed, say so in the PR body of your new submission.
PR scope
Submission PRs must touch only files under benchmarks/*/submissions/*/. Changes to schemas, READMEs, workflows, or other benchmarks require a separate PR (or the maintainer-applied meta: label, which bypasses the CI scope check).
Inspecting submissions
Submission directories are plain files. Click into any one on GitHub to see the manifest, headline metrics (in the auto-generated README), and the per-trial tree. Trajectories are zstd-compressed JSONL:
zstdcat benchmarks/chi-bench/submissions/<dir>/trials/<domain>/<trial_id>/agent/trajectory.jsonl.zst | jq .Authoritative docs
- leaderboard / README — the user-facing submission workflow.
- leaderboard / CONTRIBUTING — what CI catches, reviewer protocol, resubmission policy.
- chi-bench / docs/submission-packet.md — packet contract for benchmark authors building their own producers.
- Or jump to the in-app Submit page for the same flow with collapsible step UI.