Installation
One-time setup: clone the repo, install Python deps, fetch the dataset, set API keys, and build the Docker image.
Prerequisites
- Python 3.12+
- Docker (for the trial sandbox)
- uv — Python package manager
1. Clone and install
git clone https://github.com/actava-ai/chi-bench && cd chi-bench
uv sync --extra dev2. API keys
Copy .env.example to .env and fill in keys for the providers you intend to run. ANTHROPIC_API_KEY is always required — the workspace judge is pinned to claude-opus-4-7.
ANTHROPIC_API_KEY— required (judge + Claude Code harness default).OPENAI_API_KEY— required for Codex and OAI Agents rows.GEMINI_API_KEY— required for Gemini CLI rows.OPENROUTER_API_KEY— required for the open-stack rows on open-weight models.CLAUDE_CODE_OAUTH_TOKEN— optional, cheaper for smoke-testing the Claude Code harness.
3. Task fixtures from Hugging Face
The dataset is gated. Authenticate once with the CLI, then download a pinned revision:
uv run huggingface-cli login
REV=chi-bench-v1.0.0
uv run huggingface-cli download actava/chi-bench \
--repo-type dataset --revision "$REV" --local-dir data/
echo "$REV" > data/.chi-bench-versionThe pin in data/.chi-bench-version is what submission preflight verifies against your config's dataset.version. Always rewrite it when changing revisions.
4. Managed-Care Operations Handbook
The handbook (1,279 markdown documents) lives off Hugging Face because of size and clinical-collaborator curation provenance. Download the tarball from the share URL in your invitation email, then extract:
mkdir -p data/skills
tar -xzf managed-care-operations-handbook.tar.gz -C data/skills/5. Build the Docker image
~5 min, one-time. The image bundles the FastAPI server, the workspace judge, the agent harness, and per-task fixtures.
uv run cb docker build
cbis the short alias forchi-bench; both commands resolve to the same CLI. If your shell already aliasescb, usechi-bench.
Verify setup
uv run cb data verifyA clean run means you're ready for the quickstart.
Optional: Modal for parallel execution
Modal parallelizes trials across remote sandboxes — strongly recommended for matrix runs.
# default profile, or:
uv run modal setup
# named profile:
uv run modal token set --profile chi-benchIf you use a named profile, export MODAL_PROFILE=chi-bench in your shell before running the matrix.