Bring your own agent
The same submission flow works for built-in and custom agents — only the harness wiring differs.
Built-in agents
--agent | Example --model | Paper rows |
|---|---|---|
claude-code | anthropic/claude-opus-4-7 | Claude Code |
codex | openai/gpt-5.5 | Codex |
gemini-cli | gemini/gemini-3-pro-preview | Gemini CLI |
openclaw | anthropic/claude-opus-4-7 | OpenClaw |
hermes | openrouter/z-ai/glm-5.1 | Hermes |
openai-agents | deepseek/deepseek-v4-pro | OAI Agents |
deepagents | openrouter/x-ai/grok-4.3 | DeepAgents |
The full 30-row matrix lives in configs/experiments/table1_main_matrix.yaml.
What an agent harness needs to provide
A harness is a Python class under src/chi_bench/experiment/agents/ that implements three things:
- A constructor that receives the per-task
instruction.md, the role-scoped MCP server URL, the model identifier, and any provider credentials. - An
async run()that drives the agent loop until it terminates (success, failure, or budget exhausted). - A trajectory writer that emits one JSONL record per step into
agent/trajectory.jsonlfollowing the ATIF schema (Agent Trajectory Interchange Format).
Wiring a new harness
- Create
src/chi_bench/experiment/agents/<your_agent>.py, subclassing the baseAgentHarness. Point at the MCP server using the URL passed in via constructor. - Register it in
src/chi_bench/experiment/agents/__init__.pyso--agent <your_agent>resolves it on the CLI. - Smoke-test with
cb experiment run --agent <your_agent> --model ...on a single task; confirmverifier/scorecard.jsonreads. - Run a full submission via
cb submission run -f configs/submissions/<id>.yaml.
Custom model endpoints
Most harnesses route through provider SDKs (Anthropic, OpenAI, Google, OpenRouter) keyed by the --model prefix. To add a new provider:
- Add a model-resolver entry that maps
<provider>/<model-id>to a client construction. - Add the provider's API key handling to
.env.exampleand document it in the README. - If the endpoint is OpenAI-compatible, you can usually reuse the existing
codexoropenai-agentsharnesses by settingOPENAI_BASE_URLappropriately — minus the judge subprocess, which always uses the real Anthropic API.
Authoritative docs
- docs/extending.md — full walkthrough with code examples.
- docs/cli.md — every CLI flag and exit-code convention.
- Up next: Submit your agent to the leaderboard
The packet shape is identical regardless of whether you submit a built-in agent or a custom one. The leaderboard PR flow doesn't care how the trials were produced, only that the manifest, results CSV, and per-trial evidence pass the validator.