This directory drives a side-by-side comparison of clawcodex (this repo) and openclaude (the TypeScript reference) against SWE-bench. The goal is to confirm parity: when both agents are pointed at the same backing model, the same dataset, and the same prompts, do they resolve the same set of GitHub issues?
The agent-side wrappers and prediction generation live with the SWE-bench
harness in SWE-bench-dev/scripts/. The comparison logic and the driver that
ties everything together live here.
SWE-bench-dev/scripts/ # in the SWE-bench-dev repo
├── clawcodex_api_server.py # FastAPI wrapper around `clawcodex -p ...`
├── openclaude_api_server.py # FastAPI wrapper around `node dist/cli.mjs -p ...`
└── run_custom_api.py # generic dataset → HTTP → predictions.jsonl
eval/ # in this repo (committed on feat/eval)
├── run_compare.py # one-command driver: prepare → run → compare
├── compare_results.py # standalone summary diff
├── README.md # this file
└── runs/ # gitignored output (per-run)
└── compare-YYYYMMDD-HHMMSS/
├── clawcodex_preds.jsonl
├── openclaude_preds.jsonl
├── clawcodex_server.log
├── openclaude_server.log
├── clawcodex_harness.log
├── openclaude_harness.log
├── only_clawcodex.txt # instance ids only clawcodex solved
├── only_openclaude.txt # instance ids only openclaude solved
├── both_solved.txt
└── comparison.md # the headline report
-
Sibling repos. Clone
openclaudeandSWE-bench-devnext toclawcodex:git clone https://github.com/Gitlawb/openclaude.git git clone https://github.com/swe-bench/SWE-bench.git SWE-bench-dev
(Or set
OPENCLAUDE_REPO/SWEBENCH_REPOto wherever you keep them.) -
Docker is installed and
docker psworks. -
SWE-bench venv (per
SWE-bench-dev/clawcodex_test.md§2.1):cd SWE-bench-dev python3 -m venv .venv && source .venv/bin/activate pip install -U pip pip install -e . pip install fastapi uvicorn tiktoken transformers
Point the driver at this interpreter via
SWEBENCH_PYTHON:export SWEBENCH_PYTHON=/abs/path/to/SWE-bench-dev/.venv/bin/python -
clawcodex configured for the model you want to compare (
clawcodex login). -
openclaude provider env for the same backing model. For
gpt-4o:export CLAUDE_CODE_USE_OPENAI=1 export OPENAI_API_KEY=sk-... export OPENAI_MODEL=gpt-4o
For DeepSeek through the OpenAI-compatible path:
export CLAUDE_CODE_USE_OPENAI=1 export OPENAI_API_KEY=sk-deepseek-... export OPENAI_BASE_URL=https://api.deepseek.com/v1 # the driver also sets this export OPENAI_MODEL=deepseek-v4-pro
The driver's
--provider deepseekpreset will passOPENAI_BASE_URLandOPENAI_MODELper request, so the env exports for those two are optional;OPENAI_API_KEYalways travels via the environment.
prepare runs python -m swebench.inference.make_datasets.create_text_dataset, so the
interpreter chosen by SWEBENCH_PYTHON (or your default python3 / python) must
have SWE-bench installed. Typical setups:
Option A — install into the clawcodex venv (one interpreter for everything):
cd clawcodex
# Prefer the venv binary so you are not using the Windows Store `python3` shim.
uv pip install -e ./SWE-bench-dev fastapi uvicorn tiktoken transformers
# Optional: `run_compare` defaults to sys.executable, so after `activate` you
# can omit SWEBENCH_PYTHON when you launch with that same `python`.If you see cannot import swebench while the error suggests
...\WindowsApps\python3.EXE, you never installed swebench into that stub.
Use .venv/Scripts/python.exe eval/run_compare.py ... or set
export SWEBENCH_PYTHON="$PWD/.venv/Scripts/python.exe".
Option B — separate SWE-bench venv (matches clawcodex_test.md):
cd SWE-bench-dev
python -m venv .venv
# Windows Git Bash:
source .venv/Scripts/activate
pip install -U pip
pip install -e .
pip install fastapi uvicorn tiktoken transformers
export SWEBENCH_PYTHON="$PWD/.venv/Scripts/python.exe"Without this, prepare will fail with No module named 'swebench'.
OpenClaude’s build script expects Bun. Either install it, or only build the SWE-bench dataset for now:
python eval/run_compare.py prepare --skip-openclaude-buildThen install Bun and run bun install && bun run build inside openclaude/, or
re-run prepare without --skip-openclaude-build once bun is on your PATH.
On Windows, Git Bash sometimes does not see Bun until you add
~/.bun/bin (or the path the installer prints) to PATH.
On Chinese (and some other) Windows locales, the default text encoding is GBK.
SWE-bench’s dataset builder was opening cloned source as “system default”, which
breaks on UTF-8 files. This repo’s SWE-bench-dev fork reads those paths as
UTF-8. Re-run prepare after pulling the latest create_instance.py changes.
The stdlib resource module exists only on Unix. Upstream SWE-bench imported it
unconditionally in prepare_images.py, which breaks import swebench on Windows.
This repo’s SWE-bench-dev fork patches that import so dataset prep and imports
work on Windows. Docker-based evaluation still requires Docker Desktop; if you hit
other POSIX-only code paths, use WSL2 for the harness.
# 1. Build openclaude and the SWE-bench text dataset (only needed once).
python eval/run_compare.py prepare
# 2. Smoke run: 1 known instance, both agents, gpt-4o, full Docker harness.
python eval/run_compare.py run --scope smoke
# 3. Open the report.
ls eval/runs/ # find the latest compare-* directory
cat eval/runs/compare-*/comparison.mdA smoke run takes a few minutes per instance. Scaling up:
# Pick your own instances:
python eval/run_compare.py run \
--scope instances \
--instance-ids astropy__astropy-12907,django__django-11099
# Or the full 300-instance Lite split (takes hours, costs real money):
python eval/run_compare.py run --scope allThe --provider preset sets the model and per-agent provider routing in one
flag. Available presets:
| Preset | Model | clawcodex routing | openclaude routing |
|---|---|---|---|
openai (default) |
gpt-4o |
--provider openai |
OpenAI native |
deepseek |
deepseek-v4-pro |
--provider deepseek |
OpenAI-compatible (https://api.deepseek.com/v1) |
anthropic |
claude-sonnet-4-6 |
--provider anthropic |
Anthropic native |
glm |
zai/glm-5 |
--provider glm |
OpenAI-compatible (https://open.bigmodel.cn/api/paas/v4) |
Examples:
# DeepSeek v4-pro on both:
python eval/run_compare.py run --scope smoke --provider deepseek
# DeepSeek but a different model name:
python eval/run_compare.py run --scope smoke --provider deepseek --model deepseek-coder-v4
# Custom OpenAI-compatible endpoint:
python eval/run_compare.py run --scope smoke \
--provider openai --model my-finetune \
--openclaude-base-url https://my-gateway.example.com/v1
# Run only one of the two agents (sometimes useful when iterating):
python eval/run_compare.py run --agents openclaude --provider deepseekPer-field overrides (--model, --clawcodex-provider,
--openclaude-provider, --openclaude-base-url) layer on top of the preset.
For each agent in --agents (default clawcodex,openclaude):
- Spawn its API server (
uvicorn scripts.<agent>_api_server:app) on its port (8000 for clawcodex, 8001 for openclaude). Logs go toeval/runs/<id>/<agent>_server.log. - Wait for
/healthto come up. Falls back to a TCP-only liveness check if the wrapper doesn't expose/healthyet. - Generate predictions by invoking
SWE-bench-dev/scripts/run_custom_api.pyagainst the local/generateendpoint. Writes<agent>_preds.jsonl. - Stop the server.
- Run the Docker harness (
swebench.harness.run_evaluation) on those predictions. Writes the harness summary into the SWE-bench repo as<agent>-local.<run-id>.json.
After both agents are done:
- Diff the two summary jsons via
compare_results.pyand writecomparison.mdplusonly_<agent>.txttriage lists.
If you already have two harness summary jsons lying around:
python eval/compare_results.py \
--left /path/to/clawcodex-local.run-001.json --left-label clawcodex \
--right /path/to/openclaude-local.run-001.json --right-label openclaude \
--out eval/runs/manual/comparison.mdrun_compare.py compare is a thin alias for the same call.
| Flag | Default | Notes |
|---|---|---|
--scope |
smoke |
smoke / instances / all (full split) |
--provider |
openai |
Preset: openai / deepseek / anthropic / glm |
--model |
(preset's default) | Override the preset's model name |
--clawcodex-provider |
(preset) | Override clawcodex --provider flag |
--openclaude-provider |
(preset) | Override openclaude routing hint |
--openclaude-base-url |
(preset) | Override OPENAI_BASE_URL for openclaude |
--max-turns |
30 |
Per-instance agent turn cap |
--request-timeout |
1800 |
HTTP timeout per instance, seconds |
--max-patch-retries |
2 |
Re-prompt when extracted diff is invalid |
--max-workers |
1 |
Docker harness parallelism (raise carefully) |
--skip-harness |
off | Generate predictions only — useful when iterating on prompts |
--agents |
clawcodex,openclaude |
Drop one to run only the other |
-
Text dataset not found— You skippedprepare, or it never finished. Runpython eval/run_compare.py prepare(from the clawcodex repo) after installingswebenchinto the interpreterSWEBENCH_PYTHONpoints at (see Python that can importswebenchabove). On Windows Git Bash you can use.venv/Scripts/python.exe eval/run_compare.py prepare. -
text dataset not found at .../datasets/SWE-bench__SWE-bench_Lite__style-3__fs-oracle— Same as above: runpreparefirst, or pass--dataset-localif you keep the dataset elsewhere. -
OPENCLAUDE_REPO set but dist/cli.mjs not found—prepareshould build it for you. If it doesn't,cd openclaude && bun install && bun run build. -
No module named 'src'from clawcodex — the wrapper's fallback path needsCLAWCODEX_REPOset. The driver passes it automatically; if you're running the wrapper by hand, followSWE-bench-dev/clawcodex_test.md§2.2. -
One agent blew up but the other was fine — the run still produces a half-finished comparison. Check
eval/runs/<id>/<agent>_server.logand<agent>_harness.log. -
Patch apply errors (
Only garbage was found in the patch input) — usually the model returned prose instead of a unified diff. The patch-retry loop inrun_custom_api.pyalready handles this; if it persists, lower--max-turnsor pick a model with stronger tool/diff fidelity.
SWE-bench-dev/clawcodex_test.md— manual reproduction of every step the driver automates, plus the original error-recovery cookbook.SWE-bench-dev/swebench/harness/reporting.py— defines the summary JSON shape thatcompare_results.pyreads.