Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
_clear_infra_errors.py	_clear_infra_errors.py
_describe_dataset.py	_describe_dataset.py
compare_results.py	compare_results.py
make_results_chart.py	make_results_chart.py
pick_batch.py	pick_batch.py
repair_preds.py	repair_preds.py
run_compare.py	run_compare.py

Eval — clawcodex vs openclaude on SWE-bench

This directory drives a side-by-side comparison of clawcodex (this repo) and openclaude (the TypeScript reference) against SWE-bench. The goal is to confirm parity: when both agents are pointed at the same backing model, the same dataset, and the same prompts, do they resolve the same set of GitHub issues?

The agent-side wrappers and prediction generation live with the SWE-bench harness in SWE-bench-dev/scripts/. The comparison logic and the driver that ties everything together live here.

Layout

SWE-bench-dev/scripts/                   # in the SWE-bench-dev repo
├── clawcodex_api_server.py              # FastAPI wrapper around `clawcodex -p ...`
├── openclaude_api_server.py             # FastAPI wrapper around `node dist/cli.mjs -p ...`
└── run_custom_api.py                    # generic dataset → HTTP → predictions.jsonl

eval/                                    # in this repo (committed on feat/eval)
├── run_compare.py                       # one-command driver: prepare → run → compare
├── compare_results.py                   # standalone summary diff
├── README.md                            # this file
└── runs/                                # gitignored output (per-run)
    └── compare-YYYYMMDD-HHMMSS/
        ├── clawcodex_preds.jsonl
        ├── openclaude_preds.jsonl
        ├── clawcodex_server.log
        ├── openclaude_server.log
        ├── clawcodex_harness.log
        ├── openclaude_harness.log
        ├── only_clawcodex.txt           # instance ids only clawcodex solved
        ├── only_openclaude.txt          # instance ids only openclaude solved
        ├── both_solved.txt
        └── comparison.md                # the headline report

Prerequisites (one-time)

Sibling repos. Clone openclaude and SWE-bench-dev next to clawcodex:

git clone https://github.com/Gitlawb/openclaude.git
git clone https://github.com/swe-bench/SWE-bench.git SWE-bench-dev

(Or set OPENCLAUDE_REPO / SWEBENCH_REPO to wherever you keep them.)

Docker is installed and docker ps works.

SWE-bench venv (per SWE-bench-dev/clawcodex_test.md §2.1):

cd SWE-bench-dev
python3 -m venv .venv && source .venv/bin/activate
pip install -U pip
pip install -e .
pip install fastapi uvicorn tiktoken transformers

Point the driver at this interpreter via SWEBENCH_PYTHON:

export SWEBENCH_PYTHON=/abs/path/to/SWE-bench-dev/.venv/bin/python

clawcodex configured for the model you want to compare (clawcodex login).
openclaude provider env for the same backing model. For gpt-4o:
```
export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=sk-...
export OPENAI_MODEL=gpt-4o
```
For DeepSeek through the OpenAI-compatible path:
```
export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=sk-deepseek-...
export OPENAI_BASE_URL=https://api.deepseek.com/v1   # the driver also sets this
export OPENAI_MODEL=deepseek-v4-pro
```
The driver's --provider deepseek preset will pass OPENAI_BASE_URL and OPENAI_MODEL per request, so the env exports for those two are optional; OPENAI_API_KEY always travels via the environment.

Python that can import `swebench`

prepare runs python -m swebench.inference.make_datasets.create_text_dataset, so the interpreter chosen by SWEBENCH_PYTHON (or your default python3 / python) must have SWE-bench installed. Typical setups:

Option A — install into the clawcodex venv (one interpreter for everything):

cd clawcodex
# Prefer the venv binary so you are not using the Windows Store `python3` shim.
uv pip install -e ./SWE-bench-dev fastapi uvicorn tiktoken transformers
# Optional: `run_compare` defaults to sys.executable, so after `activate` you
# can omit SWEBENCH_PYTHON when you launch with that same `python`.

If you see cannot import swebench while the error suggests ...\WindowsApps\python3.EXE, you never installed swebench into that stub. Use .venv/Scripts/python.exe eval/run_compare.py ... or set export SWEBENCH_PYTHON="$PWD/.venv/Scripts/python.exe".

Option B — separate SWE-bench venv (matches clawcodex_test.md):

cd SWE-bench-dev
python -m venv .venv
# Windows Git Bash:
source .venv/Scripts/activate
pip install -U pip
pip install -e .
pip install fastapi uvicorn tiktoken transformers
export SWEBENCH_PYTHON="$PWD/.venv/Scripts/python.exe"

Without this, prepare will fail with No module named 'swebench'.

`bun not found on PATH` during `prepare`

OpenClaude’s build script expects Bun. Either install it, or only build the SWE-bench dataset for now:

python eval/run_compare.py prepare --skip-openclaude-build

Then install Bun and run bun install && bun run build inside openclaude/, or re-run prepare without --skip-openclaude-build once bun is on your PATH. On Windows, Git Bash sometimes does not see Bun until you add ~/.bun/bin (or the path the installer prints) to PATH.

`UnicodeDecodeError: 'gbk' codec can't decode...` during `prepare`

On Chinese (and some other) Windows locales, the default text encoding is GBK. SWE-bench’s dataset builder was opening cloned source as “system default”, which breaks on UTF-8 files. This repo’s SWE-bench-dev fork reads those paths as UTF-8. Re-run prepare after pulling the latest create_instance.py changes.

Windows: `No module named 'resource'`

The stdlib resource module exists only on Unix. Upstream SWE-bench imported it unconditionally in prepare_images.py, which breaks import swebench on Windows. This repo’s SWE-bench-dev fork patches that import so dataset prep and imports work on Windows. Docker-based evaluation still requires Docker Desktop; if you hit other POSIX-only code paths, use WSL2 for the harness.

Quickest path to a result

# 1. Build openclaude and the SWE-bench text dataset (only needed once).
python eval/run_compare.py prepare

# 2. Smoke run: 1 known instance, both agents, gpt-4o, full Docker harness.
python eval/run_compare.py run --scope smoke

# 3. Open the report.
ls eval/runs/                      # find the latest compare-* directory
cat eval/runs/compare-*/comparison.md

A smoke run takes a few minutes per instance. Scaling up:

# Pick your own instances:
python eval/run_compare.py run \
    --scope instances \
    --instance-ids astropy__astropy-12907,django__django-11099

# Or the full 300-instance Lite split (takes hours, costs real money):
python eval/run_compare.py run --scope all

Picking the model for both agents

The --provider preset sets the model and per-agent provider routing in one flag. Available presets:

Preset	Model	clawcodex routing	openclaude routing
`openai` (default)	`gpt-4o`	`--provider openai`	OpenAI native
`deepseek`	`deepseek-v4-pro`	`--provider deepseek`	OpenAI-compatible (`https://api.deepseek.com/v1`)
`anthropic`	`claude-sonnet-4-6`	`--provider anthropic`	Anthropic native
`glm`	`zai/glm-5`	`--provider glm`	OpenAI-compatible (`https://open.bigmodel.cn/api/paas/v4`)

Examples:

# DeepSeek v4-pro on both:
python eval/run_compare.py run --scope smoke --provider deepseek

# DeepSeek but a different model name:
python eval/run_compare.py run --scope smoke --provider deepseek --model deepseek-coder-v4

# Custom OpenAI-compatible endpoint:
python eval/run_compare.py run --scope smoke \
    --provider openai --model my-finetune \
    --openclaude-base-url https://my-gateway.example.com/v1

# Run only one of the two agents (sometimes useful when iterating):
python eval/run_compare.py run --agents openclaude --provider deepseek

Per-field overrides (--model, --clawcodex-provider, --openclaude-provider, --openclaude-base-url) layer on top of the preset.

What `run` actually does (sequentially per agent)

For each agent in --agents (default clawcodex,openclaude):

Spawn its API server (uvicorn scripts.<agent>_api_server:app) on its port (8000 for clawcodex, 8001 for openclaude). Logs go to eval/runs/<id>/<agent>_server.log.
Wait for /health to come up. Falls back to a TCP-only liveness check if the wrapper doesn't expose /health yet.
Generate predictions by invoking SWE-bench-dev/scripts/run_custom_api.py against the local /generate endpoint. Writes <agent>_preds.jsonl.
Stop the server.
Run the Docker harness (swebench.harness.run_evaluation) on those predictions. Writes the harness summary into the SWE-bench repo as <agent>-local.<run-id>.json.

After both agents are done:

Diff the two summary jsons via compare_results.py and write comparison.md plus only_<agent>.txt triage lists.

Just compare two existing runs

If you already have two harness summary jsons lying around:

python eval/compare_results.py \
    --left  /path/to/clawcodex-local.run-001.json  --left-label  clawcodex \
    --right /path/to/openclaude-local.run-001.json --right-label openclaude \
    --out   eval/runs/manual/comparison.md

run_compare.py compare is a thin alias for the same call.

Tunables worth knowing

Flag	Default	Notes
`--scope`	`smoke`	`smoke` / `instances` / `all` (full split)
`--provider`	`openai`	Preset: `openai` / `deepseek` / `anthropic` / `glm`
`--model`	(preset's default)	Override the preset's model name
`--clawcodex-provider`	(preset)	Override clawcodex `--provider` flag
`--openclaude-provider`	(preset)	Override openclaude routing hint
`--openclaude-base-url`	(preset)	Override `OPENAI_BASE_URL` for openclaude
`--max-turns`	`30`	Per-instance agent turn cap
`--request-timeout`	`1800`	HTTP timeout per instance, seconds
`--max-patch-retries`	`2`	Re-prompt when extracted diff is invalid
`--max-workers`	`1`	Docker harness parallelism (raise carefully)
`--skip-harness`	off	Generate predictions only — useful when iterating on prompts
`--agents`	`clawcodex,openclaude`	Drop one to run only the other

Troubleshooting

Text dataset not found — You skipped prepare, or it never finished. Run python eval/run_compare.py prepare (from the clawcodex repo) after installing swebench into the interpreter SWEBENCH_PYTHON points at (see Python that can import swebench above). On Windows Git Bash you can use .venv/Scripts/python.exe eval/run_compare.py prepare.
text dataset not found at .../datasets/SWE-bench__SWE-bench_Lite__style-3__fs-oracle — Same as above: run prepare first, or pass --dataset-local if you keep the dataset elsewhere.
OPENCLAUDE_REPO set but dist/cli.mjs not found — prepare should build it for you. If it doesn't, cd openclaude && bun install && bun run build.
No module named 'src' from clawcodex — the wrapper's fallback path needs CLAWCODEX_REPO set. The driver passes it automatically; if you're running the wrapper by hand, follow SWE-bench-dev/clawcodex_test.md §2.2.
One agent blew up but the other was fine — the run still produces a half-finished comparison. Check eval/runs/<id>/<agent>_server.log and <agent>_harness.log.
Patch apply errors (Only garbage was found in the patch input) — usually the model returned prose instead of a unified diff. The patch-retry loop in run_custom_api.py already handles this; if it persists, lower --max-turns or pick a model with stronger tool/diff fidelity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Eval — clawcodex vs openclaude on SWE-bench

Layout

Prerequisites (one-time)

Python that can import `swebench`

`bun not found on PATH` during `prepare`

`UnicodeDecodeError: 'gbk' codec can't decode...` during `prepare`

Windows: `No module named 'resource'`

Quickest path to a result

Picking the model for both agents

What `run` actually does (sequentially per agent)

Just compare two existing runs

Tunables worth knowing

Troubleshooting

See also

FilesExpand file tree

eval

Directory actions

More options

Directory actions

More options

Latest commit

History

eval

Folders and files

parent directory

README.md

Eval — clawcodex vs openclaude on SWE-bench

Layout

Prerequisites (one-time)

Python that can import swebench

bun not found on PATH during prepare

UnicodeDecodeError: 'gbk' codec can't decode... during prepare

Windows: No module named 'resource'

Quickest path to a result

Picking the model for both agents

What run actually does (sequentially per agent)

Just compare two existing runs

Tunables worth knowing

Troubleshooting

See also

Python that can import `swebench`

`bun not found on PATH` during `prepare`

`UnicodeDecodeError: 'gbk' codec can't decode...` during `prepare`

Windows: `No module named 'resource'`

What `run` actually does (sequentially per agent)