Build Repair Agent: initial evaluation harness#5762
Build Repair Agent: initial evaluation harness#5762evgenyrp wants to merge 20 commits intomozilla:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an initial evaluation harness for the Build Repair Agent, integrating Weave tracing/scoring and a Docker-based workflow to run multi-trial build-repair evaluations against a local Firefox checkout.
Changes:
- Introduces a standalone CLI (
scripts/build_repair_eval.py) that runs a WeaveEvaluationover a published dataset with configurable trials/parallelism. - Adds core build-repair components (agent, prompts, sandbox config, scorers, try/local build verification, git worktree management).
- Adds dev Docker/Docker Compose setup plus a dataset-creation notebook and minor repo hygiene updates.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/build_repair_eval.py | New CLI harness to run Weave evaluation, manage worktrees per example, and trace LLM stages/costs. |
| requirements.txt | Adds claude-agent-sdk dependency for the new agent implementation. |
| notebooks/build_repair_create_dataset.ipynb | Notebook to build/publish a Weave dataset of one-commit failures and fixes. |
| docker/build_repair/Dockerfile | Dev image for running the evaluation in a container with Firefox repo mounted. |
| docker-compose.dev.yml | Dev compose service for local evaluation runs under Docker. |
| bugbug/tools/build_repair/worktree.py | New helper for creating/cleaning git worktrees for parallel runs. |
| bugbug/tools/build_repair/try_server.py | Implements local build verification and optional try push + Treeherder polling. |
| bugbug/tools/build_repair/scorer.py | Adds Weave scorers for basic metrics and build pass rates (plus LLM-judge scaffold). |
| bugbug/tools/build_repair/prompts.py | Prompt templates for analysis and fix stages (with eval constraints). |
| bugbug/tools/build_repair/config.py | Centralizes model IDs, cutoff dates, sandbox/tool allowlist, and try settings. |
| bugbug/tools/build_repair/agent.py | Core two-stage agent implementation using claude-agent-sdk, producing diffs and build verification results. |
| bugbug/tools/build_repair/init.py | Exposes build-repair public API objects. |
| .gitignore | Ignores JetBrains .idea directory. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| finally: | ||
| if worktree_created: | ||
| logger.info(f"Bug {bug_id}: cleaning up worktree {wt_name}") | ||
| self.worktree_mgr.cleanup(wt_name) | ||
|
|
There was a problem hiding this comment.
Cleanup runs in a finally, but WorktreeManager.cleanup() uses check=True and any failure here will raise and potentially mask the original error/result from the evaluation run. Consider catching and logging cleanup errors (or using check=False) so a failed cleanup doesn’t abort the evaluation process.
| diff_result = subprocess.run( | ||
| ["git", "diff", "HEAD"], | ||
| cwd=worktree_path, | ||
| capture_output=True, | ||
| text=True, | ||
| ) | ||
| diff = diff_result.stdout | ||
| logger.info(f"Bug {failure.bug_id}: git diff produced {len(diff)} chars") | ||
|
|
||
| if not diff.strip(): | ||
| logger.warning(f"Bug {failure.bug_id}: no diff produced, returning early") | ||
| return AgentResponse( | ||
| summary=summary, | ||
| analysis=analysis, | ||
| diff=diff, | ||
| cost_usd=total_cost, | ||
| num_turns=total_turns, | ||
| **self._usage_fields(total_usage), | ||
| stage1_transcript=stage1_transcript, | ||
| stage2_transcript=stage2_transcript, | ||
| ) |
There was a problem hiding this comment.
The diff is collected with git diff HEAD, which does not include untracked files. If the agent creates a new source file, diff will be empty and the run will incorrectly return early without build verification. Consider using git status --porcelain to detect any changes and/or using a diff method that includes untracked files (e.g., staging first, or git diff --stat --no-ext-diff -- plus explicit handling of untracked paths).
| "source": "### Ger bugzilla comments before the fix" | ||
| }, |
There was a problem hiding this comment.
Typo in markdown heading: "Ger bugzilla comments before the fix" should be "Get bugzilla comments before the fix".
| if datetime.fromisoformat(fix_commit_date).date() < cutoff: | ||
| logger.warning( | ||
| f"Skipping bug {bug_id}: fix date {fix_commit_date} " | ||
| f"is before model cutoff {cutoff}" | ||
| ) | ||
| raise ValueError("skipped_data_contamination") |
There was a problem hiding this comment.
fix_commit_date values from Bugzilla commonly end with Z (UTC). datetime.fromisoformat() does not parse the Z suffix, so this will raise ValueError and break evaluation runs. Normalize the timestamp (e.g., replace Z with +00:00) or use a parser that supports Z before comparing to the cutoff date.
| - .:/app # live code editing | ||
| - ${FIREFOX_REPO}:/workspace/firefox # Firefox repo | ||
| - build-repair-tmp:/tmp/build_repair_worktrees | ||
| environment: | ||
| - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} | ||
| - WANDB_API_KEY=${WANDB_API_KEY} # for weave | ||
| - FIREFOX_GIT_REPO=/workspace/firefox |
There was a problem hiding this comment.
The volume mount uses ${FIREFOX_REPO} but the rest of the tooling (and container env) refers to FIREFOX_GIT_REPO. This mismatch can cause docker-compose to fail if FIREFOX_REPO isn’t set, and is easy to misconfigure. Consider using a single env var name consistently (or providing a default).
| "def _get_fix_commit_date(bug, fix_commit):\n", | ||
| " for comment in bug._metadata[\"comments\"]:\n", | ||
| " if (\n", | ||
| " comment[\"creator\"] == \"pulsebot@bmo.tld\"\n", | ||
| " and fix_commit[:12] in comment[\"raw_text\"]\n", | ||
| " ):\n", | ||
| " return comment[\"time\"]\n", | ||
| " raise None\n", | ||
| "\n", |
There was a problem hiding this comment.
raise None is invalid in Python (exceptions must derive from BaseException) and will raise a TypeError, breaking dataset creation if no matching pulsebot comment is found. Raise an appropriate exception type (or return None) instead.
| ENV PATH="/opt/venv/bin:$PATH" | ||
|
|
||
| RUN apt-get install -y git nodejs npm && rm -rf /var/lib/apt/lists/* | ||
| RUN pip install weave>=0.52.29 pydantic claude-agent-sdk requests |
There was a problem hiding this comment.
RUN pip install weave>=0.52.29 pydantic claude-agent-sdk requests installs multiple third-party Python packages without pinning them to immutable versions, which creates a supply-chain risk: if any of these packages or their PyPI releases are compromised, a rebuilt image could execute attacker-controlled code and exfiltrate secrets or tamper with build artifacts. To reduce this risk, pin these dependencies to specific versions (and ideally hashes) and update them only via reviewed dependency bumps.
There was a problem hiding this comment.
I'd want to have separate requirements + lock files but wasn't sure where to put it.
There was a problem hiding this comment.
These dependencies are already installed by bugbug — the only exception is claude-agent-sdk, which you've already added.
There was a problem hiding this comment.
I noticed it was very slow to rebuild the Docker image with all the deps together. But yeah, probably it's better to keep everything in one file, since the build repair tool can access some other modules from bugbug.
suhaibmujahid
left a comment
There was a problem hiding this comment.
I only had a cursory look because it's too much new code, but it LGTM overall.
| from bugbug.tools.build_repair.agent import AgentResponse, BuildFailure, BuildRepairTool | ||
|
|
||
| __all__ = ["AgentResponse", "BuildFailure", "BuildRepairTool"] |
There was a problem hiding this comment.
This is not needed since we already import from the submodules directly.
There was a problem hiding this comment.
I would move the Dockerfile to services/build_repair/Dockerfile to be part of a service.
bugbug/tools/ contains agents and tools; services/ contains deployment logic — Dockerfiles, API applications, etc.
|
|
||
| # JetBrains IDEs | ||
| .idea |
There was a problem hiding this comment.
This one falls a bit outside our scope, but I think the cleanest solution here is a user-scoped global gitignore file — that way it applies across all your repos without any per-project configuration.
I personally keep a ~/.gitignore_global with entries like:
*~
.DS_Store
.idea/*
.vscode/*
cmake-build-debug/*
.claude/settings.local.json
Just don't forget to register it with Git:
git config --global core.excludesfile ~/.gitignore_globalThere was a problem hiding this comment.
There was a problem hiding this comment.
It's not clear where this link leads
There was a problem hiding this comment.
Ok, it seems it's the comment about Dockerfile
| } | ||
| if self.num_trials > 1: | ||
| summary.update(_pass_at_k(score_rows, self.num_trials, "successful")) | ||
| logger.info(f"BasicMetrics summary: {summary}") |
There was a problem hiding this comment.
Avoid f-string interpolation in logging calls — the string is eagerly evaluated even when the log level is disabled. Use %s lazy formatting instead. See: https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-fstring-interpolation.html
Here's an example of running it on a dataset of 85 examples. An outage at Anthropic contributed to some fails but it works overall.
The baseline agent is very simple; most of the complexity is about integration with Weave and building Firefox.
What works so far:
Out of scope for this PR:
./mach bootstrapon every example (I have the code, but it didn't work well, so I kept the simple approach)./mach load-taskalso didn't work for me)