Build Repair Agent: initial evaluation harness by evgenyrp · Pull Request #5762 · mozilla/bugbug

evgenyrp · 2026-03-03T01:10:49Z

Here's an example of running it on a dataset of 85 examples. An outage at Anthropic contributed to some fails but it works overall.

The baseline agent is very simple; most of the complexity is about integration with Weave and building Firefox.

What works so far:

Creating a dataset of one commit build failures (with one commit fixes)
Running evaluation from a local machine under docker-compose
Creating and cleaning worktrees for each example
Basic sandboxing
Tracing agent's steps in Weave
Running multiple trials per example
Separate stages for analysis and fixing (allows deploying analysis only)
Building Firefox locally inside Docker to verify the fix
Basic scorers
Cost tracking
Verifying data contamination (meaning the LLM is trained before the example date)

Out of scope for this PR:

Reusing a pool of worktrees not to run ./mach bootstrap on every example (I have the code, but it didn't work well, so I kept the simple approach)
Pushing to TRY to verify a specific build: the main issue is propagating credentials to Docker (an alternative of using ./mach load-task also didn't work for me)
LLM-as-a-judge scorers
Better integration with Weave (hacky implementation of tracing, verify metrics when all scorers work, verify error handling)
Production deployment-related work
Improving the agent itself (reading through the Claude Agents SDK docs and applying best practices, making sure the Firefox MCP and skills work, maybe doing another iteration if the build is still failing etc.)
Improving sandboxing

Copilot

Pull request overview

Adds an initial evaluation harness for the Build Repair Agent, integrating Weave tracing/scoring and a Docker-based workflow to run multi-trial build-repair evaluations against a local Firefox checkout.

Changes:

Introduces a standalone CLI (scripts/build_repair_eval.py) that runs a Weave Evaluation over a published dataset with configurable trials/parallelism.
Adds core build-repair components (agent, prompts, sandbox config, scorers, try/local build verification, git worktree management).
Adds dev Docker/Docker Compose setup plus a dataset-creation notebook and minor repo hygiene updates.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
scripts/build_repair_eval.py	New CLI harness to run Weave evaluation, manage worktrees per example, and trace LLM stages/costs.
requirements.txt	Adds `claude-agent-sdk` dependency for the new agent implementation.
notebooks/build_repair_create_dataset.ipynb	Notebook to build/publish a Weave dataset of one-commit failures and fixes.
docker/build_repair/Dockerfile	Dev image for running the evaluation in a container with Firefox repo mounted.
docker-compose.dev.yml	Dev compose service for local evaluation runs under Docker.
bugbug/tools/build_repair/worktree.py	New helper for creating/cleaning git worktrees for parallel runs.
bugbug/tools/build_repair/try_server.py	Implements local build verification and optional try push + Treeherder polling.
bugbug/tools/build_repair/scorer.py	Adds Weave scorers for basic metrics and build pass rates (plus LLM-judge scaffold).
bugbug/tools/build_repair/prompts.py	Prompt templates for analysis and fix stages (with eval constraints).
bugbug/tools/build_repair/config.py	Centralizes model IDs, cutoff dates, sandbox/tool allowlist, and try settings.
bugbug/tools/build_repair/agent.py	Core two-stage agent implementation using `claude-agent-sdk`, producing diffs and build verification results.
bugbug/tools/build_repair/init.py	Exposes build-repair public API objects.
.gitignore	Ignores JetBrains `.idea` directory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

scripts/build_repair_eval.py

Copilot · 2026-03-09T18:31:08Z

scripts/build_repair_eval.py

+        finally:
+            if worktree_created:
+                logger.info(f"Bug {bug_id}: cleaning up worktree {wt_name}")
+                self.worktree_mgr.cleanup(wt_name)
+


Cleanup runs in a finally, but WorktreeManager.cleanup() uses check=True and any failure here will raise and potentially mask the original error/result from the evaluation run. Consider catching and logging cleanup errors (or using check=False) so a failed cleanup doesn’t abort the evaluation process.

bugbug/tools/build_repair/worktree.py

Copilot · 2026-03-09T18:31:09Z

bugbug/tools/build_repair/agent.py

+        diff_result = subprocess.run(
+            ["git", "diff", "HEAD"],
+            cwd=worktree_path,
+            capture_output=True,
+            text=True,
+        )
+        diff = diff_result.stdout
+        logger.info(f"Bug {failure.bug_id}: git diff produced {len(diff)} chars")
+
+        if not diff.strip():
+            logger.warning(f"Bug {failure.bug_id}: no diff produced, returning early")
+            return AgentResponse(
+                summary=summary,
+                analysis=analysis,
+                diff=diff,
+                cost_usd=total_cost,
+                num_turns=total_turns,
+                **self._usage_fields(total_usage),
+                stage1_transcript=stage1_transcript,
+                stage2_transcript=stage2_transcript,
+            )


The diff is collected with git diff HEAD, which does not include untracked files. If the agent creates a new source file, diff will be empty and the run will incorrectly return early without build verification. Consider using git status --porcelain to detect any changes and/or using a diff method that includes untracked files (e.g., staging first, or git diff --stat --no-ext-diff -- plus explicit handling of untracked paths).

Copilot · 2026-03-09T18:31:09Z

notebooks/build_repair_create_dataset.ipynb

+   "source": "### Ger bugzilla comments before the fix"
+  },


Typo in markdown heading: "Ger bugzilla comments before the fix" should be "Get bugzilla comments before the fix".

Copilot · 2026-03-09T18:31:09Z

scripts/build_repair_eval.py

+            if datetime.fromisoformat(fix_commit_date).date() < cutoff:
+                logger.warning(
+                    f"Skipping bug {bug_id}: fix date {fix_commit_date} "
+                    f"is before model cutoff {cutoff}"
+                )
+                raise ValueError("skipped_data_contamination")


fix_commit_date values from Bugzilla commonly end with Z (UTC). datetime.fromisoformat() does not parse the Z suffix, so this will raise ValueError and break evaluation runs. Normalize the timestamp (e.g., replace Z with +00:00) or use a parser that supports Z before comparing to the cutoff date.

Copilot · 2026-03-09T18:31:10Z

docker-compose.dev.yml

+      - .:/app # live code editing
+      - ${FIREFOX_REPO}:/workspace/firefox # Firefox repo
+      - build-repair-tmp:/tmp/build_repair_worktrees
+    environment:
+      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
+      - WANDB_API_KEY=${WANDB_API_KEY} # for weave
+      - FIREFOX_GIT_REPO=/workspace/firefox


The volume mount uses ${FIREFOX_REPO} but the rest of the tooling (and container env) refers to FIREFOX_GIT_REPO. This mismatch can cause docker-compose to fail if FIREFOX_REPO isn’t set, and is easy to misconfigure. Consider using a single env var name consistently (or providing a default).

Copilot · 2026-03-09T18:31:10Z

notebooks/build_repair_create_dataset.ipynb

+    "def _get_fix_commit_date(bug, fix_commit):\n",
+    "    for comment in bug._metadata[\"comments\"]:\n",
+    "        if (\n",
+    "            comment[\"creator\"] == \"pulsebot@bmo.tld\"\n",
+    "            and fix_commit[:12] in comment[\"raw_text\"]\n",
+    "        ):\n",
+    "            return comment[\"time\"]\n",
+    "    raise None\n",
+    "\n",


raise None is invalid in Python (exceptions must derive from BaseException) and will raise a TypeError, breaking dataset creation if no matching pulsebot comment is found. Raise an appropriate exception type (or return None) instead.

Copilot · 2026-03-09T18:31:10Z

docker/build_repair/Dockerfile

+ENV PATH="/opt/venv/bin:$PATH"
+
+RUN apt-get install -y git nodejs npm && rm -rf /var/lib/apt/lists/*
+RUN pip install weave>=0.52.29 pydantic claude-agent-sdk requests


RUN pip install weave>=0.52.29 pydantic claude-agent-sdk requests installs multiple third-party Python packages without pinning them to immutable versions, which creates a supply-chain risk: if any of these packages or their PyPI releases are compromised, a rebuilt image could execute attacker-controlled code and exfiltrate secrets or tamper with build artifacts. To reduce this risk, pin these dependencies to specific versions (and ideally hashes) and update them only via reviewed dependency bumps.

I'd want to have separate requirements + lock files but wasn't sure where to put it.

These dependencies are already installed by bugbug — the only exception is claude-agent-sdk, which you've already added.

I noticed it was very slow to rebuild the Docker image with all the deps together. But yeah, probably it's better to keep everything in one file, since the build repair tool can access some other modules from bugbug.

suhaibmujahid

I only had a cursory look because it's too much new code, but it LGTM overall.

suhaibmujahid · 2026-03-09T18:23:21Z

bugbug/tools/build_repair/__init__.py

+from bugbug.tools.build_repair.agent import AgentResponse, BuildFailure, BuildRepairTool
+
+__all__ = ["AgentResponse", "BuildFailure", "BuildRepairTool"]


This is not needed since we already import from the submodules directly.

suhaibmujahid · 2026-03-09T18:32:56Z

docker/build_repair/Dockerfile

I would move the Dockerfile to services/build_repair/Dockerfile to be part of a service.

bugbug/tools/ contains agents and tools; services/ contains deployment logic — Dockerfiles, API applications, etc.

suhaibmujahid · 2026-03-09T18:44:04Z

.gitignore

+
+# JetBrains IDEs
+.idea


This one falls a bit outside our scope, but I think the cleanest solution here is a user-scoped global gitignore file — that way it applies across all your repos without any per-project configuration.

I personally keep a ~/.gitignore_global with entries like:

*~ .DS_Store .idea/* .vscode/* cmake-build-debug/* .claude/settings.local.json

Just don't forget to register it with Git:

git config --global core.excludesfile ~/.gitignore_global

suhaibmujahid · 2026-03-09T18:45:18Z

docker-compose.dev.yml

Similer to https://github.com/mozilla/bugbug/pull/5762/changes#r2907229399

It's not clear where this link leads

Ok, it seems it's the comment about Dockerfile

suhaibmujahid · 2026-03-10T03:10:01Z

bugbug/tools/build_repair/scorer.py

+        }
+        if self.num_trials > 1:
+            summary.update(_pass_at_k(score_rows, self.num_trials, "successful"))
+        logger.info(f"BasicMetrics summary: {summary}")


Avoid f-string interpolation in logging calls — the string is eagerly evaluated even when the log level is disabled. Use %s lazy formatting instead. See: https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-fstring-interpolation.html

evgenyrp added 20 commits February 17, 2026 12:09

Add a dataset notebook

22a1ab5

Add JetBrains IDE

77aeb2e

Initial implementation

7a95a7f

Prevent data contamination

a1af703

Adjust agent configuration and add logs

305d7c9

Improve tracing

5cb55a5

Add eval mode

79871bc

Fix local building

19addba

Improve cost tracing

7a32858

Think more on analysis

6e497df

Increase parallelizm

2fa7c62

Use default system prompt

2f2832e

Update Weave

983643e

Fix dataset

1cf9351

Log errors in weave

ffe5685

Make logging less verbose

915a2b8

Support multiple trials

ab4fce1

Fix scoring

955e048

Add docker compose

33dc52c

Change todo

2a26a6c

evgenyrp requested a review from suhaibmujahid March 3, 2026 01:10

suhaibmujahid requested a review from Copilot March 9, 2026 18:25

Copilot started reviewing on behalf of suhaibmujahid March 9, 2026 18:25 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

suhaibmujahid reviewed Mar 9, 2026

View reviewed changes

suhaibmujahid reviewed Mar 10, 2026

View reviewed changes

		from bugbug.tools.build_repair.agent import AgentResponse, BuildFailure, BuildRepairTool

		__all__ = ["AgentResponse", "BuildFailure", "BuildRepairTool"]

Conversation

evgenyrp commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

evgenyrp Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suhaibmujahid left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

evgenyrp Mar 9, 2026 •

edited

Loading