Skip to content

Build Repair Agent: initial evaluation harness#5762

Open
evgenyrp wants to merge 20 commits intomozilla:masterfrom
evgenyrp:build_repair_agent
Open

Build Repair Agent: initial evaluation harness#5762
evgenyrp wants to merge 20 commits intomozilla:masterfrom
evgenyrp:build_repair_agent

Conversation

@evgenyrp
Copy link

@evgenyrp evgenyrp commented Mar 3, 2026

Here's an example of running it on a dataset of 85 examples. An outage at Anthropic contributed to some fails but it works overall.

The baseline agent is very simple; most of the complexity is about integration with Weave and building Firefox.

What works so far:

  • Creating a dataset of one commit build failures (with one commit fixes)
  • Running evaluation from a local machine under docker-compose
  • Creating and cleaning worktrees for each example
  • Basic sandboxing
  • Tracing agent's steps in Weave
  • Running multiple trials per example
  • Separate stages for analysis and fixing (allows deploying analysis only)
  • Building Firefox locally inside Docker to verify the fix
  • Basic scorers
  • Cost tracking
  • Verifying data contamination (meaning the LLM is trained before the example date)

Out of scope for this PR:

  • Reusing a pool of worktrees not to run ./mach bootstrap on every example (I have the code, but it didn't work well, so I kept the simple approach)
  • Pushing to TRY to verify a specific build: the main issue is propagating credentials to Docker (an alternative of using ./mach load-task also didn't work for me)
  • LLM-as-a-judge scorers
  • Better integration with Weave (hacky implementation of tracing, verify metrics when all scorers work, verify error handling)
  • Production deployment-related work
  • Improving the agent itself (reading through the Claude Agents SDK docs and applying best practices, making sure the Firefox MCP and skills work, maybe doing another iteration if the build is still failing etc.)
  • Improving sandboxing

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an initial evaluation harness for the Build Repair Agent, integrating Weave tracing/scoring and a Docker-based workflow to run multi-trial build-repair evaluations against a local Firefox checkout.

Changes:

  • Introduces a standalone CLI (scripts/build_repair_eval.py) that runs a Weave Evaluation over a published dataset with configurable trials/parallelism.
  • Adds core build-repair components (agent, prompts, sandbox config, scorers, try/local build verification, git worktree management).
  • Adds dev Docker/Docker Compose setup plus a dataset-creation notebook and minor repo hygiene updates.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
scripts/build_repair_eval.py New CLI harness to run Weave evaluation, manage worktrees per example, and trace LLM stages/costs.
requirements.txt Adds claude-agent-sdk dependency for the new agent implementation.
notebooks/build_repair_create_dataset.ipynb Notebook to build/publish a Weave dataset of one-commit failures and fixes.
docker/build_repair/Dockerfile Dev image for running the evaluation in a container with Firefox repo mounted.
docker-compose.dev.yml Dev compose service for local evaluation runs under Docker.
bugbug/tools/build_repair/worktree.py New helper for creating/cleaning git worktrees for parallel runs.
bugbug/tools/build_repair/try_server.py Implements local build verification and optional try push + Treeherder polling.
bugbug/tools/build_repair/scorer.py Adds Weave scorers for basic metrics and build pass rates (plus LLM-judge scaffold).
bugbug/tools/build_repair/prompts.py Prompt templates for analysis and fix stages (with eval constraints).
bugbug/tools/build_repair/config.py Centralizes model IDs, cutoff dates, sandbox/tool allowlist, and try settings.
bugbug/tools/build_repair/agent.py Core two-stage agent implementation using claude-agent-sdk, producing diffs and build verification results.
bugbug/tools/build_repair/init.py Exposes build-repair public API objects.
.gitignore Ignores JetBrains .idea directory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +285 to +289
finally:
if worktree_created:
logger.info(f"Bug {bug_id}: cleaning up worktree {wt_name}")
self.worktree_mgr.cleanup(wt_name)

Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup runs in a finally, but WorktreeManager.cleanup() uses check=True and any failure here will raise and potentially mask the original error/result from the evaluation run. Consider catching and logging cleanup errors (or using check=False) so a failed cleanup doesn’t abort the evaluation process.

Copilot uses AI. Check for mistakes.
Comment on lines +349 to +369
diff_result = subprocess.run(
["git", "diff", "HEAD"],
cwd=worktree_path,
capture_output=True,
text=True,
)
diff = diff_result.stdout
logger.info(f"Bug {failure.bug_id}: git diff produced {len(diff)} chars")

if not diff.strip():
logger.warning(f"Bug {failure.bug_id}: no diff produced, returning early")
return AgentResponse(
summary=summary,
analysis=analysis,
diff=diff,
cost_usd=total_cost,
num_turns=total_turns,
**self._usage_fields(total_usage),
stage1_transcript=stage1_transcript,
stage2_transcript=stage2_transcript,
)
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is collected with git diff HEAD, which does not include untracked files. If the agent creates a new source file, diff will be empty and the run will incorrectly return early without build verification. Consider using git status --porcelain to detect any changes and/or using a diff method that includes untracked files (e.g., staging first, or git diff --stat --no-ext-diff -- plus explicit handling of untracked paths).

Copilot uses AI. Check for mistakes.
Comment on lines +110 to +111
"source": "### Ger bugzilla comments before the fix"
},
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in markdown heading: "Ger bugzilla comments before the fix" should be "Get bugzilla comments before the fix".

Copilot uses AI. Check for mistakes.
Comment on lines +250 to +255
if datetime.fromisoformat(fix_commit_date).date() < cutoff:
logger.warning(
f"Skipping bug {bug_id}: fix date {fix_commit_date} "
f"is before model cutoff {cutoff}"
)
raise ValueError("skipped_data_contamination")
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix_commit_date values from Bugzilla commonly end with Z (UTC). datetime.fromisoformat() does not parse the Z suffix, so this will raise ValueError and break evaluation runs. Normalize the timestamp (e.g., replace Z with +00:00) or use a parser that supports Z before comparing to the cutoff date.

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +16
- .:/app # live code editing
- ${FIREFOX_REPO}:/workspace/firefox # Firefox repo
- build-repair-tmp:/tmp/build_repair_worktrees
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- WANDB_API_KEY=${WANDB_API_KEY} # for weave
- FIREFOX_GIT_REPO=/workspace/firefox
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volume mount uses ${FIREFOX_REPO} but the rest of the tooling (and container env) refers to FIREFOX_GIT_REPO. This mismatch can cause docker-compose to fail if FIREFOX_REPO isn’t set, and is easy to misconfigure. Consider using a single env var name consistently (or providing a default).

Copilot uses AI. Check for mistakes.
Comment on lines +139 to +147
"def _get_fix_commit_date(bug, fix_commit):\n",
" for comment in bug._metadata[\"comments\"]:\n",
" if (\n",
" comment[\"creator\"] == \"pulsebot@bmo.tld\"\n",
" and fix_commit[:12] in comment[\"raw_text\"]\n",
" ):\n",
" return comment[\"time\"]\n",
" raise None\n",
"\n",
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise None is invalid in Python (exceptions must derive from BaseException) and will raise a TypeError, breaking dataset creation if no matching pulsebot comment is found. Raise an appropriate exception type (or return None) instead.

Copilot uses AI. Check for mistakes.
ENV PATH="/opt/venv/bin:$PATH"

RUN apt-get install -y git nodejs npm && rm -rf /var/lib/apt/lists/*
RUN pip install weave>=0.52.29 pydantic claude-agent-sdk requests
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUN pip install weave>=0.52.29 pydantic claude-agent-sdk requests installs multiple third-party Python packages without pinning them to immutable versions, which creates a supply-chain risk: if any of these packages or their PyPI releases are compromised, a rebuilt image could execute attacker-controlled code and exfiltrate secrets or tamper with build artifacts. To reduce this risk, pin these dependencies to specific versions (and ideally hashes) and update them only via reviewed dependency bumps.

Copilot uses AI. Check for mistakes.
Copy link
Author

@evgenyrp evgenyrp Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd want to have separate requirements + lock files but wasn't sure where to put it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These dependencies are already installed by bugbug — the only exception is claude-agent-sdk, which you've already added.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed it was very slow to rebuild the Docker image with all the deps together. But yeah, probably it's better to keep everything in one file, since the build repair tool can access some other modules from bugbug.

Copy link
Member

@suhaibmujahid suhaibmujahid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only had a cursory look because it's too much new code, but it LGTM overall.

Comment on lines +6 to +8
from bugbug.tools.build_repair.agent import AgentResponse, BuildFailure, BuildRepairTool

__all__ = ["AgentResponse", "BuildFailure", "BuildRepairTool"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed since we already import from the submodules directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the Dockerfile to services/build_repair/Dockerfile to be part of a service.

bugbug/tools/ contains agents and tools; services/ contains deployment logic — Dockerfiles, API applications, etc.

Comment on lines +52 to +54

# JetBrains IDEs
.idea
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one falls a bit outside our scope, but I think the cleanest solution here is a user-scoped global gitignore file — that way it applies across all your repos without any per-project configuration.

I personally keep a ~/.gitignore_global with entries like:

*~
.DS_Store
.idea/*
.vscode/*
cmake-build-debug/*
.claude/settings.local.json

Just don't forget to register it with Git:

git config --global core.excludesfile ~/.gitignore_global

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear where this link leads

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it seems it's the comment about Dockerfile

}
if self.num_trials > 1:
summary.update(_pass_at_k(score_rows, self.num_trials, "successful"))
logger.info(f"BasicMetrics summary: {summary}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid f-string interpolation in logging calls — the string is eagerly evaluated even when the log level is disabled. Use %s lazy formatting instead. See: https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-fstring-interpolation.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants