Skip to content

RHOAIENG-50709: add ADR for CI/CD on firewalled clusters#793

Open
ktdreyer wants to merge 1 commit intomainfrom
kdreyer/github-ci-runner-adr
Open

RHOAIENG-50709: add ADR for CI/CD on firewalled clusters#793
ktdreyer wants to merge 1 commit intomainfrom
kdreyer/github-ci-runner-adr

Conversation

@ktdreyer
Copy link
Copy Markdown
Contributor

@ktdreyer ktdreyer commented Mar 4, 2026

Present a phased approach for running GitHub Actions inside firewalled OpenShift.

Phase 1 deploys a standalone runner with no CRDs or cluster-level permissions.

Phase 2 upgrades to Actions Runner Controller (ARC) if IT approves CRDs.

https://issues.redhat.com/browse/RHOAIENG-50709

@ktdreyer ktdreyer requested a review from kahowell March 4, 2026 16:51
@ktdreyer ktdreyer changed the title add ADR for CI/CD on firewalled clusters RHOAIENG-51655: add ADR for CI/CD on firewalled clusters Mar 4, 2026
@ktdreyer ktdreyer changed the title RHOAIENG-51655: add ADR for CI/CD on firewalled clusters RHOAIENG-50709: add ADR for CI/CD on firewalled clusters Mar 4, 2026
@github-actions

This comment has been minimized.

@ktdreyer ktdreyer force-pushed the kdreyer/github-ci-runner-adr branch 2 times, most recently from d761513 to 27a7abe Compare March 4, 2026 17:06
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@ambient-code ambient-code bot added this to the Merge Queue milestone Mar 4, 2026
@ktdreyer
Copy link
Copy Markdown
Contributor Author

ktdreyer commented Mar 4, 2026

Re: "Missing Status field", I'm addressing that in #797

@ambient-code ambient-code bot removed this from the Merge Queue milestone Mar 5, 2026
@ktdreyer ktdreyer force-pushed the kdreyer/github-ci-runner-adr branch 2 times, most recently from ef47bf7 to 974cfc1 Compare March 5, 2026 21:06
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 5, 2026

Claude Code Review

Summary

PR 793 adds ADR-0007 documenting the CI/CD deployment strategy for firewalled OpenShift clusters. The content is well-structured and clearly explains the phased approach (standalone runner to ARC). However, the document deviates from the project ADR template and README-defined workflow in several ways that reduce navigability and completeness.


Issues by Severity

Blocker Issues: None.

Critical Issues: None.


Major Issues

M1. Missing Status fielddocs/internal/adr/0007-cicd-deployment-strategy.md (lines 1–6)

The ADR README defines an explicit lifecycle: Proposed → Accepted → Deprecated → Superseded. Every accepted ADR (0001–0006) includes a Status line in its header. ADR-0007 omits this field entirely, making it impossible to determine whether this is a proposal or a ratified decision.

Fix: add **Status:** Proposed (or Accepted if already ratified) to the header block.

M2. Considered Options placed after Decisiondocs/internal/adr/0007-cicd-deployment-strategy.md (lines 55–76 vs. 16–54)

The template and all accepted ADRs (0001–0005) list Considered Options before the Decision Outcome so readers understand the option space before seeing the rationale. Here the decision is presented first and rejected options follow — a structural inversion of the established pattern.

Fix: move the Considered Options section above Decision.

M3. README index not updateddocs/internal/adr/README.md

The README table ends at ADR-0005 (ADR-0006 is also absent). Per the ADR workflow defined in the README, the table should be updated when an ADR is proposed.

Fix: append both ADR-0006 (Ambient Runner SDK Architecture, 2026-02-10) and ADR-0007 (this ADR, 2026-03-04) to the table.


Minor Issues

m4. Missing Technical Story fielddocs/internal/adr/0007-cicd-deployment-strategy.md (lines 1–6)

The template requires a Technical Story field. RHOAIENG-50709 is in the PR description but absent from the ADR header — the link is lost after merge.

Fix: add **Technical Story:** https://issues.redhat.com/browse/RHOAIENG-50709

m5. Missing Decision Drivers sectiondocs/internal/adr/0007-cicd-deployment-strategy.md

The two driving properties (Immediacy and Visibility) are described in the Context prose but not extracted into a dedicated Decision Drivers section, making it harder to evaluate future alternatives against the same criteria.

m6. Consequences is a top-level sectiondocs/internal/adr/0007-cicd-deployment-strategy.md (line 62)

Per the template, Consequences (Positive/Negative/Risks) should be a subsection under Decision Outcome, not a top-level sibling.

m7. Missing Implementation Notes and Validation sectionsdocs/internal/adr/0007-cicd-deployment-strategy.md

These template sections are valuable post-rollout. Add them with [TBD] placeholders now so they get filled in after the runner is deployed.


Positive Highlights

  • Content is excellent. The phased approach (Phase 1 to Phase 2 gated on IT approval) is clearly motivated and trade-offs are honestly assessed. The standalone runner SPOF risk and ARC concurrency problem are both surfaced proactively.
  • The concurrency key recommendation for serializing deploy jobs is a concrete, actionable mitigation — not just an abstract concern flagged and left open.
  • Rejected options are documented with reasons. ArgoCD and VPN/SSH are dismissed with explicit rationale (developer UX, firewall complexity), which is exactly what ADRs are for.
  • Risks of Remaining on Phase 1 section is a thoughtful addition — useful for future decision-makers evaluating whether to invest in Phase 2.

Recommendations

  1. Add **Status:** Proposed to the header (Major — fixes ADR lifecycle tracking gap).
  2. Move Considered Options above Decision (Major — restores logical reading order per template).
  3. Update docs/internal/adr/README.md to include ADR-0006 and ADR-0007 (Major — keeps the index accurate).
  4. Add **Technical Story:** with the RHOAIENG Jira link (Minor — preserves traceability post-merge).
  5. Add stub Implementation Notes and Validation sections with [TBD] placeholders (Minor — follows template).

Reviewed by Claude Code using repository standards from .claude/context/ and .claude/patterns/

@ktdreyer ktdreyer force-pushed the kdreyer/github-ci-runner-adr branch from 974cfc1 to 59f3703 Compare March 5, 2026 21:08
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 5, 2026

Summary

PR adds ADR-0007, a phased CI/CD deployment strategy for firewalled OpenShift clusters using self-hosted GitHub Actions runners. The architectural reasoning is sound and the trade-off analysis is honest. The document is well-written, but deviates from the project ADR template structure and has a security guidance gap around credential storage.

No blocker or critical issues.

Major Issues:

  1. Significant deviations from ADR template structure - docs/internal/adr/template.md requires sections that are entirely missing: Technical Story (should link to RHOAIENG-50709, present in PR body but absent from the doc), Decision Drivers, Implementation Notes, and Validation. ADR-0006 also includes a Status field already flagged by @MPRIC and tracked in issue 797.

  2. No guidance on GitHub App private key storage - the document mentions GitHub App credentials (App ID + private key) but gives no guidance on how the private key will be stored. Per this repo's security standards, secrets must be stored in Kubernetes Secrets - not baked into the Deployment manifest, ConfigMaps, or source control. Suggested addition: "The GitHub App private key is stored as a Kubernetes Secret in the runner namespace and mounted into the runner pod. It is never checked into source control or embedded in image layers."

  3. Serial execution framed as both a Pro and a Risk without disambiguation - Phase 1 Pros lists "Serialized deploys" as a positive, while Risks of Remaining on Phase 1 lists "No concurrency" as a negative. These describe the same property. A brief note - e.g. "This is the same serialization listed as a Pro in Phase 1; the risk here is throughput, not correctness" - would remove the ambiguity.

Minor Issues:

  1. Section heading References vs Links - the template and ADR-0006 use Links. Minor inconsistency worth aligning.

  2. "with Gemini 3 Pro" in Authors field - no other ADR credits an AI tool as co-author. Worth a deliberate team decision.

  3. Monitoring gap without a follow-up action - "We need monitoring to detect this" has no owner, issue reference, or timeline.

Positive Highlights:

  • Honest risk documentation. The "Risks of Remaining on Phase 1" section is unusually candid - single point of failure, manual recovery, no HA - making this a genuinely useful decision record.
  • Phase gate is well-structured. The "Phase 2 depends on IT approving CRDs" condition is explicit and the fallback path is clear.
  • The concurrency: YAML example is exactly right. cancel-in-progress: false and framing it as "a safe starting point, not a long-term solution" prevents a footgun for whoever implements Phase 2.
  • Options rejected with clear reasoning. Both rejected options (ArgoCD, VPN overlay) have concrete reasons tied to the stated requirements.

Recommendations:

  1. Add missing template sections (Decision Drivers, Technical Story, Status, Implementation Notes, Validation). Coordinate the Status field with issue 797.
  2. Add a sentence on GitHub App private key storage referencing Kubernetes Secrets.
  3. Disambiguate the serial execution framing in the Risks section.
  4. Rename the References section to Links for template consistency.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 5, 2026

Full review below - Part 1 of 3: BLOCKER ISSUES: None. CRITICAL ISSUES: None. MAJOR ISSUES: 1. Missing secure credential storage guidance (docs/internal/adr/0007-cicd-deployment-strategy.md, line 31). The document describes GitHub App credentials (App ID + private key) but does not specify how they must be stored. In Kubernetes context they must be stored as Kubernetes Secrets. Without this guidance an implementer could store the key insecurely (ConfigMap, baked into image, etc). Suggested addition: The GitHub App private key and App ID MUST be stored as a Kubernetes Secret and mounted into the runner pod. They must NOT be stored in a ConfigMap, baked into the container image, or committed to source control. Violates: security-standards.md.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 5, 2026

Part 2 of 3 - MINOR ISSUES: 1. Missing Status field (line 3): Every other ADR (0001-0006) includes Status: Accepted after the title block. Suggested: add Status: Accepted. 2. Missing Technical Story link: RHOAIENG-50709 is in the PR description but not the document. Add: Technical Story: https://issues.redhat.com/browse/RHOAIENG-50709. 3. Temporal language (line 23): 'This week I have already moved forward with trialing Phase 1' will confuse future readers. Replace with a date-anchored statement e.g. 'As of 2026-03-04, Phase 1 has been trialled in the target environment.' 4. Inconsistent bullet formatting (lines 97-98): Two bullets have a leading space before the asterisk, unlike every other bullet. 5. AI co-authorship non-standard (line 5): Authors: Ken Dreyer (with Gemini 3 Pro) - no other ADR attributes an AI assistant. Consider removing the parenthetical.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 5, 2026

Part 3 of 3 - POSITIVE HIGHLIGHTS: Pragmatic phased approach: Phase 1 deploys with zero IT dependencies - exactly the right call. Proactive risk identification: The 'Risks of Remaining on Phase 1' section goes beyond template requirements and candidly enumerates single-point-of-failure, no-alerting, and manual recovery concerns - genuinely useful for future operators. Concurrency addressed: Proactively identifies that ARC reintroduces concurrent deploy races and proposes the GitHub Actions concurrency key, with honest acknowledgment it negates most of ARC's benefit. Security-conscious design: Outbound-only model, namespace-scoped ServiceAccount, and short-lived GitHub App tokens align with project security standards. Clear option rejection: ArgoCD and VPN options rejected with specific technical reasons. RECOMMENDATIONS: 1. (Major) Add guidance that GitHub App credentials must be stored as Kubernetes Secrets. 2. (Minor) Add Status and Technical Story fields to the header. 3. (Minor) Replace 'This week...' with a date-anchored statement. 4. (Minor) Fix leading-space bullets on lines 97-98. 5. (Minor) Consider removing '(with Gemini 3 Pro)' from Authors. --- Reviewed with Claude Code (claude-sonnet-4-6) using amber.review

@ktdreyer ktdreyer force-pushed the kdreyer/github-ci-runner-adr branch from 59f3703 to c5a74c3 Compare March 11, 2026 13:11
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 11, 2026

Walkthrough

Adds ADR-0007 documenting CI/CD deployment strategy for firewalled OpenShift clusters, describing Phase 1 standalone GitHub Actions runner deployment and Phase 2 migration to Actions Runner Controller (ARC) pending IT approval.

Changes

Cohort / File(s) Summary
CI/CD Deployment Strategy ADR
docs/internal/adr/0007-cicd-deployment-strategy.md
New ADR outlining a two-phase approach: Phase 1 deploys a namespace-scoped standalone GitHub Actions runner Deployment with GitHub App credentials and minimal ServiceAccount permissions; Phase 2 proposes upgrading to Actions Runner Controller (ARC) with CRDs for auto-scaling if approved. Covers requirements, pros/cons, rejected alternatives (ArgoCD, VPN/SSH), concurrency handling, risks, and upgrade criteria.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main change: adding an ADR document for CI/CD deployment on firewalled clusters, which matches the changeset.
Description check ✅ Passed The description is directly related to the changeset, explaining the phased approach (Phase 1 with standalone runner, Phase 2 with ARC) documented in the ADR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kdreyer/github-ci-runner-adr
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.

Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 11, 2026

🧪 CI Insights

Here's what we observed from your CI run for c5a74c3.

🟢 All jobs passed!

But CI Insights is watching 👀

@ambient-code
Copy link
Copy Markdown
Contributor

ambient-code bot commented Mar 11, 2026

Review Queue — Blockers Found

Check Status Detail
CI pass
Merge conflicts pass
Review comments FAIL 2 inline threads on docs/internal/adr/0007-cicd-deployment-strategy.md
Jira hygiene pass
Fork PR pass ---
Staleness pass

This comment is auto-generated by the Review Queue workflow and will be updated when the PR changes.

@ambient-code ambient-code bot modified the milestones: Merge Queue, Review Queue Mar 11, 2026
@ktdreyer ktdreyer force-pushed the kdreyer/github-ci-runner-adr branch from c5a74c3 to 833a5a7 Compare March 12, 2026 17:30
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/internal/adr/0007-cicd-deployment-strategy.md`:
- Around line 18-32: The ADR is missing a clear trust-boundary and guardrails
for the in-cluster GitHub Actions runners—update the "Phase 1: Standalone
Runner" section to explicitly document that any workflow running on these
self-hosted runners inherits cluster network reachability and the ServiceAccount
permissions (e.g., oc apply), and add concrete constraints: only allow trusted
branches/workflows to target the Deployment-backed runner, disallow PR/fork jobs
from using these runners, require protected environments/approved reviewers or
environment approvals before workflows can perform cluster actions, and note
that ServiceAccount RBAC must be scoped to the minimum required permissions;
reference the Deployment, ServiceAccount, and oc usage in the text so reviewers
can find and verify the guardrails.
- Around line 95-109: Update the ADR to make runner health monitoring a
mandatory Phase 1 requirement: modify the Phase 1 decision/risk text around the
"standalone runner" and "Risks of Remaining on Phase 1" sections to require
observability and alerting (not just list it as a risk) and add explicit
acceptance criteria describing alerts for runner offline/registration failures,
long queued-job age, and pod unavailability; ensure the ADR states these
monitoring checks must be implemented before adopting Phase 1 and reference the
standalone runner and registration/queued-job/pod availability terms used in the
document.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3c0cc31a-40c9-459d-a568-3af1d7250234

📥 Commits

Reviewing files that changed from the base of the PR and between 373db39 and 833a5a7.

📒 Files selected for processing (1)
  • docs/internal/adr/0007-cicd-deployment-strategy.md

Comment on lines +18 to +32
We will run self-hosted GitHub Actions runners inside our firewalled OpenShift cluster. The runners make outbound connections to GitHub, pick up jobs, and execute them locally. Because they live inside the cluster and firewall, they can talk directly to the OpenShift API via `oc`.

We are rolling this out in two phases.

This week I have already moved forward with trialing Phase 1.

### Phase 1: Standalone Runner

Deploy a standalone GitHub Actions runner as a regular `Deployment` — no CRDs, no `ClusterRoles`, no operator.

**How it works:**

* A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner).
* At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub.
* The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Document the trust boundary for these in-cluster runners.

Because the repo is public, any workflow that can land on this self-hosted runner inherits cluster reachability plus the ServiceAccount permissions from Line 32. The ADR should make the guardrails part of the decision: only trusted deploy workflows/branches may target these runners, PR/fork jobs must stay off them, and protected environments/approvals should gate cluster access. Without that, this design creates a direct path from untrusted workflow code into the firewalled cluster.

Suggested ADR addition
 **How it works:**
 
 * A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner).
 * At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub.
 * The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions.
+* **Security guardrails:** Only trusted deploy workflows on protected branches/environments may target this runner. PRs, fork-based workflows, and other untrusted jobs must use GitHub-hosted runners and must not receive cluster credentials.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
We will run self-hosted GitHub Actions runners inside our firewalled OpenShift cluster. The runners make outbound connections to GitHub, pick up jobs, and execute them locally. Because they live inside the cluster and firewall, they can talk directly to the OpenShift API via `oc`.
We are rolling this out in two phases.
This week I have already moved forward with trialing Phase 1.
### Phase 1: Standalone Runner
Deploy a standalone GitHub Actions runner as a regular `Deployment` — no CRDs, no `ClusterRoles`, no operator.
**How it works:**
* A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner).
* At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub.
* The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions.
We will run self-hosted GitHub Actions runners inside our firewalled OpenShift cluster. The runners make outbound connections to GitHub, pick up jobs, and execute them locally. Because they live inside the cluster and firewall, they can talk directly to the OpenShift API via `oc`.
We are rolling this out in two phases.
This week I have already moved forward with trialing Phase 1.
### Phase 1: Standalone Runner
Deploy a standalone GitHub Actions runner as a regular `Deployment` — no CRDs, no `ClusterRoles`, no operator.
**How it works:**
* A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner).
* At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub.
* The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions.
* **Security guardrails:** Only trusted deploy workflows on protected branches/environments may target this runner. PRs, fork-based workflows, and other untrusted jobs must use GitHub-hosted runners and must not receive cluster credentials.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/internal/adr/0007-cicd-deployment-strategy.md` around lines 18 - 32, The
ADR is missing a clear trust-boundary and guardrails for the in-cluster GitHub
Actions runners—update the "Phase 1: Standalone Runner" section to explicitly
document that any workflow running on these self-hosted runners inherits cluster
network reachability and the ServiceAccount permissions (e.g., oc apply), and
add concrete constraints: only allow trusted branches/workflows to target the
Deployment-backed runner, disallow PR/fork jobs from using these runners,
require protected environments/approved reviewers or environment approvals
before workflows can perform cluster actions, and note that ServiceAccount RBAC
must be scoped to the minimum required permissions; reference the Deployment,
ServiceAccount, and oc usage in the text so reviewers can find and verify the
guardrails.

Comment on lines +95 to +109
**Risks:**

* No alerting on standalone runner failure. If the pod crashes, deployments silently stop and GitHub Actions jobs queue instead of running. We need monitoring to detect this.
* IT may not approve CRDs for Phase 2, leaving us on Phase 1 permanently.
* Moving to Phase 2 introduces concurrent deploy jobs. We would need to investigate serialization or locking to prevent jobs from trampling each other in prod.

## Risks of Remaining on Phase 1

If we cannot move to Phase 2, the standalone runner carries ongoing operational risks:

* **Single point of failure.** One pod handles all CI jobs. If it crashes or is evicted, no jobs run until the pod restarts, without notifying a person or agentic process.
* **No concurrency.** Jobs run sequentially, which prevents deploy races but increases latency when multiple PRs merge quickly.
* **No auto-scaling.** The runner pod runs continuously regardless of load — wasting resources when idle, unable to scale during bursts.
* **Manual recovery.** If the runner loses its GitHub registration (e.g. after a credential rotation or a prolonged outage), someone must re-register it manually.
* **No built-in high availability.** Running multiple replicas of the standalone runner may cause conflicts with job pickup and GitHub registration. A high-availability solution would require further investigation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make runner health monitoring a Phase 1 requirement, not just a listed risk.

The ADR currently accepts a failure mode where deployments stop silently. For the selected Phase 1 design, minimum observability should be part of the decision itself: alert on runner offline/registration failure, queued-job age, and pod unavailability. Otherwise releases can stall indefinitely without anyone noticing.

Suggested ADR addition
 **Risks:**
 
-* No alerting on standalone runner failure. If the pod crashes, deployments silently stop and GitHub Actions jobs queue instead of running. We need monitoring to detect this.
+* Standalone runner failure must be monitored explicitly. Before rollout, we need alerting for runner pod availability, runner registration/online status, and queued-job age so that deployment outages are detected quickly.
 * IT may not approve CRDs for Phase 2, leaving us on Phase 1 permanently.
 * Moving to Phase 2 introduces concurrent deploy jobs. We would need to investigate serialization or locking to prevent jobs from trampling each other in prod.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/internal/adr/0007-cicd-deployment-strategy.md` around lines 95 - 109,
Update the ADR to make runner health monitoring a mandatory Phase 1 requirement:
modify the Phase 1 decision/risk text around the "standalone runner" and "Risks
of Remaining on Phase 1" sections to require observability and alerting (not
just list it as a risk) and add explicit acceptance criteria describing alerts
for runner offline/registration failures, long queued-job age, and pod
unavailability; ensure the ADR states these monitoring checks must be
implemented before adopting Phase 1 and reference the standalone runner and
registration/queued-job/pod availability terms used in the document.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not at a point in this project's lifetime where we have monitoring infra into which this information could flow, nor a staff of AI devops agents or humans engineers to act upon it.

It's a good idea, but we don't have it presently.

We probably need a separate ADR for this.

@ambient-code ambient-code bot removed this from the Review Queue milestone Mar 13, 2026
Present a phased approach for running GitHub Actions inside firewalled
OpenShift.

Phase 1 deploys a standalone runner with no CRDs or cluster-level
permissions.

Phase 2 replaces that with an upgrade to Actions Runner Controller
(ARC), if IT approves CRDs.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@ktdreyer ktdreyer force-pushed the kdreyer/github-ci-runner-adr branch from 833a5a7 to cd02e76 Compare March 16, 2026 13:28
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
docs/internal/adr/0007-cicd-deployment-strategy.md (2)

95-109: ⚠️ Potential issue | 🟠 Major

Make monitoring a Phase 1 adoption requirement, not an optional risk note.

The ADR currently accepts silent deployment outages as a known risk instead of defining mandatory observability gates for the selected architecture.

Suggested ADR patch
 **Risks:**
 
-* No alerting on standalone runner failure. If the pod crashes, deployments silently stop and GitHub Actions jobs queue instead of running. We need monitoring to detect this.
+* Phase 1 requires monitoring before rollout: alert on runner pod availability, runner registration/online status, and queued-job age so deployment outages are detected quickly.
 * IT may not approve CRDs for Phase 2, leaving us on Phase 1 permanently.
 * Moving to Phase 2 introduces concurrent deploy jobs. We would need to investigate serialization or locking to prevent jobs from trampling each other in prod.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/internal/adr/0007-cicd-deployment-strategy.md` around lines 95 - 109,
Change the ADR so monitoring/observability is a required Phase 1 adoption
criterion rather than an optional risk note: update the "Risks:" and "Risks of
Remaining on Phase 1" sections to include a mandatory monitoring checklist for
Phase 1 that specifies alerting on standalone runner pod crashes and GitHub
registration loss, health/readiness probes, key metrics to track (runner
availability, job queue length, registration state), escalation/SLAs, and a
deployment gate that prevents Phase 1 adoption without these observability
controls; ensure the text clearly labels these items as required for Phase 1
(not merely recommended) and reference the Phase 1/Phase 2 decision points.

30-33: ⚠️ Potential issue | 🟠 Major

Add explicit trust-boundary guardrails for the in-cluster runner.

This section still does not state which workflows are allowed to run on this runner. In a public repo, that leaves a security gap: untrusted workflow code could inherit in-cluster network reachability and ServiceAccount permissions.

Suggested ADR patch
 * A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner).
 * At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub.
 * The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions.
+* **Trust boundary / guardrails:** Only trusted deploy workflows from protected branches/environments may target this runner. PR and fork-origin workflows must not use this runner and must remain on GitHub-hosted runners. Environment approvals are required before cluster actions run.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/internal/adr/0007-cicd-deployment-strategy.md` around lines 30 - 33, Add
a short explicit "Trust boundary & allowed workflows" section to the ADR that
names the in-cluster runner (the Deployment with replicas: 1 running the
actions-runner) and the runner's ServiceAccount, then prescribe concrete
guardrails: (1) require all workflows that may run on this runner to include a
dedicated self-hosted label (e.g. runs-on: [self-hosted, in-cluster-runner]) and
document the approved workflow list or repo(s)/org(s) allowed to use that label;
(2) lock down GitHub App permissions and registration token usage to the minimum
scopes and rotation policy used by the runner registration flow; (3) restrict
the runner's ServiceAccount RBAC to only the exact verbs/namespaces needed and
record those scopes in the ADR; and (4) add network-level controls (namespace
NetworkPolicy or Pod egress controls) and an enforcement mechanism (e.g.
admission policy or GitHub repository settings) to prevent untrusted workflows
from inheriting cluster reachability—reference the Deployment, runner, and
ServiceAccount symbols so reviewers can find where to apply these controls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/internal/adr/0007-cicd-deployment-strategy.md`:
- Around line 95-109: Change the ADR so monitoring/observability is a required
Phase 1 adoption criterion rather than an optional risk note: update the
"Risks:" and "Risks of Remaining on Phase 1" sections to include a mandatory
monitoring checklist for Phase 1 that specifies alerting on standalone runner
pod crashes and GitHub registration loss, health/readiness probes, key metrics
to track (runner availability, job queue length, registration state),
escalation/SLAs, and a deployment gate that prevents Phase 1 adoption without
these observability controls; ensure the text clearly labels these items as
required for Phase 1 (not merely recommended) and reference the Phase 1/Phase 2
decision points.
- Around line 30-33: Add a short explicit "Trust boundary & allowed workflows"
section to the ADR that names the in-cluster runner (the Deployment with
replicas: 1 running the actions-runner) and the runner's ServiceAccount, then
prescribe concrete guardrails: (1) require all workflows that may run on this
runner to include a dedicated self-hosted label (e.g. runs-on: [self-hosted,
in-cluster-runner]) and document the approved workflow list or repo(s)/org(s)
allowed to use that label; (2) lock down GitHub App permissions and registration
token usage to the minimum scopes and rotation policy used by the runner
registration flow; (3) restrict the runner's ServiceAccount RBAC to only the
exact verbs/namespaces needed and record those scopes in the ADR; and (4) add
network-level controls (namespace NetworkPolicy or Pod egress controls) and an
enforcement mechanism (e.g. admission policy or GitHub repository settings) to
prevent untrusted workflows from inheriting cluster reachability—reference the
Deployment, runner, and ServiceAccount symbols so reviewers can find where to
apply these controls.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9f91d7a9-0409-4186-bb8a-739104b11166

📥 Commits

Reviewing files that changed from the base of the PR and between 833a5a7 and cd02e76.

📒 Files selected for processing (1)
  • docs/internal/adr/0007-cicd-deployment-strategy.md

@Gkrumbach07 Gkrumbach07 added this to the Review Queue milestone Mar 16, 2026
**Cons:**

* No auto-scaling — the runner pod is always running regardless of job queue depth
* Single point of failure — if the pod crashes, jobs silently queue instead of running
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • No job isolation. Every job uses the environment (pod) from all previous jobs, until something or someone outside GitHub restarts that pod.

@ambient-code ambient-code bot modified the milestone: Review Queue Mar 18, 2026
@ambient-code ambient-code deleted a comment from github-actions bot Mar 18, 2026
@ambient-code ambient-code deleted a comment from github-actions bot Mar 18, 2026
@ambient-code ambient-code deleted a comment from github-actions bot Mar 18, 2026
@ambient-code ambient-code bot removed this from the Review Queue milestone Mar 26, 2026
@ambient-code ambient-code bot added this to the Review Queue milestone Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants