RHOAIENG-50709: add ADR for CI/CD on firewalled clusters#793
RHOAIENG-50709: add ADR for CI/CD on firewalled clusters#793
Conversation
This comment has been minimized.
This comment has been minimized.
d761513 to
27a7abe
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Re: "Missing Status field", I'm addressing that in #797 |
ef47bf7 to
974cfc1
Compare
|
Claude Code Review Summary PR 793 adds ADR-0007 documenting the CI/CD deployment strategy for firewalled OpenShift clusters. The content is well-structured and clearly explains the phased approach (standalone runner to ARC). However, the document deviates from the project ADR template and README-defined workflow in several ways that reduce navigability and completeness. Issues by Severity Blocker Issues: None. Critical Issues: None. Major Issues M1. Missing Status field — The ADR README defines an explicit lifecycle: Proposed → Accepted → Deprecated → Superseded. Every accepted ADR (0001–0006) includes a Fix: add M2. Considered Options placed after Decision — The template and all accepted ADRs (0001–0005) list Considered Options before the Decision Outcome so readers understand the option space before seeing the rationale. Here the decision is presented first and rejected options follow — a structural inversion of the established pattern. Fix: move the Considered Options section above Decision. M3. README index not updated — The README table ends at ADR-0005 (ADR-0006 is also absent). Per the ADR workflow defined in the README, the table should be updated when an ADR is proposed. Fix: append both ADR-0006 (Ambient Runner SDK Architecture, 2026-02-10) and ADR-0007 (this ADR, 2026-03-04) to the table. Minor Issues m4. Missing Technical Story field — The template requires a Fix: add m5. Missing Decision Drivers section — The two driving properties (Immediacy and Visibility) are described in the Context prose but not extracted into a dedicated Decision Drivers section, making it harder to evaluate future alternatives against the same criteria. m6. Consequences is a top-level section — Per the template, Consequences (Positive/Negative/Risks) should be a subsection under Decision Outcome, not a top-level sibling. m7. Missing Implementation Notes and Validation sections — These template sections are valuable post-rollout. Add them with [TBD] placeholders now so they get filled in after the runner is deployed. Positive Highlights
Recommendations
Reviewed by Claude Code using repository standards from |
974cfc1 to
59f3703
Compare
SummaryPR adds ADR-0007, a phased CI/CD deployment strategy for firewalled OpenShift clusters using self-hosted GitHub Actions runners. The architectural reasoning is sound and the trade-off analysis is honest. The document is well-written, but deviates from the project ADR template structure and has a security guidance gap around credential storage. No blocker or critical issues. Major Issues:
Minor Issues:
Positive Highlights:
Recommendations:
|
|
Full review below - Part 1 of 3: BLOCKER ISSUES: None. CRITICAL ISSUES: None. MAJOR ISSUES: 1. Missing secure credential storage guidance (docs/internal/adr/0007-cicd-deployment-strategy.md, line 31). The document describes GitHub App credentials (App ID + private key) but does not specify how they must be stored. In Kubernetes context they must be stored as Kubernetes Secrets. Without this guidance an implementer could store the key insecurely (ConfigMap, baked into image, etc). Suggested addition: The GitHub App private key and App ID MUST be stored as a Kubernetes Secret and mounted into the runner pod. They must NOT be stored in a ConfigMap, baked into the container image, or committed to source control. Violates: security-standards.md. |
|
Part 2 of 3 - MINOR ISSUES: 1. Missing Status field (line 3): Every other ADR (0001-0006) includes Status: Accepted after the title block. Suggested: add Status: Accepted. 2. Missing Technical Story link: RHOAIENG-50709 is in the PR description but not the document. Add: Technical Story: https://issues.redhat.com/browse/RHOAIENG-50709. 3. Temporal language (line 23): 'This week I have already moved forward with trialing Phase 1' will confuse future readers. Replace with a date-anchored statement e.g. 'As of 2026-03-04, Phase 1 has been trialled in the target environment.' 4. Inconsistent bullet formatting (lines 97-98): Two bullets have a leading space before the asterisk, unlike every other bullet. 5. AI co-authorship non-standard (line 5): Authors: Ken Dreyer (with Gemini 3 Pro) - no other ADR attributes an AI assistant. Consider removing the parenthetical. |
|
Part 3 of 3 - POSITIVE HIGHLIGHTS: Pragmatic phased approach: Phase 1 deploys with zero IT dependencies - exactly the right call. Proactive risk identification: The 'Risks of Remaining on Phase 1' section goes beyond template requirements and candidly enumerates single-point-of-failure, no-alerting, and manual recovery concerns - genuinely useful for future operators. Concurrency addressed: Proactively identifies that ARC reintroduces concurrent deploy races and proposes the GitHub Actions concurrency key, with honest acknowledgment it negates most of ARC's benefit. Security-conscious design: Outbound-only model, namespace-scoped ServiceAccount, and short-lived GitHub App tokens align with project security standards. Clear option rejection: ArgoCD and VPN options rejected with specific technical reasons. RECOMMENDATIONS: 1. (Major) Add guidance that GitHub App credentials must be stored as Kubernetes Secrets. 2. (Minor) Add Status and Technical Story fields to the header. 3. (Minor) Replace 'This week...' with a date-anchored statement. 4. (Minor) Fix leading-space bullets on lines 97-98. 5. (Minor) Consider removing '(with Gemini 3 Pro)' from Authors. --- Reviewed with Claude Code (claude-sonnet-4-6) using amber.review |
59f3703 to
c5a74c3
Compare
WalkthroughAdds ADR-0007 documenting CI/CD deployment strategy for firewalled OpenShift clusters, describing Phase 1 standalone GitHub Actions runner deployment and Phase 2 migration to Actions Runner Controller (ARC) pending IT approval. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present. |
🧪 CI InsightsHere's what we observed from your CI run for c5a74c3. 🟢 All jobs passed!But CI Insights is watching 👀 |
Review Queue — Blockers Found
|
c5a74c3 to
833a5a7
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/internal/adr/0007-cicd-deployment-strategy.md`:
- Around line 18-32: The ADR is missing a clear trust-boundary and guardrails
for the in-cluster GitHub Actions runners—update the "Phase 1: Standalone
Runner" section to explicitly document that any workflow running on these
self-hosted runners inherits cluster network reachability and the ServiceAccount
permissions (e.g., oc apply), and add concrete constraints: only allow trusted
branches/workflows to target the Deployment-backed runner, disallow PR/fork jobs
from using these runners, require protected environments/approved reviewers or
environment approvals before workflows can perform cluster actions, and note
that ServiceAccount RBAC must be scoped to the minimum required permissions;
reference the Deployment, ServiceAccount, and oc usage in the text so reviewers
can find and verify the guardrails.
- Around line 95-109: Update the ADR to make runner health monitoring a
mandatory Phase 1 requirement: modify the Phase 1 decision/risk text around the
"standalone runner" and "Risks of Remaining on Phase 1" sections to require
observability and alerting (not just list it as a risk) and add explicit
acceptance criteria describing alerts for runner offline/registration failures,
long queued-job age, and pod unavailability; ensure the ADR states these
monitoring checks must be implemented before adopting Phase 1 and reference the
standalone runner and registration/queued-job/pod availability terms used in the
document.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3c0cc31a-40c9-459d-a568-3af1d7250234
📒 Files selected for processing (1)
docs/internal/adr/0007-cicd-deployment-strategy.md
| We will run self-hosted GitHub Actions runners inside our firewalled OpenShift cluster. The runners make outbound connections to GitHub, pick up jobs, and execute them locally. Because they live inside the cluster and firewall, they can talk directly to the OpenShift API via `oc`. | ||
|
|
||
| We are rolling this out in two phases. | ||
|
|
||
| This week I have already moved forward with trialing Phase 1. | ||
|
|
||
| ### Phase 1: Standalone Runner | ||
|
|
||
| Deploy a standalone GitHub Actions runner as a regular `Deployment` — no CRDs, no `ClusterRoles`, no operator. | ||
|
|
||
| **How it works:** | ||
|
|
||
| * A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner). | ||
| * At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub. | ||
| * The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions. |
There was a problem hiding this comment.
Document the trust boundary for these in-cluster runners.
Because the repo is public, any workflow that can land on this self-hosted runner inherits cluster reachability plus the ServiceAccount permissions from Line 32. The ADR should make the guardrails part of the decision: only trusted deploy workflows/branches may target these runners, PR/fork jobs must stay off them, and protected environments/approvals should gate cluster access. Without that, this design creates a direct path from untrusted workflow code into the firewalled cluster.
Suggested ADR addition
**How it works:**
* A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner).
* At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub.
* The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions.
+* **Security guardrails:** Only trusted deploy workflows on protected branches/environments may target this runner. PRs, fork-based workflows, and other untrusted jobs must use GitHub-hosted runners and must not receive cluster credentials.As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| We will run self-hosted GitHub Actions runners inside our firewalled OpenShift cluster. The runners make outbound connections to GitHub, pick up jobs, and execute them locally. Because they live inside the cluster and firewall, they can talk directly to the OpenShift API via `oc`. | |
| We are rolling this out in two phases. | |
| This week I have already moved forward with trialing Phase 1. | |
| ### Phase 1: Standalone Runner | |
| Deploy a standalone GitHub Actions runner as a regular `Deployment` — no CRDs, no `ClusterRoles`, no operator. | |
| **How it works:** | |
| * A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner). | |
| * At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub. | |
| * The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions. | |
| We will run self-hosted GitHub Actions runners inside our firewalled OpenShift cluster. The runners make outbound connections to GitHub, pick up jobs, and execute them locally. Because they live inside the cluster and firewall, they can talk directly to the OpenShift API via `oc`. | |
| We are rolling this out in two phases. | |
| This week I have already moved forward with trialing Phase 1. | |
| ### Phase 1: Standalone Runner | |
| Deploy a standalone GitHub Actions runner as a regular `Deployment` — no CRDs, no `ClusterRoles`, no operator. | |
| **How it works:** | |
| * A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner). | |
| * At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub. | |
| * The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions. | |
| * **Security guardrails:** Only trusted deploy workflows on protected branches/environments may target this runner. PRs, fork-based workflows, and other untrusted jobs must use GitHub-hosted runners and must not receive cluster credentials. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/internal/adr/0007-cicd-deployment-strategy.md` around lines 18 - 32, The
ADR is missing a clear trust-boundary and guardrails for the in-cluster GitHub
Actions runners—update the "Phase 1: Standalone Runner" section to explicitly
document that any workflow running on these self-hosted runners inherits cluster
network reachability and the ServiceAccount permissions (e.g., oc apply), and
add concrete constraints: only allow trusted branches/workflows to target the
Deployment-backed runner, disallow PR/fork jobs from using these runners,
require protected environments/approved reviewers or environment approvals
before workflows can perform cluster actions, and note that ServiceAccount RBAC
must be scoped to the minimum required permissions; reference the Deployment,
ServiceAccount, and oc usage in the text so reviewers can find and verify the
guardrails.
| **Risks:** | ||
|
|
||
| * No alerting on standalone runner failure. If the pod crashes, deployments silently stop and GitHub Actions jobs queue instead of running. We need monitoring to detect this. | ||
| * IT may not approve CRDs for Phase 2, leaving us on Phase 1 permanently. | ||
| * Moving to Phase 2 introduces concurrent deploy jobs. We would need to investigate serialization or locking to prevent jobs from trampling each other in prod. | ||
|
|
||
| ## Risks of Remaining on Phase 1 | ||
|
|
||
| If we cannot move to Phase 2, the standalone runner carries ongoing operational risks: | ||
|
|
||
| * **Single point of failure.** One pod handles all CI jobs. If it crashes or is evicted, no jobs run until the pod restarts, without notifying a person or agentic process. | ||
| * **No concurrency.** Jobs run sequentially, which prevents deploy races but increases latency when multiple PRs merge quickly. | ||
| * **No auto-scaling.** The runner pod runs continuously regardless of load — wasting resources when idle, unable to scale during bursts. | ||
| * **Manual recovery.** If the runner loses its GitHub registration (e.g. after a credential rotation or a prolonged outage), someone must re-register it manually. | ||
| * **No built-in high availability.** Running multiple replicas of the standalone runner may cause conflicts with job pickup and GitHub registration. A high-availability solution would require further investigation. |
There was a problem hiding this comment.
Make runner health monitoring a Phase 1 requirement, not just a listed risk.
The ADR currently accepts a failure mode where deployments stop silently. For the selected Phase 1 design, minimum observability should be part of the decision itself: alert on runner offline/registration failure, queued-job age, and pod unavailability. Otherwise releases can stall indefinitely without anyone noticing.
Suggested ADR addition
**Risks:**
-* No alerting on standalone runner failure. If the pod crashes, deployments silently stop and GitHub Actions jobs queue instead of running. We need monitoring to detect this.
+* Standalone runner failure must be monitored explicitly. Before rollout, we need alerting for runner pod availability, runner registration/online status, and queued-job age so that deployment outages are detected quickly.
* IT may not approve CRDs for Phase 2, leaving us on Phase 1 permanently.
* Moving to Phase 2 introduces concurrent deploy jobs. We would need to investigate serialization or locking to prevent jobs from trampling each other in prod.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/internal/adr/0007-cicd-deployment-strategy.md` around lines 95 - 109,
Update the ADR to make runner health monitoring a mandatory Phase 1 requirement:
modify the Phase 1 decision/risk text around the "standalone runner" and "Risks
of Remaining on Phase 1" sections to require observability and alerting (not
just list it as a risk) and add explicit acceptance criteria describing alerts
for runner offline/registration failures, long queued-job age, and pod
unavailability; ensure the ADR states these monitoring checks must be
implemented before adopting Phase 1 and reference the standalone runner and
registration/queued-job/pod availability terms used in the document.
There was a problem hiding this comment.
We're not at a point in this project's lifetime where we have monitoring infra into which this information could flow, nor a staff of AI devops agents or humans engineers to act upon it.
It's a good idea, but we don't have it presently.
We probably need a separate ADR for this.
Present a phased approach for running GitHub Actions inside firewalled OpenShift. Phase 1 deploys a standalone runner with no CRDs or cluster-level permissions. Phase 2 replaces that with an upgrade to Actions Runner Controller (ARC), if IT approves CRDs. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
833a5a7 to
cd02e76
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
docs/internal/adr/0007-cicd-deployment-strategy.md (2)
95-109:⚠️ Potential issue | 🟠 MajorMake monitoring a Phase 1 adoption requirement, not an optional risk note.
The ADR currently accepts silent deployment outages as a known risk instead of defining mandatory observability gates for the selected architecture.
Suggested ADR patch
**Risks:** -* No alerting on standalone runner failure. If the pod crashes, deployments silently stop and GitHub Actions jobs queue instead of running. We need monitoring to detect this. +* Phase 1 requires monitoring before rollout: alert on runner pod availability, runner registration/online status, and queued-job age so deployment outages are detected quickly. * IT may not approve CRDs for Phase 2, leaving us on Phase 1 permanently. * Moving to Phase 2 introduces concurrent deploy jobs. We would need to investigate serialization or locking to prevent jobs from trampling each other in prod.As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/internal/adr/0007-cicd-deployment-strategy.md` around lines 95 - 109, Change the ADR so monitoring/observability is a required Phase 1 adoption criterion rather than an optional risk note: update the "Risks:" and "Risks of Remaining on Phase 1" sections to include a mandatory monitoring checklist for Phase 1 that specifies alerting on standalone runner pod crashes and GitHub registration loss, health/readiness probes, key metrics to track (runner availability, job queue length, registration state), escalation/SLAs, and a deployment gate that prevents Phase 1 adoption without these observability controls; ensure the text clearly labels these items as required for Phase 1 (not merely recommended) and reference the Phase 1/Phase 2 decision points.
30-33:⚠️ Potential issue | 🟠 MajorAdd explicit trust-boundary guardrails for the in-cluster runner.
This section still does not state which workflows are allowed to run on this runner. In a public repo, that leaves a security gap: untrusted workflow code could inherit in-cluster network reachability and
ServiceAccountpermissions.Suggested ADR patch
* A `Deployment` with `replicas: 1` runs the [GitHub Actions runner agent](https://github.com/actions/runner/pkgs/container/actions-runner). * At startup, the runner uses GitHub App credentials (App ID + private key) to generate a short-lived registration token and register itself with GitHub. * The runner's `ServiceAccount` only needs the permissions our CI jobs require (e.g. `oc apply` to target namespaces). It does not need any cluster-level permissions. +* **Trust boundary / guardrails:** Only trusted deploy workflows from protected branches/environments may target this runner. PR and fork-origin workflows must not use this runner and must remain on GitHub-hosted runners. Environment approvals are required before cluster actions run.As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/internal/adr/0007-cicd-deployment-strategy.md` around lines 30 - 33, Add a short explicit "Trust boundary & allowed workflows" section to the ADR that names the in-cluster runner (the Deployment with replicas: 1 running the actions-runner) and the runner's ServiceAccount, then prescribe concrete guardrails: (1) require all workflows that may run on this runner to include a dedicated self-hosted label (e.g. runs-on: [self-hosted, in-cluster-runner]) and document the approved workflow list or repo(s)/org(s) allowed to use that label; (2) lock down GitHub App permissions and registration token usage to the minimum scopes and rotation policy used by the runner registration flow; (3) restrict the runner's ServiceAccount RBAC to only the exact verbs/namespaces needed and record those scopes in the ADR; and (4) add network-level controls (namespace NetworkPolicy or Pod egress controls) and an enforcement mechanism (e.g. admission policy or GitHub repository settings) to prevent untrusted workflows from inheriting cluster reachability—reference the Deployment, runner, and ServiceAccount symbols so reviewers can find where to apply these controls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/internal/adr/0007-cicd-deployment-strategy.md`:
- Around line 95-109: Change the ADR so monitoring/observability is a required
Phase 1 adoption criterion rather than an optional risk note: update the
"Risks:" and "Risks of Remaining on Phase 1" sections to include a mandatory
monitoring checklist for Phase 1 that specifies alerting on standalone runner
pod crashes and GitHub registration loss, health/readiness probes, key metrics
to track (runner availability, job queue length, registration state),
escalation/SLAs, and a deployment gate that prevents Phase 1 adoption without
these observability controls; ensure the text clearly labels these items as
required for Phase 1 (not merely recommended) and reference the Phase 1/Phase 2
decision points.
- Around line 30-33: Add a short explicit "Trust boundary & allowed workflows"
section to the ADR that names the in-cluster runner (the Deployment with
replicas: 1 running the actions-runner) and the runner's ServiceAccount, then
prescribe concrete guardrails: (1) require all workflows that may run on this
runner to include a dedicated self-hosted label (e.g. runs-on: [self-hosted,
in-cluster-runner]) and document the approved workflow list or repo(s)/org(s)
allowed to use that label; (2) lock down GitHub App permissions and registration
token usage to the minimum scopes and rotation policy used by the runner
registration flow; (3) restrict the runner's ServiceAccount RBAC to only the
exact verbs/namespaces needed and record those scopes in the ADR; and (4) add
network-level controls (namespace NetworkPolicy or Pod egress controls) and an
enforcement mechanism (e.g. admission policy or GitHub repository settings) to
prevent untrusted workflows from inheriting cluster reachability—reference the
Deployment, runner, and ServiceAccount symbols so reviewers can find where to
apply these controls.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9f91d7a9-0409-4186-bb8a-739104b11166
📒 Files selected for processing (1)
docs/internal/adr/0007-cicd-deployment-strategy.md
| **Cons:** | ||
|
|
||
| * No auto-scaling — the runner pod is always running regardless of job queue depth | ||
| * Single point of failure — if the pod crashes, jobs silently queue instead of running |
There was a problem hiding this comment.
- No job isolation. Every job uses the environment (pod) from all previous jobs, until something or someone outside GitHub restarts that pod.
Present a phased approach for running GitHub Actions inside firewalled OpenShift.
Phase 1 deploys a standalone runner with no CRDs or cluster-level permissions.
Phase 2 upgrades to Actions Runner Controller (ARC) if IT approves CRDs.
https://issues.redhat.com/browse/RHOAIENG-50709