Skip to content

ci-operator: Fix Release All Leases Unreachable Code#4988

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
danilo-gemoli:fix/ci-operator/release-all-leases
Mar 5, 2026
Merged

ci-operator: Fix Release All Leases Unreachable Code#4988
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
danilo-gemoli:fix/ci-operator/release-all-leases

Conversation

@danilo-gemoli
Copy link
Contributor

@danilo-gemoli danilo-gemoli commented Mar 5, 2026

The ReleaseAll() safety net was, indeed, unreachable:

t := time.NewTicker(30 * time.Second)
for range t.C {
  if err := o.leaseClient.Heartbeat(); err != nil {
    logrus.WithError(err).Warn("Failed to update leases.")
  }
}
o.leaseClient.ReleaseAll()

that is an infinity loop that never ends, therefore o.leaseClient.ReleaseAll() won't be ever executed.
This is an attempt to fix it by:

  • Running the hearth-beating function into its own, stoppable, goroutine.
  • Stopping the hearth-beating goroutine upon test completion and then release all the remaining leases, if any.

Summary by CodeRabbit

  • Refactor
    • Added dedicated heartbeat control for lease management to improve reliability of background lease maintenance.
    • Centralized and enhanced cleanup to ensure all leased resources are released on exit, preventing potential resource leaks.
    • Streamlined lease startup/shutdown flow for clearer lifecycle handling and more predictable shutdown behavior.

@openshift-ci-robot
Copy link
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

Walkthrough

Adds a lease heartbeat mechanism via startLeaseHearthbeating() and refactors Run() to start the heartbeat and defer cleanup (stop heartbeat and release leaked leases); also adds an explicit return nil in initializeLeaseClient().

Changes

Cohort / File(s) Summary
Lease Heartbeat Refactoring
cmd/ci-operator/main.go
Added startLeaseHearthbeating() to run a lease-client heartbeat and return a stop channel; Run() now starts the heartbeat and defers stopping it and releasing leaked leases; removed inline ReleaseAll(...) from the heartbeat path; added explicit return nil in initializeLeaseClient().

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix: addressing an unreachable ReleaseAll() call caused by an infinite heartbeat loop, which is the core objective of the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed The custom check for stable and deterministic test names is not applicable to this PR. The PR exclusively modifies production code in cmd/ci-operator/main.go and does not introduce, modify, or contain any test code.
Test Structure And Quality ✅ Passed No test files were added or modified in this PR; only production code changes to cmd/ci-operator/main.go were made.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from hector-vido and liangxia March 5, 2026 16:40
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 5, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cmd/ci-operator/main.go (1)

2172-2172: Typo in function name: "Hearthbeating" should be "Heartbeating".

The function name startLeaseHearthbeating contains a typo - "Hearth" instead of "Heart".

✏️ Proposed fix
-func (o *options) startLeaseHearthbeating() chan struct{} {
+func (o *options) startLeaseHeartbeating() chan struct{} {

Also update the call site at line 1172:

-			stopHeartbeating := o.startLeaseHearthbeating()
+			stopHeartbeating := o.startLeaseHeartbeating()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/ci-operator/main.go` at line 2172, Rename the function
startLeaseHearthbeating to startLeaseHeartbeating and update its declaration
accordingly, then find and update all call sites (e.g., any invocation like
o.startLeaseHearthbeating()) to use o.startLeaseHeartbeating() so the identifier
matches; ensure any related comments or documentation are updated to reflect the
corrected spelling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/ci-operator/main.go`:
- Around line 2172-2189: The ticker created in startLeaseHearthbeating (variable
t) is never stopped causing a resource leak; modify the goroutine in
startLeaseHearthbeating so the ticker is stopped when the goroutine exits
(either by adding defer t.Stop() at the top of the goroutine or by calling
t.Stop() in the case <-stopChan branch before breaking the loop) to ensure the
ticker is cleaned up when stopChan is closed; reference the function
startLeaseHearthbeating, the ticker variable t, and stopChan when making the
change.

---

Nitpick comments:
In `@cmd/ci-operator/main.go`:
- Line 2172: Rename the function startLeaseHearthbeating to
startLeaseHeartbeating and update its declaration accordingly, then find and
update all call sites (e.g., any invocation like o.startLeaseHearthbeating()) to
use o.startLeaseHeartbeating() so the identifier matches; ensure any related
comments or documentation are updated to reflect the corrected spelling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d1c3c8bc-c682-45be-a3a6-591d25d3c861

📥 Commits

Reviewing files that changed from the base of the PR and between 3ab1d87 and d356acc.

📒 Files selected for processing (1)
  • cmd/ci-operator/main.go

@danilo-gemoli danilo-gemoli force-pushed the fix/ci-operator/release-all-leases branch from d356acc to 7f826be Compare March 5, 2026 16:56
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/ci-operator/main.go`:
- Around line 1172-1181: The deferred cleanup currently closes the
stopHeartbeating channel and then immediately calls o.leaseClient.ReleaseAll(),
which can race with the heartbeat goroutine; change startLeaseHearthbeating() to
return a stop-and-wait function instead of chan struct{} so callers can signal
shutdown and block until the heartbeat goroutine has fully exited before calling
ReleaseAll(); update the call site (where stopHeartbeating :=
o.startLeaseHearthbeating()) to assign the returned function and invoke it in
the defer (instead of close(stopHeartbeating)), and ensure the heartbeat
goroutine signals completion (e.g., via sync.WaitGroup or a done channel) inside
startLeaseHearthbeating() so ReleaseAll() runs only after Heartbeat() calls have
finished.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c2f7b126-fbd7-4b54-aef9-e018de970b6b

📥 Commits

Reviewing files that changed from the base of the PR and between d356acc and 7f826be.

📒 Files selected for processing (1)
  • cmd/ci-operator/main.go

Comment on lines +1172 to +1181
stopHeartbeating := o.startLeaseHearthbeating()
defer func() {
close(stopHeartbeating)
if l, err := o.leaseClient.ReleaseAll(); err != nil {
logrus.WithError(err).Errorf("Failed to release leaked leases (%v)", l)
} else if len(l) != 0 {
o.metricsAgent.Record(metrics.NewInsightsEvent(metrics.InsightLeaseReleased, metrics.Context{"released_count": len(l)}))
logrus.Warnf("Would leak leases: %v", l)
}
}()
Copy link

@coderabbitai coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd cmd/ci-operator && rg -n 'startLeaseHearthbeating' main.go

Repository: openshift/ci-tools

Length of output: 182


🏁 Script executed:

cd cmd/ci-operator && rg -n -B5 -A20 'func.*startLeaseHearthbeating' main.go

Repository: openshift/ci-tools

Length of output: 782


🏁 Script executed:

cd cmd/ci-operator && sed -n '1172,1181p' main.go

Repository: openshift/ci-tools

Length of output: 506


Heartbeat goroutine can overlap with ReleaseAll() due to missing completion synchronization.

The cleanup defers a call that closes stopHeartbeating and immediately invokes ReleaseAll(), but closing the stop channel only signals the heartbeat goroutine to exit—it does not wait for completion. The goroutine may still be executing a Heartbeat() call when ReleaseAll() runs, creating non-deterministic lease cleanup behavior.

The startLeaseHearthbeating() function (lines 2172-2190) returns a channel, and its goroutine exits only after stopping the ticker and breaking the loop, both of which happen asynchronously after the channel is closed. Change the return type from chan struct{} to a function that both signals and waits for goroutine completion before allowing ReleaseAll() to proceed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/ci-operator/main.go` around lines 1172 - 1181, The deferred cleanup
currently closes the stopHeartbeating channel and then immediately calls
o.leaseClient.ReleaseAll(), which can race with the heartbeat goroutine; change
startLeaseHearthbeating() to return a stop-and-wait function instead of chan
struct{} so callers can signal shutdown and block until the heartbeat goroutine
has fully exited before calling ReleaseAll(); update the call site (where
stopHeartbeating := o.startLeaseHearthbeating()) to assign the returned function
and invoke it in the defer (instead of close(stopHeartbeating)), and ensure the
heartbeat goroutine signals completion (e.g., via sync.WaitGroup or a done
channel) inside startLeaseHearthbeating() so ReleaseAll() runs only after
Heartbeat() calls have finished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReleaseAll and Heartbeat are two functions synchronized by the same mutex. As far as I understand, these two scenario might happen:

Case 1:
ReleaseAll gets executed before Heartbeat does. In this case the client releases all the leases, therefore Heartbeat does nothing.

Case 2:
Heartbeat gets executed before ReleaseAll does. In this case the client updates all the leases, that will then be released by ReleaseAll.

I guess both of them are safe. Am I correct?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

Let me check the lease.Client implementation to verify the mutex synchronization.


🏁 Script executed:

#!/bin/bash
# Find lease client implementation
rg -n -A 30 'func.*ReleaseAll' pkg/lease/

Length of output: 1557


🏁 Script executed:

#!/bin/bash
# Find Heartbeat implementation
rg -n -A 20 'func.*Heartbeat' pkg/lease/

Length of output: 3324


@danilo-gemoli, you are correct. After checking the implementation in pkg/lease/client.go:

  • Heartbeat() (Line 159–179) acquires c.Lock() and holds it for its entire execution.
  • ReleaseAll() (Line 191–205) does the same — acquires c.Lock() and holds it throughout.

Since both methods are fully protected by the same mutex for their complete duration, they are mutually exclusive by construction. Your two scenarios are both safe:

Scenario What happens
ReleaseAll runs first Empties c.leases; when Heartbeat later acquires the lock it iterates an empty map and is a no-op.
Heartbeat runs first Updates all leases; ReleaseAll then acquires the lock and cleanly releases them all.

My original concern about a race was wrong — I had missed that the mutex serializes these two calls inside the client itself. Apologies for the noise, the current implementation is correct.

😄

  (\_/)
  (•_•)
  />🐇 you were right all along!

📝 Learnings were identified and not saved (knowledge base disabled). Enable

@danilo-gemoli
Copy link
Contributor Author

/test e2e

@deepsm007
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danilo-gemoli, deepsm007

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [danilo-gemoli,deepsm007]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@deepsm007
Copy link
Contributor

/override ci/prow/integration

Failures are not related to the PR

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@deepsm007: Overrode contexts on behalf of deepsm007: ci/prow/integration

Details

In response to this:

/override ci/prow/integration

Failures are not related to the PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@danilo-gemoli
Copy link
Contributor Author

/test images

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@danilo-gemoli: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/breaking-changes 7f826be link false /test breaking-changes

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 71d74df into openshift:main Mar 5, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants