Skip to content

refactor: use monotonic clock for circuit breaker and auto-recovery#617

Open
unsafePtr wants to merge 1 commit into
ZiggyCreatures:mainfrom
unsafePtr:fix/monotonic-time-circuit-breaker-autorecovery
Open

refactor: use monotonic clock for circuit breaker and auto-recovery#617
unsafePtr wants to merge 1 commit into
ZiggyCreatures:mainfrom
unsafePtr:fix/monotonic-time-circuit-breaker-autorecovery

Conversation

@unsafePtr
Copy link
Copy Markdown

Switch SimpleCircuitBreaker and AutoRecoveryService from DateTimeOffset.UtcNow based deadlines to Stopwatch.GetElapsedTime. Both are in-process deadline gates; their elapsed-time math should be unaffected by wall-clock adjustments (NTP correction, manual clock change).

Same change was recently added by me to dotnet/runtime#127303. I am flagging this as a refactor, since in-process deadline gates should be measured monotonically, even though using DateTime.UtcNow in this use-case will probably never lead to problems in practice. This mirrors open-telemetry/opentelemetry-dotnet#7193 (sweep) and open-telemetry/opentelemetry-dotnet#7253 (in-process freshness check).

Wall-clock usage is preserved for serialized timestamps (LogicalExpirationTimestamp, BackplaneMessage.Timestamp, AutoRecoveryItem expiration) since those cross process boundaries.

@unsafePtr
Copy link
Copy Markdown
Author

Disclaimer: this comment was written by Claude Code on my behalf, summarising follow-up notes from the same audit that produced this PR. Posting here for future-reference rather than as a request for this PR.

While auditing for the changes in this PR, the in-process L1 freshness check came up as a separate candidate — same shape as open-telemetry/opentelemetry-dotnet#7253, but with a tighter design constraint that kept it out of the present PR.

Sites (all in Internals/FusionCacheInternalUtils.cs):

  • IsLogicallyExpired (:146)
  • ShouldEagerlyRefresh (:162)
  • GetCurrentTimestamp (:80) — central "now"
  • GetNormalizedAbsoluteExpirationTimestamp (:437)
  • GetNormalizedEagerExpirationTimestamp (:449)

Why it's not in this PR: LogicalExpirationTimestamp is serialized to L2 via FusionCacheDistributedEntry and read by other nodes / after process restart, so it must remain wall-clock at that boundary. The fix would need a dual-source design — keep wall-clock for the L2 contract, add a parallel Stopwatch anchor + duration on the in-memory entry for L1-only freshness checks. Roughly 30–50 lines plus tests, scoped to the entry types.

Why it's worth doing eventually: cache TTL windows are minutes-to-hours rather than seconds, so the bug-class exposure window is much larger than the deadline gates in this PR; the failure mode (cache stampede on forward NTP step, stale data on backward step) is also more visible than a circuit-breaker blip. Same Tier 1 case OTel addressed in #7253.

Confirmed-skip (no fix possible/needed because they cross a process boundary): BackplaneMessage.Timestamp, AutoRecoveryItem.ExpirationTicks, GeneratorUtils._lastId (uniqueness seed, not a duration), ToLogString_Expiration (formatting).

Happy to open a separate PR for the L1 freshness piece if there's interest — wanted to flag the audit notes here first rather than doing it as drive-by scope-creep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant