Skip to content

Fix CounterGroup timer to use Stopwatch instead of DateTime.UtcNow#127303

Merged
tannergooding merged 4 commits into
dotnet:mainfrom
unsafePtr:fix/countergroup-stopwatch
May 5, 2026
Merged

Fix CounterGroup timer to use Stopwatch instead of DateTime.UtcNow#127303
tannergooding merged 4 commits into
dotnet:mainfrom
unsafePtr:fix/countergroup-stopwatch

Conversation

@unsafePtr
Copy link
Copy Markdown
Contributor

Background

DateTime.UtcNow can jump due to NTP sync, causing elapsed time reported to EventCounter subscribers to be incorrect for that interval — affecting rate calculations like requests/sec in monitoring dashboards. Stopwatch is monotonic and not subject to clock adjustments.

CounterGroup is only directly referenced from DiagnosticCounter. DiagnosticCounter is the base class of EventCounter, PollingCounter, IncrementingPollingCounter, IncrementingEventCounter, so they all are affected.

@dotnet-policy-service dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label Apr 22, 2026
Copy link
Copy Markdown
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me. Any concerns @tarekgh?

The previous truncation behavior for the initial nextPollingTimestamp does look like a bug and I'm fine to see it fixed. Sampling with fractional second periods is possible, but I expect it was rarely done in practice so its likely the fix will go unobserved.

@tarekgh
Copy link
Copy Markdown
Member

tarekgh commented Apr 29, 2026

🤖 Copilot Code Review — PR #127303

Note

This review was AI/Copilot-generated.

Holistic Assessment

Motivation: Replacing DateTime.UtcNow with Stopwatch for monotonic timing in CounterGroup is well-motivated — NTP clock jumps can corrupt rate calculations (requests/sec) reported by EventCounters.

Approach: The core approach is sound. However, the manual conversion between TimeSpan ticks and Stopwatch ticks introduces a formula that is repeated 3 times and is vulnerable to integer overflow on Linux, as detailed below.

Summary: ⚠️ Needs Changes. The overflow in OnTimer() is a real correctness bug on Linux (confirmed by arithmetic analysis). The behavioral change for sub-second intervals should be explicitly documented and approved by the area owners.


Detailed Findings

❌ Integer Overflow in Stopwatch-Tick Conversion (flagged by 2/2 models)

The expression (long)Math.Ceiling(delta / _pollingInterval) * _pollingInterval.Ticks * Stopwatch.Frequency / TimeSpan.TicksPerSecond evaluates left-to-right in long arithmetic.

Concrete numbers (Linux, Stopwatch.Frequency = 1,000,000,000):

Polling Interval Missed Cycles to Overflow
1 second 922
5 seconds 184
10 seconds 92
60 seconds 15

With a 60-second polling interval, a laptop sleep, VM suspension, or long GC pause causing just 15 missed cycles (~15 minutes) silently overflows long, corrupting _nextPollingTimeStamp and causing the timer to either fire continuously or never fire again.

On Windows (Stopwatch.Frequency ≈ 10,000,000), the threshold is 100× higher (~1537 missed cycles for 60s), so it's less likely but still possible under extreme conditions.

Suggested fix: Cache the per-interval Stopwatch-tick value once in EnableTimer() to avoid the 3-way multiplication:

private long _pollingIntervalSwTicks;

// In EnableTimer:
_pollingIntervalSwTicks = _pollingInterval.Ticks * Stopwatch.Frequency / TimeSpan.TicksPerSecond;
_nextPollingTimeStamp = now + _pollingIntervalSwTicks;

// In OnTimer:
_nextPollingTimeStamp += (long)Math.Ceiling(delta / _pollingInterval) * _pollingIntervalSwTicks;

This keeps the multiplicand bounded and eliminates 2 of the 3 copies of the fragile formula.


⚠️ Behavioral Change for Sub-Second Intervals (flagged by 2/2 models, matches @tannergooding's comment)

Old code: new TimeSpan(0, 0, (int)pollingIntervalInSeconds) truncated 0.9f → 0, so the first poll fired immediately.

New code: TimeSpan.FromSeconds(0.9f) correctly waits 900ms before the first poll.

This is arguably a bug fix (the old code produced garbage data on the first data point for sub-second intervals), but it IS an observable behavioral change. The PR description should explicitly acknowledge this, and the area-System.Diagnostics.Tracing team should confirm it's the desired behavior.


⚠️ No Tests Added (flagged by 2/2 models)

This PR changes timing behavior in a critical diagnostics path but adds no regression tests. At minimum:

  1. A test verifying sub-second intervals poll at the correct cadence (not immediately).
  2. A test (or at least a documented manual verification) that the timer survives a multi-cycle catch-up after a long delay.

Existing tests in src/libraries/System.Diagnostics.Tracing/tests/BasicEventSourceTest/TestEventCounter.cs could be extended.


💡 Hoist Stopwatch.GetTimestamp() Out of foreach in PollForValues

long now = Stopwatch.GetTimestamp(); // currently inside the foreach

This is called per CounterGroup in the polling loop. Hoisting it above the foreach is a minor perf improvement and was also the pattern of the original code (though the original also called DateTime.UtcNow inside the loop, so this isn't a regression).


✅ Thread Safety Preserved

All reads/writes of _nextPollingTimeStamp, _timeStampSinceCollectionStarted, and _pollingInterval remain under s_counterGroupLock. Lock discipline is identical to the original code. Verified by both models.


✅ PollForValues Negative TimeSpan Handling

When timeUntilNextPoll is negative (poll overdue), the < 1ms check correctly triggers the timer, and Math.Max(1, (int)negative) = 1 prevents negative sleep. This matches the original behavior.


✅ OnTimer Advancement Logic (modulo overflow)

The new if (delta < _pollingInterval) delta = _pollingInterval + Math.Ceiling(delta / _pollingInterval) is mathematically equivalent to the old ternary + Math.Ceiling pattern for the common path. Negative delta (early fire) is correctly normalized.


💡 Minor: Floating-Point Precision in Math.Ceiling(delta / _pollingInterval)

TimeSpan / TimeSpan returns double. If delta is exactly N * interval, FP rounding could push the quotient to N + ε, making Math.Ceiling return N+1. This is a pre-existing issue (the old code had the same Math.Ceiling(double / int) pattern), so not a regression.


Model Agreement Summary

Finding Opus 4.7 Sonnet 4.5 Consensus
Overflow in OnTimer chain Strong
Sub-second behavioral change ⚠️ ⚠️ Strong
No tests ⚠️ ⚠️ Strong
Cache SW ticks field 💡 💡 Strong
Thread safety OK Strong
PollForValues correct Strong
FP precision (pre-existing) ⚠️ noted Moderate

@tarekgh tarekgh added this to the 11.0.0 milestone Apr 29, 2026
Copy link
Copy Markdown
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modulo the copilot comments, LGTM!

@unsafePtr
Copy link
Copy Markdown
Contributor Author

  1. Fixed overflow problem storing it in long. _pollingInterval.Ticks * Stopwatch.Frequency / TimeSpan.TicksPerSecond; can overflow only if interval is beyond 15minutes on Linux, which is not a valid interval. I can add a defensive Debug.Assert if you think it's worth it.
  2. For Math.Ceiling(delta / _pollingInterval), Claude thinks following:

TimeSpan / TimeSpan is t1.Ticks / (double)t2.Ticks. Tick magnitudes here (≤ 10^12) are well below 2^53, so IEEE 754 guarantees exact representation, and correctly-rounded division means (N*x)/x = N exactly when N is also representable. No spurious bump from N to N+1. The only path to off-by-one is Stopwatch.GetElapsedTime's (long) truncation introducing ±1 TimeSpan tick on delta, which can cause the catch-up to skip a single interval — and that self-corrects on the next poll via the delta < _pollingInterval floor. This was strictly worse with DateTime.UtcNow (system-clock jitter ≫ 100 ns), so the new code improves on it.

  1. Hoist Stopwatch.GetTimestamp(). Think it would be wrong to do it, as between invocations it's shifting, so if it's hoisted it will worsen the precision. Guess for the same reason DateTime.UtcNow was used before these changes

@tannergooding
Copy link
Copy Markdown
Member

The expression (long)Math.Ceiling(delta / _pollingInterval) * _pollingInterval.Ticks * Stopwatch.Frequency / TimeSpan.TicksPerSecond evaluates left-to-right in long arithmetic.
...

The expression should be cleaned up here, but the math given by @copilot is all wrong.

At a frequency of 1 billion ticks per second (1 tick per nanosecond), you have ~8.5 billion seconds before overflow (~272.4 years due to it being tracked as long)

In fact, you'd need a frequency of nearly 2^59 to have a 4 second interval result in an overflow. A frequency of nearly 2^58 then doubles that to 8 seconds and it continues doubling for every power of two less that the timer frequency is. This would be many orders of magnitude higher than the fastest CPUs in the world, which are around 2^32 for CPU clock frequency and around 2^30 for hardware timer frequency.

The key really is in that we're only scaling the pollingIntervalTicks (which are at a frequency of 100ns per tick) by the number of polling intervals elapsed rounded up to the nearest whole integer. So the multiplication shouldn't really ever overflow, at least not within the lifetime of this code.

@tarekgh
Copy link
Copy Markdown
Member

tarekgh commented Apr 29, 2026

The expression (long)Math.Ceiling(delta / _pollingInterval) * _pollingInterval.Ticks * Stopwatch.Frequency / TimeSpan.TicksPerSecond evaluates left-to-right in long arithmetic.
...

The expression should be cleaned up here, but the math given by @copilot is all wrong.

At a frequency of 1 billion ticks per second (1 tick per nanosecond), you have ~8.5 billion seconds before overflow (~272.4 years due to it being tracked as long)

In fact, you'd need a frequency of nearly 2^59 to have a 4 second interval result in an overflow. A frequency of nearly 2^58 then doubles that to 8 seconds and it continues doubling for every power of two less that the timer frequency is. This would be many orders of magnitude higher than the fastest CPUs in the world, which are around 2^32 for CPU clock frequency and around 2^30 for hardware timer frequency.

The key really is in that we're only scaling the pollingIntervalTicks (which are at a frequency of 100ns per tick) by the number of polling intervals elapsed rounded up to the nearest whole integer. So the multiplication shouldn't really ever overflow, at least not within the lifetime of this code.

Agree with your assessment. But does it hurt to check the overflow anyway (in correct way)? or you think this will add complication to the code?

@unsafePtr
Copy link
Copy Markdown
Contributor Author

@tarekgh , @tannergooding do you mean keeping the scheduling math entirely in stopwatch ticks (drop Stopwatch.GetElapsedTime + TimeSpan division for delta), and only converting to TimeSpan where it actually leaves the timer (i.e. for WritePayload)? Want to confirm before doing any changes. Also, do you want me to add a defensive Debug.Assert(_pollingInterval.Ticks <= long.MaxValue / Stopwatch.Frequency)

@tannergooding
Copy link
Copy Markdown
Member

I was thinking the inverse rather, do it entirely in TimeSpan units instead, as that's the "canonical" form for .NET

The general point being that mixing two formats can be error prone, so keeping it as one format (ideally the canonical one) helps limit bugs/risk.

@unsafePtr unsafePtr requested a review from tannergooding May 4, 2026 23:02
Copy link
Copy Markdown
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. CC. @tarekgh for the changes

Copy link
Copy Markdown
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a minor comment. Otherwise, the latest changes LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Diagnostics.Tracing community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants