Skip to content

Fix to stabilize fast calo hit sorting#1750

Merged
oksuzian merged 2 commits into
Mu2e:mainfrom
bechenard:FastClusterFix
Mar 7, 2026
Merged

Fix to stabilize fast calo hit sorting#1750
oksuzian merged 2 commits into
Mu2e:mainfrom
bechenard:FastClusterFix

Conversation

@bechenard

Copy link
Copy Markdown
Contributor

Addresses issue 1086

@FNALbuild

Copy link
Copy Markdown
Collaborator

Hi @bechenard,
You have proposed changes to files in these packages:

  • CaloCluster

which require these tests: build.

@Mu2e/fnalbuild-users, @Mu2e/write have access to CI actions on main.

⌛ The following tests have been triggered for 4756208: build (Build queue - API unavailable)

About FNALbuild. Code review on Mu2e/Offline.

@FNALbuild

Copy link
Copy Markdown
Collaborator

☀️ The build tests passed at 4756208.

Test Result Details
test with Command did not list any other PRs to include
merge Merged 4756208 at aa12407
build (prof) Log file. Build time: 04 min 06 sec
ceSimReco Log file.
g4test_03MT Log file.
transportOnly Log file.
POT Log file.
g4study Log file.
cosmicSimReco Log file.
cosmicOffSpill Log file.
ceSteps Log file.
ceDigi Log file.
muDauSteps Log file.
ceMix Log file.
rootOverlaps Log file.
g4surfaceCheck Log file.
trigger Log file. Return Code 1.
check_cmake Log file.
FIXME, TODO TODO (0) FIXME (0) in 1 files
clang-tidy ➡️ 2 errors 6 warnings
whitespace check no whitespace errors found

N.B. These results were obtained from a build of this Pull Request at 4756208 after being merged into the base branch at aa12407.

For more information, please check the job page here.
Build artifacts are deleted after 5 days. If this is not desired, select Keep this build forever on the job page.

@oksuzian

oksuzian commented Mar 5, 2026

Copy link
Copy Markdown
Collaborator

I'll review this pull request by first loading the relevant ability and then examining the PR details and code changes.Let me also look at the full file for more context around the changed code:## Review of PR #1750: Fix to stabilize fast calo hit sorting

Author: @bechenard | Branch: FastClusterFixmain | Status: Open
Description: Addresses issue 1086

Change Summary

This PR modifies a single file — CaloCluster/src/CaloClusterFast_module.cc — with a 3-line addition / 1-line deletion to stabilize the sorting comparator used for calorimeter hits.

The Diff

// BEFORE:
auto functorTime = [&caloHits,&hits](auto a, auto b) {return caloHits[a].time() < caloHits[b].time();};

// AFTER:
auto functorTime = [&caloHits,&hits](auto a, auto b) {return caloHits[a].time() < caloHits[b].time() ||
                                                           (fabs(caloHits[a].time() - caloHits[b].time())<1e-3 &&
                                                            caloHits[a].crystalID() <caloHits[b].crystalID());};

Purpose

The original comparator sorted hits by time() alone. When two hits have (nearly) identical times, the sort result was non-deterministic, leading to unstable/non-reproducible output. The fix adds a secondary sort key (crystalID()) for hits whose times are within 1 ns (1e-3 μs) of each other.


Issues Found

🔴 1. Comparator violates strict weak ordering — critical correctness bug

This is the most important issue. The comparator used with std::stable_sort must satisfy strict weak ordering. The new comparator does not.

Consider two hits a and b where:

  • caloHits[a].time() = 10.0000
  • caloHits[b].time() = 10.0004 (difference = 0.0004, which is < 1e-3)

The comparator evaluates:

  • functorTime(a, b): 10.0 < 10.0004true (first condition fires, returns true)
  • functorTime(b, a): 10.0004 < 10.0false, then checks fabs(10.0004 - 10.0) < 1e-3true, and crystalID_b < crystalID_a — if crystalID_b < crystalID_a, this also returns true

So it's possible that both functorTime(a, b) and functorTime(b, a) return true. This violates the irreflexivity/asymmetry requirement of strict weak ordering. Using such a comparator with std::sort or std::stable_sort is undefined behavior in C++ — it can cause crashes, infinite loops, or corrupted output.

Root cause: The first condition (time_a < time_b) already says a < b, but the second condition using a fuzzy tolerance can independently say b < a when the times are close but not exactly equal.

Suggested fix: The comparator should use a consistent "equivalence bucket" approach. For example:

auto functorTime = [&caloHits](auto a, auto b) {
    double dt = caloHits[a].time() - caloHits[b].time();
    if (fabs(dt) < 1e-3) {
        return caloHits[a].crystalID() < caloHits[b].crystalID();
    }
    return dt < 0;  // i.e., caloHits[a].time() < caloHits[b].time()
};

⚠️ However, even this approach is not fully safe — fuzzy floating-point comparison with a tolerance is inherently non-transitive (A ≈ B and B ≈ C does not imply A ≈ C). A truly correct solution would either:

  • Quantize times to a grid (e.g., round(time / 1e-3)) and use the quantized value as the primary key, or
  • Use exact < for time and rely on std::stable_sort's stability guarantee to preserve relative order for equal-time hits, then use crystalID only when times are exactly equal.

🟡 2. &hits is captured in the lambda but never used

auto functorTime = [&caloHits,&hits](auto a, auto b) { ... };

The &hits capture is unnecessary — the comparator never references hits. This was already present in the original code, but this PR would be a good opportunity to clean it up:

auto functorTime = [&caloHits](auto a, auto b) { ... };

🟡 3. Missing <cmath> or equivalent include for fabs

The new code uses fabs() but it's not clear if <cmath> is included in this translation unit. If it's pulled in transitively, this will work, but it's fragile. It would be safer to explicitly ensure <cmath> is included, or use std::fabs() for the C++ version.

⚪ 4. Minor style: missing space before <

caloHits[a].crystalID() <caloHits[b].crystalID()

Should be:

caloHits[a].crystalID() < caloHits[b].crystalID()

Summary

Severity Issue Description
🔴 Critical Strict weak ordering violation Comparator can return true for both (a,b) and (b,a)undefined behavior with std::stable_sort
🟡 Minor Unused capture &hits Lambda captures hits but doesn't use it
🟡 Minor fabs without explicit include Relies on transitive <cmath> inclusion
⚪ Nit Missing space crystalID() <caloHits formatting

The critical issue (#1) should be resolved before merging. The fuzzy time tolerance combined with the strict less-than on time creates contradictory orderings that violate std::stable_sort's contract.

@FNALbuild

Copy link
Copy Markdown
Collaborator

📝 The HEAD of main has changed to 9ffc000. Tests are now out of date.

@bechenard

Copy link
Copy Markdown
Contributor Author

@FNALbuild run build test

@FNALbuild

Copy link
Copy Markdown
Collaborator

⌛ The following tests have been triggered for 863a01d: build (Build queue - API unavailable)

@FNALbuild

Copy link
Copy Markdown
Collaborator

☀️ The build tests passed at 863a01d.

Test Result Details
test with Command did not list any other PRs to include
merge Merged 863a01d at 9ffc000
build (prof) Log file. Build time: 04 min 06 sec
ceSimReco Log file.
g4test_03MT Log file.
transportOnly Log file.
POT Log file.
g4study Log file.
cosmicSimReco Log file.
cosmicOffSpill Log file.
ceSteps Log file.
ceDigi Log file.
muDauSteps Log file.
ceMix Log file.
rootOverlaps Log file.
g4surfaceCheck Log file.
trigger Log file. Return Code 1.
check_cmake Log file.
FIXME, TODO TODO (0) FIXME (0) in 1 files
clang-tidy ➡️ 2 errors 4 warnings
whitespace check no whitespace errors found

N.B. These results were obtained from a build of this Pull Request at 863a01d after being merged into the base branch at 9ffc000.

For more information, please check the job page here.
Build artifacts are deleted after 5 days. If this is not desired, select Keep this build forever on the job page.

@oksuzian oksuzian merged commit 72e134d into Mu2e:main Mar 7, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants