Skip to content

fix(thinc): restore ln2/f_log_cosh to keep AMD omptarget module-data layout#1416

Open
sbryngelson wants to merge 2 commits intoMFlowCode:masterfrom
sbryngelson:fix-amd-omp-thinc-declare-target
Open

fix(thinc): restore ln2/f_log_cosh to keep AMD omptarget module-data layout#1416
sbryngelson wants to merge 2 commits intoMFlowCode:masterfrom
sbryngelson:fix-amd-omp-thinc-declare-target

Conversation

@sbryngelson
Copy link
Copy Markdown
Member

@sbryngelson sbryngelson commented May 10, 2026

Summary

Restores ln2 and f_log_cosh to m_thinc.fpp (along with GPU_DECLARE(create=[gq3_pts, gq3_wts, ln2])) to fix spurious AMD gpu-omp CI failures introduced by #1401. The MTHINC normals math fix from #1401 (the actual feature work) stays untouched.

Background — first attempt was wrong

My initial diagnosis blamed the create=copyin= macro-form swap in #1401. CI on the first revision of this PR (with create= instead of copyin=) failed with the identical omptarget error pattern, so that wasn't it.

Revised diagnosis

The only other module-data change in #1401 was the deletion of:

  • the ln2 = 0.6931471805599453_wp static constant, and
  • f_log_cosh (the only consumer of ln2).

Both were already dead code on master before #1401 (no callers of f_log_cosh), so removing them was a reasonable cleanup. But it shrank m_thinc's .data segment by 8 bytes, which shifted the layout of subsequent declare-target entities (mthinc_nhat, mthinc_d descriptors) and adjacent module declares.

The AMD/ROCm omptarget runtime is sensitive to that layout: post-#1401 it now flags a 96-byte mapping whose tail extends 48 bytes into a sibling 2496/14784/5056-byte mapping with omptarget message: explicit extension not allowed: ..., and aborts with omptarget fatal error 1: failure of target construct while offloading is mandatory. Pre-#1401, with ln2 present, the layout doesn't trigger that overlap.

Why bring back dead code

  • ln2 and f_log_cosh are load-bearing for AMD's runtime layout, even though MFC's own code doesn't currently reference f_log_cosh. Keeping the function alive makes ln2 a non-unused module variable so compilers don't warn / DCE it.
  • This is the most surgical possible change: the entire delta of this PR vs master is restoring the three deletions from fix: correct MTHINC normals on non-uniform grids #1401's m_thinc.fpp module-static + helper section. Everything else from fix: correct MTHINC normals on non-uniform grids #1401 (the s_compute_mthinc_normals non-uniform-grid math fix, the int_comp reconstruction routing in m_rhs.fpp, the use m_nvtx in m_muscl.fpp) stays.
  • If we ever want to re-remove the dead code, it should land in a PR that's tested against AMD CI before merge.

Caveat

This is a hypothesis-driven fix. I cannot reproduce on Frontier AMD locally, so the proof will be in this PR's CI. If Oak Ridge | Frontier (AMD) (gpu-omp [1/2]) and [2/2] go green, the layout-shift theory is confirmed and we should land this. If they fail again, the cause is elsewhere in #1401 and the fallback is reverting #1401 entirely and re-doing the math fix in a separately-tested PR.

Failing-test list (for verification)

13 1D tests per shard, all of which use m_muscl → use m_thinc and so activate the broken declare-target mapping at module init, even though they don't invoke MTHINC at runtime: 0879E062, 8E3D99E6, 1CCA82F5, 02748F0F, 9DAC4DDC, 3A8359F6, 461DCB09, F1CF01C4, F5512823, 0288BDAD, B2EFB9F7, 76C663B7, 4A759316 (shard 1) and similar for shard 2.

Test plan

  • ./mfc.sh precheck -j 8 passes.
  • AMD gpu-omp [1/2] and [2/2] shards return to green (the actual proof).
  • Cray gpu-omp [1/2] and [2/2] shards stay green (no regression on the working backend).
  • MTHINC golden-file tests (5126B21F, 4E4FECA9, 4F3722DB, 4C4F339C, 7A1719C6) pass.

@qodo-code-review
Copy link
Copy Markdown
Contributor

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@github-actions
Copy link
Copy Markdown

Claude Code Review

Head SHA: 7be7cf0

Files changed:

  • 1
  • src/simulation/m_thinc.fpp

Findings

[Correctness] GPU_DECLARE(create=...) leaves quadrature constants uninitialized on device

File: src/simulation/m_thinc.fpp, line 33

The changed hunk replaces:

$:GPU_DECLARE(copyin='[gq3_pts, gq3_wts]')

with:

$:GPU_DECLARE(create='[gq3_pts, gq3_wts]')

gq3_pts and gq3_wts are module-level arrays initialized at declaration with compile-time constants (Gauss quadrature points and weights). They are never written after initialization — there is no GPU_UPDATE(device=...) or explicit copy for them anywhere in s_initialize_thinc_module.

  • copyin on a GPU_DECLARE copies the host-side initialized values to the device when the mapping is established.
  • create allocates device storage but does not copy anything — device memory contains garbage.

Both arrays are read inside GPU_PARALLEL_LOOP kernels (multiple callsites throughout the file for quadrature loops), so on any GPU build (--gpu acc or --gpu mp) every THINC Gauss-quadrature integration will silently read uninitialized device memory, producing numerically wrong results without any runtime error.

The original copyin was correct for read-only, pre-initialized constants. This change should be reverted.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.94%. Comparing base (3ac6f68) to head (7be7cf0).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1416   +/-   ##
=======================================
  Coverage   64.94%   64.94%           
=======================================
  Files          72       72           
  Lines       18861    18861           
  Branches     1570     1570           
=======================================
  Hits        12250    12250           
  Misses       5638     5638           
  Partials      973      973           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson sbryngelson changed the title fix(thinc): use create= for static-init GPU_DECLARE to unbreak AMD gpu-omp CI fix(thinc): restore ln2/f_log_cosh to keep AMD omptarget module-data layout May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant