add hook for building GROMACS on NVIDIA Grace CPUs with hwloc support#225
add hook for building GROMACS on NVIDIA Grace CPUs with hwloc support#225bedroge wants to merge 8 commits into
Conversation
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace |
|
Ah, need to wait for a dirty frag mitigation to be deployed. |
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace |
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace |
|
New job on instance
|
|
Hmm, it now runs it with cc @al42and |
|
Huh, fun. The GPU NUMA node seems to have gone away (processorsinNumaNudes went down from 9*72 to 8*72; need to make the name less lewd), but there are still seven mystery NUMA nodes left. Any chance you can get hwloc XML from the machine? |
Sure, here it is. I have to confirm it, but it looked like it worked with |
|
Let me try it here with bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace |
|
New job on instance
|
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace |
|
New job on instance
|
|
Hmm, same error now. The other main difference between my manual attempt to build it and the builds done by the bot is that the latter uses a container. I can give it a try in the same container. |
Have you generated the XML in the container or not? The XML does not have any of the "phantom" numa nodes GROMACS is complaining about. So if it was done bare-metal, than everything points to a side-effect of containerization. |
|
@bedroge. Can you simply add a call to |
Co-authored-by: Andrey Alekseenko <al42and@gmail.com>
|
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace |
|
New job on instance
|
That one was indeed generated without the container. The bot always uses the container for doing the software builds, so I'm now trying your approach. |
|
@al42and Here it is: |
|
I have a GH200 node, and there is a similar pattern if I look at However, bare-metal hwloc correctly filters them out, unless $ hwloc-ls
Machine (856GB total)
Package L#0
NUMANode L#0 (P#0 119GB)
NUMANode(GPUMemory) L#1 (P#4 95GB)
L3 L#0 (114MB)
...
$ HWLOC_ALLOW=all hwloc-ls
Machine (856GB total)
Package L#0
NUMANode L#0 (P#0 119GB)
NUMANode(GPUMemory) L#1 (P#4 95GB)
NUMANode L#2 (P#5)
NUMANode L#3 (P#6)
NUMANode L#4 (P#7)
NUMANode L#5 (P#8)
NUMANode L#6 (P#9)
NUMANode L#7 (P#10)
NUMANode L#8 (P#11)
L3 L#0 (114MB)
...There does not seem to be a clean way to filter them out except by the lack of memory. I guess this means that either Skipping this test seems like the right solution now; in the next GROMACS version, we can filter invalid NUMA nodes ourselves. To me, the hwloc behavior seems wrong here, but I don't know to what extent you can/want debug the containerization. We don't have any containers on our nodes, so I don't think I could dig any deeper. But at least I can reproduce the thing on my end, so I won't pester you about testing the workaround :) |
Trying the suggestion from EESSI/software-layer#1497 (comment).