Skip to content

Low GPU Utilization When a3fe enter the stage of ensemble equilibration #50

@gkxiao

Description

@gkxiao

Environment​​

• OS: Ubuntu 24.04.2 LTS
• Hardware: Dual NVIDIA RTX 4090 GPUs (24GB VRAM each), 64+ CPU cores
• Software:
a3fe: 0.33
GROMACS (compiled with CUDA)

GROMACS version:     2025.1
Precision:           mixed
Memory model:        64 bit
MPI library:         thread_mpi
OpenMP support:      enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:         CUDA
NBNxM GPU setup:     super-cluster 2x2x2 / cluster 8 (cluster-pair splitting on)
SIMD instructions:   AVX2_256
CPU FFT library:     fftw-3.3.10-sse2-avx-avx2-avx2_128
GPU FFT library:     cuFFT
Multi-GPU FFT:       none
RDTSCP usage:        enabled
TNG support:         enabled
Hwloc support:       disabled
Tracing support:     disabled
C compiler:          /usr/bin/cc GNU 13.3.0
C compiler flags:    -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:        /usr/bin/c++ GNU 13.3.0
C++ compiler flags:  -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict SHELL:-fopenmp -O3 -DNDEBUG
BLAS library:        Internal
LAPACK library:      Internal
CUDA compiler:       /usr/local/cuda-12.6/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2024 NVIDIA Corporation;Built on Tue_Oct_29_23:50:19_PDT_2024;Cuda compilation tools, release 12.6, V12.6.85;Build cuda_12.6.r12.6/compiler.35059454_0
CUDA compiler flags: -O3 -DNDEBUG
CUDA driver:         12.60
CUDA runtime:        12.60

• cat run_somd.sh

#!/bin/bash
#SBATCH -o somd-array-gpu-%A.%a.out
#SBATCH -n 1
#SBATCH --time 24:00:00
#SBATCH --gres=gpu:1

lam=$1
echo "lambda is: " $lam

srun somd-freenrg -C somd.cfg -l $lam -p CUDA

• a3fe script: run_a3fe.py

import a3fe as a3
calc = a3.Calculation(ensemble_size = 5)
calc.setup()
# Get optimised lambda schedule with thermodynamic speed
# of 2 kcal mol-1
calc.get_optimal_lam_vals(delta_er = 2)
# Run adaptively with a runtime constant of 0.0005 kcal**2 mol-2 ns**-1
# Note that automatic equilibration detection with the paired t-test
# method will also be carried out.
calc.run(adaptive=True, runtime_constant = 0.0005)
calc.wait()
calc.analyse()
calc.save()

Observed Behavior

• When a3fe begins to enter the ensemble equilibration step, the GPU load drops sharply.
check the slurm job:

scontrol show jobs 46054
JobId=46054 JobName=ensemble_equil_bound.sh
   UserId=gkxiao(997) GroupId=gkxiao(984) MCS_label=N/A
   Priority=1 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:36:28 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2025-06-12T15:14:11 EligibleTime=2025-06-12T15:14:11
   AccrueTime=2025-06-12T15:14:11
   StartTime=2025-06-12T15:14:12 EndTime=2025-06-13T15:14:12 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-06-12T15:14:12 Scheduler=Main
   Partition=batch AllocNode:Sid=master:250144
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=master
   BatchHost=master
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=1M,node=1,billing=1,gres/gpu=1
   AllocTRES=cpu=1,node=1,billing=1,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2/ensemble_equil_bound.sh
   WorkDir=/public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2
   StdErr=/public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2/somd-array-gpu-46054.4294967294.out
   StdIn=/dev/null
   StdOut=/public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2/somd-array-gpu-46054.4294967294.out
   Power=
   TresPerNode=gres/gpu:1

check the slurm task:

cat bound/ensemble_equilibration_2/ensemble_equil_bound.sh
#!/bin/bash
#SBATCH -o somd-array-gpu-%A.%a.out
#SBATCH -n 1
#SBATCH --time 24:00:00
#SBATCH --gres=gpu:1

python -c 'from a3fe.run.system_prep import slurm_ensemble_equilibration_bound; slurm_ensemble_equilibration_bound()'

• Two gmx mdrun processes each consuming ​​~32 CPU cores​​ (3245% CPU usage via top).

Tasks: 1511 total,   4 running, 1504 sleeping,   0 stopped,   3 zombie
%Cpu(s): 50.9 us,  0.1 sy,  0.0 ni, 49.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  6.1/257752.1 [||||||                                                                                              ]
MiB Swap:  0.0/8192.0   [                                                                                                    ]

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2909717 gkxiao    20   0 9953.0m 372272 145244 R  3245   0.1      6,09 /usr/local/gromacs/bin/gmx mdrun -deffnm gromacs -c /public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2/gromacs_out.gro
2909711 gkxiao    20   0 9941.9m 292484 142192 R  3242   0.1      6,21 /usr/local/gromacs/bin/gmx mdrun -deffnm gromacs -c /public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_1/gromacs_out.gro

• GPUs at ​​1% utilization​​ with minimal VRAM usage (392MB/24GB per GPU via nvidia-smi).

Thu Jun 12 15:20:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      On  |   00000000:01:00.0 Off |                  Off |
| 30%   48C    P0             64W /  425W |     415MiB /  24564MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      On  |   00000000:41:00.0 Off |                  Off |
| 30%   53C    P0             61W /  425W |     415MiB /  24564MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      4636      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A   2909711      C   /usr/local/gromacs/bin/gmx                    392MiB |
|    1   N/A  N/A      4636      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A   2909717      C   /usr/local/gromacs/bin/gmx                    392MiB |
+-----------------------------------------------------------------------------------------+

Expected Outcome

Implementing these changes should:
• Raise GPU utilization to ​​>90%​​ .
• Reduce CPU core usage per process to ​​<16 cores​​, balancing workload.
• Improve simulation throughput by ​​5–10×​​ based on GROMACS benchmarks.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions