Skip to content

avnikonenko/hpc-stats-scripts

Repository files navigation

hpc-stats-scripts

Utilities for working with high-performance computing (HPC) environments. The scripts help inspect PBS/Slurm job efficiency and monitor CPU/GPU and memory usage on a running system or process tree.

Made with Codex :)

Dependencies

Install the required Python packages with pip:

Feature Packages Install command
Core utilities psutil pip install psutil
Plotting for psutil-monitor matplotlib, numpy pip install matplotlib numpy
GPU monitoring for psutil-monitor --gpu nvidia-ml-py3 (pynvml) pip install nvidia-ml-py3
Plot + GPU combo psutil, matplotlib, numpy, nvidia-ml-py3 pip install psutil matplotlib numpy nvidia-ml-py3
All extras via pip extras plot + GPU pip install .[all]

The pbs-bulk-user-stats command also expects the PBS qstat utility to be available in your environment. The slurm-bulk-user-stats command expects Slurm's sacct utility to be available in your environment.

Installation

Clone the repository and install with pip:

Install from PyPI (once published):
```bash
pip install hpc-stats-scripts
pip install hpc-stats-scripts[plot]
pip install hpc-stats-scripts[gpu]
pip install hpc-stats-scripts[all]

Or install directly from GitHub (the latest version):

pip install "hpc-stats-scripts @ git+https://github.com/avnikonenko/hpc-stats-scripts.git"
pip install "hpc-stats-scripts[plot] @ git+https://github.com/avnikonenko/hpc-stats-scripts.git"
pip install "hpc-stats-scripts[gpu] @ git+https://github.com/avnikonenko/hpc-stats-scripts.git"
pip install "hpc-stats-scripts[all] @ git+https://github.com/avnikonenko/hpc-stats-scripts.git"

The base installation depends on psutil. The plot extra pulls in matplotlib and numpy for the --plot feature of psutil-monitor. The gpu extra installs nvidia-ml-py3 to enable --gpu.

CLI tools

pbs-bulk-user-stats

Summarize CPU and memory usage for PBS jobs and show which nodes the jobs are allocated to. The command relies on qstat being available in your PATH. The table now includes NGPUS (requested GPUs) when present.

Examples:

# Summarize a specific job and write CSV output
pbs-bulk-user-stats --job 12345 --csv stats.csv

# Summarize multiple specific jobs
pbs-bulk-user-stats --job 12345 12346

# Summarize all jobs for the current user (default)
pbs-bulk-user-stats --include-finished

# Summarize all jobs for a specific user
pbs-bulk-user-stats --user myuser --include-finished

# Include finished jobs but only show the 20 most recent
pbs-bulk-user-stats --user myuser --include-finished 20

When invoked with no --user or --job options:

  • On a login node (no $PBS_JOBID present), it summarizes all jobs for the current user.
  • Inside a running PBS job (where $PBS_JOBID is set), it automatically summarizes that specific job.
pbs-bulk-user-stats

State codes (PBS):

  • R running, Q queued/waiting, X finished (requires --include-finished), other codes are printed under “other” in the summary.

Expected output (CPU/RAM only):

$ pbs-bulk-user-stats

JOBID    STATE   NAME       NODES    NCPUS  WALL(h)  CPUT(h)  avgCPU  CPUeff  memUsed   memReq   memEff
-------------------------------------------------------------------------------------------------------
0001      R      run1		pbs-1    176    38.55    3632.12  163.6  93.53%  207.4 GiB 256.00 GiB 81.10%
0002      R      run2		pbs-2    176    38.59    3589.72  93.13  52.91%  50.02 GiB 256.00 GiB 19.54%
...
Summary:
  jobs:         5
  unique nodes: 3
  states:       R=4  Q=1  X=0  other=0
  mean CPUeff:  75.20%
  mean avgCPU:  132.35
  mean memEff:  82.50%
  max memUsed:  230.16 GiB

or if run inside a running PBS:

JOBID  STATE  NAME   NODES  NCPUS  WALL(h)  CPUT(h)  avgCPU  CPUeff  memUsed     memReq     memEff
-----------------------------------------------------------------------------------------------------
0001   R      STDIN  pbs-5  100    0.03     0.01     0.22    0.22%   666.58 MiB  30.00 GiB  2.17% 

Summary:
  jobs:        1
  mean CPUeff: 0.22%
  mean avgCPU: 0.22
  mean memEff: 2.17%
  max memUsed: 666.58 MiB

After the table, a summary reports the job count, mean CPU efficiency, mean average CPU usage, mean memory efficiency, and the peak memory used across all listed jobs.

psutil-monitor

Real-time CPU and memory monitor for the system or a process tree. Use --gpu to also report aggregate GPU utilization and memory via NVML (requires nvidia-ml-py3). When --csv/--plot are used, metrics stream live to the terminal during the run; CSV/PNG files are written when the monitor exits (Ctrl+C, duration reached, or proc tree ends).

Example output files (generated with --plot and --csv):

  • Plot (CPU + GPU stacked):

    psutil-monitor example plot

  • CSV: docs/psutil-monitor-example.csv

GPU output fields (when --gpu is used):

  • GPU util: Average utilization across visible GPUs.
  • busyGPUs: Sum of utilization fractions (e.g., two GPUs at 50% each → 1.0).
  • GPU mem %: Aggregate GPU memory usage percentage.
  • Per-GPU (CSV gpu_pergpu): index:util%/used/total for each device.

Examples:

# System-wide (by default) monitoring with console output only
psutil-monitor

# System-wide monitoring with CSV and PNG output
psutil-monitor --mode system --csv node.csv --plot node.png

# Monitor the current process tree (useful inside a PBS job)
psutil-monitor --mode proc --pid $$ --include-children --csv job.csv

# For script.py resources monitoring:
python script.py &                   # launch the workload
target=$!                            # PID of script.py
echo $target
# psutil-monitor watches that PID and exits when the process tree is gone
psutil-monitor --mode proc --pid "$target" --include-children --csv stat.csv --plot plot.png

Expected output:

$ psutil-monitor

CPUs available (affinity): 384
Total memory available: 754.76 GiB
CPU basis for %: 384
Memory basis for %: 754.76 GiB
2025-08-14T15:20:14  CPU  79.67%  busyCPUs 305.93  (provided 384)  MEM   9.93%  used 74.96 GiB / total 754.76 GiB
2025-08-14T15:20:16  CPU  69.30%  busyCPUs 266.13  (provided 384)  MEM   9.95%  used 75.12 GiB / total 754.76 GiB
2025-08-14T15:20:18  CPU  61.34%  busyCPUs 235.53  (provided 384)  MEM  10.05%  used 75.82 GiB / total 754.76 GiB
2025-08-14T15:20:20  CPU  61.32%  busyCPUs 235.47  (provided 384)  MEM  10.09%  used 76.15 GiB / total 754.76 GiB
2025-08-14T15:20:22  CPU  74.57%  busyCPUs 286.33  (provided 384)  MEM   9.94%  used 74.99 GiB / total 754.76 GiB
2025-08-14T15:20:24  CPU  85.94%  busyCPUs 330.01  (provided 384)  MEM   9.86%  used 74.44 GiB / total 754.76 GiB
Average busy CPUs over run: 276.570
Peak memory (system): 76.15 GiB

With GPUs (--gpu and NVIDIA GPUs present):

$ psutil-monitor --gpu

CPUs available (affinity): 96
Total memory available: 503.70 GiB
CPU basis for %: 96
Memory basis for %: 503.70 GiB
GPUs detected (NVML): 4
2026-02-03T10:00:14  CPU  45.12%  busyCPUs 43.32  (provided 96)  MEM   8.10%  used 40.80 GiB / total 503.70 GiB  GPU util  57.5% busyGPUs 2.30 mem  42.0%
2026-02-03T10:00:16  CPU  48.33%  busyCPUs 46.39  (provided 96)  MEM   8.20%  used 41.30 GiB / total 503.70 GiB  GPU util  63.0% busyGPUs 2.52 mem  44.1%
2026-02-03T10:00:18  CPU  52.10%  busyCPUs 49.99  (provided 96)  MEM   8.25%  used 41.60 GiB / total 503.70 GiB  GPU util  68.7% busyGPUs 2.75 mem  45.3%
Average busy CPUs over run: 46.567
Average busy GPUs over run: 2.523
Peak memory (system): 41.60 GiB

Use the --help option of each command to see all available options.

slurm-bulk-user-stats

Summarize CPU and memory usage for Slurm jobs and show which nodes the jobs are allocated to. The command relies on sacct being available in your PATH. The table includes NGPUS based on AllocTres/AllocGRES when present. If TRES GPU usage metrics are available, the summary also reports mean GPU util and GPU hours (used/requested).

State codes (Slurm):

  • R/RUNNING, PD/PENDING, CD/COMPLETED; other states (e.g., F, CG, S, TO) are grouped under “other” in the summary and listed in the breakdown.

Examples:

# Summarize a specific job and write CSV output
slurm-bulk-user-stats --job 12345 --csv stats.csv

# Summarize multiple specific jobs
slurm-bulk-user-stats --job 12345 12346

# Summarize all running jobs for the current user (default)
slurm-bulk-user-stats

# Summarize all jobs (including finished) for a specific user
slurm-bulk-user-stats --user myuser --include-finished

# Include finished jobs but only show the 20 most recent
slurm-bulk-user-stats --user myuser --include-finished 20

When invoked with no --user or --job options:

  • On a login node (no $SLURM_JOB_ID present), it summarizes pending/running jobs for the current user.
  • Inside a running Slurm job (where $SLURM_JOB_ID is set), it automatically summarizes that specific job.
slurm-bulk-user-stats

The output mirrors the PBS version, showing job state, node list, CPU/memory usage, efficiency metrics, and a summary block with job counts and averages.

About

Utilities for working with high-performance computing (HPC) environments. The scripts help inspect PBS job efficiency and monitor CPU and memory usage on a running system or process tree.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages