[benchmarking] Multiple fixes to stabilize the nightly benchmark suite by rlratzel · Pull Request #2035 · NVIDIA-NeMo/Curator

rlratzel · 2026-05-27T23:39:57Z

Summary

Bundles eight independent fixes uncovered while triaging the Curator
nightly benchmark suite against main (reference nemo-ci pipelines
52840568,
53098506,
53323567,
53333039,
and 53423863),
plus three environmental/observability changes added to enable fair A100
vs EOS comparison runs.

Pass-rate progression: 23/37 → 28/37 → 6/8 (scoped) → 1/2 (scoped) → 40/41
(full suite, 53423863). Sole remaining failure is video_embedding_xenna,
a new entry merged in from upstream main that needs a missing NVENC
library — out of scope for this PR.

1. Make undefined env vars in config non-fatal by default (`b737202d`)

resolve_env_vars previously raised ValueError on the first undefined
${VAR} reference, halting the entire benchmark session even when the
missing var was used by only a few entries.
Default behavior is now to substitute an empty string and log a warning
so unrelated entries can still run.
Adds --strict-config-check CLI flag to run.py to restore the old
fail-fast behavior.

2. New `--entries-exact` flag, exact-match entry filtering (`ab1d4385`, `e8c6a52f`, `7b351693`)

Fixes a substring-aliasing bug in CI per-job invocations:

--entries uses pytest's -k substring expression evaluator, so
--entries audio_tagging_tts_xenna also matches
audio_tagging_tts_xenna_repeat. In CI, where each per-job script runs
--entries <entry-name>, this caused the non-_repeat job to also
execute the _repeat entry, polluting that entry's per-entry results
dir and crashing the subsequent legitimate _repeat SLURM job with
Capture file ... already exists at Ray cluster setup.
Adds --entries-exact accepting a comma-separated list of exact entry
names. Every supplied name must match a configured (enabled) entry,
otherwise the run aborts with a ValueError listing the unknown names
alongside the available entry names.
Mutually exclusive with --entries (CLI and Session.from_dict both
enforce this).
benchmarking/tools/ci_benchmark_launcher.sh switched from --entries
to --entries-exact.
Interactive --entries substring expression semantics are unchanged.

3. Bump too-tight `timeout_s` values for several entries

Multiple nightly-benchmark.yaml entries had timeout_s values that
didn't cover the actual wall time on the EOS H100 runner. These split
into two categories:

Pre-existing tight limits that overshot during Ray teardown (Add style check #3.1).
"No-op" timeouts — set back when the entry effectively did no work
(e.g. rpv2 data was unreadable, or fuzzy_id_generator artifacts were
stale and the test failed fast); once the underlying issue was fixed
and the test ran for real, the historical ceiling was too tight (Add style check #3.2,
Add style check #3.3).

3.1 `ndd_ray_serve_dp4`: 700 → 1200 (`ab1d4385`)

SLURM-killed at the 700s --time ceiling ~13s after its benchmark
subprocess succeeded — Ray teardown overhead pushed total wall over the
wire.
1200s gives ~70% headroom over the observed warm-run wall.
Reference failed job:
https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/327777542

3.2 `exact_dedup_identification`: 500 → 1500 (`21647b4f`)

The 500s value was set when the test was effectively a no-op (the rpv2
dataset was unreadable due to filesystem permissions, so the test failed
fast on stat()). Once rpv2 access was restored, the test SLURM-killed
at 68% (516/755) of the "Inserting into shuffler" phase.
Shuffler runs at ~1.83 it/s × 755 items = ~410s on EOS H100, plus Ray
cluster setup + dataset stat + post-shuffle dedup compute + cleanup ≈
realistic wall 800-1000s.
1500s gives ~50% headroom.
Reference failed job:
https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/329434352

3.3 `dedup_removal_*`: 1100/1500 → 1800/1800 (`c2f3a6b0`)

Once the upstream fuzzy_id_generator artifacts were refreshed (see
section 4 — the delete_scratch: false fix made this possible), the
dedup_removal tests actually ran the workload for the first time:

dedup_removal_raydata: observed wall ~1419s (92% complete when killed
at the prior 1100s ceiling) → 1800s (~27% headroom over the estimated
1500s full wall).
dedup_removal_xenna: passed at 1523s wall with the prior 1500s ceiling
(zero margin) → 1800s as a preemptive bump to match raydata and absorb
run-to-run variance.

Reference: nemo-ci pipeline 53333039 / leaf 53333328 — raydata killed by
SLURM TIME LIMIT during normal pipeline execution while xenna finished
with zero headroom.

4. Install `lynx` in the CI benchmark launcher (`d0eac901`)

The math benchmarks (math_preprocess, math_preprocess_classifier,
math_preprocess_llm_cleanup) shell out to the lynx text browser via
nemo_curator/stages/math/download/html_extractors/lynx.py for HTML
extraction. lynx is not in the Curator container, so those benchmarks
fail with RuntimeError: lynx executable not found in PATH.

lynx is GPL-licensed, so we deliberately do not bake it into the
redistributable Curator image. Instead it is installed transiently in the
existing benchmark container at CI run time, used during the run, and
discarded with the container. The published image stays GPL-free; the
apt-installed lynx only lives for the lifetime of each CI container.

5. (Reverted) Preserve scratch dir for `fuzzy_dedup_identification` (`63691509` then `a3c38ac6`)

Initial commit 63691509 added delete_scratch: false to the
fuzzy_dedup_identification entry to keep its scratch artifacts around for
the downstream dedup_removal_* benchmarks. In practice the consumption
path is the canonical dataset path:

{datasets_path}/cleaned_exact_dedup_all_cc_fuzzy_output_nightly_container_paths/
  fuzzy_id_generator.json
  FuzzyDuplicateIds/

Operators promote a known-good fuzzy_dedup output to this path once, and
the dedup_removal entries consume it from there on every subsequent
pipeline — there is no same-pipeline dependency. Given that workflow, the
per-entry delete_scratch: false override is unnecessary and just leaves
unused data on lustre across runs. Reverted by a3c38ac6; the entry now
uses the session-level default (delete_scratch: true).

6. Range `domain_label_games_count` metric +/-1% (`0c375b7e`, tightened in `ab31cf11`)

Both domain_classification_raydata and domain_classification_xenna
had a requirement on domain_label_games_count with exact_value: 149816.
Run-to-run output of the classifier drifts by several hundred to a few
thousand classifications, causing repeated false failures while the actual
benchmark (throughput, total docs, number of domains predicted) is
healthy.

Loosens the metric to a +/-1% range (min_value: 148318,
max_value: 151314) which still catches genuine regressions of the
classifier output while tolerating normal run-to-run variability.
domain_label_news_count is left at exact_value: 2817 pending further
investigation.

7. Raise A100 container memory cap to host capacity (`fd6e9cfd`)

The benchmarking container on the A100 host was capped at 1 TiB RAM (and
512 GiB shm, derived as 50% of container memory) by
_max_container_memory_bytes = 1 * _TB in
benchmarking/tools/gen_runscript_vars.py. The A100 host actually has
~2 TiB RAM and ~1008 GiB /dev/shm — comparable to the EOS H100 host.
The artificial cap made A100's env.json report ~50% of host RAM and
~25% of host shm, surfacing a fake environmental delta in A100 vs EOS
comparison runs even though the workload (Ray plasma capped at 500 GB
on both sides) was bounded identically.

Raises the cap to 2 * _TB so A100's container memory and shm reflect
host capacity. After this change, A100 env.json reports
total_system_memory_bytes ≈ 2.16 × 10¹² and shm_size_bytes ≈
1.08 × 10¹² — matching EOS.

Also removes an experimental CURATOR_SHM_SIZE_BYTES env-var block in
ci_benchmark_launcher.sh that attempted to cap EOS-side /dev/shm via
mount -o remount,size=…. Pyxis containers on EOS don't grant
CAP_SYS_ADMIN, so the remount silently fell back to the WARNING branch
and the cap never applied. Raising A100 up to match EOS is the
maintainable direction (the EOS shm ceiling cannot be capped down without
nemo-ci infrastructure changes), so the toggle is no longer needed.

8. Background per-GPU stats recorder (`c2fd47e6`)

Adds benchmarking/runner/gpu_stats_recorder.py — GPUStatsRecorder,
a context-managed daemon thread that polls all GPUs via gpustat
(already a curator dep) and writes one row per (timestamp, GPU) pair to
{session_entry_path}/gpustats.csv while each entry's benchmark
subprocess runs.

CSV columns:

column	content
`timestamp_utc`	ISO-8601 timestamp
`gpu_id`	NVML index
`utilization_gpu_pct`	compute utilization 0-100
`utilization_memory_pct`	`memory_used / memory_total * 100`
`temperature_c`	GPU temperature in °C
`processes`	JSON-encoded list of `{pid, username, command, gpu_memory_usage}`

Configurable via a new top-level YAML key in nightly-benchmark.yaml:

gpu_stats_recorder:
  interval_s: 1.0   # 1 Hz; set to 0 to disable

The recorder polls all NVML-visible GPUs, independent of
CUDA_VISIBLE_DEVICES. This lets us verify post-run whether Ray/Xenna
actually honored the visible-device mask in cross-host comparison runs
(any nonzero util on a masked GPU index = leakage).

Replaces the prior shell-based nvidia-smi -l poller (block removed from
ci_benchmark_launcher.sh) which lived outside the runner lifecycle,
wrote ad-hoc file names, and required a separate env-var toggle to enable.

9. Wire `NEMO_CI_*` env vars + Slack viewer-URL block (`da4c6760`)

Curator-side plumbing for an upcoming nemo-ci-hosted developer-launch
wrapper (curator_benchmark_launch.py, still under design). All three
edits are gated behind new env vars / flags so today's nightly schedule
is byte-identical to before.

ci_benchmark_launcher.sh:
- Honors NEMO_CI_SESSION_NAME when set; otherwise picks nightly-<TS>
  for scheduled (cron) pipelines and the legacy benchmark_run_<id>
  name for direct launch_pipeline.py launches.
- Composes a run-viewer URL only when NEMO_CI_VIEWER_HOST is set,
  using the host-side lustre path and the resolved session name.
- Passes --viewer-url and --run-reason through to run.py when set.
run.py: new --viewer-url and --run-reason CLI flags. When
--run-reason is set it lands in env_dict (and thus env.json +
the Slack environment block). When --viewer-url is set the Slack
sink's sink_config["viewer_url"] is patched in-process before
sink.initialize(...) — no YAML schema change required.
slack_sink.py: SlackSink reads optional viewer_url from
sink_config; SlackParentMessage accepts viewer_url and renders
a *Results viewer:* <{url}|open run> section between the
overall-status block and the environment table, only when set.

The new NEMO_CI_* namespace is shared with the upcoming nemo-ci
launcher work and is intentionally generic (not curator-specific) so
other modules can opt in later.

Verification

Across multiple nemo-ci pipelines:

53098506
— first verification of commits 1, 2, 3.1, 3.2: 28/37 pass (up from
23/37 baseline). Confirmed audio_tagging_tts_xenna_repeat works
(--entries-exact), ndd_ray_serve_dp4 works (timeout bump),
exact_dedup_identification works (rpv2 + timeout),
audio_tagging_tts_xenna works without HF_SECRET_KEY set.
53323567
— scoped 8-entry run of commits 4, 5, 6: 6/8 pass (math_preprocess*
all pass thanks to lynx install; domain_classification_* pass with the
metric range; fuzzy_dedup_identification passes). The two failures
(dedup_removal_*) were caused by a same-pipeline race with
fuzzy_dedup_identification — fixed by operators promoting fresh
fuzzy_id_generator.json artifacts to the canonical dataset path
between runs.
53333039
— scoped 2-entry run with fresh artifacts in place: dedup_removal_xenna
passes (1523s, no margin); dedup_removal_raydata SLURM-killed by TIME
LIMIT at 92% completion → addressed by commit c2f3a6b0 (Add style check #3.3).
53423863
— full-suite verification after Add style check #3.3: 40/41 pass. Sole failure is
video_embedding_xenna (new entry merged from upstream main, missing
NVENC library — separate follow-up).
53584394
— in flight. Verifies Add batched decorator #7 (cap bump doesn't break EOS) and Fix noisy CUDA shutdown #8 (recorder
always-on writes gpustats.csv); also captures empirical evidence of
whether CUDA_VISIBLE_DEVICES=0,1,2,3 actually restricts Ray/Xenna to
4 GPUs on EOS.

Test plan

Undefined env vars

audio_tagging_tts_xenna ran successfully in 53098506 with
HF_SECRET_KEY unset.
Run with --strict-config-check; confirm it still exits with the
original ValueError.
Run an unmodified config (all env vars defined); confirm no
behavior change.

`--entries-exact`

CI per-job invocation ran exactly one entry per job in 53098506
and 53323567; no cross-entry pollution observed.
--entries-exact <typo> exits with an error listing the unknown
name and the available entry names.
--entries-exact a,b,c runs only those three entries in YAML order.
Passing both --entries and --entries-exact exits with a
"mutually exclusive" error.

Timeout bumps

ndd_ray_serve_dp4 finished without TIME LIMIT cancellation in
53098506.
exact_dedup_identification finished without TIME LIMIT in 53098506.
dedup_removal_raydata and dedup_removal_xenna both finish
without TIME LIMIT cancellation (verified in 53398187 / 53423863).

lynx install

math_preprocess* entries all passed in 53323567 — RuntimeError: lynx executable not found no longer fires.

fuzzy_dedup scratch retention (now reverted)

One-time artifact promotion completed: fresh
fuzzy_id_generator.json + FuzzyDuplicateIds/ written to the
canonical dataset path.
dedup_removal_* consume the promoted artifacts in a clean
pipeline (verified in 53398187).
After artifact promotion, the per-entry delete_scratch: false
override on fuzzy_dedup_identification is no longer needed and has
been reverted (commit a3c38ac6).

domain_label_games_count range

domain_classification_* both passed in 53323567 with the new
range satisfied.

A100 container memory cap (#7)

On the A100 nightly host, after pulling commit fd6e9cfd, a fresh
run produces env.json with total_system_memory_bytes ≈ host total
(~2.16 × 10¹²) and shm_size_bytes ≈ ~1.08 × 10¹².
EOS pipelines continue to succeed (no regression observed from
the unrelated ci_benchmark_launcher.sh shm-block removal —
verified in 53584394 run-up).

GPUStatsRecorder (#8)

After 53584394 completes, gpustats.csv exists under each
per-entry session dir on EOS lustre.
CSV header matches timestamp_utc,gpu_id,utilization_gpu_pct,utilization_memory_pct,temperature_c,processes.
Setting gpu_stats_recorder.interval_s: 0 in the YAML disables
the recorder (no file written); default (omitted key) gives 1 Hz.
Recorder polls 8 rows per timestamp on an EOS node regardless of
CUDA_VISIBLE_DEVICES (NVML-visible, not CUDA-runtime-visible).

🤖 Generated with Claude Code

copy-pr-bot · 2026-05-27T23:40:00Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

praateekmahajan · 2026-05-29T16:14:20Z

      --engine-kwargs='{"tensor_parallel_size": 1}'
      --autoscaling-config='{"min_replicas": 4, "max_replicas": 4}'
-    timeout_s: 700
+    timeout_s: 1200  # warm-run wall ~700s observed; headroom added for cold vLLM model load (cf. ndd_dynamo_dp4: 2700)


I'm wondering why this happened now?

Because even dynamo should not take 2700s rn since we upgraded versions to 1.1.0 (i.e.we can reduce that)

I need to investigate that. I know we had success on the DGX-A100 machine with 700s, but I don't know yet if a larger timeout is needed for the other machine running nightlies because it's slower in general, or if something else is causing a longer runtime on the other machine.

`resolve_env_vars` previously raised `ValueError` on the first undefined `${VAR}` reference, halting the entire session even when the missing var was only used by a few entries. The default is now to substitute an empty string and log a warning. Pass `--strict-config-check` to `run.py` to restore the old fail-fast behavior. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

In nemo-ci pipeline 52840568 / leaf 52841002, ndd_ray_serve_dp4 was SLURM-killed by TIME LIMIT at ~11:30 wall — only seconds after its benchmark subprocess succeeded (Output 853/853, benchmark wall 249s, total subprocess wall ~450s). The existing timeout_s: 700 converts to SLURM --time=00:11:40, giving no headroom for Ray teardown or cold vLLM model load. Bump to 1200s (20 min): - ~70% headroom over the observed warm-run wall - Still well below ndd_dynamo_dp4's 2700s ceiling, which is documented for cold flash-attn / gpt-oss-20b loads Reference failed job: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/327777542 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

The existing --entries flag uses pytest's "-k" expression evaluator, which does substring matching on bare identifiers. This is correct for interactive use but dangerous for automated callers that target a single known entry: passing --entries foo also selects foo_repeat, foo_extra, etc. Concrete failure: in nemo-ci pipeline 52840568 / leaf 52841002, the SLURM job for entry "audio_tagging_tts_xenna" was invoked with --entries audio_tagging_tts_xenna, which also matched the sibling "audio_tagging_tts_xenna_repeat". That entry was executed within the non-_repeat SLURM job, leaving a logs/ray.log file in the _repeat entry's per-entry results dir. The legitimate _repeat SLURM job then crashed at Ray cluster setup because logs/ is preserved by design (run.py:175-178) and the stale ray.log capture file collided. Changes: * benchmarking/run.py: add --entry-exact-name argparse flag (mutually exclusive with --entries); pass through to Session.from_dict. * benchmarking/runner/session.py: extend Session.from_dict to accept entry_exact_name; when set, filter by exact entry-name equality (takes precedence over entry_filter_expr; passing both raises ValueError). * benchmarking/tools/ci_benchmark_launcher.sh: switch CI per-job invocation from --entries to --entry-exact-name. ENTRY_NAME is already populated by the per-entry CI job generator with the exact entry name, so no value change is needed. Interactive --entries behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

Brings the new flag in line with --entries in both look and feel: * Accepts a comma-separated list of one or more exact entry names, not just a single name. This matches the mental model of --entries (which conceptually selects a set of entries) and lets a single invocation target any subset by exact name. * Every name in the list must exactly match a configured (enabled) entry; otherwise the run aborts with a ValueError that lists the unknown names alongside the available entry names. This makes typos a hard error rather than a silent no-op. * Duplicates in the input are collapsed; result order follows the YAML, matching how --entries behaves. * benchmarking/run.py: rename argparse flag --entry-exact-name to --entries-exact; parse comma-separated value into list[str]; reject empty / whitespace-only inputs; wrap Session.from_dict in try/except to surface ValueError as a clean CLI error. * benchmarking/runner/session.py: rename parameter entry_exact_name to entries_exact (list[str]); add strict validation that every requested name matches a configured entry; error message lists both missing and available names. * benchmarking/tools/ci_benchmark_launcher.sh: rename flag in the CI per-job invocation; ENTRY_NAME is a single name today so this works as a single-element list. Interactive --entries semantics unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

In nemo-ci pipeline 53092974 / leaf 53093663, exact_dedup_identification was SLURM-killed by TIME LIMIT at 68% (516/755) into the "Inserting into shuffler" phase. The shuffler ran at ~1.83 it/s on EOS H100, so the shuffler phase alone needs ~410s, plus Ray cluster setup, dataset stat, post-shuffle dedup compute, and cleanup — total wall ~800-1000s, not fitting in a 500s budget. The previous 500s value was set when the test was effectively a no-op (the rpv2 dataset was unreadable due to filesystem permissions, so the test failed fast on stat() before doing any real work — see PR description for the rpv2 access fix story). 500s was also reportedly sufficient on a faster system; EOS may simply be slower for this workload. Worth investigating after the test is unblocked. Bump to 1500s (25 min): - ~50% headroom over the estimated 1000s realistic wall on EOS - Parallel in spirit to the ndd_ray_serve_dp4 700 -> 1200 bump earlier in this PR Reference failed job: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/329434352 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

The math benchmarks (math_preprocess, math_preprocess_classifier, math_preprocess_llm_cleanup) shell out to the lynx text browser via nemo_curator/stages/math/download/html_extractors/lynx.py for HTML extraction. lynx is not present in the Curator benchmark container, so those benchmarks currently fail with: RuntimeError: lynx executable not found in PATH lynx is GPL-licensed, so we deliberately do not bake it into the redistributable Curator image. Instead it is installed transiently in the existing benchmark container at CI run time. The image we publish stays GPL-free; the apt-installed lynx lives only for the lifetime of the CI container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

Adds `delete_scratch: false` to the fuzzy_dedup_identification entry so its scratch directory (under session_entry_dir/scratch/{cache,output}) is retained after the entry finishes. The downstream dedup_removal_* benchmarks read these artifacts at known paths, so the prior default session-level cleanup (delete_scratch: true) was wiping them out before they could be consumed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

Both domain_classification_raydata and domain_classification_xenna have a requirement on domain_label_games_count with exact_value: 149816. Run-to-run output of the classifier drifts by several hundred to a few thousand classifications, causing repeated false failures while the actual benchmark (throughput, total docs, number of domains predicted) is healthy. Loosens the metric to a +/- 5% range (142325 .. 157307) which still catches genuine regressions of the classifier output while tolerating normal run-to-run variability. domain_label_news_count is left at exact_value: 2817 pending further investigation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

…rk_env_check Signed-off-by: rlratzel <rratzel@nvidia.com>

Both dedup_removal entries had timeout_s values that fit when the test was effectively a no-op (it failed fast on stale fuzzy_id_generator artifacts before doing real work). Once the upstream fuzzy_id_generator data was refreshed, the actual benchmarks ran for real: * dedup_removal_raydata observed wall ~1419s (92% complete when killed at the prior 1100s ceiling) -> 1800s (~27% headroom over the estimated 1500s full wall) * dedup_removal_xenna passed at 1523s wall with the prior 1500s ceiling (no margin) -> 1800s as a preemptive bump to match raydata and absorb run-to-run variance Reference: nemo-ci pipeline 53333039 / leaf 53333328 — raydata killed by SLURM TIME LIMIT during normal pipeline execution while xenna finished successfully but with zero headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

The earlier delete_scratch: false override on fuzzy_dedup_identification was added so the entry's scratch/output artifacts could be picked up by the downstream dedup_removal_* benchmarks within the same pipeline. In practice the artifacts are consumed via the canonical dataset path: {datasets_path}/cleaned_exact_dedup_all_cc_fuzzy_output_nightly_container_paths/ fuzzy_id_generator.json FuzzyDuplicateIds/ Operators promote a known-good fuzzy_dedup output to this path once, and dedup_removal_raydata / dedup_removal_xenna consume it from there on every subsequent pipeline (no same-pipeline dependency). With that workflow in place, the per-entry delete_scratch override is no longer needed and just leaves unused data on lustre across runs. Revert to the session-level default (delete_scratch: true). This reverts commit 6369150. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

For cross-host benchmark comparisons (e.g. A100 vs H100) the host's /dev/shm size differs (~550 GB vs ~1 TB) and the container inherits the host default. Provide an opt-in env var to remount /dev/shm at a chosen size. Unset preserves prior behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

Background `nvidia-smi` poller writes one CSV per entry covering all GPUs on the node, independent of CUDA_VISIBLE_DEVICES. Lets us verify post-run whether Ray/Xenna actually honored the visible-device mask (any nonzero util on masked indices ⇒ leakage). Unset → no polling, preserving prior behavior. Subprocess is killed on EXIT via trap so a python crash doesn't leave it orphaned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

The benchmarking container cap of 1 TiB on the A100 host shrinks the container's visible memory to ~50% of host and shm to ~25%, which makes env.json report 1024 GiB / 512 GiB even though the host has ~2 TiB / ~1008 GiB. EOS reports the host values directly, so the A100 vs EOS comparison shows an artificial environmental mismatch. Raising the cap to 2 TiB lets A100's container see the host's full memory, matching EOS. Also remove the CURATOR_SHM_SIZE_BYTES env-var block from ci_benchmark_launcher.sh: pyxis-on-EOS does not grant CAP_SYS_ADMIN, so the remount silently fell back to the WARNING branch and never applied. With A100 raised to match EOS, the toggle is no longer needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

The prior CURATOR_GPU_POLL_INTERVAL_S nvidia-smi block in ci_benchmark_launcher.sh polled GPUs by spawning nvidia-smi as a background process. That implementation was bash-only, lived outside the runner's lifecycle, and named output files ad-hoc. Replace with a Python GPUStatsRecorder that uses gpustat (already a curator dep) on a daemon thread, started/stopped as a context manager around each entry's subprocess in run_entry. Per-entry CSV is written to {session_entry_path}/gpustats.csv with columns: timestamp_utc, gpu_id, utilization_gpu_pct, utilization_memory_pct, temperature_c, processes (JSON-encoded list of {pid, username, command, gpu_memory_usage}). Polling cadence is configured via a new top-level YAML key `gpu_stats_recorder.interval_s` in the benchmark config (default 1.0; set to 0 to disable). The recorder polls all visible GPUs regardless of CUDA_VISIBLE_DEVICES, which lets us verify post-run that Ray/Xenna honored the visible-device mask. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

Curator-side counterpart of the upcoming nemo-ci curator_benchmark_launch wrapper (still under design). Adds three small, backwards-compatible pieces of plumbing so per-launch metadata (session name, viewer URL, run reason) can flow from the launcher all the way through to the Slack parent message and env.json — gated behind new env vars / flags so today's nightly schedule is unaffected. * ci_benchmark_launcher.sh: - Honor NEMO_CI_SESSION_NAME when set; otherwise pick nightly-<TS> for scheduled (cron) pipelines and the legacy benchmark_run_<id> name for direct launch_pipeline.py manual launches. - Compose a run-viewer URL only when NEMO_CI_VIEWER_HOST is set, using the host-side lustre path and the resolved session name. - Pass --viewer-url and --run-reason through to run.py when set. * run.py: - New --viewer-url and --run-reason CLI flags (both default None). - When --run-reason is set, inject it into env_dict before sinks initialize, so it persists to env.json and appears in the Slack environment block. - When --viewer-url is set, patch the Slack sink's sink_config in-process (no YAML schema change required) before sinks initialize. * slack_sink.py: - SlackSink reads optional viewer_url from sink_config. - SlackParentMessage accepts viewer_url and renders a "Results viewer: <url|open run>" section between the overall-status block and the environment table — only when set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

praateekmahajan

LGTM! Thanks for this 🙏

power_draw / power_limit / fan_speed are populated by the same NVML query gpustat.new_query() already issues for utilization/memory/temp, so reading them adds no measurable overhead (no extra NVML calls). Datacenter SKUs (H100/A100 chassis cards) usually return None for fan_speed since there's no controllable per-card fan; render None as empty string for CSV cleanliness. Same handling for power_draw / power_limit on the rare hardware that doesn't expose them. CSV columns now: timestamp_utc, gpu_id, utilization_gpu_pct, utilization_memory_pct, temperature_c, power_draw_w, power_limit_w, fan_speed_pct, processes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

Tighter than the original 142325..157307 (+/-5% around the observed 149816); now 148318..151314 (+/-1%). Catches smaller classifier-output drifts that the prior wider window absorbed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

…rk_env_check Signed-off-by: rlratzel <rratzel@nvidia.com>

greptile-apps · 2026-06-04T21:00:40Z

Greptile Summary

This PR bundles eight independent fixes to stabilize the Curator nightly benchmark suite, improving the pass rate from 23/37 to 40/41 entries. Changes span timeout bumps, a new --entries-exact CLI flag to fix CI substring-matching aliasing, non-fatal undefined env-var handling, lynx transient install for math benchmarks, a GPUStatsRecorder background thread, and several observability/environment alignment improvements.

--entries-exact (session.py, run.py, ci_benchmark_launcher.sh): replaces pytest-style substring --entries in CI per-job invocations to prevent one job from accidentally matching and running a sibling entry, with strict validation that every supplied name exists.
GPUStatsRecorder (gpu_stats_recorder.py, run.py): new daemon-thread context manager that polls all NVML-visible GPUs at 1 Hz and writes per-GPU rows to gpustats.csv alongside each benchmark entry; configurable via top-level YAML gpu_stats_recorder.interval_s.
Slack viewer URL / run-reason (slack_sink.py, run.py): new --viewer-url and --run-reason CLI flags intended to surface a results-viewer link in Slack parent messages, but the in-process sink_config patch does not propagate to self.viewer_url which was already set at init time.

Confidence Score: 4/5

Safe to merge for the benchmark stability fixes; the viewer-URL feature introduced here will silently not work until the one-line attribute assignment is corrected.

The core fixes — timeout bumps, --entries-exact, non-fatal env vars, lynx install — are all straightforward and well-verified in the referenced CI pipelines. The GPUStatsRecorder is a best-effort side-channel that cannot affect benchmark correctness. The only functional defect is that --viewer-url won't surface in Slack messages because self.viewer_url is set at SlackSink.init time and the post-init sink_config patch has no effect; this breaks a new observability feature but nothing that was working before.

benchmarking/run.py and benchmarking/runner/sinks/slack_sink.py for the viewer_url propagation issue; benchmarking/runner/gpu_stats_recorder.py for the CSV header discrepancy vs. documentation.

Important Files Changed

Filename	Overview
benchmarking/runner/gpu_stats_recorder.py	New GPUStatsRecorder context-manager daemon thread; CSV header (9 columns) is wider than the 6-column format documented in the PR description and test plan, and there is a narrow write-after-close window if the background thread survives the 10 s join timeout in stop().
benchmarking/run.py	Adds --entries-exact, --strict-config-check, --viewer-url, --run-reason CLI flags and wires GPUStatsRecorder; --viewer-url patches sink_config but not sink.viewer_url, so the URL is silently dropped from Slack messages.
benchmarking/runner/sinks/slack_sink.py	Adds viewer_url support to SlackParentMessage and SlackSink; self.viewer_url is read from sink_config only at init time and never refreshed, making the in-process patch from run.py a no-op.
benchmarking/runner/session.py	Adds entries_exact parameter to from_dict with correct mutual-exclusivity guard and informative ValueError when unknown names are supplied; logic is correct.
benchmarking/runner/utils.py	Refactors resolve_env_vars to accept a strict flag; non-strict mode now warns and substitutes empty string instead of raising ValueError. Clean implementation.
benchmarking/tools/ci_benchmark_launcher.sh	Switches --entries to --entries-exact, adds lynx install, session-name resolution, and optional viewer-URL construction; viewer URL is built with http:// which may not match the actual host scheme.
benchmarking/nightly-benchmark.yaml	Timeout bumps for exact_dedup_identification, dedup_removal_*, ndd_ray_serve_dp4; domain_label_games_count loosened to ±1% range; gpu_stats_recorder config added. All changes match the rationale in the PR.
benchmarking/tools/gen_runscript_vars.py	Raises _max_container_memory_bytes from 1 TiB to 2 TiB to reflect actual A100 host capacity. One-line change, straightforward.

Sequence Diagram

sequenceDiagram
    participant CLI as run.py (main)
    participant Session
    participant SlackSink
    participant RunEntry as run_entry()
    participant GPURec as GPUStatsRecorder
    participant Subprocess

    CLI->>Session: "from_dict(config, entries_exact=[...])"
    Session-->>CLI: session (filtered entries)

    CLI->>SlackSink: __init__(sink_config)
    Note over SlackSink: self.viewer_url = sink_config.get(viewer_url) = None

    CLI->>SlackSink: "sink_config[viewer_url] = args.viewer_url"
    Note over CLI,SlackSink: self.viewer_url unchanged (still None)

    CLI->>SlackSink: initialize(session_name, session, env_dict)
    SlackSink->>SlackSink: "_create_parent_message(viewer_url=self.viewer_url=None)"

    loop for each entry
        CLI->>RunEntry: run_entry(..., gpu_stats_recorder_interval_s)
        RunEntry->>GPURec: __enter__() start()
        GPURec-->>RunEntry: daemon thread polling gpustats.csv
        RunEntry->>Subprocess: run_command_with_timeout(cmd, timeout)
        Subprocess-->>RunEntry: run_data
        RunEntry->>GPURec: __exit__() stop()
        GPURec-->>RunEntry: CSV flushed and closed
        RunEntry-->>CLI: success/failure
        CLI->>SlackSink: register_benchmark_entry_finished(result)
    end

    CLI->>SlackSink: finalize()

_{Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'upstream/m..." | Re-trigger Greptile}

greptile-apps · 2026-06-04T21:00:45Z


+        # Optional run-viewer URL surfaced as a "Results viewer" link in the parent
+        # Slack message. Typically populated in-process by run.py from --viewer-url.
+        self.viewer_url: str | None = sink_config.get("viewer_url")


--viewer-url silently dropped — self.viewer_url is never updated after __init__

self.viewer_url is assigned once from sink_config.get("viewer_url") at init time. In run.py, the CLI flag patches sink.sink_config["viewer_url"] = args.viewer_url after init, but that only mutates the backing dict — self.viewer_url still holds None when initialize() later calls _create_parent_message(viewer_url=self.viewer_url). The URL will never appear in the Slack parent message.

The fix is to also set sink.viewer_url = args.viewer_url directly in run.py, or to re-read it at the start of initialize().

greptile-apps · 2026-06-04T21:00:46Z

+    def stop(self) -> None:
+        if self._thread is None:
+            return
+        self._stop_event.set()
+        self._thread.join(timeout=10.0)
+        self._thread = None
+        if self._csv_file is not None:
+            self._csv_file.close()
+            self._csv_file = None
+            self._csv_writer = None


Potential write-after-close if background thread outlives 10 s join timeout

stop() calls self._stop_event.set(), then self._thread.join(timeout=10.0). If _poll_once blocks longer than 10 s (e.g., NVML hangs), the join returns but the thread is still alive. _csv_file.close() is then called immediately, leaving the running thread free to call self._csv_writer.writerow(...) on a closed file. Since _poll_loop swallows all exceptions, the next write silently fails or raises an OSError that is only logged at WARNING level — no data is lost for the benchmark itself, but the CSV could be left truncated or corrupted.

greptile-apps · 2026-06-04T21:00:47Z

+    HEADER: ClassVar[list[str]] = [
+        "timestamp_utc",
+        "gpu_id",
+        "utilization_gpu_pct",
+        "utilization_memory_pct",
+        "temperature_c",
+        "power_draw_w",
+        "power_limit_w",
+        "fan_speed_pct",
+        "processes",
+    ]


CSV header in code does not match the documented 6-column format

The PR description and the test plan both say the expected CSV header is timestamp_utc,gpu_id,utilization_gpu_pct,utilization_memory_pct,temperature_c,processes, but the actual HEADER constant here contains 9 fields, adding power_draw_w, power_limit_w, and fan_speed_pct between temperature_c and processes. Any downstream consumer or CI validation that checks for the 6-column format documented in the test plan (CSV header matches timestamp_utc,...,processes) will fail.

greptile-apps · 2026-06-04T21:00:48Z

+    RESULTS_HOST_DIR="${DEFAULT_CLUSTER_DIR}/curator_ci/results/${BRANCH_NAME}/${SESSION_NAME}"
+    ENC_DIR=$(python3 -c 'import sys, urllib.parse; print(urllib.parse.quote(sys.argv[1]))' "${RESULTS_HOST_DIR}")
+    ENC_RUN=$(python3 -c 'import sys, urllib.parse; print(urllib.parse.quote(sys.argv[1]))' "${SESSION_NAME}")
+    VIEWER_URL="http://${NEMO_CI_VIEWER_HOST}/run-viewer?dir=${ENC_DIR}&run=${ENC_RUN}"


Viewer URL constructed over plain HTTP, not HTTPS

VIEWER_URL="http://${NEMO_CI_VIEWER_HOST}/..." hardcodes http://. If the viewer host is exposed over HTTPS (which is typical for internal dashboards), the link sent to Slack will be an unclickable or redirected HTTP URL. Consider using https:// by default, or parameterizing the scheme as part of NEMO_CI_VIEWER_HOST.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-04T21:00:49Z

+    if args.viewer_url:
+        for sink in session.sinks:
+            if getattr(sink, "name", None) == "slack":
+                sink.sink_config["viewer_url"] = args.viewer_url


Since self.viewer_url is already set from sink_config at init time, patching sink.sink_config["viewer_url"] has no effect on the value _create_parent_message will see. Setting the attribute directly ensures the URL is visible when initialize() runs.

Suggested change

if args.viewer_url:

for sink in session.sinks:

if getattr(sink, "name", None) == "slack":

sink.sink_config["viewer_url"] = args.viewer_url

if args.viewer_url:

for sink in session.sinks:

if getattr(sink, "name", None) == "slack":

sink.sink_config["viewer_url"] = args.viewer_url

sink.viewer_url = args.viewer_url

NVIDIA-NeMo#2035) * [benchmarking] Make undefined env vars in config non-fatal by default `resolve_env_vars` previously raised `ValueError` on the first undefined `${VAR}` reference, halting the entire session even when the missing var was only used by a few entries. The default is now to substitute an empty string and log a warning. Pass `--strict-config-check` to `run.py` to restore the old fail-fast behavior. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * benchmark: bump ndd_ray_serve_dp4 timeout_s 700 -> 1200 In nemo-ci pipeline 52840568 / leaf 52841002, ndd_ray_serve_dp4 was SLURM-killed by TIME LIMIT at ~11:30 wall — only seconds after its benchmark subprocess succeeded (Output 853/853, benchmark wall 249s, total subprocess wall ~450s). The existing timeout_s: 700 converts to SLURM --time=00:11:40, giving no headroom for Ray teardown or cold vLLM model load. Bump to 1200s (20 min): - ~70% headroom over the observed warm-run wall - Still well below ndd_dynamo_dp4's 2700s ceiling, which is documented for cold flash-attn / gpt-oss-20b loads Reference failed job: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/327777542 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * benchmark: add --entry-exact-name flag for exact entry matching The existing --entries flag uses pytest's "-k" expression evaluator, which does substring matching on bare identifiers. This is correct for interactive use but dangerous for automated callers that target a single known entry: passing --entries foo also selects foo_repeat, foo_extra, etc. Concrete failure: in nemo-ci pipeline 52840568 / leaf 52841002, the SLURM job for entry "audio_tagging_tts_xenna" was invoked with --entries audio_tagging_tts_xenna, which also matched the sibling "audio_tagging_tts_xenna_repeat". That entry was executed within the non-_repeat SLURM job, leaving a logs/ray.log file in the _repeat entry's per-entry results dir. The legitimate _repeat SLURM job then crashed at Ray cluster setup because logs/ is preserved by design (run.py:175-178) and the stale ray.log capture file collided. Changes: * benchmarking/run.py: add --entry-exact-name argparse flag (mutually exclusive with --entries); pass through to Session.from_dict. * benchmarking/runner/session.py: extend Session.from_dict to accept entry_exact_name; when set, filter by exact entry-name equality (takes precedence over entry_filter_expr; passing both raises ValueError). * benchmarking/tools/ci_benchmark_launcher.sh: switch CI per-job invocation from --entries to --entry-exact-name. ENTRY_NAME is already populated by the per-entry CI job generator with the exact entry name, so no value change is needed. Interactive --entries behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * benchmark: rename --entry-exact-name to --entries-exact (list, strict) Brings the new flag in line with --entries in both look and feel: * Accepts a comma-separated list of one or more exact entry names, not just a single name. This matches the mental model of --entries (which conceptually selects a set of entries) and lets a single invocation target any subset by exact name. * Every name in the list must exactly match a configured (enabled) entry; otherwise the run aborts with a ValueError that lists the unknown names alongside the available entry names. This makes typos a hard error rather than a silent no-op. * Duplicates in the input are collapsed; result order follows the YAML, matching how --entries behaves. * benchmarking/run.py: rename argparse flag --entry-exact-name to --entries-exact; parse comma-separated value into list[str]; reject empty / whitespace-only inputs; wrap Session.from_dict in try/except to surface ValueError as a clean CLI error. * benchmarking/runner/session.py: rename parameter entry_exact_name to entries_exact (list[str]); add strict validation that every requested name matches a configured entry; error message lists both missing and available names. * benchmarking/tools/ci_benchmark_launcher.sh: rename flag in the CI per-job invocation; ENTRY_NAME is a single name today so this works as a single-element list. Interactive --entries semantics unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * benchmark: bump exact_dedup_identification timeout_s 500 -> 1500 In nemo-ci pipeline 53092974 / leaf 53093663, exact_dedup_identification was SLURM-killed by TIME LIMIT at 68% (516/755) into the "Inserting into shuffler" phase. The shuffler ran at ~1.83 it/s on EOS H100, so the shuffler phase alone needs ~410s, plus Ray cluster setup, dataset stat, post-shuffle dedup compute, and cleanup — total wall ~800-1000s, not fitting in a 500s budget. The previous 500s value was set when the test was effectively a no-op (the rpv2 dataset was unreadable due to filesystem permissions, so the test failed fast on stat() before doing any real work — see PR description for the rpv2 access fix story). 500s was also reportedly sufficient on a faster system; EOS may simply be slower for this workload. Worth investigating after the test is unblocked. Bump to 1500s (25 min): - ~50% headroom over the estimated 1000s realistic wall on EOS - Parallel in spirit to the ndd_ray_serve_dp4 700 -> 1200 bump earlier in this PR Reference failed job: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/329434352 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * benchmark: install lynx in ci_benchmark_launcher.sh The math benchmarks (math_preprocess, math_preprocess_classifier, math_preprocess_llm_cleanup) shell out to the lynx text browser via nemo_curator/stages/math/download/html_extractors/lynx.py for HTML extraction. lynx is not present in the Curator benchmark container, so those benchmarks currently fail with: RuntimeError: lynx executable not found in PATH lynx is GPL-licensed, so we deliberately do not bake it into the redistributable Curator image. Instead it is installed transiently in the existing benchmark container at CI run time. The image we publish stays GPL-free; the apt-installed lynx lives only for the lifetime of the CI container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * benchmark: preserve scratch dir for fuzzy_dedup_identification Adds `delete_scratch: false` to the fuzzy_dedup_identification entry so its scratch directory (under session_entry_dir/scratch/{cache,output}) is retained after the entry finishes. The downstream dedup_removal_* benchmarks read these artifacts at known paths, so the prior default session-level cleanup (delete_scratch: true) was wiping them out before they could be consumed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * benchmark: range domain_label_games_count +/-5% (was exact) Both domain_classification_raydata and domain_classification_xenna have a requirement on domain_label_games_count with exact_value: 149816. Run-to-run output of the classifier drifts by several hundred to a few thousand classifications, causing repeated false failures while the actual benchmark (throughput, total docs, number of domains predicted) is healthy. Loosens the metric to a +/- 5% range (142325 .. 157307) which still catches genuine regressions of the classifier output while tolerating normal run-to-run variability. domain_label_news_count is left at exact_value: 2817 pending further investigation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * benchmark: bump dedup_removal_* timeout_s for real-data wall Both dedup_removal entries had timeout_s values that fit when the test was effectively a no-op (it failed fast on stale fuzzy_id_generator artifacts before doing real work). Once the upstream fuzzy_id_generator data was refreshed, the actual benchmarks ran for real: * dedup_removal_raydata observed wall ~1419s (92% complete when killed at the prior 1100s ceiling) -> 1800s (~27% headroom over the estimated 1500s full wall) * dedup_removal_xenna passed at 1523s wall with the prior 1500s ceiling (no margin) -> 1800s as a preemptive bump to match raydata and absorb run-to-run variance Reference: nemo-ci pipeline 53333039 / leaf 53333328 — raydata killed by SLURM TIME LIMIT during normal pipeline execution while xenna finished successfully but with zero headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * Revert "benchmark: preserve scratch dir for fuzzy_dedup_identification" The earlier delete_scratch: false override on fuzzy_dedup_identification was added so the entry's scratch/output artifacts could be picked up by the downstream dedup_removal_* benchmarks within the same pipeline. In practice the artifacts are consumed via the canonical dataset path: {datasets_path}/cleaned_exact_dedup_all_cc_fuzzy_output_nightly_container_paths/ fuzzy_id_generator.json FuzzyDuplicateIds/ Operators promote a known-good fuzzy_dedup output to this path once, and dedup_removal_raydata / dedup_removal_xenna consume it from there on every subsequent pipeline (no same-pipeline dependency). With that workflow in place, the per-entry delete_scratch override is no longer needed and just leaves unused data on lustre across runs. Revert to the session-level default (delete_scratch: true). This reverts commit 6369150. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * Add optional /dev/shm cap via CURATOR_SHM_SIZE_BYTES env var For cross-host benchmark comparisons (e.g. A100 vs H100) the host's /dev/shm size differs (~550 GB vs ~1 TB) and the container inherits the host default. Provide an opt-in env var to remount /dev/shm at a chosen size. Unset preserves prior behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * Add optional all-GPU utilization poller via CURATOR_GPU_POLL_INTERVAL_S Background `nvidia-smi` poller writes one CSV per entry covering all GPUs on the node, independent of CUDA_VISIBLE_DEVICES. Lets us verify post-run whether Ray/Xenna actually honored the visible-device mask (any nonzero util on masked indices ⇒ leakage). Unset → no polling, preserving prior behavior. Subprocess is killed on EXIT via trap so a python crash doesn't leave it orphaned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * Raise A100 container memory cap to 2 TiB; drop EOS-side shm remount The benchmarking container cap of 1 TiB on the A100 host shrinks the container's visible memory to ~50% of host and shm to ~25%, which makes env.json report 1024 GiB / 512 GiB even though the host has ~2 TiB / ~1008 GiB. EOS reports the host values directly, so the A100 vs EOS comparison shows an artificial environmental mismatch. Raising the cap to 2 TiB lets A100's container see the host's full memory, matching EOS. Also remove the CURATOR_SHM_SIZE_BYTES env-var block from ci_benchmark_launcher.sh: pyxis-on-EOS does not grant CAP_SYS_ADMIN, so the remount silently fell back to the WARNING branch and never applied. With A100 raised to match EOS, the toggle is no longer needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * Replace bash GPU poller with threaded GPUStatsRecorder The prior CURATOR_GPU_POLL_INTERVAL_S nvidia-smi block in ci_benchmark_launcher.sh polled GPUs by spawning nvidia-smi as a background process. That implementation was bash-only, lived outside the runner's lifecycle, and named output files ad-hoc. Replace with a Python GPUStatsRecorder that uses gpustat (already a curator dep) on a daemon thread, started/stopped as a context manager around each entry's subprocess in run_entry. Per-entry CSV is written to {session_entry_path}/gpustats.csv with columns: timestamp_utc, gpu_id, utilization_gpu_pct, utilization_memory_pct, temperature_c, processes (JSON-encoded list of {pid, username, command, gpu_memory_usage}). Polling cadence is configured via a new top-level YAML key `gpu_stats_recorder.interval_s` in the benchmark config (default 1.0; set to 0 to disable). The recorder polls all visible GPUs regardless of CUDA_VISIBLE_DEVICES, which lets us verify post-run that Ray/Xenna honored the visible-device mask. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * Wire NEMO_CI_* env vars + Slack viewer-URL block (curator-side) Curator-side counterpart of the upcoming nemo-ci curator_benchmark_launch wrapper (still under design). Adds three small, backwards-compatible pieces of plumbing so per-launch metadata (session name, viewer URL, run reason) can flow from the launcher all the way through to the Slack parent message and env.json — gated behind new env vars / flags so today's nightly schedule is unaffected. * ci_benchmark_launcher.sh: - Honor NEMO_CI_SESSION_NAME when set; otherwise pick nightly-<TS> for scheduled (cron) pipelines and the legacy benchmark_run_<id> name for direct launch_pipeline.py manual launches. - Compose a run-viewer URL only when NEMO_CI_VIEWER_HOST is set, using the host-side lustre path and the resolved session name. - Pass --viewer-url and --run-reason through to run.py when set. * run.py: - New --viewer-url and --run-reason CLI flags (both default None). - When --run-reason is set, inject it into env_dict before sinks initialize, so it persists to env.json and appears in the Slack environment block. - When --viewer-url is set, patch the Slack sink's sink_config in-process (no YAML schema change required) before sinks initialize. * slack_sink.py: - SlackSink reads optional viewer_url from sink_config. - SlackParentMessage accepts viewer_url and renders a "Results viewer: <url|open run>" section between the overall-status block and the environment table — only when set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * GPUStatsRecorder: also record power draw, power limit, fan speed power_draw / power_limit / fan_speed are populated by the same NVML query gpustat.new_query() already issues for utilization/memory/temp, so reading them adds no measurable overhead (no extra NVML calls). Datacenter SKUs (H100/A100 chassis cards) usually return None for fan_speed since there's no controllable per-card fan; render None as empty string for CSV cleanliness. Same handling for power_draw / power_limit on the rare hardware that doesn't expose them. CSV columns now: timestamp_utc, gpu_id, utilization_gpu_pct, utilization_memory_pct, temperature_c, power_draw_w, power_limit_w, fan_speed_pct, processes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * Tighten domain_label_games_count tolerance from +/-5% to +/-1% Tighter than the original 142325..157307 (+/-5% around the observed 149816); now 148318..151314 (+/-1%). Catches smaller classifier-output drifts that the prior wider window absorbed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> * nightly-benchmark.yaml: drop noisy timeout_s commentary Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com> --------- Signed-off-by: rlratzel <rratzel@nvidia.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

rlratzel added the docs-only label May 28, 2026

rlratzel changed the title ~~[benchmarking] Make undefined env vars in config non-fatal by default~~ [benchmarking] Nightly-suite fixes: undefined env vars, --entries-exact, ndd_ray_serve_dp4 timeout May 28, 2026

praateekmahajan reviewed May 29, 2026

View reviewed changes

praateekmahajan approved these changes May 29, 2026

View reviewed changes

rlratzel and others added 4 commits May 29, 2026 13:23

rlratzel force-pushed the 2606-update_benchmark_env_check branch from ed1855e to 7b35169 Compare May 29, 2026 18:23

rlratzel changed the title ~~[benchmarking] Nightly-suite fixes: undefined env vars, --entries-exact, ndd_ray_serve_dp4 timeout~~ [benchmarking] Nightly-suite fixes: undefined env vars, --entries-exact, timeout adjustments May 29, 2026

rlratzel mentioned this pull request May 29, 2026

[benchmarking] Add ping_users_on_failure toggle, disable in nightly config #2039

Merged

4 tasks

rlratzel and others added 3 commits June 1, 2026 14:06

rlratzel changed the title ~~[benchmarking] Nightly-suite fixes: undefined env vars, --entries-exact, timeout adjustments~~ [benchmarking] Multiple fixes to stabilize the nightly benchmark suite Jun 1, 2026

rlratzel and others added 8 commits June 2, 2026 05:12

Merge remote-tracking branch 'upstream/main' into 2606-update_benchma…

3759b21

…rk_env_check Signed-off-by: rlratzel <rratzel@nvidia.com>

praateekmahajan reviewed Jun 4, 2026

View reviewed changes

Comment thread benchmarking/runner/gpu_stats_recorder.py

praateekmahajan reviewed Jun 4, 2026

View reviewed changes

Comment thread benchmarking/nightly-benchmark.yaml Outdated

praateekmahajan reviewed Jun 4, 2026

View reviewed changes

Comment thread benchmarking/nightly-benchmark.yaml Outdated

praateekmahajan reviewed Jun 4, 2026

View reviewed changes

Comment thread benchmarking/nightly-benchmark.yaml Outdated

praateekmahajan approved these changes Jun 4, 2026

View reviewed changes

rlratzel and others added 4 commits June 4, 2026 14:43

nightly-benchmark.yaml: drop noisy timeout_s commentary

c580964

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>

Merge remote-tracking branch 'upstream/main' into 2606-update_benchma…

c15025d

…rk_env_check Signed-off-by: rlratzel <rratzel@nvidia.com>

rlratzel marked this pull request as ready for review June 4, 2026 20:55

rlratzel requested review from a team, ayushdg and sarahyurick as code owners June 4, 2026 20:55

rlratzel enabled auto-merge (squash) June 4, 2026 20:56

copy-pr-bot Bot temporarily deployed to public June 4, 2026 20:56 Inactive

rlratzel merged commit d6a63e6 into NVIDIA-NeMo:main Jun 4, 2026
23 checks passed

greptile-apps Bot reviewed Jun 4, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public June 4, 2026 21:04 Inactive

Conversation

rlratzel commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Make undefined env vars in config non-fatal by default (b737202d)

2. New --entries-exact flag, exact-match entry filtering (ab1d4385, e8c6a52f, 7b351693)

3. Bump too-tight timeout_s values for several entries

3.1 ndd_ray_serve_dp4: 700 → 1200 (ab1d4385)

3.2 exact_dedup_identification: 500 → 1500 (21647b4f)

3.3 dedup_removal_*: 1100/1500 → 1800/1800 (c2f3a6b0)

4. Install lynx in the CI benchmark launcher (d0eac901)

5. (Reverted) Preserve scratch dir for fuzzy_dedup_identification (63691509 then a3c38ac6)

6. Range domain_label_games_count metric +/-1% (0c375b7e, tightened in ab31cf11)

7. Raise A100 container memory cap to host capacity (fd6e9cfd)

8. Background per-GPU stats recorder (c2fd47e6)

9. Wire NEMO_CI_* env vars + Slack viewer-URL block (da4c6760)

Verification

Test plan

Undefined env vars

--entries-exact

Timeout bumps

lynx install

fuzzy_dedup scratch retention (now reverted)

domain_label_games_count range

A100 container memory cap (#7)

GPUStatsRecorder (#8)

Uh oh!

copy-pr-bot Bot commented May 27, 2026

Uh oh!

praateekmahajan May 29, 2026

Choose a reason for hiding this comment

Uh oh!

praateekmahajan May 29, 2026

Choose a reason for hiding this comment

Uh oh!

rlratzel May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

praateekmahajan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 4, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rlratzel commented May 27, 2026 •

edited

Loading

1. Make undefined env vars in config non-fatal by default (`b737202d`)

2. New `--entries-exact` flag, exact-match entry filtering (`ab1d4385`, `e8c6a52f`, `7b351693`)

3. Bump too-tight `timeout_s` values for several entries

3.1 `ndd_ray_serve_dp4`: 700 → 1200 (`ab1d4385`)

3.2 `exact_dedup_identification`: 500 → 1500 (`21647b4f`)

3.3 `dedup_removal_*`: 1100/1500 → 1800/1800 (`c2f3a6b0`)

4. Install `lynx` in the CI benchmark launcher (`d0eac901`)

5. (Reverted) Preserve scratch dir for `fuzzy_dedup_identification` (`63691509` then `a3c38ac6`)

6. Range `domain_label_games_count` metric +/-1% (`0c375b7e`, tightened in `ab31cf11`)

7. Raise A100 container memory cap to host capacity (`fd6e9cfd`)

8. Background per-GPU stats recorder (`c2fd47e6`)

9. Wire `NEMO_CI_*` env vars + Slack viewer-URL block (`da4c6760`)

`--entries-exact`