adding part of pipeline as smoke merge to main by Jorjeous · Pull Request #2007 · NVIDIA-NeMo/Curator

Jorjeous · 2026-05-21T12:50:53Z

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

copy-pr-bot · 2026-05-21T12:50:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-21T12:56:49Z

Greptile Summary

This PR adds a speaker ID and audio quality pipeline to NeMo Curator, including speaker embedding extraction (Lhotse/JSONL/AudioTask variants), two-stage AHC clustering (standard and large-scale BIRCH+AHC), per-utterance confidence scoring, and a UTMOSv2 MOS scoring stage.

Embedding extraction (SpeakerEmbeddingLhotseStage, SpeakerEmbeddingRequestStage, SpeakerEmbeddingAudioTaskStage): shard-aware NeMo TitaNet-based extraction with support for local files, tar archives, AIS/S3, and in-memory Lhotse CutSets.
Clustering (SpeakerClusteringStage): three modes — per-shard, global, and grouped — backed by either standard AHC (ahc.py) or memory-bounded BIRCH+AHC (large_scale_clustering_and_scoring.py, the "SCOTCH" algorithm) with min-cluster-size filtering and silhouette-based confidence.
MOS scoring (GetUtmosv2ScoreStage): batch-writes temp WAV files, calls UTMOSv2, and maps scores back to entries by file stem rather than result ordering.

Confidence Score: 4/5

The core clustering logic and UTMOSv2 scoring are sound, but the global clustering mode can crash with an unhandled FileNotFoundError if any embedding shard is missing, while the grouped mode handles the same situation gracefully.

Global and shard-level processing modes propagate FileNotFoundError uncaught from _load_shard_pair, aborting a multi-hour clustering run if one shard file is absent, while grouped mode recovers gracefully. All three modes also use direct dict indexing that raises KeyError on any malformed manifest line instead of skipping it.

nemo_curator/stages/audio/speaker_id/speaker_clustering_and_scoring.py — the three processing-mode methods need consistent error recovery; nemo_curator/stages/audio/speaker_id/speaker_embedding_request.py — unconditional librosa import and fragile channel heuristic.

Important Files Changed

Filename	Overview
nemo_curator/stages/audio/speaker_id/speaker_clustering_and_scoring.py	Core clustering stage with three modes; global mode lacks FileNotFoundError recovery unlike grouped mode; all modes use direct dict indexing that can KeyError on malformed manifests; duplicated helpers create unnecessary torch coupling.
nemo_curator/stages/audio/speaker_id/speaker_embedding_request.py	JSONL-backed embedding stage; unconditional librosa import without fallback can crash mid-batch on resampling; fragile channel-layout heuristic.
nemo_curator/stages/audio/speaker_id/clustering/large_scale_clustering_and_scoring.py	BIRCH+AHC two-stage large-scale clustering; well-documented with leaf-cap backoff; suppressed F821 warning on leaf_centroids assigned inside loop body.
nemo_curator/stages/audio/metrics/utmosv2_score.py	UTMOSv2 MOS scoring stage; prior axis and ordering bugs addressed; stem-keyed score dict and graceful librosa fallback.
nemo_curator/stages/audio/speaker_id/speaker_embedding_lhotse.py	Lhotse shard-aware extractor; NeMo imports are lazy; helper functions duplicated in speaker_clustering_and_scoring.py.
nemo_curator/stages/audio/speaker_id/clustering/ahc.py	Standard AHC clustering with cosine distance; clean implementation with proper scipy imports.
nemo_curator/stages/audio/speaker_id/embedding/extractor.py	Duration-aware dynamic batching for WeSpeaker embeddings; sorts longest-first; clean flush logic.
nemo_curator/stages/audio/speaker_id/embedding/model_loader.py	WeSpeaker model loader bypassing wespeaker/init.py via importlib to avoid heavy optional dependency chain.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Audio Input\nLhotse / JSONL / AudioTask] --> B{Embedding\nExtraction}
    B --> C[SpeakerEmbeddingLhotseStage\nnemo_tarred / lhotse_shar]
    B --> D[SpeakerEmbeddingRequestStage\nJSONL + file paths]
    B --> E[SpeakerEmbeddingAudioTaskStage\nin-memory waveform]
    C --> F[per-shard .npz files]
    D --> F
    E --> F
    F --> G{SpeakerClusteringStage\n_resolve_mode}
    G --> H[shard mode\nper-shard AHC]
    G --> I[global mode\nall-shards AHC]
    G --> J[grouped mode\nbatch AHC + label offset]
    H --> K{Dataset size?}
    I --> K
    J --> K
    K -->|small| L[ahc.py\nstandard cosine AHC]
    K -->|large| M[large_scale_clustering\nBIRCH + AHC on centroids]
    L --> N[min_cluster_size filter]
    M --> N
    N --> O[speaker_label + confidence_score\nannotated JSONL manifests]
    P[Audio Input] --> Q[GetUtmosv2ScoreStage]
    Q --> R[temp WAV dir]
    R --> S[UTMOSv2 model.predict]
    S --> T[stem-keyed score dict]
    T --> U[utmosv2_score in DocumentBatch / AudioTask]

_{Reviews (3): Last reviewed commit: "Merge origin/main into Test_pipeline_MR" | Re-trigger Greptile}

greptile-apps · 2026-05-21T12:56:56Z

+        return f"gpu{cv.split(',')[0]}"
+    return f"pid{os.getpid()}"
+
+if TYPE_CHECKING:
+    from lhotse import Cut, CutSet
+
+try:
+    from nemo.collections.common.data.lhotse.nemo_adapters import (
+        LazyNeMoIterator,
+        LazyNeMoTarredIterator,


Unconditional top-level NeMo import breaks environments without NeMo

The module raises RuntimeError at import time if NeMo is not installed, because the try/except around the NeMo import re-raises as a hard error. Since speaker_id/__init__.py eagerly imports SpeakerEmbeddingLhotseStage from this module, and nemo_curator/stages/audio/__init__.py re-exports it, any import nemo_curator.stages.audio will fail in environments where NeMo is not installed. Moving the LazyNeMoIterator/LazyNeMoTarredIterator imports inside the process() method would make NeMo a lazy dependency and avoid breaking all other audio stages.

greptile-apps · 2026-05-21T12:56:59Z

+        if waveform.ndim > 1:
+            waveform = waveform.mean(axis=1)


Wrong axis for multi-channel stereo mix-down silently corrupts audio

When the waveform is in channels-first format (channels, samples) as used by NeMo and Lhotse, mean(axis=1) averages over the time dimension, yielding shape (channels,) instead of (samples,). That array is then resampled and written as a 1-sample WAV, so UTMOSv2 scores garbage audio with no error. The correct axis for channels-first is 0.

Suggested change

if waveform.ndim > 1:

waveform = waveform.mean(axis=1)

if waveform.ndim > 1:

waveform = waveform.mean(axis=0)

greptile-apps · 2026-05-21T12:57:01Z

+        results = self._model.predict(
+            input_dir=wav_dir,
+            batch_size=self.inference_batch_size,
+            num_repetitions=self.num_repetitions,
+            predict_dataset=self.predict_dataset,
+            num_workers=0,
+            verbose=False,
+        )


_score_dir result ordering is not guaranteed

model.predict(input_dir=wav_dir) returns results whose order UTMOSv2 does not document. The code zips results directly against valid_indices relying on lexicographic file-name order. The zero-padded names happen to work today, but if a future UTMOSv2 version changes traversal order, MOS scores will be silently assigned to the wrong entries.

greptile-apps · 2026-05-21T12:57:03Z

+    try:
+        import librosa
+        return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr)
+    except ImportError:
+        ratio = target_sr / orig_sr
+        indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int)
+        indices = indices[indices < len(audio)]
+        return audio[indices]


Fallback resampler performs nearest-neighbor, not linear interpolation

The fallback uses np.round to pick the nearest existing sample index, which is nearest-neighbor resampling. The comment should say so, or the implementation should use np.interp for true linear interpolation.

Suggested change

try:

import librosa

return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr)

except ImportError:

ratio = target_sr / orig_sr

indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int)

indices = indices[indices < len(audio)]

return audio[indices]

try:

import librosa

return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr)

except ImportError:

# Nearest-neighbour fallback (low quality - prefer installing librosa).

ratio = target_sr / orig_sr

indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int)

indices = indices[indices < len(audio)]

return audio[indices]

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

greptile-apps · 2026-06-09T10:00:18Z

+            dynamic_ncols=True,
+            disable=bar_disable,
+        ):
+            manifest_entries, _cut_ids, embeddings, mapping = self._load_shard_pair(mp)


Missing FileNotFoundError recovery in global mode

_process_global calls self._load_shard_pair(mp) without a try/except, so a single missing .npz file crashes the entire clustering run with an unhandled FileNotFoundError. _process_grouped wraps the same call in a try/except FileNotFoundError with a graceful warning-and-skip, and _load_shard_pair itself explicitly raises FileNotFoundError for missing embedding files. In distributed or multi-shard pipelines where a worker may have failed and left a gap, global mode will abort while the grouped path silently continues — creating a confusing behavioral difference.

greptile-apps · 2026-06-09T10:00:19Z

+                _log_cluster_summary(labels, self.threshold)
+
+                for entry in manifest_entries:
+                    afp = entry[self.audio_filepath_key]


KeyError crash on malformed manifest entries in all processing modes

entry[self.audio_filepath_key] uses direct dict indexing in all three process paths (_process_shard_level, _process_global, _process_grouped). If any JSONL line in the manifest is missing the key (corrupted file, truncated write, schema mismatch), the entire clustering stage raises KeyError and crashes. The upstream _verify_manifest_vs_embeddings helper correctly uses .get(audio_filepath_key, ""), but the downstream write loops do not. This affects all three clustering modes.

greptile-apps · 2026-06-09T10:00:20Z

+            import librosa
+
+            audio = librosa.resample(audio, orig_sr=sr, target_sr=self.target_sample_rate)


Hard librosa import with no fallback can crash mid-batch

When the loaded audio's sample rate differs from target_sample_rate, librosa is imported unconditionally with no try/except. If librosa is not installed the stage raises ModuleNotFoundError in the middle of processing a batch, after the model has been loaded and some rows already processed, potentially leaving a partially-written output file. Compare with _resample() in utmosv2_score.py which wraps the same import in a try/except and falls back gracefully. A similar unconditional import exists at line 252 in the waveform branch of process().

oyilmaz-nvidia

Since it's more domain specific, it's a bit hard for me to review thoroughly. But I asked Claude to get its review.

Can you please at least check the following and if it makes sense, address the points raised? I think the Critical ones definitely need some attention.

Code Review: PR #2007 — Audio Speaker ID & UTMOSv2 Stages

Overview

Adds two major audio curation capabilities:

Speaker diarization pipeline — speaker embedding extraction (three backends: Lhotse/NeMo, request-based, AudioTask), followed by AHC or large-scale BIRCH+AHC clustering and scoring.
UTMOSv2 MOS scoring stage.

Plus Fern docs, a tutorial pipeline, and unit tests for all components.

30 files, 5,257 additions, 0 deletions — pure addition.

Critical Bugs

1. [CRITICAL] Channel-handling convention is inconsistent across the three embedding backends

Three files mix up channels-first vs channels-last, and mean vs first-channel:

File	Operation	Assumed layout
`utmosv2_score.py`	`waveform.mean(axis=0)`	channels-first `(C, T)`
`feature.py` (`load_audio`)	`pcm.mean(axis=1)`	channels-last `(T, C)`
`speaker_embedding_request.py`	`audio[:, 0]`	channels-last `(T, C)`, first channel only
`speaker_embedding_lhotse.py`	`audio[0]`	channels-first `(C, T)`, first channel only

At minimum two of these conventions are wrong. On stereo data this silently corrupts embeddings without raising an error.

2. [CRITICAL] Confidence formula appears inverted (`ahc.py` and `large_scale_clustering_and_scoring.py`)

The silhouette score is (b - a) / max(a, b) where a = intra-cluster distance and b = nearest-cluster distance. The code computes (a - b) / max(a, b) — the negation. A tight, well-separated cluster will produce a negative confidence score and a poor cluster a positive one. This inverts the entire quality metric.

3. [HIGH] `noqa: F821` suppresses a potential `NameError` (`large_scale_clustering_and_scoring.py`)

After del birch, the code references leaf_centroids with # noqa: F821 to silence the undefined-name linter warning. If del birch runs before leaf_centroids is assigned (e.g. in an early backoff iteration), this crashes at runtime rather than at lint time. The suppression hides the problem.

4. [MEDIUM] BIRCH radius backoff can loop uselessly

The backoff loop runs up to 8 attempts with 1.25× geometric growth but caps threshold at 1.0. Once the cap is hit, subsequent iterations repeat the same computation at threshold=1.0 without making progress, wasting compute.

Correctness / Logic Issues

5. [HIGH] `speaker_embedding_request.py` uses first channel (`audio[:, 0]`) instead of mean

utmosv2_score.py and feature.py mix down to mono by averaging channels. The request backend silently drops all channels except the first. Embeddings will differ across backends for the same stereo file.

6. [MEDIUM] `FileNotFoundError` handled inconsistently across the three `SpeakerClusteringStage` processing modes

_process_grouped catches FileNotFoundError and silently skips the shard. _process_global and _process_shard_level let it propagate and abort. This is an undocumented behavioral asymmetry.

Structural Issues

7. [MEDIUM] `SpeakerClusteringStage` bundles three mutually exclusive processing modes in one class

_process_shard_level, _process_global, and _process_grouped diverge in behavior, error handling, and label-offsetting logic with no shared non-trivial code. Three separate stage classes or a strategy pattern would make the distinct behaviors explicit.

8. [LOW] Two 700+ line files

speaker_clustering_and_scoring.py (742 lines) and large_scale_clustering_and_scoring.py (719 lines) interleave math, file I/O, and orchestration. The metric functions (cluster_stats, cluster_quality, speaker_confidence) could be extracted into a shared metrics.py.

9. [MEDIUM] `model_loader.py` injects stub packages into `sys.modules`

Bypassing wespeaker/__init__.py via importlib.util and patching sys.modules is fragile — if the real package is present the stubs will conflict, and the approach silently breaks if the package's internal layout changes. At minimum this needs a comment explaining why it's necessary and when it's safe.

10. [LOW] Lazy vs eager `librosa` import is inconsistent

utmosv2_score.py lazily imports librosa inside _resample with a fallback. speaker_embedding_request.py imports it unconditionally at the top. Pick one pattern.

PR Hygiene

11. [HIGH] PR description is entirely empty

The body is the unmodified template — all sections blank and all checklist items unchecked for a 5,257-line addition. No context for what this replaces, what the intended data flow is, or how to test it. The title ("adding part of pipeline as smoke merge to main") reads like a placeholder.

12. [LOW] `speaker_id/init.py` exports 37 symbols

Makes it hard to distinguish the intended public API from internal helpers. Consider being selective or organizing by sub-module.

Summary

Severity	Issues
CRITICAL	Channel layout mismatch (#1), confidence formula inverted (#2)
HIGH	`noqa` hiding potential NameError (#3), channel inconsistency across backends (#5), empty PR description (#11)
MEDIUM	BIRCH cap wastes iterations (#4), FileNotFoundError asymmetry (#6), SpeakerClusteringStage design (#7), sys.modules patching (#9)
LOW	File size (#8), librosa import inconsistency (#10), export surface (#12)

Block on #1 and #2 — both are silent correctness bugs that produce wrong results on real data.

adding part of pipeline as smoke merge to main

1a15b8f

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous requested a review from a team as a code owner May 21, 2026 12:50

Jorjeous requested review from oyilmaz-nvidia and removed request for a team May 21, 2026 12:50

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

Jorjeous added 2 commits June 9, 2026 02:53

fix CI, ruff linter

c44d881

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Merge origin/main into Test_pipeline_MR

1bb9291

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous force-pushed the Test_pipeline_MR branch from f102dcd to 1bb9291 Compare June 9, 2026 09:54

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

oyilmaz-nvidia reviewed Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding part of pipeline as smoke merge to main#2007

adding part of pipeline as smoke merge to main#2007
Jorjeous wants to merge 3 commits into
mainfrom
Test_pipeline_MR

Jorjeous commented May 21, 2026

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

greptile-apps Bot commented May 21, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot Jun 9, 2026

Uh oh!

greptile-apps Bot Jun 9, 2026

Uh oh!

greptile-apps Bot Jun 9, 2026

Uh oh!

oyilmaz-nvidia left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		import librosa

		audio = librosa.resample(audio, orig_sr=sr, target_sr=self.target_sample_rate)

Conversation

Jorjeous commented May 21, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

greptile-apps Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

oyilmaz-nvidia left a comment

Choose a reason for hiding this comment

Code Review: PR #2007 — Audio Speaker ID & UTMOSv2 Stages

Overview

Critical Bugs

1. [CRITICAL] Channel-handling convention is inconsistent across the three embedding backends

2. [CRITICAL] Confidence formula appears inverted (ahc.py and large_scale_clustering_and_scoring.py)

3. [HIGH] noqa: F821 suppresses a potential NameError (large_scale_clustering_and_scoring.py)

4. [MEDIUM] BIRCH radius backoff can loop uselessly

Correctness / Logic Issues

5. [HIGH] speaker_embedding_request.py uses first channel (audio[:, 0]) instead of mean

6. [MEDIUM] FileNotFoundError handled inconsistently across the three SpeakerClusteringStage processing modes

Structural Issues

7. [MEDIUM] SpeakerClusteringStage bundles three mutually exclusive processing modes in one class

8. [LOW] Two 700+ line files

9. [MEDIUM] model_loader.py injects stub packages into sys.modules

10. [LOW] Lazy vs eager librosa import is inconsistent

PR Hygiene

11. [HIGH] PR description is entirely empty

12. [LOW] speaker_id/__init__.py exports 37 symbols

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 21, 2026 •

edited

Loading

2. [CRITICAL] Confidence formula appears inverted (`ahc.py` and `large_scale_clustering_and_scoring.py`)

3. [HIGH] `noqa: F821` suppresses a potential `NameError` (`large_scale_clustering_and_scoring.py`)

5. [HIGH] `speaker_embedding_request.py` uses first channel (`audio[:, 0]`) instead of mean

6. [MEDIUM] `FileNotFoundError` handled inconsistently across the three `SpeakerClusteringStage` processing modes

7. [MEDIUM] `SpeakerClusteringStage` bundles three mutually exclusive processing modes in one class

9. [MEDIUM] `model_loader.py` injects stub packages into `sys.modules`

10. [LOW] Lazy vs eager `librosa` import is inconsistent

12. [LOW] `speaker_id/init.py` exports 37 symbols