adding part of pipeline as smoke merge to main#2007
Conversation
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Greptile SummaryThis PR adds a speaker ID and audio quality pipeline to NeMo Curator, including speaker embedding extraction (Lhotse/JSONL/AudioTask variants), two-stage AHC clustering (standard and large-scale BIRCH+AHC), per-utterance confidence scoring, and a UTMOSv2 MOS scoring stage.
Confidence Score: 4/5The core clustering logic and UTMOSv2 scoring are sound, but the global clustering mode can crash with an unhandled FileNotFoundError if any embedding shard is missing, while the grouped mode handles the same situation gracefully. Global and shard-level processing modes propagate FileNotFoundError uncaught from _load_shard_pair, aborting a multi-hour clustering run if one shard file is absent, while grouped mode recovers gracefully. All three modes also use direct dict indexing that raises KeyError on any malformed manifest line instead of skipping it. nemo_curator/stages/audio/speaker_id/speaker_clustering_and_scoring.py — the three processing-mode methods need consistent error recovery; nemo_curator/stages/audio/speaker_id/speaker_embedding_request.py — unconditional librosa import and fragile channel heuristic. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Audio Input\nLhotse / JSONL / AudioTask] --> B{Embedding\nExtraction}
B --> C[SpeakerEmbeddingLhotseStage\nnemo_tarred / lhotse_shar]
B --> D[SpeakerEmbeddingRequestStage\nJSONL + file paths]
B --> E[SpeakerEmbeddingAudioTaskStage\nin-memory waveform]
C --> F[per-shard .npz files]
D --> F
E --> F
F --> G{SpeakerClusteringStage\n_resolve_mode}
G --> H[shard mode\nper-shard AHC]
G --> I[global mode\nall-shards AHC]
G --> J[grouped mode\nbatch AHC + label offset]
H --> K{Dataset size?}
I --> K
J --> K
K -->|small| L[ahc.py\nstandard cosine AHC]
K -->|large| M[large_scale_clustering\nBIRCH + AHC on centroids]
L --> N[min_cluster_size filter]
M --> N
N --> O[speaker_label + confidence_score\nannotated JSONL manifests]
P[Audio Input] --> Q[GetUtmosv2ScoreStage]
Q --> R[temp WAV dir]
R --> S[UTMOSv2 model.predict]
S --> T[stem-keyed score dict]
T --> U[utmosv2_score in DocumentBatch / AudioTask]
Reviews (3): Last reviewed commit: "Merge origin/main into Test_pipeline_MR" | Re-trigger Greptile |
| return f"gpu{cv.split(',')[0]}" | ||
| return f"pid{os.getpid()}" | ||
|
|
||
| if TYPE_CHECKING: | ||
| from lhotse import Cut, CutSet | ||
|
|
||
| try: | ||
| from nemo.collections.common.data.lhotse.nemo_adapters import ( | ||
| LazyNeMoIterator, | ||
| LazyNeMoTarredIterator, |
There was a problem hiding this comment.
Unconditional top-level NeMo import breaks environments without NeMo
The module raises RuntimeError at import time if NeMo is not installed, because the try/except around the NeMo import re-raises as a hard error. Since speaker_id/__init__.py eagerly imports SpeakerEmbeddingLhotseStage from this module, and nemo_curator/stages/audio/__init__.py re-exports it, any import nemo_curator.stages.audio will fail in environments where NeMo is not installed. Moving the LazyNeMoIterator/LazyNeMoTarredIterator imports inside the process() method would make NeMo a lazy dependency and avoid breaking all other audio stages.
| if waveform.ndim > 1: | ||
| waveform = waveform.mean(axis=1) |
There was a problem hiding this comment.
Wrong axis for multi-channel stereo mix-down silently corrupts audio
When the waveform is in channels-first format (channels, samples) as used by NeMo and Lhotse, mean(axis=1) averages over the time dimension, yielding shape (channels,) instead of (samples,). That array is then resampled and written as a 1-sample WAV, so UTMOSv2 scores garbage audio with no error. The correct axis for channels-first is 0.
| if waveform.ndim > 1: | |
| waveform = waveform.mean(axis=1) | |
| if waveform.ndim > 1: | |
| waveform = waveform.mean(axis=0) |
| results = self._model.predict( | ||
| input_dir=wav_dir, | ||
| batch_size=self.inference_batch_size, | ||
| num_repetitions=self.num_repetitions, | ||
| predict_dataset=self.predict_dataset, | ||
| num_workers=0, | ||
| verbose=False, | ||
| ) |
There was a problem hiding this comment.
_score_dir result ordering is not guaranteed
model.predict(input_dir=wav_dir) returns results whose order UTMOSv2 does not document. The code zips results directly against valid_indices relying on lexicographic file-name order. The zero-padded names happen to work today, but if a future UTMOSv2 version changes traversal order, MOS scores will be silently assigned to the wrong entries.
| try: | ||
| import librosa | ||
| return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr) | ||
| except ImportError: | ||
| ratio = target_sr / orig_sr | ||
| indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int) | ||
| indices = indices[indices < len(audio)] | ||
| return audio[indices] |
There was a problem hiding this comment.
Fallback resampler performs nearest-neighbor, not linear interpolation
The fallback uses np.round to pick the nearest existing sample index, which is nearest-neighbor resampling. The comment should say so, or the implementation should use np.interp for true linear interpolation.
| try: | |
| import librosa | |
| return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr) | |
| except ImportError: | |
| ratio = target_sr / orig_sr | |
| indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int) | |
| indices = indices[indices < len(audio)] | |
| return audio[indices] | |
| try: | |
| import librosa | |
| return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr) | |
| except ImportError: | |
| # Nearest-neighbour fallback (low quality - prefer installing librosa). | |
| ratio = target_sr / orig_sr | |
| indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int) | |
| indices = indices[indices < len(audio)] | |
| return audio[indices] |
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
f102dcd to
1bb9291
Compare
| dynamic_ncols=True, | ||
| disable=bar_disable, | ||
| ): | ||
| manifest_entries, _cut_ids, embeddings, mapping = self._load_shard_pair(mp) |
There was a problem hiding this comment.
Missing FileNotFoundError recovery in global mode
_process_global calls self._load_shard_pair(mp) without a try/except, so a single missing .npz file crashes the entire clustering run with an unhandled FileNotFoundError. _process_grouped wraps the same call in a try/except FileNotFoundError with a graceful warning-and-skip, and _load_shard_pair itself explicitly raises FileNotFoundError for missing embedding files. In distributed or multi-shard pipelines where a worker may have failed and left a gap, global mode will abort while the grouped path silently continues — creating a confusing behavioral difference.
| _log_cluster_summary(labels, self.threshold) | ||
|
|
||
| for entry in manifest_entries: | ||
| afp = entry[self.audio_filepath_key] |
There was a problem hiding this comment.
KeyError crash on malformed manifest entries in all processing modes
entry[self.audio_filepath_key] uses direct dict indexing in all three process paths (_process_shard_level, _process_global, _process_grouped). If any JSONL line in the manifest is missing the key (corrupted file, truncated write, schema mismatch), the entire clustering stage raises KeyError and crashes. The upstream _verify_manifest_vs_embeddings helper correctly uses .get(audio_filepath_key, ""), but the downstream write loops do not. This affects all three clustering modes.
| import librosa | ||
|
|
||
| audio = librosa.resample(audio, orig_sr=sr, target_sr=self.target_sample_rate) |
There was a problem hiding this comment.
Hard
librosa import with no fallback can crash mid-batch
When the loaded audio's sample rate differs from target_sample_rate, librosa is imported unconditionally with no try/except. If librosa is not installed the stage raises ModuleNotFoundError in the middle of processing a batch, after the model has been loaded and some rows already processed, potentially leaving a partially-written output file. Compare with _resample() in utmosv2_score.py which wraps the same import in a try/except and falls back gracefully. A similar unconditional import exists at line 252 in the waveform branch of process().
oyilmaz-nvidia
left a comment
There was a problem hiding this comment.
Since it's more domain specific, it's a bit hard for me to review thoroughly. But I asked Claude to get its review.
Can you please at least check the following and if it makes sense, address the points raised? I think the Critical ones definitely need some attention.
Code Review: PR #2007 — Audio Speaker ID & UTMOSv2 Stages
Overview
Adds two major audio curation capabilities:
- Speaker diarization pipeline — speaker embedding extraction (three backends: Lhotse/NeMo, request-based, AudioTask), followed by AHC or large-scale BIRCH+AHC clustering and scoring.
- UTMOSv2 MOS scoring stage.
Plus Fern docs, a tutorial pipeline, and unit tests for all components.
30 files, 5,257 additions, 0 deletions — pure addition.
Critical Bugs
1. [CRITICAL] Channel-handling convention is inconsistent across the three embedding backends
Three files mix up channels-first vs channels-last, and mean vs first-channel:
| File | Operation | Assumed layout |
|---|---|---|
utmosv2_score.py |
waveform.mean(axis=0) |
channels-first (C, T) |
feature.py (load_audio) |
pcm.mean(axis=1) |
channels-last (T, C) |
speaker_embedding_request.py |
audio[:, 0] |
channels-last (T, C), first channel only |
speaker_embedding_lhotse.py |
audio[0] |
channels-first (C, T), first channel only |
At minimum two of these conventions are wrong. On stereo data this silently corrupts embeddings without raising an error.
2. [CRITICAL] Confidence formula appears inverted (ahc.py and large_scale_clustering_and_scoring.py)
The silhouette score is (b - a) / max(a, b) where a = intra-cluster distance and b = nearest-cluster distance. The code computes (a - b) / max(a, b) — the negation. A tight, well-separated cluster will produce a negative confidence score and a poor cluster a positive one. This inverts the entire quality metric.
3. [HIGH] noqa: F821 suppresses a potential NameError (large_scale_clustering_and_scoring.py)
After del birch, the code references leaf_centroids with # noqa: F821 to silence the undefined-name linter warning. If del birch runs before leaf_centroids is assigned (e.g. in an early backoff iteration), this crashes at runtime rather than at lint time. The suppression hides the problem.
4. [MEDIUM] BIRCH radius backoff can loop uselessly
The backoff loop runs up to 8 attempts with 1.25× geometric growth but caps threshold at 1.0. Once the cap is hit, subsequent iterations repeat the same computation at threshold=1.0 without making progress, wasting compute.
Correctness / Logic Issues
5. [HIGH] speaker_embedding_request.py uses first channel (audio[:, 0]) instead of mean
utmosv2_score.py and feature.py mix down to mono by averaging channels. The request backend silently drops all channels except the first. Embeddings will differ across backends for the same stereo file.
6. [MEDIUM] FileNotFoundError handled inconsistently across the three SpeakerClusteringStage processing modes
_process_grouped catches FileNotFoundError and silently skips the shard. _process_global and _process_shard_level let it propagate and abort. This is an undocumented behavioral asymmetry.
Structural Issues
7. [MEDIUM] SpeakerClusteringStage bundles three mutually exclusive processing modes in one class
_process_shard_level, _process_global, and _process_grouped diverge in behavior, error handling, and label-offsetting logic with no shared non-trivial code. Three separate stage classes or a strategy pattern would make the distinct behaviors explicit.
8. [LOW] Two 700+ line files
speaker_clustering_and_scoring.py (742 lines) and large_scale_clustering_and_scoring.py (719 lines) interleave math, file I/O, and orchestration. The metric functions (cluster_stats, cluster_quality, speaker_confidence) could be extracted into a shared metrics.py.
9. [MEDIUM] model_loader.py injects stub packages into sys.modules
Bypassing wespeaker/__init__.py via importlib.util and patching sys.modules is fragile — if the real package is present the stubs will conflict, and the approach silently breaks if the package's internal layout changes. At minimum this needs a comment explaining why it's necessary and when it's safe.
10. [LOW] Lazy vs eager librosa import is inconsistent
utmosv2_score.py lazily imports librosa inside _resample with a fallback. speaker_embedding_request.py imports it unconditionally at the top. Pick one pattern.
PR Hygiene
11. [HIGH] PR description is entirely empty
The body is the unmodified template — all sections blank and all checklist items unchecked for a 5,257-line addition. No context for what this replaces, what the intended data flow is, or how to test it. The title ("adding part of pipeline as smoke merge to main") reads like a placeholder.
12. [LOW] speaker_id/__init__.py exports 37 symbols
Makes it hard to distinguish the intended public API from internal helpers. Consider being selective or organizing by sub-module.
Summary
| Severity | Issues |
|---|---|
| CRITICAL | Channel layout mismatch (#1), confidence formula inverted (#2) |
| HIGH | noqa hiding potential NameError (#3), channel inconsistency across backends (#5), empty PR description (#11) |
| MEDIUM | BIRCH cap wastes iterations (#4), FileNotFoundError asymmetry (#6), SpeakerClusteringStage design (#7), sys.modules patching (#9) |
| LOW | File size (#8), librosa import inconsistency (#10), export surface (#12) |
Block on #1 and #2 — both are silent correctness bugs that produce wrong results on real data.
Description
Usage
# Add snippet demonstrating usageChecklist