Skip to content

adding part of pipeline as smoke merge to main#2007

Open
Jorjeous wants to merge 3 commits into
mainfrom
Test_pipeline_MR
Open

adding part of pipeline as smoke merge to main#2007
Jorjeous wants to merge 3 commits into
mainfrom
Test_pipeline_MR

Conversation

@Jorjeous

Copy link
Copy Markdown
Member

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous Jorjeous requested a review from a team as a code owner May 21, 2026 12:50
@Jorjeous Jorjeous requested review from oyilmaz-nvidia and removed request for a team May 21, 2026 12:50
@copy-pr-bot

copy-pr-bot Bot commented May 21, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a speaker ID and audio quality pipeline to NeMo Curator, including speaker embedding extraction (Lhotse/JSONL/AudioTask variants), two-stage AHC clustering (standard and large-scale BIRCH+AHC), per-utterance confidence scoring, and a UTMOSv2 MOS scoring stage.

  • Embedding extraction (SpeakerEmbeddingLhotseStage, SpeakerEmbeddingRequestStage, SpeakerEmbeddingAudioTaskStage): shard-aware NeMo TitaNet-based extraction with support for local files, tar archives, AIS/S3, and in-memory Lhotse CutSets.
  • Clustering (SpeakerClusteringStage): three modes — per-shard, global, and grouped — backed by either standard AHC (ahc.py) or memory-bounded BIRCH+AHC (large_scale_clustering_and_scoring.py, the "SCOTCH" algorithm) with min-cluster-size filtering and silhouette-based confidence.
  • MOS scoring (GetUtmosv2ScoreStage): batch-writes temp WAV files, calls UTMOSv2, and maps scores back to entries by file stem rather than result ordering.

Confidence Score: 4/5

The core clustering logic and UTMOSv2 scoring are sound, but the global clustering mode can crash with an unhandled FileNotFoundError if any embedding shard is missing, while the grouped mode handles the same situation gracefully.

Global and shard-level processing modes propagate FileNotFoundError uncaught from _load_shard_pair, aborting a multi-hour clustering run if one shard file is absent, while grouped mode recovers gracefully. All three modes also use direct dict indexing that raises KeyError on any malformed manifest line instead of skipping it.

nemo_curator/stages/audio/speaker_id/speaker_clustering_and_scoring.py — the three processing-mode methods need consistent error recovery; nemo_curator/stages/audio/speaker_id/speaker_embedding_request.py — unconditional librosa import and fragile channel heuristic.

Important Files Changed

Filename Overview
nemo_curator/stages/audio/speaker_id/speaker_clustering_and_scoring.py Core clustering stage with three modes; global mode lacks FileNotFoundError recovery unlike grouped mode; all modes use direct dict indexing that can KeyError on malformed manifests; duplicated helpers create unnecessary torch coupling.
nemo_curator/stages/audio/speaker_id/speaker_embedding_request.py JSONL-backed embedding stage; unconditional librosa import without fallback can crash mid-batch on resampling; fragile channel-layout heuristic.
nemo_curator/stages/audio/speaker_id/clustering/large_scale_clustering_and_scoring.py BIRCH+AHC two-stage large-scale clustering; well-documented with leaf-cap backoff; suppressed F821 warning on leaf_centroids assigned inside loop body.
nemo_curator/stages/audio/metrics/utmosv2_score.py UTMOSv2 MOS scoring stage; prior axis and ordering bugs addressed; stem-keyed score dict and graceful librosa fallback.
nemo_curator/stages/audio/speaker_id/speaker_embedding_lhotse.py Lhotse shard-aware extractor; NeMo imports are lazy; helper functions duplicated in speaker_clustering_and_scoring.py.
nemo_curator/stages/audio/speaker_id/clustering/ahc.py Standard AHC clustering with cosine distance; clean implementation with proper scipy imports.
nemo_curator/stages/audio/speaker_id/embedding/extractor.py Duration-aware dynamic batching for WeSpeaker embeddings; sorts longest-first; clean flush logic.
nemo_curator/stages/audio/speaker_id/embedding/model_loader.py WeSpeaker model loader bypassing wespeaker/init.py via importlib to avoid heavy optional dependency chain.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Audio Input\nLhotse / JSONL / AudioTask] --> B{Embedding\nExtraction}
    B --> C[SpeakerEmbeddingLhotseStage\nnemo_tarred / lhotse_shar]
    B --> D[SpeakerEmbeddingRequestStage\nJSONL + file paths]
    B --> E[SpeakerEmbeddingAudioTaskStage\nin-memory waveform]
    C --> F[per-shard .npz files]
    D --> F
    E --> F
    F --> G{SpeakerClusteringStage\n_resolve_mode}
    G --> H[shard mode\nper-shard AHC]
    G --> I[global mode\nall-shards AHC]
    G --> J[grouped mode\nbatch AHC + label offset]
    H --> K{Dataset size?}
    I --> K
    J --> K
    K -->|small| L[ahc.py\nstandard cosine AHC]
    K -->|large| M[large_scale_clustering\nBIRCH + AHC on centroids]
    L --> N[min_cluster_size filter]
    M --> N
    N --> O[speaker_label + confidence_score\nannotated JSONL manifests]
    P[Audio Input] --> Q[GetUtmosv2ScoreStage]
    Q --> R[temp WAV dir]
    R --> S[UTMOSv2 model.predict]
    S --> T[stem-keyed score dict]
    T --> U[utmosv2_score in DocumentBatch / AudioTask]
Loading

Reviews (3): Last reviewed commit: "Merge origin/main into Test_pipeline_MR" | Re-trigger Greptile

Comment on lines +63 to +72
return f"gpu{cv.split(',')[0]}"
return f"pid{os.getpid()}"

if TYPE_CHECKING:
from lhotse import Cut, CutSet

try:
from nemo.collections.common.data.lhotse.nemo_adapters import (
LazyNeMoIterator,
LazyNeMoTarredIterator,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unconditional top-level NeMo import breaks environments without NeMo

The module raises RuntimeError at import time if NeMo is not installed, because the try/except around the NeMo import re-raises as a hard error. Since speaker_id/__init__.py eagerly imports SpeakerEmbeddingLhotseStage from this module, and nemo_curator/stages/audio/__init__.py re-exports it, any import nemo_curator.stages.audio will fail in environments where NeMo is not installed. Moving the LazyNeMoIterator/LazyNeMoTarredIterator imports inside the process() method would make NeMo a lazy dependency and avoid breaking all other audio stages.

Comment on lines +188 to +189
if waveform.ndim > 1:
waveform = waveform.mean(axis=1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Wrong axis for multi-channel stereo mix-down silently corrupts audio

When the waveform is in channels-first format (channels, samples) as used by NeMo and Lhotse, mean(axis=1) averages over the time dimension, yielding shape (channels,) instead of (samples,). That array is then resampled and written as a 1-sample WAV, so UTMOSv2 scores garbage audio with no error. The correct axis for channels-first is 0.

Suggested change
if waveform.ndim > 1:
waveform = waveform.mean(axis=1)
if waveform.ndim > 1:
waveform = waveform.mean(axis=0)

Comment on lines +228 to +235
results = self._model.predict(
input_dir=wav_dir,
batch_size=self.inference_batch_size,
num_repetitions=self.num_repetitions,
predict_dataset=self.predict_dataset,
num_workers=0,
verbose=False,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _score_dir result ordering is not guaranteed

model.predict(input_dir=wav_dir) returns results whose order UTMOSv2 does not document. The code zips results directly against valid_indices relying on lexicographic file-name order. The zero-padded names happen to work today, but if a future UTMOSv2 version changes traversal order, MOS scores will be silently assigned to the wrong entries.

Comment on lines +120 to +127
try:
import librosa
return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr)
except ImportError:
ratio = target_sr / orig_sr
indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int)
indices = indices[indices < len(audio)]
return audio[indices]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fallback resampler performs nearest-neighbor, not linear interpolation

The fallback uses np.round to pick the nearest existing sample index, which is nearest-neighbor resampling. The comment should say so, or the implementation should use np.interp for true linear interpolation.

Suggested change
try:
import librosa
return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr)
except ImportError:
ratio = target_sr / orig_sr
indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int)
indices = indices[indices < len(audio)]
return audio[indices]
try:
import librosa
return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr)
except ImportError:
# Nearest-neighbour fallback (low quality - prefer installing librosa).
ratio = target_sr / orig_sr
indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int)
indices = indices[indices < len(audio)]
return audio[indices]

Jorjeous added 2 commits June 9, 2026 02:53
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous Jorjeous force-pushed the Test_pipeline_MR branch from f102dcd to 1bb9291 Compare June 9, 2026 09:54
dynamic_ncols=True,
disable=bar_disable,
):
manifest_entries, _cut_ids, embeddings, mapping = self._load_shard_pair(mp)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing FileNotFoundError recovery in global mode

_process_global calls self._load_shard_pair(mp) without a try/except, so a single missing .npz file crashes the entire clustering run with an unhandled FileNotFoundError. _process_grouped wraps the same call in a try/except FileNotFoundError with a graceful warning-and-skip, and _load_shard_pair itself explicitly raises FileNotFoundError for missing embedding files. In distributed or multi-shard pipelines where a worker may have failed and left a gap, global mode will abort while the grouped path silently continues — creating a confusing behavioral difference.

_log_cluster_summary(labels, self.threshold)

for entry in manifest_entries:
afp = entry[self.audio_filepath_key]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 KeyError crash on malformed manifest entries in all processing modes

entry[self.audio_filepath_key] uses direct dict indexing in all three process paths (_process_shard_level, _process_global, _process_grouped). If any JSONL line in the manifest is missing the key (corrupted file, truncated write, schema mismatch), the entire clustering stage raises KeyError and crashes. The upstream _verify_manifest_vs_embeddings helper correctly uses .get(audio_filepath_key, ""), but the downstream write loops do not. This affects all three clustering modes.

Comment on lines +191 to +193
import librosa

audio = librosa.resample(audio, orig_sr=sr, target_sr=self.target_sample_rate)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hard librosa import with no fallback can crash mid-batch

When the loaded audio's sample rate differs from target_sample_rate, librosa is imported unconditionally with no try/except. If librosa is not installed the stage raises ModuleNotFoundError in the middle of processing a batch, after the model has been loaded and some rows already processed, potentially leaving a partially-written output file. Compare with _resample() in utmosv2_score.py which wraps the same import in a try/except and falls back gracefully. A similar unconditional import exists at line 252 in the waveform branch of process().

@oyilmaz-nvidia oyilmaz-nvidia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's more domain specific, it's a bit hard for me to review thoroughly. But I asked Claude to get its review.

Can you please at least check the following and if it makes sense, address the points raised? I think the Critical ones definitely need some attention.

Code Review: PR #2007 — Audio Speaker ID & UTMOSv2 Stages

Overview

Adds two major audio curation capabilities:

  1. Speaker diarization pipeline — speaker embedding extraction (three backends: Lhotse/NeMo, request-based, AudioTask), followed by AHC or large-scale BIRCH+AHC clustering and scoring.
  2. UTMOSv2 MOS scoring stage.

Plus Fern docs, a tutorial pipeline, and unit tests for all components.

30 files, 5,257 additions, 0 deletions — pure addition.


Critical Bugs

1. [CRITICAL] Channel-handling convention is inconsistent across the three embedding backends

Three files mix up channels-first vs channels-last, and mean vs first-channel:

File Operation Assumed layout
utmosv2_score.py waveform.mean(axis=0) channels-first (C, T)
feature.py (load_audio) pcm.mean(axis=1) channels-last (T, C)
speaker_embedding_request.py audio[:, 0] channels-last (T, C), first channel only
speaker_embedding_lhotse.py audio[0] channels-first (C, T), first channel only

At minimum two of these conventions are wrong. On stereo data this silently corrupts embeddings without raising an error.

2. [CRITICAL] Confidence formula appears inverted (ahc.py and large_scale_clustering_and_scoring.py)

The silhouette score is (b - a) / max(a, b) where a = intra-cluster distance and b = nearest-cluster distance. The code computes (a - b) / max(a, b) — the negation. A tight, well-separated cluster will produce a negative confidence score and a poor cluster a positive one. This inverts the entire quality metric.

3. [HIGH] noqa: F821 suppresses a potential NameError (large_scale_clustering_and_scoring.py)

After del birch, the code references leaf_centroids with # noqa: F821 to silence the undefined-name linter warning. If del birch runs before leaf_centroids is assigned (e.g. in an early backoff iteration), this crashes at runtime rather than at lint time. The suppression hides the problem.

4. [MEDIUM] BIRCH radius backoff can loop uselessly

The backoff loop runs up to 8 attempts with 1.25× geometric growth but caps threshold at 1.0. Once the cap is hit, subsequent iterations repeat the same computation at threshold=1.0 without making progress, wasting compute.


Correctness / Logic Issues

5. [HIGH] speaker_embedding_request.py uses first channel (audio[:, 0]) instead of mean

utmosv2_score.py and feature.py mix down to mono by averaging channels. The request backend silently drops all channels except the first. Embeddings will differ across backends for the same stereo file.

6. [MEDIUM] FileNotFoundError handled inconsistently across the three SpeakerClusteringStage processing modes

_process_grouped catches FileNotFoundError and silently skips the shard. _process_global and _process_shard_level let it propagate and abort. This is an undocumented behavioral asymmetry.


Structural Issues

7. [MEDIUM] SpeakerClusteringStage bundles three mutually exclusive processing modes in one class

_process_shard_level, _process_global, and _process_grouped diverge in behavior, error handling, and label-offsetting logic with no shared non-trivial code. Three separate stage classes or a strategy pattern would make the distinct behaviors explicit.

8. [LOW] Two 700+ line files

speaker_clustering_and_scoring.py (742 lines) and large_scale_clustering_and_scoring.py (719 lines) interleave math, file I/O, and orchestration. The metric functions (cluster_stats, cluster_quality, speaker_confidence) could be extracted into a shared metrics.py.

9. [MEDIUM] model_loader.py injects stub packages into sys.modules

Bypassing wespeaker/__init__.py via importlib.util and patching sys.modules is fragile — if the real package is present the stubs will conflict, and the approach silently breaks if the package's internal layout changes. At minimum this needs a comment explaining why it's necessary and when it's safe.

10. [LOW] Lazy vs eager librosa import is inconsistent

utmosv2_score.py lazily imports librosa inside _resample with a fallback. speaker_embedding_request.py imports it unconditionally at the top. Pick one pattern.


PR Hygiene

11. [HIGH] PR description is entirely empty

The body is the unmodified template — all sections blank and all checklist items unchecked for a 5,257-line addition. No context for what this replaces, what the intended data flow is, or how to test it. The title ("adding part of pipeline as smoke merge to main") reads like a placeholder.

12. [LOW] speaker_id/__init__.py exports 37 symbols

Makes it hard to distinguish the intended public API from internal helpers. Consider being selective or organizing by sub-module.


Summary

Severity Issues
CRITICAL Channel layout mismatch (#1), confidence formula inverted (#2)
HIGH noqa hiding potential NameError (#3), channel inconsistency across backends (#5), empty PR description (#11)
MEDIUM BIRCH cap wastes iterations (#4), FileNotFoundError asymmetry (#6), SpeakerClusteringStage design (#7), sys.modules patching (#9)
LOW File size (#8), librosa import inconsistency (#10), export surface (#12)

Block on #1 and #2 — both are silent correctness bugs that produce wrong results on real data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants