Add style check by ayushdg · Pull Request #3 · NVIDIA-NeMo/Curator

ayushdg · 2024-03-18T23:52:52Z

The aim of the PR is to add pre-commit configs and a GHA for style checks.
The PR does not attempt to fix all the style failures present in the repo. That will be done in a followup PR.

The config ends up using isort, black & flake8 (some of it included in the contributing guide).

Happy to also explore ruff which should be a much faster drop in replacement for some of the tools mentioned above and has been adopted by many large python projects.

ryantwolf · 2024-03-20T21:21:29Z

Looks like the .pre-commit-config.yaml of both NeMo Aligner and NeMo are slightly different than ours. Can we change ours to match theirs exactly, or is there a reason ours should be different?

ayushdg · 2024-03-20T21:56:33Z

Looks like the .pre-commit-config.yaml of both NeMo Aligner and NeMo are slightly different than ours. Can we change ours to match theirs exactly, or is there a reason ours should be different?

There's no reason for them to match exactly, some of it depends on the practices we want to follow for the repo so it's good to go over the differences:

pre-commit-hooks: It makes sense to include the hooks included in these libraries, but probably makes sense to keep the additional Trailing-whitespace, end-of-file-fixer unless we find specific examples in the repo where it doesn't make sense.
Full list of hooks here.
Black: Pretty much identical
Isort: I'd argue that adding the black profile should be the case since we're using black as recommended by isort.
Flake8: This is personal preference, a lot of projects prefer some level of linting flake8 is one option that combines rules from PyCodeStyle and PyFlakes. black formatting arguably fixes some of these rules, but happy to omit this, don't have a strong preference.
ci: I'm happy to explore using pre-commit.ci over the pre-commit action directly provided it doesn't require any additional enterprise approvals.

ryantwolf · 2024-03-20T22:07:17Z

Ok cool. Thanks for the clarification.

Yup let's add the ones they use and keep the ones you already have.
Good.
Just looked into this tool. I like it so let's keep it too haha. Do we need to exclude any directories like aligner does?
I'm fine with keeping it so long as we don't run into any issues.
Yeah if you could look into that that could be good. Just to keep things in line with the rest of the repos. If it ends up being a hassle though let me know and we can stick with this.

ayushdg · 2024-03-20T22:12:47Z

Isort: I'd argue that adding the black profile should be the case since we're using black as recommended by isort.

Actually I stand corrected, other repos do use this but specify the config in their pyproject.toml file. I'll probably do the same.

ayushdg · 2024-03-20T22:59:48Z

This also has the side effect of pre-commit.ci auto formatting all the files. @ryantwolf Are you okay updating all the file formats in this PR?

ryantwolf · 2024-03-20T23:02:16Z

It had to happen at some point haha. Let's reformat the code 🫡

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

…line length Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

for more information, see https://pre-commit.ci Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

Co-authored-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

This PR adds a short README.md file for the TinyStories tutorial, with instructions on how to run it. Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add workflow for cpu pytests Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Install wheel to fix fasttext import Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Omit python 3.8, do not fast fail Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Check if updating setuptools/pip changes cython errors Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Explicitly install cython Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Try freeing up space before install Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Try rapids_no_initialize Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * remove python 3.9 testing for now Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

ryantwolf

Good with me. Just fix the DCO error by using the --signoff option on your commits.

ayushdg · 2024-03-21T22:42:26Z

Had some weirdness during the rebasing with signoff. Will open a new PR

minor changes for style guide sweep

[benchmarks] Update Semdedup benchmark + Add more metrics logging to KMeans + load_dataset_size accepts dataset ratio

…ion recovery and unified _skip_me tracking Adds optional QwenASR (0.6B) re-transcription stage that recovers samples flagged as hallucinated by QwenOmni. Unifies _skip_me field with source tracking (Hallucination:WhisperHallucination_omni, Recovered:QwenASR, Wrong language:FastTextLID, etc.). Routes the best prediction to downstream stages via SelectBestPredictionStage. - nemo_curator/models/qwen_asr.py — QwenASR model wrapper (qwen_asr lib + vLLM) - nemo_curator/stages/audio/inference/qwen_asr.py — conditional QwenASR stage - nemo_curator/stages/audio/text_filtering/select_best_prediction.py — picker - keep_waveform flag on QwenOmni for downstream audio access - Skip empty text in hallucination detection - Enabled via --asr_model_id (otherwise no-op) Squash cherry-pick of nune-tadevosyan#1. Conflict resolution: - run_pipeline.py: rebuilt to integrate both QwenASR (this PR) and ITN (PR #3) stage chains. Used PR_#1's structure as base, added ITN imports, CLI flags, prompt loading, and conditional ITNRestorationStage append after PnC. - fasttext_lid.py: kept PR_#1's source-tracked "Empty text:{self.name}" (matches the unified _skip_me convention introduced here). #NO_PR Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>

…st, validated improvements on top of the 4 PRs Squash cherry-pick of integration-test's unique commits on top of #1853 + #1 + #3 + #1839: - 633acc7 FastText and Hallucination update → SelectBestPredictionStage: cross-model WER agreement. If both omni and ASR are flagged hallucinated but agree (WER ≤ 100 - min_agreement_pct, default 80%), keep omni and mark recovered — two independent models producing near-identical text is strong evidence the text is correct. → FastTextLIDStage: HuggingFace-format model loader, proper _predict() abstraction, source-tracked _skip_me ("Wrong language:{name}"). - 5fdfa0a additional notes key + skip writing keys after skip_me + pnc prompt + prefill caching → Models (qwen_omni, qwen_asr, qwen_text_llm): notes_key field for diagnostic info, vLLM enable_prefix_caching=True with xxhash. → text_filtering stages: skip writing output keys when skip_me is set. → New file: prompts/pnc_prompt.md. - 15424e3 updated prompt for ITN → Sharper ITN prompt (handles more conversion edge cases). - 0cf8e6c match max model len for ITN and PnC → Aligned ITN/PnC max_model_len (4096), max_num_seqs (16), gpu_memory_utilization (0.95). Wired ITN args through run_pipeline. - 7e32df1 add Qwen3ASR for all → Apply QwenASR recovery to all hallucination flags, not just specific patterns. WhisperHallucinationStage tweaks. - caccd37 Add min word count for FastText → Re-adds min_word_count=2 (FastText is unreliable on single-word inputs). Conflict resolution: - run_pipeline.py: kept multi-line argparse style (ours), kept --source_lang_key, adopted theirs' ITN stage construction (with new max_model_len/num_seqs/gpu_mem args). - fasttext_lid.py: took theirs' richer process logic (min_word_count check, per-sample expected language via source_lang_key, source-tracked _skip_me values). #NO_PR Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>

Merge origin/main into dev to pick up upstream changes (492 files, +57k/-6k): - 26.04 staging release - Generic ASR/TTS audio processing pipeline (#1679) - Dynamo disaggregated serving + validators (#1813, #1820, #1833, #1834, #1861) - ReadSpeech audio curation benchmark + tutorials (#1841, #1851, #1870) - VideoReader path validation, audio waveform leak fixes (#1845, #1765) - Sortformer tutorial fixes + benchmarks (#1764) - Generic audio pipeline + qwen3 support (#1827) - Fern docs (audio + curate-audio sections) Conflict resolution: - nemo_curator/stages/audio/__init__.py: kept dev's lazy __getattr__ registry, added main's new ManifestReader and ManifestWriterStage to both __all__ and _LAZY_IMPORTS (now lazy-loaded from nemo_curator.stages.audio.common). - uv.lock: took main's version (latest dependency resolutions). Removals propagated from main (pre-merge-base files we no longer need): - nemo_curator/stages/audio/alm/alm_manifest_writer.py (replaced by ShardedManifestWriterStage) - nemo_curator/stages/audio/alm/alm_manifest_reader.py - nemo_curator/backends/experimental/* (refactored away) - nemo_curator/core/serve.py (replaced by typed serve config) Verified intact: - SCOTCH pipeline: speaker_id/, hifi_pipeline/slurm_e2e/ (dev-only additions, untouched). - Cherry-picked audio PRs (#1853, #3, #1, #1839, integration-test) all present. Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>

…ng, perf per-actor/per-GPU, reader/writer hardening Adapters + inference co-location (review NVIDIA-NeMo#3/NVIDIA-NeMo#25): - Move nemo_curator/adapters/asr/* -> stages/audio/inference/asr/adapters/* - Move stages/inference/{batch_policy,bucketed_stage}.py -> stages/audio/inference/ - Update all imports, YAML adapter_target / batch_policy._target_, READMEs, tests (tests mirror the new package paths). vLLM teardown (review NVIDIA-NeMo#15): - _cleanup_gpu() no longer calls torch.distributed.destroy_process_group(); vLLM owns its TP group, so tearing down the global PG could corrupt other components. Tutorial backend (review NVIDIA-NeMo#30): - Default backend -> ray_data (YAML + main.py); trim the "Choosing a Backend" section. - Drop tutorial test tests/tutorials/.../test_main_resources.py (review NVIDIA-NeMo#28). Performance reporting: - Unify metrics into a single per_actor block keyed by actor_id; GPU actors embed physical_address (host_ip:indices, now the canonical per-GPU key), gpu_indices/ gpu_uuids, and per-GPU utilization percentiles (nested gpus map). - Add first-class StagePerfStats.invocation_id for exact dedup; __add__ preserves identity only within the same worker. - Add inference_compute_fraction (inference time / total process time). - New GpuUtilSampler (NVML, bounded deque + lock); harden _collect_gpu_uuids to map physical indices -> visible ordinals under CUDA_VISIBLE_DEVICES. - Remove per-shard <output_dir>/<shard_key>_perf.jsonl output. Reader (nemo_tarred_reader): - Collision-safe _ManifestIndex (exact path, then longest path-suffix; ambiguous basenames resolve to no-match). - Enforce max_duration_s against decoded audio duration; narrow decode except to audio errors; write .done for empty/fully-filtered shards. Writer (sharded_manifest_writer): - Batch manifest writes per shard instead of per-utterance open/close. Misc: - Remove redundant super().setup_on_node() in RayActorPoolStageAdapter.__init__ (executor already runs once-per-node setup). - Fix stale docstring (disfluency_text_key); drop bucketize redundancy; trim adapter duplicate volume math; reduce comment verbosity across PR files. Signed-off-by: Aaftab V <aaftabv@nvidia.com>

ayushdg mentioned this pull request Mar 18, 2024

GitHub Actions: Add style check #1

Closed

ayushdg marked this pull request as ready for review March 19, 2024 21:53

ayushdg requested a review from ryantwolf March 19, 2024 21:56

ayushdg marked this pull request as draft March 19, 2024 22:26

ayushdg marked this pull request as ready for review March 20, 2024 23:12

ryantwolf and others added 14 commits March 21, 2024 15:36

Initial commit

716c715

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

gha: Add style check

ade1ac2

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

Add newline

000e086

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

Add pre commit config file

a1bf64c

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

Update action versions

09aeb21

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

Add newline to end of pre-commit-config

b66d36f

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

Remove flake8, move black,isort,pytest configs to pyproject.toml

6ca0994

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

remove style action in favor of precommit.ci

5a9b1b0

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

Remove style.yapf in favor of black update contributing guide around …

7bb7fc0

…line length Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

87bbd1d

for more information, see https://pre-commit.ci Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

Fix circular imports

e223918

Co-authored-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

Update README

cd37b72

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

[Tutorials] Add a readme file for the TinyStories tutorial (#5)

5624e91

This PR adds a short README.md file for the TinyStories tutorial, with instructions on how to run it. Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

ryantwolf approved these changes Mar 21, 2024

View reviewed changes

ayushdg closed this Mar 21, 2024

ayushdg force-pushed the ayushdg95/ci/style-check branch from a8980a5 to f607633 Compare March 21, 2024 22:40

ayushdg deleted the ayushdg95/ci/style-check branch March 25, 2024 19:48

copy-pr-bot Bot pushed a commit that referenced this pull request Dec 4, 2025

Merge pull request #3 from lbliii/memory_management_guide

fae8728

minor changes for style guide sweep

copy-pr-bot Bot pushed a commit that referenced this pull request Jan 23, 2026

Merge pull request #3 from praateekmahajan/praateek/benchmark-semdedup

a2733ba

[benchmarks] Update Semdedup benchmark + Add more metrics logging to KMeans + load_dataset_size accepts dataset ratio

rlratzel mentioned this pull request Jun 2, 2026

[benchmarking] Multiple fixes to stabilize the nightly benchmark suite #2035

Merged

21 tasks

oyilmaz-nvidia mentioned this pull request Jun 9, 2026

adding part of pipeline as smoke merge to main #2007

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add style check#3

Add style check#3
ayushdg wants to merge 14 commits into
mainfrom
ayushdg95/ci/style-check

ayushdg commented Mar 18, 2024 •

edited

Loading

Uh oh!

ryantwolf commented Mar 20, 2024

Uh oh!

ayushdg commented Mar 20, 2024 •

edited

Loading

Uh oh!

ryantwolf commented Mar 20, 2024

Uh oh!

ayushdg commented Mar 20, 2024 •

edited

Loading

Uh oh!

ayushdg commented Mar 20, 2024

Uh oh!

ryantwolf commented Mar 20, 2024

Uh oh!

ryantwolf left a comment •

edited

Loading

Uh oh!

ayushdg commented Mar 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ayushdg commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryantwolf commented Mar 20, 2024

Uh oh!

ayushdg commented Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryantwolf commented Mar 20, 2024

Uh oh!

ayushdg commented Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ayushdg commented Mar 20, 2024

Uh oh!

ryantwolf commented Mar 20, 2024

Uh oh!

ryantwolf left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushdg commented Mar 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ayushdg commented Mar 18, 2024 •

edited

Loading

ayushdg commented Mar 20, 2024 •

edited

Loading

ayushdg commented Mar 20, 2024 •

edited

Loading

ryantwolf left a comment •

edited

Loading