Skip to content

Add style check#3

Closed
ayushdg wants to merge 14 commits into
mainfrom
ayushdg95/ci/style-check
Closed

Add style check#3
ayushdg wants to merge 14 commits into
mainfrom
ayushdg95/ci/style-check

Conversation

@ayushdg

@ayushdg ayushdg commented Mar 18, 2024

Copy link
Copy Markdown
Contributor

The aim of the PR is to add pre-commit configs and a GHA for style checks.
The PR does not attempt to fix all the style failures present in the repo. That will be done in a followup PR.

The config ends up using isort, black & flake8 (some of it included in the contributing guide).

Happy to also explore ruff which should be a much faster drop in replacement for some of the tools mentioned above and has been adopted by many large python projects.

@ayushdg ayushdg marked this pull request as ready for review March 19, 2024 21:53
@ayushdg ayushdg requested a review from ryantwolf March 19, 2024 21:56
@ayushdg ayushdg marked this pull request as draft March 19, 2024 22:26
@ryantwolf

Copy link
Copy Markdown
Contributor

Looks like the .pre-commit-config.yaml of both NeMo Aligner and NeMo are slightly different than ours. Can we change ours to match theirs exactly, or is there a reason ours should be different?

@ayushdg

ayushdg commented Mar 20, 2024

Copy link
Copy Markdown
Contributor Author

Looks like the .pre-commit-config.yaml of both NeMo Aligner and NeMo are slightly different than ours. Can we change ours to match theirs exactly, or is there a reason ours should be different?

There's no reason for them to match exactly, some of it depends on the practices we want to follow for the repo so it's good to go over the differences:

  1. pre-commit-hooks: It makes sense to include the hooks included in these libraries, but probably makes sense to keep the additional Trailing-whitespace, end-of-file-fixer unless we find specific examples in the repo where it doesn't make sense.
    Full list of hooks here.
  2. Black: Pretty much identical
  3. Isort: I'd argue that adding the black profile should be the case since we're using black as recommended by isort.
  4. Flake8: This is personal preference, a lot of projects prefer some level of linting flake8 is one option that combines rules from PyCodeStyle and PyFlakes. black formatting arguably fixes some of these rules, but happy to omit this, don't have a strong preference.
  5. ci: I'm happy to explore using pre-commit.ci over the pre-commit action directly provided it doesn't require any additional enterprise approvals.

@ryantwolf

Copy link
Copy Markdown
Contributor

Ok cool. Thanks for the clarification.

  1. Yup let's add the ones they use and keep the ones you already have.
  2. Good.
  3. Just looked into this tool. I like it so let's keep it too haha. Do we need to exclude any directories like aligner does?
  4. I'm fine with keeping it so long as we don't run into any issues.
  5. Yeah if you could look into that that could be good. Just to keep things in line with the rest of the repos. If it ends up being a hassle though let me know and we can stick with this.

@ayushdg

ayushdg commented Mar 20, 2024

Copy link
Copy Markdown
Contributor Author

Isort: I'd argue that adding the black profile should be the case since we're using black as recommended by isort.

Actually I stand corrected, other repos do use this but specify the config in their pyproject.toml file. I'll probably do the same.

@ayushdg

ayushdg commented Mar 20, 2024

Copy link
Copy Markdown
Contributor Author

This also has the side effect of pre-commit.ci auto formatting all the files. @ryantwolf Are you okay updating all the file formats in this PR?

@ryantwolf

Copy link
Copy Markdown
Contributor

It had to happen at some point haha. Let's reformat the code 🫡

@ayushdg ayushdg marked this pull request as ready for review March 20, 2024 23:12
ryantwolf and others added 14 commits March 21, 2024 15:36
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
…line length

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
for more information, see https://pre-commit.ci

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
This PR adds a short README.md file for the TinyStories tutorial, with
instructions on how to run it.

Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add workflow for cpu pytests

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Install wheel to fix fasttext import

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Omit python 3.8, do not fast fail

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Check if updating setuptools/pip changes cython errors

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Explicitly install cython

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Try freeing up space before install

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Try rapids_no_initialize

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* remove python 3.9 testing for now

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

@ryantwolf ryantwolf left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good with me. Just fix the DCO error by using the --signoff option on your commits.

@ayushdg ayushdg closed this Mar 21, 2024
@ayushdg ayushdg force-pushed the ayushdg95/ci/style-check branch from a8980a5 to f607633 Compare March 21, 2024 22:40
@ayushdg

ayushdg commented Mar 21, 2024

Copy link
Copy Markdown
Contributor Author

Had some weirdness during the rebasing with signoff. Will open a new PR

@ayushdg ayushdg deleted the ayushdg95/ci/style-check branch March 25, 2024 19:48
copy-pr-bot Bot pushed a commit that referenced this pull request Dec 4, 2025
minor changes for style guide sweep
copy-pr-bot Bot pushed a commit that referenced this pull request Jan 23, 2026
[benchmarks] Update Semdedup benchmark + Add more metrics logging to KMeans + load_dataset_size accepts dataset ratio
Jorjeous added a commit that referenced this pull request Apr 27, 2026
…ion recovery and unified _skip_me tracking

Adds optional QwenASR (0.6B) re-transcription stage that recovers samples
flagged as hallucinated by QwenOmni. Unifies _skip_me field with source
tracking (Hallucination:WhisperHallucination_omni, Recovered:QwenASR,
Wrong language:FastTextLID, etc.). Routes the best prediction to downstream
stages via SelectBestPredictionStage.

- nemo_curator/models/qwen_asr.py — QwenASR model wrapper (qwen_asr lib + vLLM)
- nemo_curator/stages/audio/inference/qwen_asr.py — conditional QwenASR stage
- nemo_curator/stages/audio/text_filtering/select_best_prediction.py — picker
- keep_waveform flag on QwenOmni for downstream audio access
- Skip empty text in hallucination detection
- Enabled via --asr_model_id (otherwise no-op)

Squash cherry-pick of nune-tadevosyan#1.
Conflict resolution:
- run_pipeline.py: rebuilt to integrate both QwenASR (this PR) and ITN
  (PR #3) stage chains. Used PR_#1's structure as base, added ITN imports,
  CLI flags, prompt loading, and conditional ITNRestorationStage append
  after PnC.
- fasttext_lid.py: kept PR_#1's source-tracked "Empty text:{self.name}"
  (matches the unified _skip_me convention introduced here). #NO_PR

Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Jorjeous added a commit that referenced this pull request Apr 27, 2026
…st, validated improvements on top of the 4 PRs

Squash cherry-pick of integration-test's unique commits on top of #1853 + #1 + #3 + #1839:

- 633acc7 FastText and Hallucination update
  → SelectBestPredictionStage: cross-model WER agreement. If both omni and
    ASR are flagged hallucinated but agree (WER ≤ 100 - min_agreement_pct,
    default 80%), keep omni and mark recovered — two independent models
    producing near-identical text is strong evidence the text is correct.
  → FastTextLIDStage: HuggingFace-format model loader, proper _predict()
    abstraction, source-tracked _skip_me ("Wrong language:{name}").

- 5fdfa0a additional notes key + skip writing keys after skip_me + pnc prompt + prefill caching
  → Models (qwen_omni, qwen_asr, qwen_text_llm): notes_key field for
    diagnostic info, vLLM enable_prefix_caching=True with xxhash.
  → text_filtering stages: skip writing output keys when skip_me is set.
  → New file: prompts/pnc_prompt.md.

- 15424e3 updated prompt for ITN
  → Sharper ITN prompt (handles more conversion edge cases).

- 0cf8e6c match max model len for ITN and PnC
  → Aligned ITN/PnC max_model_len (4096), max_num_seqs (16),
    gpu_memory_utilization (0.95). Wired ITN args through run_pipeline.

- 7e32df1 add Qwen3ASR for all
  → Apply QwenASR recovery to all hallucination flags, not just specific
    patterns. WhisperHallucinationStage tweaks.

- caccd37 Add min word count for FastText
  → Re-adds min_word_count=2 (FastText is unreliable on single-word inputs).

Conflict resolution:
- run_pipeline.py: kept multi-line argparse style (ours), kept --source_lang_key,
  adopted theirs' ITN stage construction (with new max_model_len/num_seqs/gpu_mem args).
- fasttext_lid.py: took theirs' richer process logic (min_word_count check,
  per-sample expected language via source_lang_key, source-tracked _skip_me values). #NO_PR

Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Jorjeous added a commit that referenced this pull request Apr 27, 2026
Merge origin/main into dev to pick up upstream changes (492 files, +57k/-6k):
- 26.04 staging release
- Generic ASR/TTS audio processing pipeline (#1679)
- Dynamo disaggregated serving + validators (#1813, #1820, #1833, #1834, #1861)
- ReadSpeech audio curation benchmark + tutorials (#1841, #1851, #1870)
- VideoReader path validation, audio waveform leak fixes (#1845, #1765)
- Sortformer tutorial fixes + benchmarks (#1764)
- Generic audio pipeline + qwen3 support (#1827)
- Fern docs (audio + curate-audio sections)

Conflict resolution:
- nemo_curator/stages/audio/__init__.py: kept dev's lazy __getattr__ registry,
  added main's new ManifestReader and ManifestWriterStage to both __all__ and
  _LAZY_IMPORTS (now lazy-loaded from nemo_curator.stages.audio.common).
- uv.lock: took main's version (latest dependency resolutions).

Removals propagated from main (pre-merge-base files we no longer need):
- nemo_curator/stages/audio/alm/alm_manifest_writer.py (replaced by ShardedManifestWriterStage)
- nemo_curator/stages/audio/alm/alm_manifest_reader.py
- nemo_curator/backends/experimental/* (refactored away)
- nemo_curator/core/serve.py (replaced by typed serve config)

Verified intact:
- SCOTCH pipeline: speaker_id/, hifi_pipeline/slurm_e2e/ (dev-only additions, untouched).
- Cherry-picked audio PRs (#1853, #3, #1, #1839, integration-test) all present.

Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
mohammadaaftabv added a commit to mohammadaaftabv/Curator that referenced this pull request Jun 11, 2026
…ng, perf per-actor/per-GPU, reader/writer hardening

Adapters + inference co-location (review NVIDIA-NeMo#3/NVIDIA-NeMo#25):
- Move nemo_curator/adapters/asr/* -> stages/audio/inference/asr/adapters/*
- Move stages/inference/{batch_policy,bucketed_stage}.py -> stages/audio/inference/
- Update all imports, YAML adapter_target / batch_policy._target_, READMEs, tests
  (tests mirror the new package paths).

vLLM teardown (review NVIDIA-NeMo#15):
- _cleanup_gpu() no longer calls torch.distributed.destroy_process_group(); vLLM
  owns its TP group, so tearing down the global PG could corrupt other components.

Tutorial backend (review NVIDIA-NeMo#30):
- Default backend -> ray_data (YAML + main.py); trim the "Choosing a Backend" section.
- Drop tutorial test tests/tutorials/.../test_main_resources.py (review NVIDIA-NeMo#28).

Performance reporting:
- Unify metrics into a single per_actor block keyed by actor_id; GPU actors embed
  physical_address (host_ip:indices, now the canonical per-GPU key), gpu_indices/
  gpu_uuids, and per-GPU utilization percentiles (nested gpus map).
- Add first-class StagePerfStats.invocation_id for exact dedup; __add__ preserves
  identity only within the same worker.
- Add inference_compute_fraction (inference time / total process time).
- New GpuUtilSampler (NVML, bounded deque + lock); harden _collect_gpu_uuids to map
  physical indices -> visible ordinals under CUDA_VISIBLE_DEVICES.
- Remove per-shard <output_dir>/<shard_key>_perf.jsonl output.

Reader (nemo_tarred_reader):
- Collision-safe _ManifestIndex (exact path, then longest path-suffix; ambiguous
  basenames resolve to no-match).
- Enforce max_duration_s against decoded audio duration; narrow decode except to
  audio errors; write .done for empty/fully-filtered shards.

Writer (sharded_manifest_writer):
- Batch manifest writes per shard instead of per-utterance open/close.

Misc:
- Remove redundant super().setup_on_node() in RayActorPoolStageAdapter.__init__
  (executor already runs once-per-node setup).
- Fix stale docstring (disfluency_text_key); drop bucketize redundancy; trim
  adapter duplicate volume math; reduce comment verbosity across PR files.

Signed-off-by: Aaftab V <aaftabv@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants