Add style check#3
Conversation
|
Looks like the |
There's no reason for them to match exactly, some of it depends on the practices we want to follow for the repo so it's good to go over the differences:
|
|
Ok cool. Thanks for the clarification.
|
Actually I stand corrected, other repos do use this but specify the config in their pyproject.toml file. I'll probably do the same. |
|
This also has the side effect of pre-commit.ci auto formatting all the files. @ryantwolf Are you okay updating all the file formats in this PR? |
|
It had to happen at some point haha. Let's reformat the code 🫡 |
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
…line length Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
for more information, see https://pre-commit.ci Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Ryan Wolf <rywolf@nvidia.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
This PR adds a short README.md file for the TinyStories tutorial, with instructions on how to run it. Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Add workflow for cpu pytests Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Install wheel to fix fasttext import Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Omit python 3.8, do not fast fail Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Check if updating setuptools/pip changes cython errors Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Explicitly install cython Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Try freeing up space before install Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Try rapids_no_initialize Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * remove python 3.9 testing for now Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
a8980a5 to
f607633
Compare
|
Had some weirdness during the rebasing with signoff. Will open a new PR |
minor changes for style guide sweep
[benchmarks] Update Semdedup benchmark + Add more metrics logging to KMeans + load_dataset_size accepts dataset ratio
…ion recovery and unified _skip_me tracking Adds optional QwenASR (0.6B) re-transcription stage that recovers samples flagged as hallucinated by QwenOmni. Unifies _skip_me field with source tracking (Hallucination:WhisperHallucination_omni, Recovered:QwenASR, Wrong language:FastTextLID, etc.). Routes the best prediction to downstream stages via SelectBestPredictionStage. - nemo_curator/models/qwen_asr.py — QwenASR model wrapper (qwen_asr lib + vLLM) - nemo_curator/stages/audio/inference/qwen_asr.py — conditional QwenASR stage - nemo_curator/stages/audio/text_filtering/select_best_prediction.py — picker - keep_waveform flag on QwenOmni for downstream audio access - Skip empty text in hallucination detection - Enabled via --asr_model_id (otherwise no-op) Squash cherry-pick of nune-tadevosyan#1. Conflict resolution: - run_pipeline.py: rebuilt to integrate both QwenASR (this PR) and ITN (PR #3) stage chains. Used PR_#1's structure as base, added ITN imports, CLI flags, prompt loading, and conditional ITNRestorationStage append after PnC. - fasttext_lid.py: kept PR_#1's source-tracked "Empty text:{self.name}" (matches the unified _skip_me convention introduced here). #NO_PR Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
…st, validated improvements on top of the 4 PRs Squash cherry-pick of integration-test's unique commits on top of #1853 + #1 + #3 + #1839: - 633acc7 FastText and Hallucination update → SelectBestPredictionStage: cross-model WER agreement. If both omni and ASR are flagged hallucinated but agree (WER ≤ 100 - min_agreement_pct, default 80%), keep omni and mark recovered — two independent models producing near-identical text is strong evidence the text is correct. → FastTextLIDStage: HuggingFace-format model loader, proper _predict() abstraction, source-tracked _skip_me ("Wrong language:{name}"). - 5fdfa0a additional notes key + skip writing keys after skip_me + pnc prompt + prefill caching → Models (qwen_omni, qwen_asr, qwen_text_llm): notes_key field for diagnostic info, vLLM enable_prefix_caching=True with xxhash. → text_filtering stages: skip writing output keys when skip_me is set. → New file: prompts/pnc_prompt.md. - 15424e3 updated prompt for ITN → Sharper ITN prompt (handles more conversion edge cases). - 0cf8e6c match max model len for ITN and PnC → Aligned ITN/PnC max_model_len (4096), max_num_seqs (16), gpu_memory_utilization (0.95). Wired ITN args through run_pipeline. - 7e32df1 add Qwen3ASR for all → Apply QwenASR recovery to all hallucination flags, not just specific patterns. WhisperHallucinationStage tweaks. - caccd37 Add min word count for FastText → Re-adds min_word_count=2 (FastText is unreliable on single-word inputs). Conflict resolution: - run_pipeline.py: kept multi-line argparse style (ours), kept --source_lang_key, adopted theirs' ITN stage construction (with new max_model_len/num_seqs/gpu_mem args). - fasttext_lid.py: took theirs' richer process logic (min_word_count check, per-sample expected language via source_lang_key, source-tracked _skip_me values). #NO_PR Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Merge origin/main into dev to pick up upstream changes (492 files, +57k/-6k): - 26.04 staging release - Generic ASR/TTS audio processing pipeline (#1679) - Dynamo disaggregated serving + validators (#1813, #1820, #1833, #1834, #1861) - ReadSpeech audio curation benchmark + tutorials (#1841, #1851, #1870) - VideoReader path validation, audio waveform leak fixes (#1845, #1765) - Sortformer tutorial fixes + benchmarks (#1764) - Generic audio pipeline + qwen3 support (#1827) - Fern docs (audio + curate-audio sections) Conflict resolution: - nemo_curator/stages/audio/__init__.py: kept dev's lazy __getattr__ registry, added main's new ManifestReader and ManifestWriterStage to both __all__ and _LAZY_IMPORTS (now lazy-loaded from nemo_curator.stages.audio.common). - uv.lock: took main's version (latest dependency resolutions). Removals propagated from main (pre-merge-base files we no longer need): - nemo_curator/stages/audio/alm/alm_manifest_writer.py (replaced by ShardedManifestWriterStage) - nemo_curator/stages/audio/alm/alm_manifest_reader.py - nemo_curator/backends/experimental/* (refactored away) - nemo_curator/core/serve.py (replaced by typed serve config) Verified intact: - SCOTCH pipeline: speaker_id/, hifi_pipeline/slurm_e2e/ (dev-only additions, untouched). - Cherry-picked audio PRs (#1853, #3, #1, #1839, integration-test) all present. Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
…ng, perf per-actor/per-GPU, reader/writer hardening Adapters + inference co-location (review NVIDIA-NeMo#3/NVIDIA-NeMo#25): - Move nemo_curator/adapters/asr/* -> stages/audio/inference/asr/adapters/* - Move stages/inference/{batch_policy,bucketed_stage}.py -> stages/audio/inference/ - Update all imports, YAML adapter_target / batch_policy._target_, READMEs, tests (tests mirror the new package paths). vLLM teardown (review NVIDIA-NeMo#15): - _cleanup_gpu() no longer calls torch.distributed.destroy_process_group(); vLLM owns its TP group, so tearing down the global PG could corrupt other components. Tutorial backend (review NVIDIA-NeMo#30): - Default backend -> ray_data (YAML + main.py); trim the "Choosing a Backend" section. - Drop tutorial test tests/tutorials/.../test_main_resources.py (review NVIDIA-NeMo#28). Performance reporting: - Unify metrics into a single per_actor block keyed by actor_id; GPU actors embed physical_address (host_ip:indices, now the canonical per-GPU key), gpu_indices/ gpu_uuids, and per-GPU utilization percentiles (nested gpus map). - Add first-class StagePerfStats.invocation_id for exact dedup; __add__ preserves identity only within the same worker. - Add inference_compute_fraction (inference time / total process time). - New GpuUtilSampler (NVML, bounded deque + lock); harden _collect_gpu_uuids to map physical indices -> visible ordinals under CUDA_VISIBLE_DEVICES. - Remove per-shard <output_dir>/<shard_key>_perf.jsonl output. Reader (nemo_tarred_reader): - Collision-safe _ManifestIndex (exact path, then longest path-suffix; ambiguous basenames resolve to no-match). - Enforce max_duration_s against decoded audio duration; narrow decode except to audio errors; write .done for empty/fully-filtered shards. Writer (sharded_manifest_writer): - Batch manifest writes per shard instead of per-utterance open/close. Misc: - Remove redundant super().setup_on_node() in RayActorPoolStageAdapter.__init__ (executor already runs once-per-node setup). - Fix stale docstring (disfluency_text_key); drop bucketize redundancy; trim adapter duplicate volume math; reduce comment verbosity across PR files. Signed-off-by: Aaftab V <aaftabv@nvidia.com>
The aim of the PR is to add pre-commit configs and a GHA for style checks.
The PR does not attempt to fix all the style failures present in the repo. That will be done in a followup PR.
The config ends up using
isort, black & flake8(some of it included in the contributing guide).Happy to also explore ruff which should be a much faster drop in replacement for some of the tools mentioned above and has been adopted by many large python projects.