GPU sparse operations package (SpMV, SpMM, SpGEMM, SDDMM, gather, scatter, sparse formats).
pip install . --no-deps --no-build-isolationUse --no-build-isolation to avoid downloading build deps when offline.
Runtime dependencies (install when needed):
pip install torch triton cupy-cuda12xsrc/flagsparse/- core package (sparse_operations/is emitted as several.pymodules from string literals inflagsparse.py)tests/- pytest testsbenchmark/- performance benchmarks
Run from project root, or cd tests then run scripts (paths like ../matrix for .mtx dir).
The commands below are the repository's documented invocation standard. CPU-only install, build, help-text, and smoke paths are checked in CI; GPU-specific examples are documented but not executed there unless you opt into the triton smoke job locally.
pytest accuracy suite - small synthetic CUDA cases, selectable by operator marker:
pytest tests/pytest --mode quick
pytest tests/pytest --mode normal -m "spmv_csr or spmm_csr"
python run_flagsparse_pytest.py --mode quick --ops gather,spmv_csr,spmm_csr --gpus 0
python run_flagsparse_pytest.py --op-list ops.txt --gpus 0,1 --results-dir pytest_resultsThe runner writes per-operator accuracy.log files plus summary.json, summary.csv, and summary.xlsx when openpyxl is installed.
test_spmv.py - CSR SpMV (SuiteSparse .mtx, synthetic, or CSR CSV export):
python tests/test_spmv.py <dir_or_file.mtx> # batch run, default float32
python tests/test_spmv.py <dir/> --dtype float64 # optional: --index-dtype int32|int64, --warmup, --iters, --no-cusparse
python tests/test_spmv.py --synthetic # synthetic benchmark
python tests/test_spmv.py <dir/> --csv-csr results.csv # all value×index dtypes -> one CSV (per-matrix lines while running)test_spmv_coo.py - COO SpMV (requires --synthetic or --csv-coo; no standalone .mtx batch):
python tests/test_spmv_coo.py --synthetic
python tests/test_spmv_coo.py <dir/> --csv-coo out.csvtest_spmv_opt.py - SpMV baseline vs optimised A/B (float32 / float64 only):
python tests/test_spmv_opt.py <dir_or_file.mtx> [...]
python tests/test_spmv_opt.py <dir/> --csv out.csvtest_spmm.py - CSR SpMM (.mtx batch, synthetic, or --csv):
python tests/test_spmm.py <dir_or_file.mtx>
python tests/test_spmm.py --synthetic # optional: --skip-api-checks, --skip-alg1-coverage
python tests/test_spmm.py <dir/> --csv results.csv # float32/float64 + int32 in CSV; per-matrix console output
# common options: --dtype, --index-dtype, --dense-cols, --block-n, --block-nnz, --max-segments, --warmup, --iters, --no-cusparsetest_spmm_opt.py - CSR SpMM baseline vs optimised A/B:
python tests/test_spmm_opt.py <dir_or_file.mtx> --dense-cols 32
python tests/test_spmm_opt.py <dir/> --csv spmm_opt.csv # optional: --dtype float32|float64, --dense-cols
# common options: --dtype, --dense-cols, --warmup, --iterstest_spmm_coo.py - native COO SpMM:
python tests/test_spmm_coo.py <dir_or_file.mtx>
python tests/test_spmm_coo.py --synthetic # optional: --route rowrun|atomic|compare, --skip-api-checks, --skip-coo-coverage
python tests/test_spmm_coo.py <dir/> --csv out.csv # only --route rowrun or atomic (not compare)
# same tuning flags as CSR SpMM where applicable: --dense-cols, --block-n, --block-nnz, --warmup, --iters, --no-cusparsetest_sddmm.py - CSR SDDMM (.mtx batch or --csv):
python tests/test_sddmm.py <dir_or_file.mtx> --k 64
python tests/test_sddmm.py <dir/> --csv out.csv # optional: --dtype float32|float64, --acc_mode f32|f64, --k 64
# common options: --dtype, --index-dtype, --acc_mode, --k, --alpha, --beta, --warmup, --iters, --no-cupy-ref, --skip-api-checkstest_spgemm.py - CSR SpGEMM (.mtx batch or --csv):
python tests/test_spgemm.py <dir_or_file.mtx> --input-mode auto
python tests/test_spgemm.py <dir/> --csv results.csv # optional: --dtype float32|float64, --input-mode auto|a_equals_b|a_at, --compare-device cpu|gpu
# common options: --dtype, --index-dtype, --warmup, --iters, --input-mode, --adaptive-loops, --no-cusparse, --ref-blocked-retry, --ref-isolated-retry, --ref-block-rows, --compare-device, --run-api-checkstest_spsv.py - SpSV (triangular solve; square matrices only). CSR and COO share this script; there is no test_spsv_coo.py.
python tests/test_spsv.py --synthetic
python tests/test_spsv.py <dir/> --csv-csr spsv.csv
python tests/test_spsv.py <dir/> --csv-coo out.csv # same CSV columns as CSRtest_spsm.py - SpSM (triangular matrix-matrix solve; square matrices only):
python tests/test_spsm.py --synthetic --n 512 --rhs 32
python tests/test_spsm.py <dir/> --csv-csr spsm_csr.csv --rhs 32
python tests/test_spsm.py <dir/> --csv-coo spsm_coo.csv --rhs 32test_gather.py / test_scatter.py - gather/scatter benchmarks (pytest or python tests/test_gather.py).
Accuracy suites should use tests/pytest/accuracy_utils.py for FlagGems-style
golden reference and tolerance policy. Numeric compute operators compare against
CPU-FP64 golden references cast back to the dtype under test, while exact/logical
outputs compare against CPU int32 references.
.github/workflows/ci.ymlis CPU-only and runs compile, format checks, lint, source-critical static checks, build, install, and smoke tests on GitHub-hosted runners.- The smoke set now covers installed-wheel validation, packaging metadata, public API surface, operator registry consistency, shared runtime policy helpers, CLI
--help, and README command snippets. conf/operators.yamlis the FlagGems-style operator interface registry for public FlagSparse sparse operators and sparse-format helpers..github/workflows/nightly-cpu.ymlis amain-branch-only nightly CPU check that repeats the package, lint, and shared-runtime smoke tests..github/workflows/release.ymlbuilds source and wheel artifacts, then attaches them to GitHub Releases onv*tags..github/workflows/triton-smoke.ymlis a manual opt-in job for triton-dependent smoke checks..github/workflows/gpu-ci.ymlis a manual GPU accuracy smoke workflow for a self-hosted runner labeledself-hosted,linux, andgpu..github/workflows/gpu-benchmark.ymladds an Actions button for synthetic GPU benchmark runs on a self-hosted runner labeledself-hosted,linux, andgpu..github/workflows/release-drafter.ymlkeeps draft release notes current from merged PRs.make helplists the local entry points.make ci/make checkrun the same CPU-only pipeline used by CI.make format-check,make lint, andmake lint-srcare the non-GPU quality gates for CI formatting, CI helper lint, and critical package-source static checks.make smokeis the CPU smoke stage alias.make release-check/make releasebuild, validate, and checksum release artifacts.make triton-smokeandmake triton-depsare opt-in local targets for the triton-dependent runtime checks.make gpu-env-checkvalidates CUDA visibility throughtools/ci/check_gpu_environment.pyon a GPU runner.make gpu-benchmarkruns the quick synthetic benchmark suite on a CUDA machine.python tools/ci/run_gpu_benchmark.py --suite quickmirrors the manual GPU benchmark workflow locally on a CUDA machine.python tools/ci/run_gpu_benchmark.py --suite full --matrix-dir tests/dataruns the full benchmark matrix, including.mtx-backed SpGEMM and SDDMM suites against the repository test matrices.tools/ci/requirements-ci.lock.txtandtools/ci/requirements-triton-smoke.lock.txtare the pinned local dependency bundles behind those make targets..github/dependabot.ymlkeeps GitHub Actions and Python dependency updates visible..github/ISSUE_TEMPLATE/keeps issue entry points structured for bugs and feature requests.- The CI dependency bundle now stays on packaging and test tooling only; triton-dependent smoke is opt-in through
FLAGSPARSE_TRITON_SMOKE=1. - Release artifacts now ship with a generated
SHA256SUMSmanifest and a matching checksum verification step in CI. - PR quality gates are implemented through the default CPU CI workflow; configure branch protection in GitHub to require the
CI / Build and smoke testcheck before merge. - GPU accuracy and benchmark scripts still require CUDA hardware; the GPU workflows are manual and only run on a self-hosted GPU runner.
benchmark/performance_utils.pydefines the pytest-style performance base class, default metrics (latency_base,latency,speedup), median timing, warmup/iteration controls, CUDA synchronization, CSV record helpers, and the two-level average speedup rule.benchmark/attri_util.pyandbenchmark/core_shapes.yamlkeep default and special shape grids centralized.benchmark/summary_for_plot.pyreads recorded benchmark CSV files and reports the two-level speedup summary.benchmark/test_sparse_perf.pyis an opt-in pytest entry point; real GPU runs remain manual or self-hosted because GitHub-hosted runners do not provide CUDA GPUs.tests/data/*.mtxcan be used as the default MatrixMarket smoke dataset for mtx-backed GPU benchmark suites.
This project is licensed under the Apache (Version 2.0) license.