scICEpy

Single-cell Inconsistency-based Clustering Evaluation for Python

scICEpy evaluates clustering stability on precomputed single-cell graphs stored in AnnData objects. It is designed for Scanpy-style workflows and writes its results back to adata.uns["scICE"].

The Seurat/R counterpart is scICER.

Overview

scICEpy currently supports two analysis modes:

Cluster-range mode: run one shared resolution search, derive per-target gamma intervals, then optimize each requested target cluster.
Manual resolution mode: skip shared search and evaluate the supplied gamma values directly. Repeated gamma values are deduplicated before evaluation.

Important current result semantics:

Returned public results are keyed by the final merged cluster count in best_labels.
In cluster-range mode, source_target_cluster records which requested target produced each returned final cluster result.
min_cluster_size affects both effective-cluster counting during search and optimization, and the final merge applied to best_labels.
The public beta parameter is retained for API compatibility and metadata, but the current Python backend reports beta_supported = False and beta_applied = False.

Manual resolution mode semantics:

Duplicated input gamma values are removed before evaluation.
Each remaining gamma is evaluated directly with repeated Leiden trials and bootstrap IC.
The public main result is deduplicated by final cluster number, so resolution=[...] is not guaranteed to replay the public output of a prior cluster_range run.

Installation

Install from source

cd scICEpy
pip install -e .

Development install

cd scICEpy
pip install -e ".[dev]"

Core dependencies

Python >= 3.8
numpy
pandas
scipy
scanpy
anndata
matplotlib
python-igraph
leidenalg

Verification

Quick import check:

python - <<'PY'
import scICEpy
print(scICEpy.__version__)
print(scICEpy.scICE_clustering)
PY

Automated tests:

python -m pytest -q -m "not slow"
python -m pytest -q -m slow tests/test_smoke.py

The fast default suite lives under tests/. The slow smoke test in tests/test_smoke.py exercises end-to-end Scanpy preprocessing, clustering, plotting, and manual resolution mode.

Development note:

The implementation logic lives in clustering_dispatch.py, clustering_inputs.py, clustering_modes.py, clustering_reporting.py, gamma_candidates.py, gamma_execution.py, target_optimizer.py, cluster_utils.py, and scICEpy.py.
The split helper modules now use explicit one-way imports rather than circular import * chains.

Documentation

README.md: quick installation, common usage patterns, and practical examples.
design.md: implementation-oriented design document for scICE_clustering(). It explains the full workflow, parameter semantics, output fields, and important behavior differences between cluster_range mode and manual resolution mode.
Use design.md when you need to understand how results are produced internally, interpret fields such as resolution_diagnostics, or understand why resolution=[...] is not a one-to-one replay of a previous cluster_range run.

Quick Start

import matplotlib.pyplot as plt
import scanpy as sc
import scICEpy

adata = sc.read_h5ad("your_data.h5ad")

# If the graph is not already present, compute neighbors and UMAP first.
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, subset=True)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

scICEpy.scICE_clustering(
    adata,
    graph_key="connectivities",
    cluster_range=list(range(2, 21)),
    n_trials=15,
    n_bootstrap=100,
    seed=42,
    verbose=True,
)

results = adata.uns["scICE"]
print(results["analysis_mode"])
print(results["n_cluster"])
print(results["ic"])
print(results["best_cluster"], results["best_resolution"])

fig, ax = scICEpy.plot_ic(adata, threshold=1.005, show_gamma=True)
fig.savefig("scice_ic_plot.png", dpi=150, bbox_inches="tight")
plt.close(fig)

adata = scICEpy.get_robust_labels(adata, threshold=1.005, return_adata=True)
scice_columns = [column for column in adata.obs.columns if column.startswith("scICE_k_")]
if scice_columns:
    sc.pl.umap(
        adata,
        color=scice_columns[: min(3, len(scice_columns))],
        wspace=0.4,
        show=False,
    )
    plt.savefig("scice_umap.png", dpi=150, bbox_inches="tight")
    plt.close()

Manual resolution mode:

scICEpy.scICE_clustering(
    adata,
    resolution=[0.05, 0.05, 0.10, 0.20],
    n_trials=15,
    n_bootstrap=100,
    seed=42,
    verbose=True,
)

results = adata.uns["scICE"]
print(results["analysis_mode"])      # "resolution"
print(results["resolution_input"])   # deduplicated gamma values

Accessing Results

After scICE_clustering() finishes, scICEpy stores a nested result dictionary in adata.uns["scICE"].

results = adata.uns["scICE"]

print(results["analysis_mode"])
print(results["n_cluster"])              # returned final merged cluster counts
print(results["source_target_cluster"])  # originating requested targets
print(results["gamma"])
print(results["ic"])
print(results["consistent_clusters"])
print(results["best_cluster"], results["best_resolution"])
print(results["coverage_complete"], results["search_coverage_complete"])
print(results["parallel_layout"])

Key fields to know:

analysis_mode: "cluster_range" or "resolution".
n_cluster: returned final merged cluster counts in the public result.
source_target_cluster: requested target that produced each returned result in cluster-range mode.
gamma, ic, ic_vec, best_labels: main per-result outputs.
consistent_clusters, best_cluster, best_resolution: summary fields derived from ic_threshold.
coverage_complete: whether every requested target survives the final public result after optimization and final-count rekeying.
search_coverage_complete: whether shared search found optimization-ready intervals for all requested targets before optimization.
target_diagnostics, resolution_search_diagnostics, optimization_diagnostics, resolution_diagnostics: pandas DataFrames with search, optimization, and manual-resolution diagnostics.
parallel_layout: resolved outer/inner worker layout used for the run. Large graphs now bias Phase 1 toward shared process-pool execution and often resolve to inner_workers = 1.
min_cluster_size: value used for effective counting and final label merge.
beta_supported, beta_applied, beta_support_reason: backend capability metadata for the beta parameter.
cluster_range_tested: summary array retained in the public result; it currently mirrors the returned n_cluster values.

Labels and Plotting

Extract returned labels as a DataFrame:

labels_df = scICEpy.get_robust_labels(adata, threshold=1.005)
print(labels_df.head())

Or add them directly to adata.obs:

adata = scICEpy.get_robust_labels(adata, threshold=1.005, return_adata=True)

Plot and save the IC distributions:

import matplotlib.pyplot as plt

fig, ax = scICEpy.plot_ic(adata, threshold=1.005, show_gamma=True)
fig.savefig("scice_ic_plot.png", dpi=150, bbox_inches="tight")
plt.close(fig)

If UMAP coordinates are available, you can also add robust labels and save a UMAP figure:

import matplotlib.pyplot as plt
import scanpy as sc

adata = scICEpy.get_robust_labels(adata, threshold=1.005, return_adata=True)
scice_columns = [column for column in adata.obs.columns if column.startswith("scICE_k_")]

if "X_umap" in adata.obsm and scice_columns:
    sc.pl.umap(
        adata,
        color=scice_columns[: min(3, len(scice_columns))],
        wspace=0.4,
        show=False,
    )
    plt.savefig("scice_umap.png", dpi=150, bbox_inches="tight")
    plt.close()

plot_ic() and get_robust_labels() can also consume a raw result dictionary instead of an AnnData object, as long as cell names are available when labels need to be returned.

API Summary

Main entry point:

scICEpy.scICE_clustering(
    adata,
    graph_key="connectivities",
    cluster_range=None,
    n_workers=10,
    outer_workers=None,
    inner_workers=None,
    n_trials=15,
    n_bootstrap=100,
    seed=None,
    beta=0.1,
    n_iterations=10,
    max_iterations=150,
    ic_threshold=float("inf"),
    objective_function="CPM",
    remove_threshold=1.15,
    min_cluster_size=2,
    resolution_tolerance=1e-8,
    verbose=True,
    resolution=None,
    copy=False,
    scratch_dir=None,
)

Behavior notes for the most important options:

graph_key: which graph in adata.obsp to cluster.
cluster_range: requested targets for cluster-range mode.
resolution: manual gamma values. When set, shared search is skipped.
n_workers: top-level worker budget. scICEpy resolves actual outer and inner worker layout from this budget.
outer_workers, inner_workers: optional explicit caps for outer multiprocessing and inner thread work.
remove_threshold: cluster-range pre-filter threshold. It is ignored in manual resolution mode.
min_cluster_size: when greater than 1, scICEpy counts effective clusters during search and optimization and merges small clusters in final labels.
resolution_tolerance: search tolerance used in cluster-range mode.
copy: return a modified AnnData copy instead of writing in place.
scratch_dir: optional runtime temp root for spill files and temporary working directories.
beta: kept for compatibility and metadata, but not applied by the current Python backend. Current version of leidenalg (0.10.2) do not support beta input.

Public helpers:

scICEpy.plot_ic(...)
scICEpy.get_robust_labels(...)

Large H5AD Inputs

If your .h5ad file is dominated by expression matrices or layers that scICEpy does not need for clustering, use the large-file wrapper to create a lighter copy, run scICEpy on that copy, and write the results back into the original .h5ad.

Helper scripts:

scripts/run_large_h5ad_scice.py
scripts/make_light_h5ad.py

The light-file step preserves:

obs
obsm
obsp
uns
only the first n_vars feature columns from X and aligned per-variable metadata

Primary workflow:

python scripts/run_large_h5ad_scice.py \
  --input your_data.h5ad \
  --light-output your_data.light.h5ad \
  --n-vars 1 \
  --cluster-range 2 3 4 5 6 7 8 9 10 \
  --n-trials 15 \
  --n-bootstrap 100 \
  --seed 42

This workflow:

creates your_data.light.h5ad
runs scICEpy.scICE_clustering(...) on the light file
writes only uns["scICE"] back to your_data.h5ad
keeps the light file on disk for reuse or inspection

For H5AD persistence, variable-length result sequences such as bootstrap IC vectors and stored label collections are written through an internal H5AD-safe encoding. plot_ic() and get_robust_labels() read that encoding transparently after reload.

Use --resolution 0.05 0.10 0.20 instead of --cluster-range ... when you want manual resolution mode.

If you only want to generate the lightweight file, scripts/make_light_h5ad.py still supports the original one-step conversion:

python scripts/make_light_h5ad.py \
  --input your_data.h5ad \
  --output your_data.light.h5ad \
  --n-vars 1

After the wrapper finishes, read the original file again and plot from the results that were written back:

import matplotlib.pyplot as plt
import scanpy as sc
import scICEpy

adata = sc.read_h5ad("your_data.h5ad")

fig, ax = scICEpy.plot_ic(adata, threshold=1.005, show_gamma=True)
fig.savefig("your_data_scice_ic.png", dpi=150, bbox_inches="tight")
plt.close(fig)

adata = scICEpy.get_robust_labels(adata, threshold=1.005, return_adata=True)

scice_columns = [column for column in adata.obs.columns if column.startswith("scICE_k_")]
if "X_umap" in adata.obsm and scice_columns:
    sc.pl.umap(
        adata,
        color=scice_columns[: min(3, len(scice_columns))],
        wspace=0.4,
        show=False,
    )
    plt.savefig("your_data_scice_umap.png", dpi=150, bbox_inches="tight")
    plt.close()

Parallelism and Performance

scICEpy uses:

shared search plus per-target optimization in cluster-range mode
outer multiprocessing on Unix where appropriate
a shared Phase 1 process pool for large graphs (n_cells >= 200000)
per-gamma trial summaries that reuse final-cluster counts and preferred-hit trial bookkeeping across optimization and finalization
inner thread pools mainly for small/medium jobs and bootstrap/finalize work
runtime temporary directories and optional spill-to-disk for large intermediate matrices

Result assembly keeps the public adata.uns["scICE"] contract unchanged while using one shared final-cluster deduplication path for both cluster-range and manual resolution outputs.

Practical tips:

Start with a focused cluster_range instead of scanning more targets than you need.
Reduce n_trials and n_bootstrap for large datasets when you need a faster exploratory run.
Use n_workers as the top-level budget and only set outer_workers or inner_workers when you need tighter control.
Set scratch_dir if you want runtime temporary files written somewhere specific.
Keep verbose=True when tuning large jobs so you can inspect search, optimization, and final summary logs.

Testing

From the repository root:

pip install -e ".[dev]"
python -m pytest -q -m "not slow"
python -m pytest -q -m slow tests/test_smoke.py

If you launch Python from the repository parent directory, import scICEpy now resolves through a repository-root shim and still exposes the packaged API (scICE_clustering, plot_ic, get_robust_labels, and helper submodules such as scICEpy.target_optimizer).

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
scICEpy		scICEpy
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
__init__.py		__init__.py
design.md		design.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scICEpy

Overview

Installation

Install from source

Development install

Core dependencies

Verification

Documentation

Quick Start

Accessing Results

Labels and Plotting

API Summary

Large H5AD Inputs

Parallelism and Performance

Testing

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scICEpy

Overview

Installation

Install from source

Development install

Core dependencies

Verification

Documentation

Quick Start

Accessing Results

Labels and Plotting

API Summary

Large H5AD Inputs

Parallelism and Performance

Testing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages