Single-cell Inconsistency-based Clustering Evaluation for Python
scICEpy evaluates clustering stability on precomputed single-cell graphs stored
in AnnData objects. It is designed for Scanpy-style workflows and writes its
results back to adata.uns["scICE"].
The Seurat/R counterpart is scICER.
scICEpy currently supports two analysis modes:
- Cluster-range mode: run one shared resolution search, derive per-target gamma intervals, then optimize each requested target cluster.
- Manual resolution mode: skip shared search and evaluate the supplied gamma values directly. Repeated gamma values are deduplicated before evaluation.
Important current result semantics:
- Returned public results are keyed by the final merged cluster count in
best_labels. - In cluster-range mode,
source_target_clusterrecords which requested target produced each returned final cluster result. min_cluster_sizeaffects both effective-cluster counting during search and optimization, and the final merge applied tobest_labels.- The public
betaparameter is retained for API compatibility and metadata, but the current Python backend reportsbeta_supported = Falseandbeta_applied = False.
Manual resolution mode semantics:
- Duplicated input gamma values are removed before evaluation.
- Each remaining gamma is evaluated directly with repeated Leiden trials and bootstrap IC.
- The public main result is deduplicated by final cluster number, so
resolution=[...]is not guaranteed to replay the public output of a priorcluster_rangerun.
cd scICEpy
pip install -e .cd scICEpy
pip install -e ".[dev]"- Python >= 3.8
- numpy
- pandas
- scipy
- scanpy
- anndata
- matplotlib
- python-igraph
- leidenalg
Quick import check:
python - <<'PY'
import scICEpy
print(scICEpy.__version__)
print(scICEpy.scICE_clustering)
PYAutomated tests:
python -m pytest -q -m "not slow"
python -m pytest -q -m slow tests/test_smoke.pyThe fast default suite lives under tests/. The slow smoke test in
tests/test_smoke.py exercises end-to-end Scanpy preprocessing, clustering,
plotting, and manual resolution mode.
Development note:
- The implementation logic lives in
clustering_dispatch.py,clustering_inputs.py,clustering_modes.py,clustering_reporting.py,gamma_candidates.py,gamma_execution.py,target_optimizer.py,cluster_utils.py, andscICEpy.py. - The split helper modules now use explicit one-way imports rather than
circular
import *chains.
README.md: quick installation, common usage patterns, and practical examples.design.md: implementation-oriented design document forscICE_clustering(). It explains the full workflow, parameter semantics, output fields, and important behavior differences betweencluster_rangemode and manualresolutionmode.- Use
design.mdwhen you need to understand how results are produced internally, interpret fields such asresolution_diagnostics, or understand whyresolution=[...]is not a one-to-one replay of a previouscluster_rangerun.
import matplotlib.pyplot as plt
import scanpy as sc
import scICEpy
adata = sc.read_h5ad("your_data.h5ad")
# If the graph is not already present, compute neighbors and UMAP first.
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, subset=True)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
scICEpy.scICE_clustering(
adata,
graph_key="connectivities",
cluster_range=list(range(2, 21)),
n_trials=15,
n_bootstrap=100,
seed=42,
verbose=True,
)
results = adata.uns["scICE"]
print(results["analysis_mode"])
print(results["n_cluster"])
print(results["ic"])
print(results["best_cluster"], results["best_resolution"])
fig, ax = scICEpy.plot_ic(adata, threshold=1.005, show_gamma=True)
fig.savefig("scice_ic_plot.png", dpi=150, bbox_inches="tight")
plt.close(fig)
adata = scICEpy.get_robust_labels(adata, threshold=1.005, return_adata=True)
scice_columns = [column for column in adata.obs.columns if column.startswith("scICE_k_")]
if scice_columns:
sc.pl.umap(
adata,
color=scice_columns[: min(3, len(scice_columns))],
wspace=0.4,
show=False,
)
plt.savefig("scice_umap.png", dpi=150, bbox_inches="tight")
plt.close()Manual resolution mode:
scICEpy.scICE_clustering(
adata,
resolution=[0.05, 0.05, 0.10, 0.20],
n_trials=15,
n_bootstrap=100,
seed=42,
verbose=True,
)
results = adata.uns["scICE"]
print(results["analysis_mode"]) # "resolution"
print(results["resolution_input"]) # deduplicated gamma valuesAfter scICE_clustering() finishes, scICEpy stores a nested result dictionary
in adata.uns["scICE"].
results = adata.uns["scICE"]
print(results["analysis_mode"])
print(results["n_cluster"]) # returned final merged cluster counts
print(results["source_target_cluster"]) # originating requested targets
print(results["gamma"])
print(results["ic"])
print(results["consistent_clusters"])
print(results["best_cluster"], results["best_resolution"])
print(results["coverage_complete"], results["search_coverage_complete"])
print(results["parallel_layout"])Key fields to know:
analysis_mode:"cluster_range"or"resolution".n_cluster: returned final merged cluster counts in the public result.source_target_cluster: requested target that produced each returned result in cluster-range mode.gamma,ic,ic_vec,best_labels: main per-result outputs.consistent_clusters,best_cluster,best_resolution: summary fields derived fromic_threshold.coverage_complete: whether every requested target survives the final public result after optimization and final-count rekeying.search_coverage_complete: whether shared search found optimization-ready intervals for all requested targets before optimization.target_diagnostics,resolution_search_diagnostics,optimization_diagnostics,resolution_diagnostics: pandas DataFrames with search, optimization, and manual-resolution diagnostics.parallel_layout: resolved outer/inner worker layout used for the run. Large graphs now bias Phase 1 toward shared process-pool execution and often resolve toinner_workers = 1.min_cluster_size: value used for effective counting and final label merge.beta_supported,beta_applied,beta_support_reason: backend capability metadata for thebetaparameter.cluster_range_tested: summary array retained in the public result; it currently mirrors the returnedn_clustervalues.
Extract returned labels as a DataFrame:
labels_df = scICEpy.get_robust_labels(adata, threshold=1.005)
print(labels_df.head())Or add them directly to adata.obs:
adata = scICEpy.get_robust_labels(adata, threshold=1.005, return_adata=True)Plot and save the IC distributions:
import matplotlib.pyplot as plt
fig, ax = scICEpy.plot_ic(adata, threshold=1.005, show_gamma=True)
fig.savefig("scice_ic_plot.png", dpi=150, bbox_inches="tight")
plt.close(fig)If UMAP coordinates are available, you can also add robust labels and save a UMAP figure:
import matplotlib.pyplot as plt
import scanpy as sc
adata = scICEpy.get_robust_labels(adata, threshold=1.005, return_adata=True)
scice_columns = [column for column in adata.obs.columns if column.startswith("scICE_k_")]
if "X_umap" in adata.obsm and scice_columns:
sc.pl.umap(
adata,
color=scice_columns[: min(3, len(scice_columns))],
wspace=0.4,
show=False,
)
plt.savefig("scice_umap.png", dpi=150, bbox_inches="tight")
plt.close()plot_ic() and get_robust_labels() can also consume a raw result dictionary
instead of an AnnData object, as long as cell names are available when labels
need to be returned.
Main entry point:
scICEpy.scICE_clustering(
adata,
graph_key="connectivities",
cluster_range=None,
n_workers=10,
outer_workers=None,
inner_workers=None,
n_trials=15,
n_bootstrap=100,
seed=None,
beta=0.1,
n_iterations=10,
max_iterations=150,
ic_threshold=float("inf"),
objective_function="CPM",
remove_threshold=1.15,
min_cluster_size=2,
resolution_tolerance=1e-8,
verbose=True,
resolution=None,
copy=False,
scratch_dir=None,
)Behavior notes for the most important options:
graph_key: which graph inadata.obspto cluster.cluster_range: requested targets for cluster-range mode.resolution: manual gamma values. When set, shared search is skipped.n_workers: top-level worker budget. scICEpy resolves actual outer and inner worker layout from this budget.outer_workers,inner_workers: optional explicit caps for outer multiprocessing and inner thread work.remove_threshold: cluster-range pre-filter threshold. It is ignored in manual resolution mode.min_cluster_size: when greater than 1, scICEpy counts effective clusters during search and optimization and merges small clusters in final labels.resolution_tolerance: search tolerance used in cluster-range mode.copy: return a modified AnnData copy instead of writing in place.scratch_dir: optional runtime temp root for spill files and temporary working directories.beta: kept for compatibility and metadata, but not applied by the current Python backend. Current version of leidenalg (0.10.2) do not supportbetainput.
Public helpers:
scICEpy.plot_ic(...)scICEpy.get_robust_labels(...)
If your .h5ad file is dominated by expression matrices or layers that
scICEpy does not need for clustering, use the large-file wrapper to create a
lighter copy, run scICEpy on that copy, and write the results back into the
original .h5ad.
Helper scripts:
scripts/run_large_h5ad_scice.pyscripts/make_light_h5ad.py
The light-file step preserves:
obsobsmobspuns- only the first
n_varsfeature columns fromXand aligned per-variable metadata
Primary workflow:
python scripts/run_large_h5ad_scice.py \
--input your_data.h5ad \
--light-output your_data.light.h5ad \
--n-vars 1 \
--cluster-range 2 3 4 5 6 7 8 9 10 \
--n-trials 15 \
--n-bootstrap 100 \
--seed 42This workflow:
- creates
your_data.light.h5ad - runs
scICEpy.scICE_clustering(...)on the light file - writes only
uns["scICE"]back toyour_data.h5ad - keeps the light file on disk for reuse or inspection
For H5AD persistence, variable-length result sequences such as bootstrap IC
vectors and stored label collections are written through an internal
H5AD-safe encoding. plot_ic() and get_robust_labels() read that encoding
transparently after reload.
Use --resolution 0.05 0.10 0.20 instead of --cluster-range ... when you
want manual resolution mode.
If you only want to generate the lightweight file, scripts/make_light_h5ad.py
still supports the original one-step conversion:
python scripts/make_light_h5ad.py \
--input your_data.h5ad \
--output your_data.light.h5ad \
--n-vars 1After the wrapper finishes, read the original file again and plot from the results that were written back:
import matplotlib.pyplot as plt
import scanpy as sc
import scICEpy
adata = sc.read_h5ad("your_data.h5ad")
fig, ax = scICEpy.plot_ic(adata, threshold=1.005, show_gamma=True)
fig.savefig("your_data_scice_ic.png", dpi=150, bbox_inches="tight")
plt.close(fig)
adata = scICEpy.get_robust_labels(adata, threshold=1.005, return_adata=True)
scice_columns = [column for column in adata.obs.columns if column.startswith("scICE_k_")]
if "X_umap" in adata.obsm and scice_columns:
sc.pl.umap(
adata,
color=scice_columns[: min(3, len(scice_columns))],
wspace=0.4,
show=False,
)
plt.savefig("your_data_scice_umap.png", dpi=150, bbox_inches="tight")
plt.close()scICEpy uses:
- shared search plus per-target optimization in cluster-range mode
- outer multiprocessing on Unix where appropriate
- a shared Phase 1 process pool for large graphs (
n_cells >= 200000) - per-gamma trial summaries that reuse final-cluster counts and preferred-hit trial bookkeeping across optimization and finalization
- inner thread pools mainly for small/medium jobs and bootstrap/finalize work
- runtime temporary directories and optional spill-to-disk for large intermediate matrices
Result assembly keeps the public adata.uns["scICE"] contract unchanged while
using one shared final-cluster deduplication path for both cluster-range and
manual resolution outputs.
Practical tips:
- Start with a focused
cluster_rangeinstead of scanning more targets than you need. - Reduce
n_trialsandn_bootstrapfor large datasets when you need a faster exploratory run. - Use
n_workersas the top-level budget and only setouter_workersorinner_workerswhen you need tighter control. - Set
scratch_dirif you want runtime temporary files written somewhere specific. - Keep
verbose=Truewhen tuning large jobs so you can inspect search, optimization, and final summary logs.
From the repository root:
pip install -e ".[dev]"
python -m pytest -q -m "not slow"
python -m pytest -q -m slow tests/test_smoke.pyIf you launch Python from the repository parent directory, import scICEpy
now resolves through a repository-root shim and still exposes the packaged
API (scICE_clustering, plot_ic, get_robust_labels, and helper submodules
such as scICEpy.target_optimizer).