scICER is an R package that implements a systematic and efficient workflow to evaluate clustering consistency in single-cell RNA-seq data. This is the R port of the original scICE Julia package, designed to be fully compatible with the Seurat ecosystem.
The AnnData/Python counterpart is scICEpy.
Implementation details are documented in design.md.
- 🔬 Automated Cluster Evaluation: Systematically tests multiple cluster numbers to find consistent results
- ⚡ Element-Centric Similarity (ECS): Efficient algorithm for comparing clustering results
- 🧬 Seurat Integration: Seamless integration with Seurat objects and workflows
- 🚀 Parallel Processing: Leverages multiple cores for faster computation
- 📊 Comprehensive Visualization: Built-in plotting functions for result interpretation
- 📈 Robust Statistics: Bootstrap-based confidence intervals and stability assessment
- 🎯 Reproducible Analysis: Optional seed parameter for deterministic results
- 🔍 Smart Parameter Recommendations: Automatically suggests optimal parameters based on dataset size
- ✅ Preprocessing Validation: Checks and guides users through proper Seurat preprocessing
- 📋 Comprehensive Reporting: Generates detailed analysis summaries with interpretation guidance
scICER evaluates clustering consistency by:
- Multiple Clustering: Runs Leiden clustering multiple times with different parameters
- Binary Search: Automatically finds resolution parameters for target cluster numbers
- ECS Calculation: Uses Element-Centric Similarity to efficiently compare clustering results
- Optimization: Iteratively refines clustering to minimize inconsistency
- Bootstrap Validation: Assesses stability through bootstrap sampling
- Threshold Application: Identifies cluster numbers meeting consistency criteria
- Reproducibility Control: Optional seed parameter for deterministic results
Performance Considerations vs scICE Julia package
- Speed: R implementation typically runs slower than the native Julia version
- Memory: Requires more memory for equivalent analyses
- Accuracy: While highly consistent, results may differ slightly from the original scICE due to implementation differences in R vs Julia, random number generation variations, and floating-point precision handling.
Note: These tradeoffs provide better integration with the Seurat ecosystem and R workflows
scICER requires the ClustAssess library for Element-Centric Similarity (ECS) calculations, which provides:
- ~150x faster performance for similarity calculations
- Professional, well-tested implementation by Cambridge Core Bioinformatics
- Identical results to reference implementations
ClustAssess will be automatically installed when you install scICER.
Ensure you have R ≥ 4.0.0 and the required dependencies:
# Install required CRAN packages
install.packages(c(
"Seurat", "SeuratObject", "igraph", "Matrix",
"parallel", "foreach", "doParallel", "dplyr",
"ggplot2", "devtools", "ClustAssess"
))
# Optional: Install leiden package for better performance
install.packages("leiden")# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
# Install scICER from GitHub
devtools::install_github("ATPs/scICER", upgrade = "never")#avoid upgrade packages and use existing oneslibrary(scICER)
library(Seurat)# Load required packages
library(scICER)
library(Seurat)
library(foreach)
library(doParallel)
library(ggplot2)
# Set up parallel processing (optional, but recommended)
n_workers <- 4 # adjust based on your system
# Load your Seurat object
data(pbmc_small) # Example dataset
seurat_obj <- pbmc_small
# Ensure preprocessing is complete
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)
# Run scICER analysis
scice_results <- scICE_clustering(
object = seurat_obj,
cluster_range = 3:20,
n_workers = n_workers,
min_cluster_size = 2, # Default; consistent with Seurat singleton behavior
remove_threshold = Inf,
verbose = TRUE
)
# Or evaluate manually chosen gamma values directly
manual_results <- scICE_clustering(
object = seurat_obj,
resolution = c(0.2, 0.4, 0.8),
n_workers = n_workers,
min_cluster_size = 2,
verbose = TRUE
)
# For reproducible results, use the seed parameter
scice_results <- scICE_clustering(
object = seurat_obj,
cluster_range = 3:20,
n_workers = n_workers,
min_cluster_size = 2,
remove_threshold = Inf,
verbose = TRUE,
seed = 123 # Set seed for reproducible results
)
# Visualize results
plot_ic(scice_results, threshold = 1.005)
# Extract consistent clustering labels
seurat_obj <- get_robust_labels(scice_results, return_seurat = TRUE, object = seurat_obj)
# Visualize with consistent clusters, cluster range from 3 to 15
xl.fig <- list()
for (n in 3:15) {
if (paste0("clusters_", n) %in% colnames(seurat_obj@meta.data)){
xl.fig[[paste0("clusters_", n)]] <- DimPlot(seurat_obj, group.by = paste0("clusters_", n), label = TRUE) + ggtitle(paste("Clusters:", n))
}
}
cowplot::plot_grid(plotlist = xl.fig, ncol = 3)README.md: quick installation, common usage patterns, and practical examples.design.md: implementation-oriented design document forscICE_clustering(). It explains the full workflow, parameter semantics, output fields, and important behavior differences betweencluster_rangemode and manualresolutionmode.- Use
design.mdwhen you need to understand how results are produced internally, interpret fields such asresolution_diagnostics, or understand whyresolution = old_results$gammais not a one-to-one replay of a previouscluster_rangerun.
scICE_clustering(..., verbose = TRUE) prints timestamped progress messages like
[YYYY-MM-DD HH:MM:SS] ... via message(), which goes to stderr.
If you launch jobs in the background and only capture stdout, logs can appear missing.
Use:
Rscript examples/run_scice.R --help
Rscript examples/run_scice.R --input_qs input.qs --output_qs augmented_with_scice.qs > scice.log 2>&1 &
tail -f scice.logFor very large input objects, qs::qread() can be silent for a while before scICER starts.
The provided run_scice.R script prints timestamped checkpoints before and after
each major step, builds a lightweight Seurat object that keeps only one feature
plus the requested graph for the clustering phase, deletes the original large
object before scICE_clustering(), then reloads the original object only for
metadata merge. The final augmented Seurat object is saved with qs.
The script accepts both environment variables and explicit CLI flags such as
--qread_threads, --n_workers, --cluster_range, and --resolution.
If --output_qs is omitted, it defaults to --input_qs.
# Comprehensive scICER analysis
results <- scICE_clustering(
object = your_seurat_object,
cluster_range = 2:20, # Range of cluster numbers to test
n_workers = 8, # Number of parallel workers
n_trials = 15, # Clustering trials per resolution
n_bootstrap = 100, # Bootstrap iterations
seed = 42, # Optional: Set for reproducible results
ic_threshold = 1.005, # Consistency threshold
min_cluster_size = 2, # Default minimum cells per effective cluster (set 1 to disable)
verbose = TRUE # Progress messages
)# Fine-tuned parameters for large datasets
results <- scICE_clustering(
object = large_seurat_object,
graph_name = "RNA_snn", # Optional: specify graph name
cluster_range = 5:12, # Focused range for efficiency
objective_function = "CPM", # "CPM" or "modularity"
beta = 0.1, # Leiden beta parameter
n_iterations = 10, # Initial Leiden iterations
max_iterations = 150, # Maximum optimization iterations
remove_threshold = 1.15, # Filter inconsistent results
min_cluster_size = 2, # Default threshold; final merge only on best_labels
resolution_tolerance = 1e-8, # Resolution search precision
n_trials = 10, # Reduced for speed
n_bootstrap = 50, # Reduced for speed
seed = 12345 # For reproducible results
)It's OK to set a large remove_threshold value, because plot_ic and get_robust_labels all have default threshold = 1.005.
min_cluster_size = 2 approximates Seurat singleton grouping; use min_cluster_size = 1 to keep raw Leiden clusters.
labels stay raw for IC/MEI, while best_labels receives one final merge pass when min_cluster_size > 1.
resolution can be a single gamma or a vector of gamma values. When resolution is provided, scICER skips the cluster_range resolution search, evaluates the supplied gamma values directly, and keeps the lowest-IC result for each final cluster number. If both resolution and cluster_range are provided, resolution takes priority and cluster_range is ignored with an informational message.
# Evaluate one user-provided gamma
result_single_resolution <- scICE_clustering(
object = your_seurat_object,
resolution = 0.2,
n_trials = 15,
n_bootstrap = 100,
min_cluster_size = 2,
seed = 42
)
# Evaluate several user-provided gamma values
results_manual <- scICE_clustering(
object = your_seurat_object,
resolution = c(0.1, 0.2, 0.4, 0.8),
n_trials = 15,
n_bootstrap = 100,
min_cluster_size = 2,
seed = 42
)
# Plot the retained results
plot_ic(results_manual, threshold = 1.005)
# Per-gamma diagnostics for all user-supplied resolutions
results_manual$resolution_diagnostics
# Lowest-IC returned solution
results_manual$best_cluster
results_manual$best_resolution
# resolution takes priority if cluster_range is also supplied
results_manual2 <- scICE_clustering(
object = your_seurat_object,
cluster_range = 3:10,
resolution = c(0.15, 0.3, 0.6),
n_trials = 15,
n_bootstrap = 100,
min_cluster_size = 2,
seed = 42
)# Plot Inconsistency Coefficient (IC) scores
plot_ic(results,
threshold = 1.005,
title = "Clustering Consistency Analysis")
# Plot stability analysis
plot_stability(results)
# Extract summary of consistent clusters
summary <- extract_consistent_clusters(results, threshold = 1.005)
print(summary)
# Get clustering labels as data frame
labels_df <- get_robust_labels(results)
# Add results directly to Seurat object
seurat_obj <- get_robust_labels(results, return_seurat = TRUE, object = seurat_obj)# Complete Seurat + scICER workflow
seurat_obj <- CreateSeuratObject(counts = your_counts)
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)
# Run scICER with reproducibility
scice_results <- scICE_clustering(
seurat_obj,
cluster_range = 3:15,
seed = 42 # For reproducible results
)
# Add consistent clusters to Seurat object
seurat_obj <- get_robust_labels(scice_results, return_seurat = TRUE, object = seurat_obj)
# Continue with standard Seurat analysis
seurat_obj <- RunUMAP(seurat_obj, dims = 1:20)
DimPlot(seurat_obj, group.by = "clusters_8") # Use consistent clustering
# Find markers for consistent clusters
Idents(seurat_obj) <- "clusters_8"
markers <- FindAllMarkers(seurat_obj)The codes used to generate results in scICER paper are available at GitHub (https://github.com/yerry77/scICER-workflow).
The scICE_clustering() function returns a list containing:
gamma: Resolution parameters for each cluster numberlabels: Raw clustering arrays used for IC/MEI (no small-cluster merge)ic: Inconsistency scores (lower = more consistent)ic_vec: Bootstrap IC distributionsn_cluster: Number of clusters testedbest_labels: Final best labels (single post-optimization merge whenmin_cluster_size > 1)consistent_clusters: Cluster numbers meeting consistency thresholdmei: Mutual Element-wise Information scoresanalysis_mode:"cluster_range"or"resolution"resolution_input: Manual gamma values requested by the user inresolutionmoderesolution_diagnostics: Per-gamma summary table inresolutionmodebest_cluster: Returned cluster number with the lowest IC scorebest_resolution: Gamma corresponding tobest_cluster
scICE_clustering() returns a lightweight results object and does not store the full
Seurat input object. When you need metadata columns added back, pass the original
Seurat object explicitly:
seurat_obj <- get_robust_labels(results, return_seurat = TRUE, object = seurat_obj)In resolution mode, resolution_diagnostics keeps one row per input gamma, while the main result object is deduplicated by final cluster number so that each returned cluster number retains only the lowest-IC gamma.
- IC < 1.005: Highly consistent clustering (recommended), less than 0.5% of cells is inconsistent.
- IC 1.005-1.01: Moderately consistent clustering, 0.5% to 1% of cells is inconsistent.
- IC > 1.01: Low consistency (not recommended), more than 1% of cells is inconsistent.
- IC Plot: Shows consistency scores across cluster numbers
- Stability Plot: Displays bootstrap distributions
- UMAP/t-SNE: Visualize clusters with consistent labels
# Get recommended parameters based on dataset size
recommended <- get_recommended_parameters(ncol(large_seurat_obj))
# Optimized settings for large datasets
results <- scICE_clustering(
object = large_seurat_obj,
cluster_range = 5:15, # Focused range
n_trials = 8, # Fewer trials
n_bootstrap = 50, # Fewer bootstrap samples
n_workers = detectCores() - 1, # Maximum parallel processing
seed = 42 # For reproducibility
)For very large Seurat objects, the biggest avoidable memory cost is keeping the
full object in memory during the entire scICER run. scICE_clustering() only
needs the graph and cell names, so a practical pattern is:
- read the original object with
qs::qread(); - build a lightweight Seurat object that keeps only one feature plus the target graph;
- remove the original object and run
gc()before clustering; - run
scICE_clustering()on the lightweight object; - reload the original object and add the chosen labels back with
get_robust_labels(); - save both the results object and the augmented Seurat object with
qs::qsave().
library(Seurat)
library(scICER)
library(qs)
input_qs <- "/path/to/very_large_object.qs"
graph_name <- "RNA_snn"
qread_threads <- 40
full_obj <- qs::qread(input_qs, nthread = qread_threads)
# Keep one feature plus the graph so the clustering phase does not hold the
# full assay/reduction payload in memory.
assay_name <- DefaultAssay(full_obj)
keep_feature <- rownames(full_obj[[assay_name]])[1]
small_obj <- DietSeurat(
object = full_obj,
assays = assay_name,
graphs = graph_name,
dimreducs = NULL,
features = keep_feature,
misc = FALSE
)
rm(full_obj)
gc()
results <- scICE_clustering(
object = small_obj,
graph_name = graph_name,
cluster_range = 4:5, # Keep the range focused for huge objects
n_trials = 4, # Start small and increase only if needed
n_bootstrap = 20,
n_workers = 40,
remove_threshold = Inf,
seed = 123,
verbose = TRUE
)
rm(small_obj)
gc()
full_obj <- qs::qread(input_qs, nthread = qread_threads)
full_obj <- get_robust_labels(
results,
return_seurat = TRUE,
object = full_obj,
threshold = 1.01
)
qs::qsave(results, "scice_results.qs", preset = "balanced")
qs::qsave(full_obj, "very_large_object_with_scice.qs", preset = "balanced")If this is still too slow, reduce cluster_range further or test a small set
of user-provided resolution values directly instead of scanning a wide target
cluster range. For the bundled background script, you can switch to manual
resolution mode without editing the file:
SCICE_RESOLUTION=0.01,0.02 \
SCICE_N_TRIALS=2 \
SCICE_N_BOOTSTRAP=10 \
Rscript examples/run_scice.R --input_qs input.qs --output_qs augmented_with_scice.qs- Reduce
n_trialsandn_bootstrapfor memory-limited systems - Use focused
cluster_rangebased on biological expectations - Install the
leidenpackage for improved performance - Use
get_recommended_parameters()for optimal settings
-
"Graph not found" error:
- Run
FindNeighbors()on your Seurat object first - Use
check_seurat_ready()to validate preprocessing - By default, scICER uses the SNN graph from the active assay (e.g., "RNA_snn")
- You can check available graphs with
names(your_seurat_object@graphs) - Specify a different graph with the
graph_nameparameter if needed
- Run
-
No consistent clusters found:
- Try adjusting
ic_threshold(e.g., 1.01) - Expand
cluster_range - Check preprocessing quality
- Use
create_results_summary()for detailed analysis
- Try adjusting
-
Memory issues:
- Reduce
n_workers,n_trials, orcluster_range - Use
get_recommended_parameters()for guidance
- Reduce
-
Slow performance:
- Install the
leidenpackage - Increase
n_workers - Use recommended parameters for your dataset size
- Install the
-
Reproducibility issues:
- Always set the
seedparameter - Keep all other parameters constant
- Document parameter values used
- Always set the
# Check if leiden package is available
if (requireNamespace("leiden", quietly = TRUE)) {
message("leiden package detected - enhanced performance available")
} else {
message("Consider installing 'leiden' package for better performance")
}
# Validate Seurat object preprocessing
check_seurat_ready(seurat_obj)
# Get recommended parameters
params <- get_recommended_parameters(ncol(seurat_obj))If you use scICER in your research, please cite:
Cao, X. (2024). scICER: Single-cell Inconsistency-based Clustering Evaluation in R.
R package version 1.1.0. https://github.com/ATPs/scICER
The original scICE algorithm citation will be added upon publication
For each number of clusters (e.g., when testing 3-10 clusters):
- The algorithm runs
n_trialsindependent clustering attempts (default: 15) - Each trial uses the Leiden algorithm with different random initializations
- For each cluster number:
- Calculates pairwise similarities between all trial results using Element-Centric Similarity (ECS)
- Computes an Inconsistency (IC) score measuring how different the clustering solutions are
- A lower IC score indicates more consistent clustering across trials
- For cluster numbers with IC scores below your threshold (e.g., < 1.05):
- Selects the "best" clustering solution from the trials
- The "best" solution is the one with highest average similarity to other solutions (most representative)
- This becomes the reported clustering for that number of clusters
So when you get results showing clusters 3-10 have IC scores < 1.05, it means:
- Each of these cluster numbers produced stable results across different random initializations
- The reported clustering solution is the most representative one from all trials
- You can be confident in using these clustering solutions for downstream analysis
If you use resolution = c(...), the same IC logic is applied directly to each supplied gamma. When multiple gamma values collapse to the same final cluster number, scICER keeps the lowest-IC one in the main result and stores all evaluated gamma values in resolution_diagnostics.
n_trials:
- Increasing
n_trials(default: 15) can improve the robustness of results - More trials = more chances to find truly stable clustering solutions
- For large datasets, you might want to reduce trials to improve speed
- Recommended ranges:
- Small datasets (<10k cells): 15-20 trials
- Medium datasets (10k-50k cells): 10-15 trials
- Large datasets (>50k cells): 8-10 trials
n_bootstrap:
- Increasing
n_bootstrap(default: 100) improves confidence in IC score estimates - More bootstrap iterations = more accurate stability assessment
- The default 100 iterations is usually sufficient
- Consider increasing for:
- Critical analyses requiring high confidence
- Publications or benchmark studies
- Unusual or complex datasets
- Recommended ranges:
- Standard analysis: 100 iterations
- High-confidence analysis: 200-500 iterations
- Note: Higher values increase computation time
Remember: Quality preprocessing (normalization, feature selection, dimensionality reduction) often has more impact than increasing these parameters.
- 📋 Issues: Report bugs or request features on GitHub Issues
- 📧 Contact: atps@outlook.com
- 🤝 Contributing: Pull requests welcome!
This package is licensed under the MIT License. See LICENSE file for details.
- Original scICE Julia package developers
- Seurat team for the excellent single-cell framework
- R community for the robust package ecosystem