Skip to content

ATPs/scICER

Repository files navigation

scICER: Single-cell Inconsistency-based Clustering Evaluation in R

R Seurat License: MIT

Overview

scICER is an R package that implements a systematic and efficient workflow to evaluate clustering consistency in single-cell RNA-seq data. This is the R port of the original scICE Julia package, designed to be fully compatible with the Seurat ecosystem.

The AnnData/Python counterpart is scICEpy.

Implementation details are documented in design.md.

Key Features

  • 🔬 Automated Cluster Evaluation: Systematically tests multiple cluster numbers to find consistent results
  • Element-Centric Similarity (ECS): Efficient algorithm for comparing clustering results
  • 🧬 Seurat Integration: Seamless integration with Seurat objects and workflows
  • 🚀 Parallel Processing: Leverages multiple cores for faster computation
  • 📊 Comprehensive Visualization: Built-in plotting functions for result interpretation
  • 📈 Robust Statistics: Bootstrap-based confidence intervals and stability assessment
  • 🎯 Reproducible Analysis: Optional seed parameter for deterministic results
  • 🔍 Smart Parameter Recommendations: Automatically suggests optimal parameters based on dataset size
  • Preprocessing Validation: Checks and guides users through proper Seurat preprocessing
  • 📋 Comprehensive Reporting: Generates detailed analysis summaries with interpretation guidance

Algorithm Overview

scICER evaluates clustering consistency by:

  1. Multiple Clustering: Runs Leiden clustering multiple times with different parameters
  2. Binary Search: Automatically finds resolution parameters for target cluster numbers
  3. ECS Calculation: Uses Element-Centric Similarity to efficiently compare clustering results
  4. Optimization: Iteratively refines clustering to minimize inconsistency
  5. Bootstrap Validation: Assesses stability through bootstrap sampling
  6. Threshold Application: Identifies cluster numbers meeting consistency criteria
  7. Reproducibility Control: Optional seed parameter for deterministic results

Performance Considerations vs scICE Julia package

  • Speed: R implementation typically runs slower than the native Julia version
  • Memory: Requires more memory for equivalent analyses
  • Accuracy: While highly consistent, results may differ slightly from the original scICE due to implementation differences in R vs Julia, random number generation variations, and floating-point precision handling.

Note: These tradeoffs provide better integration with the Seurat ecosystem and R workflows

Dependencies

scICER requires the ClustAssess library for Element-Centric Similarity (ECS) calculations, which provides:

  • ~150x faster performance for similarity calculations
  • Professional, well-tested implementation by Cambridge Core Bioinformatics
  • Identical results to reference implementations

ClustAssess will be automatically installed when you install scICER.

Installation

Prerequisites

Ensure you have R ≥ 4.0.0 and the required dependencies:

# Install required CRAN packages
install.packages(c(
  "Seurat", "SeuratObject", "igraph", "Matrix", 
  "parallel", "foreach", "doParallel", "dplyr", 
  "ggplot2", "devtools", "ClustAssess"
))

# Optional: Install leiden package for better performance
install.packages("leiden")

Install scICER from GitHub

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

# Install scICER from GitHub
devtools::install_github("ATPs/scICER", upgrade = "never")#avoid upgrade packages and use existing ones

Load the package

library(scICER)
library(Seurat)

Quick Start

Basic Workflow

# Load required packages
library(scICER)
library(Seurat)
library(foreach)
library(doParallel)
library(ggplot2)

# Set up parallel processing (optional, but recommended)
n_workers <- 4  # adjust based on your system

# Load your Seurat object
data(pbmc_small)  # Example dataset
seurat_obj <- pbmc_small

# Ensure preprocessing is complete
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)  
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)

# Run scICER analysis
scice_results <- scICE_clustering(
  object = seurat_obj,
  cluster_range = 3:20,
  n_workers = n_workers,
  min_cluster_size = 2,  # Default; consistent with Seurat singleton behavior
  remove_threshold = Inf,
  verbose = TRUE
)

# Or evaluate manually chosen gamma values directly
manual_results <- scICE_clustering(
  object = seurat_obj,
  resolution = c(0.2, 0.4, 0.8),
  n_workers = n_workers,
  min_cluster_size = 2,
  verbose = TRUE
)

# For reproducible results, use the seed parameter
scice_results <- scICE_clustering(
  object = seurat_obj,
  cluster_range = 3:20,
  n_workers = n_workers,
  min_cluster_size = 2,
  remove_threshold = Inf,
  verbose = TRUE,
  seed = 123  # Set seed for reproducible results
)

# Visualize results
plot_ic(scice_results, threshold = 1.005)

# Extract consistent clustering labels
seurat_obj <- get_robust_labels(scice_results, return_seurat = TRUE, object = seurat_obj)

# Visualize with consistent clusters, cluster range from 3 to 15
xl.fig <- list()
for (n in 3:15) {
  if (paste0("clusters_", n) %in% colnames(seurat_obj@meta.data)){
    xl.fig[[paste0("clusters_", n)]] <- DimPlot(seurat_obj, group.by = paste0("clusters_", n), label = TRUE) + ggtitle(paste("Clusters:", n))
  }
}

cowplot::plot_grid(plotlist = xl.fig, ncol = 3)

Documentation Notes

  • README.md: quick installation, common usage patterns, and practical examples.
  • design.md: implementation-oriented design document for scICE_clustering(). It explains the full workflow, parameter semantics, output fields, and important behavior differences between cluster_range mode and manual resolution mode.
  • Use design.md when you need to understand how results are produced internally, interpret fields such as resolution_diagnostics, or understand why resolution = old_results$gamma is not a one-to-one replay of a previous cluster_range run.

Background Runs and Verbose Logs

scICE_clustering(..., verbose = TRUE) prints timestamped progress messages like [YYYY-MM-DD HH:MM:SS] ... via message(), which goes to stderr. If you launch jobs in the background and only capture stdout, logs can appear missing.

Use:

Rscript examples/run_scice.R --help
Rscript examples/run_scice.R --input_qs input.qs --output_qs augmented_with_scice.qs > scice.log 2>&1 &
tail -f scice.log

For very large input objects, qs::qread() can be silent for a while before scICER starts. The provided run_scice.R script prints timestamped checkpoints before and after each major step, builds a lightweight Seurat object that keeps only one feature plus the requested graph for the clustering phase, deletes the original large object before scICE_clustering(), then reloads the original object only for metadata merge. The final augmented Seurat object is saved with qs. The script accepts both environment variables and explicit CLI flags such as --qread_threads, --n_workers, --cluster_range, and --resolution. If --output_qs is omitted, it defaults to --input_qs.

Detailed Usage

1. Standard Analysis

# Comprehensive scICER analysis
results <- scICE_clustering(
  object = your_seurat_object,
  cluster_range = 2:20,           # Range of cluster numbers to test
  n_workers = 8,                  # Number of parallel workers  
  n_trials = 15,                  # Clustering trials per resolution
  n_bootstrap = 100,              # Bootstrap iterations
  seed = 42,                      # Optional: Set for reproducible results
  ic_threshold = 1.005,           # Consistency threshold
  min_cluster_size = 2,           # Default minimum cells per effective cluster (set 1 to disable)
  verbose = TRUE                  # Progress messages
)

2. Advanced Parameter Customization

# Fine-tuned parameters for large datasets
results <- scICE_clustering(
  object = large_seurat_object,
  graph_name = "RNA_snn",         # Optional: specify graph name
  cluster_range = 5:12,           # Focused range for efficiency
  objective_function = "CPM",     # "CPM" or "modularity"
  beta = 0.1,                     # Leiden beta parameter
  n_iterations = 10,              # Initial Leiden iterations
  max_iterations = 150,           # Maximum optimization iterations
  remove_threshold = 1.15,        # Filter inconsistent results
  min_cluster_size = 2,           # Default threshold; final merge only on best_labels
  resolution_tolerance = 1e-8,    # Resolution search precision
  n_trials = 10,                  # Reduced for speed
  n_bootstrap = 50,               # Reduced for speed
  seed = 12345                    # For reproducible results
)

It's OK to set a large remove_threshold value, because plot_ic and get_robust_labels all have default threshold = 1.005. min_cluster_size = 2 approximates Seurat singleton grouping; use min_cluster_size = 1 to keep raw Leiden clusters. labels stay raw for IC/MEI, while best_labels receives one final merge pass when min_cluster_size > 1. resolution can be a single gamma or a vector of gamma values. When resolution is provided, scICER skips the cluster_range resolution search, evaluates the supplied gamma values directly, and keeps the lowest-IC result for each final cluster number. If both resolution and cluster_range are provided, resolution takes priority and cluster_range is ignored with an informational message.

3. Manual Resolution Mode

# Evaluate one user-provided gamma
result_single_resolution <- scICE_clustering(
  object = your_seurat_object,
  resolution = 0.2,
  n_trials = 15,
  n_bootstrap = 100,
  min_cluster_size = 2,
  seed = 42
)

# Evaluate several user-provided gamma values
results_manual <- scICE_clustering(
  object = your_seurat_object,
  resolution = c(0.1, 0.2, 0.4, 0.8),
  n_trials = 15,
  n_bootstrap = 100,
  min_cluster_size = 2,
  seed = 42
)

# Plot the retained results
plot_ic(results_manual, threshold = 1.005)

# Per-gamma diagnostics for all user-supplied resolutions
results_manual$resolution_diagnostics

# Lowest-IC returned solution
results_manual$best_cluster
results_manual$best_resolution

# resolution takes priority if cluster_range is also supplied
results_manual2 <- scICE_clustering(
  object = your_seurat_object,
  cluster_range = 3:10,
  resolution = c(0.15, 0.3, 0.6),
  n_trials = 15,
  n_bootstrap = 100,
  min_cluster_size = 2,
  seed = 42
)

4. Visualization and Results

# Plot Inconsistency Coefficient (IC) scores
plot_ic(results, 
        threshold = 1.005,
        title = "Clustering Consistency Analysis")

# Plot stability analysis
plot_stability(results)

# Extract summary of consistent clusters
summary <- extract_consistent_clusters(results, threshold = 1.005)
print(summary)

# Get clustering labels as data frame
labels_df <- get_robust_labels(results)

# Add results directly to Seurat object
seurat_obj <- get_robust_labels(results, return_seurat = TRUE, object = seurat_obj)

4. Integration with Seurat Workflow

# Complete Seurat + scICER workflow
seurat_obj <- CreateSeuratObject(counts = your_counts)
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)

# Run scICER with reproducibility
scice_results <- scICE_clustering(
  seurat_obj, 
  cluster_range = 3:15,
  seed = 42  # For reproducible results
)

# Add consistent clusters to Seurat object
seurat_obj <- get_robust_labels(scice_results, return_seurat = TRUE, object = seurat_obj)

# Continue with standard Seurat analysis
seurat_obj <- RunUMAP(seurat_obj, dims = 1:20)
DimPlot(seurat_obj, group.by = "clusters_8")  # Use consistent clustering

# Find markers for consistent clusters
Idents(seurat_obj) <- "clusters_8"
markers <- FindAllMarkers(seurat_obj)

scICER paper

The codes used to generate results in scICER paper are available at GitHub (https://github.com/yerry77/scICER-workflow).

Understanding the Output

Main Results Object

The scICE_clustering() function returns a list containing:

  • gamma: Resolution parameters for each cluster number
  • labels: Raw clustering arrays used for IC/MEI (no small-cluster merge)
  • ic: Inconsistency scores (lower = more consistent)
  • ic_vec: Bootstrap IC distributions
  • n_cluster: Number of clusters tested
  • best_labels: Final best labels (single post-optimization merge when min_cluster_size > 1)
  • consistent_clusters: Cluster numbers meeting consistency threshold
  • mei: Mutual Element-wise Information scores
  • analysis_mode: "cluster_range" or "resolution"
  • resolution_input: Manual gamma values requested by the user in resolution mode
  • resolution_diagnostics: Per-gamma summary table in resolution mode
  • best_cluster: Returned cluster number with the lowest IC score
  • best_resolution: Gamma corresponding to best_cluster

scICE_clustering() returns a lightweight results object and does not store the full Seurat input object. When you need metadata columns added back, pass the original Seurat object explicitly:

seurat_obj <- get_robust_labels(results, return_seurat = TRUE, object = seurat_obj)

In resolution mode, resolution_diagnostics keeps one row per input gamma, while the main result object is deduplicated by final cluster number so that each returned cluster number retains only the lowest-IC gamma.

Interpretation Guidelines

  • IC < 1.005: Highly consistent clustering (recommended), less than 0.5% of cells is inconsistent.
  • IC 1.005-1.01: Moderately consistent clustering, 0.5% to 1% of cells is inconsistent.
  • IC > 1.01: Low consistency (not recommended), more than 1% of cells is inconsistent.

Visualization Outputs

  • IC Plot: Shows consistency scores across cluster numbers
  • Stability Plot: Displays bootstrap distributions
  • UMAP/t-SNE: Visualize clusters with consistent labels

Performance Tips

For Large Datasets (>50k cells)

# Get recommended parameters based on dataset size
recommended <- get_recommended_parameters(ncol(large_seurat_obj))

# Optimized settings for large datasets
results <- scICE_clustering(
  object = large_seurat_obj,
  cluster_range = 5:15,      # Focused range
  n_trials = 8,              # Fewer trials
  n_bootstrap = 50,          # Fewer bootstrap samples
  n_workers = detectCores() - 1,  # Maximum parallel processing
  seed = 42                  # For reproducibility
)

For Very Large Seurat Objects

For very large Seurat objects, the biggest avoidable memory cost is keeping the full object in memory during the entire scICER run. scICE_clustering() only needs the graph and cell names, so a practical pattern is:

  1. read the original object with qs::qread();
  2. build a lightweight Seurat object that keeps only one feature plus the target graph;
  3. remove the original object and run gc() before clustering;
  4. run scICE_clustering() on the lightweight object;
  5. reload the original object and add the chosen labels back with get_robust_labels();
  6. save both the results object and the augmented Seurat object with qs::qsave().
library(Seurat)
library(scICER)
library(qs)

input_qs <- "/path/to/very_large_object.qs"
graph_name <- "RNA_snn"
qread_threads <- 40

full_obj <- qs::qread(input_qs, nthread = qread_threads)

# Keep one feature plus the graph so the clustering phase does not hold the
# full assay/reduction payload in memory.
assay_name <- DefaultAssay(full_obj)
keep_feature <- rownames(full_obj[[assay_name]])[1]
small_obj <- DietSeurat(
  object = full_obj,
  assays = assay_name,
  graphs = graph_name,
  dimreducs = NULL,
  features = keep_feature,
  misc = FALSE
)

rm(full_obj)
gc()

results <- scICE_clustering(
  object = small_obj,
  graph_name = graph_name,
  cluster_range = 4:5,       # Keep the range focused for huge objects
  n_trials = 4,              # Start small and increase only if needed
  n_bootstrap = 20,
  n_workers = 40,
  remove_threshold = Inf,
  seed = 123,
  verbose = TRUE
)

rm(small_obj)
gc()

full_obj <- qs::qread(input_qs, nthread = qread_threads)
full_obj <- get_robust_labels(
  results,
  return_seurat = TRUE,
  object = full_obj,
  threshold = 1.01
)

qs::qsave(results, "scice_results.qs", preset = "balanced")
qs::qsave(full_obj, "very_large_object_with_scice.qs", preset = "balanced")

If this is still too slow, reduce cluster_range further or test a small set of user-provided resolution values directly instead of scanning a wide target cluster range. For the bundled background script, you can switch to manual resolution mode without editing the file:

SCICE_RESOLUTION=0.01,0.02 \
SCICE_N_TRIALS=2 \
SCICE_N_BOOTSTRAP=10 \
Rscript examples/run_scice.R --input_qs input.qs --output_qs augmented_with_scice.qs

Memory Management

  • Reduce n_trials and n_bootstrap for memory-limited systems
  • Use focused cluster_range based on biological expectations
  • Install the leiden package for improved performance
  • Use get_recommended_parameters() for optimal settings

Troubleshooting

Common Issues

  1. "Graph not found" error:

    • Run FindNeighbors() on your Seurat object first
    • Use check_seurat_ready() to validate preprocessing
    • By default, scICER uses the SNN graph from the active assay (e.g., "RNA_snn")
    • You can check available graphs with names(your_seurat_object@graphs)
    • Specify a different graph with the graph_name parameter if needed
  2. No consistent clusters found:

    • Try adjusting ic_threshold (e.g., 1.01)
    • Expand cluster_range
    • Check preprocessing quality
    • Use create_results_summary() for detailed analysis
  3. Memory issues:

    • Reduce n_workers, n_trials, or cluster_range
    • Use get_recommended_parameters() for guidance
  4. Slow performance:

    • Install the leiden package
    • Increase n_workers
    • Use recommended parameters for your dataset size
  5. Reproducibility issues:

    • Always set the seed parameter
    • Keep all other parameters constant
    • Document parameter values used

Performance Optimization

# Check if leiden package is available
if (requireNamespace("leiden", quietly = TRUE)) {
  message("leiden package detected - enhanced performance available")
} else {
  message("Consider installing 'leiden' package for better performance")
}

# Validate Seurat object preprocessing
check_seurat_ready(seurat_obj)

# Get recommended parameters
params <- get_recommended_parameters(ncol(seurat_obj))

Citation

If you use scICER in your research, please cite:

Cao, X. (2024). scICER: Single-cell Inconsistency-based Clustering Evaluation in R. 
R package version 1.1.0. https://github.com/ATPs/scICER

The original scICE algorithm citation will be added upon publication

Q&A

How does scICER select the final clustering solution?

For each number of clusters (e.g., when testing 3-10 clusters):

  1. The algorithm runs n_trials independent clustering attempts (default: 15)
  2. Each trial uses the Leiden algorithm with different random initializations
  3. For each cluster number:
    • Calculates pairwise similarities between all trial results using Element-Centric Similarity (ECS)
    • Computes an Inconsistency (IC) score measuring how different the clustering solutions are
    • A lower IC score indicates more consistent clustering across trials
  4. For cluster numbers with IC scores below your threshold (e.g., < 1.05):
    • Selects the "best" clustering solution from the trials
    • The "best" solution is the one with highest average similarity to other solutions (most representative)
    • This becomes the reported clustering for that number of clusters

So when you get results showing clusters 3-10 have IC scores < 1.05, it means:

  • Each of these cluster numbers produced stable results across different random initializations
  • The reported clustering solution is the most representative one from all trials
  • You can be confident in using these clustering solutions for downstream analysis

If you use resolution = c(...), the same IC logic is applied directly to each supplied gamma. When multiple gamma values collapse to the same final cluster number, scICER keeps the lowest-IC one in the main result and stores all evaluated gamma values in resolution_diagnostics.

Will increasing n_trials or n_bootstrap improve results?

n_trials:

  • Increasing n_trials (default: 15) can improve the robustness of results
  • More trials = more chances to find truly stable clustering solutions
  • For large datasets, you might want to reduce trials to improve speed
  • Recommended ranges:
    • Small datasets (<10k cells): 15-20 trials
    • Medium datasets (10k-50k cells): 10-15 trials
    • Large datasets (>50k cells): 8-10 trials

n_bootstrap:

  • Increasing n_bootstrap (default: 100) improves confidence in IC score estimates
  • More bootstrap iterations = more accurate stability assessment
  • The default 100 iterations is usually sufficient
  • Consider increasing for:
    • Critical analyses requiring high confidence
    • Publications or benchmark studies
    • Unusual or complex datasets
  • Recommended ranges:
    • Standard analysis: 100 iterations
    • High-confidence analysis: 200-500 iterations
    • Note: Higher values increase computation time

Remember: Quality preprocessing (normalization, feature selection, dimensionality reduction) often has more impact than increasing these parameters.

Support and Contributing

License

This package is licensed under the MIT License. See LICENSE file for details.

Acknowledgments

  • Original scICE Julia package developers
  • Seurat team for the excellent single-cell framework
  • R community for the robust package ecosystem

Related Resources

About

No description, website, or topics provided.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages