scICER: Single-cell Inconsistency-based Clustering Evaluation in R

Overview

scICER is an R package that implements a systematic and efficient workflow to evaluate clustering consistency in single-cell RNA-seq data. This is the R port of the original scICE Julia package, designed to be fully compatible with the Seurat ecosystem.

The AnnData/Python counterpart is scICEpy.

Implementation details are documented in design.md.

scICER paper

The codes used to generate results in scICER paper are available at GitHub (https://github.com/ATPs/scICER-workflow).

Key Features

🔬 Automated Cluster Evaluation: Systematically tests multiple cluster numbers to find consistent results
⚡ Element-Centric Similarity (ECS): Efficient algorithm for comparing clustering results
🧬 Seurat Integration: Seamless integration with Seurat objects and workflows
🚀 Parallel Processing: Leverages multiple cores for faster computation
📊 Comprehensive Visualization: Built-in plotting functions for result interpretation
📈 Robust Statistics: Bootstrap-based confidence intervals and stability assessment
🎯 Reproducible Analysis: Optional seed parameter for deterministic results
🔍 Smart Parameter Recommendations: Automatically suggests optimal parameters based on dataset size
✅ Preprocessing Validation: Checks and guides users through proper Seurat preprocessing
📋 Comprehensive Reporting: Generates detailed analysis summaries with interpretation guidance

Algorithm Overview

scICER evaluates clustering consistency by:

Multiple Clustering: Runs Leiden clustering multiple times with different parameters
Binary Search: Automatically finds resolution parameters for target cluster numbers
ECS Calculation: Uses Element-Centric Similarity to efficiently compare clustering results
Optimization: Iteratively refines clustering to minimize inconsistency
Bootstrap Validation: Assesses stability through bootstrap sampling
Threshold Application: Identifies cluster numbers meeting consistency criteria
Reproducibility Control: Optional seed parameter for deterministic results

Performance Considerations vs scICE Julia package

Speed: R implementation typically runs slower than the native Julia version
Memory: Requires more memory for equivalent analyses
Accuracy: While highly consistent, results may differ slightly from the original scICE due to implementation differences in R vs Julia, random number generation variations, and floating-point precision handling.

Note: These tradeoffs provide better integration with the Seurat ecosystem and R workflows

Dependencies

scICER requires the ClustAssess library for Element-Centric Similarity (ECS) calculations, which provides:

~150x faster performance for similarity calculations
Professional, well-tested implementation by Cambridge Core Bioinformatics
Identical results to reference implementations

ClustAssess will be automatically installed when you install scICER.

Installation

Prerequisites

Ensure you have R ≥ 4.0.0 and the required dependencies:

# Install required CRAN packages
install.packages(c(
  "Seurat", "SeuratObject", "igraph", "Matrix", 
  "parallel", "foreach", "doParallel", "dplyr", 
  "ggplot2", "devtools", "ClustAssess"
))

# Optional: Install leiden package for better performance
install.packages("leiden")

Install scICER from GitHub

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

# Install scICER from GitHub
devtools::install_github("ATPs/scICER", upgrade = "never")#avoid upgrade packages and use existing ones

Load the package

library(scICER)
library(Seurat)

Quick Start

Basic Workflow

# Load required packages
library(scICER)
library(Seurat)
library(foreach)
library(doParallel)
library(ggplot2)

# Set up a parallel worker budget (optional, but recommended)
n_workers <- 4  # start conservatively and adjust based on your system

# Load your Seurat object
data(pbmc_small)  # Example dataset
seurat_obj <- pbmc_small

# Ensure preprocessing is complete
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)  
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)

# Run scICER analysis
scice_results <- scICE_clustering(
  object = seurat_obj,
  cluster_range = 3:20,
  n_workers = n_workers,
  min_cluster_size = 2,  # Default; consistent with Seurat singleton behavior
  remove_threshold = Inf,
  verbose = TRUE
)

# Or evaluate manually chosen gamma values directly
manual_results <- scICE_clustering(
  object = seurat_obj,
  resolution = c(0.2, 0.4, 0.8),
  n_workers = n_workers,
  min_cluster_size = 2,
  verbose = TRUE
)

# For reproducible results, use the seed parameter
scice_results <- scICE_clustering(
  object = seurat_obj,
  cluster_range = 3:20,
  n_workers = n_workers,
  min_cluster_size = 2,
  remove_threshold = Inf,
  verbose = TRUE,
  seed = 123  # Set seed for reproducible results
)

# Visualize results
plot_ic(scice_results, threshold = 1.005)

# Extract consistent clustering labels
seurat_obj <- get_robust_labels(scice_results, return_seurat = TRUE, object = seurat_obj)

# Visualize with consistent clusters, cluster range from 3 to 15
xl.fig <- list()
for (n in 3:15) {
  if (paste0("clusters_", n) %in% colnames(seurat_obj@meta.data)){
    xl.fig[[paste0("clusters_", n)]] <- DimPlot(seurat_obj, group.by = paste0("clusters_", n), label = TRUE) + ggtitle(paste("Clusters:", n))
  }
}

cowplot::plot_grid(plotlist = xl.fig, ncol = 3)

Documentation Notes

README.md: quick installation, common usage patterns, and practical examples.
design.md: implementation-oriented design document for scICE_clustering(). It explains the full workflow, parameter semantics, output fields, and important behavior differences between cluster_range mode and manual resolution mode.
Use design.md when you need to understand how results are produced internally, interpret fields such as resolution_diagnostics, or understand why resolution = old_results$gamma is not a one-to-one replay of a previous cluster_range run.

Background Runs and Verbose Logs

scICE_clustering(..., verbose = TRUE) prints timestamped progress messages like [YYYY-MM-DD HH:MM:SS] ... via message(), which goes to stderr. If you launch jobs in the background and only capture stdout, logs can appear missing.

Use:

Rscript examples/run_scice.R --help
Rscript examples/run_scice.R --input_qs input.qs --output_qs augmented_with_scice.qs > scice.log 2>&1 &
tail -f scice.log

For very large input objects, qs::qread() can be silent for a while before scICER starts. The provided run_scice.R script prints timestamped checkpoints before and after each major step, builds a lightweight Seurat object that keeps two features plus the requested graph for the clustering phase, deletes the original large object before scICE_clustering(), then reloads the original object only for metadata merge. The final augmented Seurat object is saved with qs. The script accepts both environment variables and explicit CLI flags such as --qread_threads, --n_workers, --cluster_range, and --resolution. If --output_qs is omitted, it defaults to --input_qs.

Worker and CPU Usage

n_workers is a scICER worker budget, not a strict upper bound on total CPU cores or OS processes. In cluster_range mode, scICER can run outer workers for different target cluster numbers and nested workers for gamma evaluation, iterative refinement, or bootstrap steps. These nested parallel::mclapply() calls can make the observed process count or CPU usage higher than n_workers. Native libraries used by R packages, especially igraph/Leiden backends, may also use OpenMP or BLAS threads unless they are limited by the environment.

Most users do not need to tune these settings for ordinary local runs. They matter mainly on shared servers, scheduled jobs, or environments with strict CPU limits.

Practical recommendations:

Set n_workers below your scheduler CPU allocation when using broad cluster_range values or large n_trials/n_bootstrap.
Limit native library threading before starting R, for example:

export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export BLIS_NUM_THREADS=1
Rscript examples/run_scice.R --input_qs input.qs --output_qs output.qs

Remember that qs::qread(..., nthread = ...) or the script option --qread_threads is separate from n_workers.
If your environment enforces a hard CPU cap, prefer a smaller n_workers, narrower cluster_range, and lower n_trials/n_bootstrap, then scale up after checking actual CPU and process counts.

Detailed Usage

1. Standard Analysis

# Comprehensive scICER analysis
results <- scICE_clustering(
  object = your_seurat_object,
  cluster_range = 2:20,           # Range of cluster numbers to test
  n_workers = 8,                  # scICER worker budget
  n_trials = 15,                  # Clustering trials per resolution
  n_bootstrap = 100,              # Bootstrap iterations
  seed = 42,                      # Optional: Set for reproducible results
  ic_threshold = 1.005,           # Consistency threshold
  min_cluster_size = 2,           # Default minimum cells per effective cluster (set 1 to disable)
  verbose = TRUE                  # Progress messages
)

2. Advanced Parameter Customization

# Fine-tuned parameters for large datasets
results <- scICE_clustering(
  object = large_seurat_object,
  graph_name = "RNA_snn",         # Optional: specify graph name
  cluster_range = 5:12,           # Focused range for efficiency
  objective_function = "CPM",     # "CPM" or "modularity"
  beta = 0.1,                     # Leiden beta parameter
  n_iterations = 10,              # Initial Leiden iterations
  max_iterations = 150,           # Maximum optimization iterations
  remove_threshold = 1.15,        # Filter inconsistent results
  min_cluster_size = 2,           # Default threshold; final merge only on best_labels
  resolution_tolerance = 1e-8,    # Resolution search precision
  n_trials = 10,                  # Reduced for speed
  n_bootstrap = 50,               # Reduced for speed
  seed = 12345                    # For reproducible results
)

It's OK to set a large remove_threshold value, because plot_ic and get_robust_labels all have default threshold = 1.005. min_cluster_size = 2 approximates Seurat singleton grouping; use min_cluster_size = 1 to keep raw Leiden clusters. labels stay raw for IC/MEI, while best_labels receives one final merge pass when min_cluster_size > 1. resolution can be a single gamma or a vector of gamma values. When resolution is provided, scICER skips the cluster_range resolution search, evaluates the supplied gamma values directly, and keeps the lowest-IC result for each final cluster number. If both resolution and cluster_range are provided, resolution takes priority and cluster_range is ignored with an informational message.

3. Manual Resolution Mode

# Evaluate one user-provided gamma
result_single_resolution <- scICE_clustering(
  object = your_seurat_object,
  resolution = 0.2,
  n_trials = 15,
  n_bootstrap = 100,
  min_cluster_size = 2,
  seed = 42
)

# Evaluate several user-provided gamma values
results_manual <- scICE_clustering(
  object = your_seurat_object,
  resolution = c(0.1, 0.2, 0.4, 0.8),
  n_trials = 15,
  n_bootstrap = 100,
  min_cluster_size = 2,
  seed = 42
)

# Plot the retained results
plot_ic(results_manual, threshold = 1.005)

# Per-gamma diagnostics for all user-supplied resolutions
results_manual$resolution_diagnostics

# Lowest-IC returned solution
results_manual$best_cluster
results_manual$best_resolution

# resolution takes priority if cluster_range is also supplied
results_manual2 <- scICE_clustering(
  object = your_seurat_object,
  cluster_range = 3:10,
  resolution = c(0.15, 0.3, 0.6),
  n_trials = 15,
  n_bootstrap = 100,
  min_cluster_size = 2,
  seed = 42
)

4. Visualization and Results

# Plot Inconsistency Coefficient (IC) scores
plot_ic(results, 
        threshold = 1.005,
        title = "Clustering Consistency Analysis")

# Plot stability analysis
plot_stability(results)

# Extract summary of consistent clusters
summary <- extract_consistent_clusters(results, threshold = 1.005)
print(summary)

# Get clustering labels as data frame
labels_df <- get_robust_labels(results)

# Add results directly to Seurat object
seurat_obj <- get_robust_labels(results, return_seurat = TRUE, object = seurat_obj)

4. Integration with Seurat Workflow

# Complete Seurat + scICER workflow
seurat_obj <- CreateSeuratObject(counts = your_counts)
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)

# Run scICER with reproducibility
scice_results <- scICE_clustering(
  seurat_obj, 
  cluster_range = 3:15,
  seed = 42  # For reproducible results
)

# Add consistent clusters to Seurat object
seurat_obj <- get_robust_labels(scice_results, return_seurat = TRUE, object = seurat_obj)

# Continue with standard Seurat analysis
seurat_obj <- RunUMAP(seurat_obj, dims = 1:20)
DimPlot(seurat_obj, group.by = "clusters_8")  # Use consistent clustering

# Find markers for consistent clusters
Idents(seurat_obj) <- "clusters_8"
markers <- FindAllMarkers(seurat_obj)

Understanding the Output

Main Results Object

The scICE_clustering() function returns a list containing:

gamma: Resolution parameters for each cluster number
labels: Raw clustering arrays used for IC/MEI (no small-cluster merge)
ic: Inconsistency scores (lower = more consistent)
ic_vec: Bootstrap IC distributions
n_cluster: Number of clusters tested
best_labels: Final best labels (single post-optimization merge when min_cluster_size > 1)
consistent_clusters: Cluster numbers meeting consistency threshold
mei: Mutual Element-wise Information scores
analysis_mode: "cluster_range" or "resolution"
resolution_input: Manual gamma values requested by the user in resolution mode
resolution_diagnostics: Per-gamma summary table in resolution mode
best_cluster: Returned cluster number with the lowest IC score
best_resolution: Gamma corresponding to best_cluster

scICE_clustering() returns a lightweight results object and does not store the full Seurat input object. When you need metadata columns added back, pass the original Seurat object explicitly:

seurat_obj <- get_robust_labels(results, return_seurat = TRUE, object = seurat_obj)

In resolution mode, resolution_diagnostics keeps one row per input gamma, while the main result object is deduplicated by final cluster number so that each returned cluster number retains only the lowest-IC gamma.

Interpretation Guidelines

IC < 1.005: Highly consistent clustering (recommended), less than 0.5% of cells is inconsistent.
IC 1.005-1.01: Moderately consistent clustering, 0.5% to 1% of cells is inconsistent.
IC > 1.01: Low consistency (not recommended), more than 1% of cells is inconsistent.

Visualization Outputs

IC Plot: Shows consistency scores across cluster numbers
Stability Plot: Displays bootstrap distributions
UMAP/t-SNE: Visualize clusters with consistent labels

Performance Tips

For Large Datasets (>50k cells)

# Get recommended parameters based on dataset size
recommended <- get_recommended_parameters(ncol(large_seurat_obj))

# Optimized settings for large datasets
results <- scICE_clustering(
  object = large_seurat_obj,
  cluster_range = 5:15,      # Focused range
  n_trials = 8,              # Fewer trials
  n_bootstrap = 50,          # Fewer bootstrap samples
  n_workers = detectCores() - 1,  # Worker budget; actual CPU use can be higher
  seed = 42                  # For reproducibility
)

For Very Large Seurat Objects

For very large Seurat objects, the biggest avoidable memory cost is keeping the full object in memory during the entire scICER run. scICE_clustering() only needs the graph and cell names, so a practical pattern is:

read the original object with qs::qread();
build a lightweight Seurat object that keeps two features plus the target graph;
remove the original object and run gc() before clustering;
run scICE_clustering() on the lightweight object;
reload the original object and add the chosen labels back with get_robust_labels();
save both the results object and the augmented Seurat object with qs::qsave().

library(Seurat)
library(scICER)
library(qs)

input_qs <- "/path/to/very_large_object.qs"
graph_name <- "RNA_snn"
qread_threads <- 40

full_obj <- qs::qread(input_qs, nthread = qread_threads)

# Keep two features plus the graph so the clustering phase does not hold the
# full assay/reduction payload in memory. Some Seurat v5 Assay5 objects fail
# when DietSeurat() keeps only one feature.
assay_name <- DefaultAssay(full_obj)
keep_features <- head(rownames(full_obj[[assay_name]]), 2L)
if (length(keep_features) < 2L || anyNA(keep_features) || any(!nzchar(keep_features))) {
  stop("Need at least two valid feature names for the lightweight Seurat object.")
}
small_obj <- DietSeurat(
  object = full_obj,
  assays = assay_name,
  graphs = graph_name,
  dimreducs = NULL,
  features = keep_features,
  misc = FALSE
)

rm(full_obj)
gc()

results <- scICE_clustering(
  object = small_obj,
  graph_name = graph_name,
  cluster_range = 4:5,       # Keep the range focused for huge objects
  n_trials = 4,              # Start small and increase only if needed
  n_bootstrap = 20,
  n_workers = 40,
  remove_threshold = Inf,
  seed = 123,
  verbose = TRUE
)

rm(small_obj)
gc()

full_obj <- qs::qread(input_qs, nthread = qread_threads)
full_obj <- get_robust_labels(
  results,
  return_seurat = TRUE,
  object = full_obj,
  threshold = 1.01
)

qs::qsave(results, "scice_results.qs", preset = "balanced")
qs::qsave(full_obj, "very_large_object_with_scice.qs", preset = "balanced")

If this is still too slow, reduce cluster_range further or test a small set of user-provided resolution values directly instead of scanning a wide target cluster range. For the bundled background script, you can switch to manual resolution mode without editing the file:

SCICE_RESOLUTION=0.01,0.02 \
SCICE_N_TRIALS=2 \
SCICE_N_BOOTSTRAP=10 \
Rscript examples/run_scice.R --input_qs input.qs --output_qs augmented_with_scice.qs

Memory Management

Reduce n_trials and n_bootstrap for memory-limited systems
Use focused cluster_range based on biological expectations
Install the leiden package for improved performance
Use get_recommended_parameters() for optimal settings

Troubleshooting

Common Issues

"Graph not found" error:
- Run FindNeighbors() on your Seurat object first
- Use check_seurat_ready() to validate preprocessing
- By default, scICER uses the SNN graph from the active assay (e.g., "RNA_snn")
- You can check available graphs with names(your_seurat_object@graphs)
- Specify a different graph with the graph_name parameter if needed
No consistent clusters found:
- Try adjusting ic_threshold (e.g., 1.01)
- Expand cluster_range
- Check preprocessing quality
- Use create_results_summary() for detailed analysis
Memory issues:
- Reduce n_workers, n_trials, or cluster_range
- Use get_recommended_parameters() for guidance
Slow performance:
- Install the leiden package
- Increase n_workers
- Use recommended parameters for your dataset size
Reproducibility issues:
- Always set the seed parameter
- Keep all other parameters constant, including n_workers
- In cluster_range mode, different n_workers values can produce different results even with the same seed
- Document parameter values used

Performance Optimization

# Check if leiden package is available
if (requireNamespace("leiden", quietly = TRUE)) {
  message("leiden package detected - enhanced performance available")
} else {
  message("Consider installing 'leiden' package for better performance")
}

# Validate Seurat object preprocessing
check_seurat_ready(seurat_obj)

# Get recommended parameters
params <- get_recommended_parameters(ncol(seurat_obj))

Citation

If you use scICER in your research, please cite:

Cao, X. (2024). scICER: Single-cell Inconsistency-based Clustering Evaluation in R. 
R package version 1.1.0. https://github.com/ATPs/scICER

The original scICE algorithm citation will be added upon publication

Q&A

Why can results differ when I change `n_workers` even with the same `seed`?

In cluster_range mode, n_workers can affect the resolution-search probe grid and search path. A fixed seed controls randomness within a given run configuration, but it does not guarantee identical results after changing n_workers. For exact reruns, keep n_workers and other analysis parameters unchanged.

How does scICER select the final clustering solution?

For each number of clusters (e.g., when testing 3-10 clusters):

The algorithm runs n_trials independent clustering attempts (default: 15)
Each trial uses the Leiden algorithm with different random initializations
For each cluster number:
- Calculates pairwise similarities between all trial results using Element-Centric Similarity (ECS)
- Computes an Inconsistency (IC) score measuring how different the clustering solutions are
- A lower IC score indicates more consistent clustering across trials
For cluster numbers with IC scores below your threshold (e.g., < 1.05):
- Selects the "best" clustering solution from the trials
- The "best" solution is the one with highest average similarity to other solutions (most representative)
- This becomes the reported clustering for that number of clusters

So when you get results showing clusters 3-10 have IC scores < 1.05, it means:

Each of these cluster numbers produced stable results across different random initializations
The reported clustering solution is the most representative one from all trials
You can be confident in using these clustering solutions for downstream analysis

If you use resolution = c(...), the same IC logic is applied directly to each supplied gamma. When multiple gamma values collapse to the same final cluster number, scICER keeps the lowest-IC one in the main result and stores all evaluated gamma values in resolution_diagnostics.

Will increasing n_trials or n_bootstrap improve results?

n_trials:

Increasing n_trials (default: 15) can improve the robustness of results
More trials = more chances to find truly stable clustering solutions
For large datasets, you might want to reduce trials to improve speed
Recommended ranges:
- Small datasets (<10k cells): 15-20 trials
- Medium datasets (10k-50k cells): 10-15 trials
- Large datasets (>50k cells): 8-10 trials

n_bootstrap:

Increasing n_bootstrap (default: 100) improves confidence in IC score estimates
More bootstrap iterations = more accurate stability assessment
The default 100 iterations is usually sufficient
Consider increasing for:
- Critical analyses requiring high confidence
- Publications or benchmark studies
- Unusual or complex datasets
Recommended ranges:
- Standard analysis: 100 iterations
- High-confidence analysis: 200-500 iterations
- Note: Higher values increase computation time

Remember: Quality preprocessing (normalization, feature selection, dimensionality reduction) often has more impact than increasing these parameters.

Support and Contributing

📋 Issues: Report bugs or request features on GitHub Issues
📧 Contact: atps@outlook.com
🤝 Contributing: Pull requests welcome!

License

This package is licensed under the MIT License. See LICENSE file for details.

Acknowledgments

Original scICE Julia package developers
Seurat team for the excellent single-cell framework
R community for the robust package ecosystem

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
R		R
examples		examples
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
design.md		design.md
install.R		install.R
install_and_test.R		install_and_test.R

Folders and files

Latest commit

History

Repository files navigation

scICER: Single-cell Inconsistency-based Clustering Evaluation in R

Overview

scICER paper

Key Features

Algorithm Overview

Performance Considerations vs scICE Julia package

Dependencies

Installation

Prerequisites

Install scICER from GitHub

Load the package

Quick Start

Basic Workflow

Documentation Notes

Background Runs and Verbose Logs

Worker and CPU Usage

Detailed Usage

1. Standard Analysis

2. Advanced Parameter Customization

3. Manual Resolution Mode

4. Visualization and Results

4. Integration with Seurat Workflow

Understanding the Output

Main Results Object

Interpretation Guidelines

Visualization Outputs

Performance Tips

For Large Datasets (>50k cells)

For Very Large Seurat Objects

Memory Management

Troubleshooting

Common Issues

Performance Optimization

Citation

Q&A

Why can results differ when I change n_workers even with the same seed?

How does scICER select the final clustering solution?

Will increasing n_trials or n_bootstrap improve results?

Support and Contributing

License

Acknowledgments

Related Resources

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why can results differ when I change `n_workers` even with the same `seed`?

Packages