scICER is an R package that implements a systematic and efficient workflow to evaluate clustering consistency in single-cell RNA-seq data. This is the R port of the original scICE Julia package, designed to be fully compatible with the Seurat ecosystem.
The AnnData/Python counterpart is scICEpy.
Implementation details are documented in design.md.
The codes used to generate results in scICER paper are available at GitHub (https://github.com/ATPs/scICER-workflow).
- π¬ Automated Cluster Evaluation: Systematically tests multiple cluster numbers to find consistent results
- β‘ Element-Centric Similarity (ECS): Efficient algorithm for comparing clustering results
- 𧬠Seurat Integration: Seamless integration with Seurat objects and workflows
- π Parallel Processing: Leverages multiple cores for faster computation
- π Comprehensive Visualization: Built-in plotting functions for result interpretation
- π Robust Statistics: Bootstrap-based confidence intervals and stability assessment
- π― Reproducible Analysis: Optional seed parameter for deterministic results
- π Smart Parameter Recommendations: Automatically suggests optimal parameters based on dataset size
- β Preprocessing Validation: Checks and guides users through proper Seurat preprocessing
- π Comprehensive Reporting: Generates detailed analysis summaries with interpretation guidance
scICER evaluates clustering consistency by:
- Multiple Clustering: Runs Leiden clustering multiple times with different parameters
- Binary Search: Automatically finds resolution parameters for target cluster numbers
- ECS Calculation: Uses Element-Centric Similarity to efficiently compare clustering results
- Optimization: Iteratively refines clustering to minimize inconsistency
- Bootstrap Validation: Assesses stability through bootstrap sampling
- Threshold Application: Identifies cluster numbers meeting consistency criteria
- Reproducibility Control: Optional seed parameter for deterministic results
Performance Considerations vs scICE Julia package
- Speed: R implementation typically runs slower than the native Julia version
- Memory: Requires more memory for equivalent analyses
- Accuracy: While highly consistent, results may differ slightly from the original scICE due to implementation differences in R vs Julia, random number generation variations, and floating-point precision handling.
Note: These tradeoffs provide better integration with the Seurat ecosystem and R workflows
scICER requires the ClustAssess library for Element-Centric Similarity (ECS) calculations, which provides:
- ~150x faster performance for similarity calculations
- Professional, well-tested implementation by Cambridge Core Bioinformatics
- Identical results to reference implementations
ClustAssess will be automatically installed when you install scICER.
Ensure you have R β₯ 4.0.0 and the required dependencies:
# Install required CRAN packages
install.packages(c(
"Seurat", "SeuratObject", "igraph", "Matrix",
"parallel", "foreach", "doParallel", "dplyr",
"ggplot2", "devtools", "ClustAssess"
))
# Optional: Install leiden package for better performance
install.packages("leiden")# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
# Install scICER from GitHub
devtools::install_github("ATPs/scICER", upgrade = "never")#avoid upgrade packages and use existing oneslibrary(scICER)
library(Seurat)# Load required packages
library(scICER)
library(Seurat)
library(foreach)
library(doParallel)
library(ggplot2)
# Set up a parallel worker budget (optional, but recommended)
n_workers <- 4 # start conservatively and adjust based on your system
# Load your Seurat object
data(pbmc_small) # Example dataset
seurat_obj <- pbmc_small
# Ensure preprocessing is complete
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)
# Run scICER analysis
scice_results <- scICE_clustering(
object = seurat_obj,
cluster_range = 3:20,
n_workers = n_workers,
min_cluster_size = 2, # Default; consistent with Seurat singleton behavior
remove_threshold = Inf,
verbose = TRUE
)
# Or evaluate manually chosen gamma values directly
manual_results <- scICE_clustering(
object = seurat_obj,
resolution = c(0.2, 0.4, 0.8),
n_workers = n_workers,
min_cluster_size = 2,
verbose = TRUE
)
# For reproducible results, use the seed parameter
scice_results <- scICE_clustering(
object = seurat_obj,
cluster_range = 3:20,
n_workers = n_workers,
min_cluster_size = 2,
remove_threshold = Inf,
verbose = TRUE,
seed = 123 # Set seed for reproducible results
)
# Visualize results
plot_ic(scice_results, threshold = 1.005)
# Extract consistent clustering labels
seurat_obj <- get_robust_labels(scice_results, return_seurat = TRUE, object = seurat_obj)
# Visualize with consistent clusters, cluster range from 3 to 15
xl.fig <- list()
for (n in 3:15) {
if (paste0("clusters_", n) %in% colnames(seurat_obj@meta.data)){
xl.fig[[paste0("clusters_", n)]] <- DimPlot(seurat_obj, group.by = paste0("clusters_", n), label = TRUE) + ggtitle(paste("Clusters:", n))
}
}
cowplot::plot_grid(plotlist = xl.fig, ncol = 3)README.md: quick installation, common usage patterns, and practical examples.design.md: implementation-oriented design document forscICE_clustering(). It explains the full workflow, parameter semantics, output fields, and important behavior differences betweencluster_rangemode and manualresolutionmode.- Use
design.mdwhen you need to understand how results are produced internally, interpret fields such asresolution_diagnostics, or understand whyresolution = old_results$gammais not a one-to-one replay of a previouscluster_rangerun.
scICE_clustering(..., verbose = TRUE) prints timestamped progress messages like
[YYYY-MM-DD HH:MM:SS] ... via message(), which goes to stderr.
If you launch jobs in the background and only capture stdout, logs can appear missing.
Use:
Rscript examples/run_scice.R --help
Rscript examples/run_scice.R --input_qs input.qs --output_qs augmented_with_scice.qs > scice.log 2>&1 &
tail -f scice.logFor very large input objects, qs::qread() can be silent for a while before scICER starts.
The provided run_scice.R script prints timestamped checkpoints before and after
each major step, builds a lightweight Seurat object that keeps two features
plus the requested graph for the clustering phase, deletes the original large
object before scICE_clustering(), then reloads the original object only for
metadata merge. The final augmented Seurat object is saved with qs.
The script accepts both environment variables and explicit CLI flags such as
--qread_threads, --n_workers, --cluster_range, and --resolution.
If --output_qs is omitted, it defaults to --input_qs.
n_workers is a scICER worker budget, not a strict upper bound on total CPU
cores or OS processes. In cluster_range mode, scICER can run outer workers for
different target cluster numbers and nested workers for gamma evaluation,
iterative refinement, or bootstrap steps. These nested parallel::mclapply()
calls can make the observed process count or CPU usage higher than n_workers.
Native libraries used by R packages, especially igraph/Leiden backends, may also
use OpenMP or BLAS threads unless they are limited by the environment.
Most users do not need to tune these settings for ordinary local runs. They matter mainly on shared servers, scheduled jobs, or environments with strict CPU limits.
Practical recommendations:
- Set
n_workersbelow your scheduler CPU allocation when using broadcluster_rangevalues or largen_trials/n_bootstrap. - Limit native library threading before starting R, for example:
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export BLIS_NUM_THREADS=1
Rscript examples/run_scice.R --input_qs input.qs --output_qs output.qs- Remember that
qs::qread(..., nthread = ...)or the script option--qread_threadsis separate fromn_workers. - If your environment enforces a hard CPU cap, prefer a smaller
n_workers, narrowercluster_range, and lowern_trials/n_bootstrap, then scale up after checking actual CPU and process counts.
# Comprehensive scICER analysis
results <- scICE_clustering(
object = your_seurat_object,
cluster_range = 2:20, # Range of cluster numbers to test
n_workers = 8, # scICER worker budget
n_trials = 15, # Clustering trials per resolution
n_bootstrap = 100, # Bootstrap iterations
seed = 42, # Optional: Set for reproducible results
ic_threshold = 1.005, # Consistency threshold
min_cluster_size = 2, # Default minimum cells per effective cluster (set 1 to disable)
verbose = TRUE # Progress messages
)# Fine-tuned parameters for large datasets
results <- scICE_clustering(
object = large_seurat_object,
graph_name = "RNA_snn", # Optional: specify graph name
cluster_range = 5:12, # Focused range for efficiency
objective_function = "CPM", # "CPM" or "modularity"
beta = 0.1, # Leiden beta parameter
n_iterations = 10, # Initial Leiden iterations
max_iterations = 150, # Maximum optimization iterations
remove_threshold = 1.15, # Filter inconsistent results
min_cluster_size = 2, # Default threshold; final merge only on best_labels
resolution_tolerance = 1e-8, # Resolution search precision
n_trials = 10, # Reduced for speed
n_bootstrap = 50, # Reduced for speed
seed = 12345 # For reproducible results
)It's OK to set a large remove_threshold value, because plot_ic and get_robust_labels all have default threshold = 1.005.
min_cluster_size = 2 approximates Seurat singleton grouping; use min_cluster_size = 1 to keep raw Leiden clusters.
labels stay raw for IC/MEI, while best_labels receives one final merge pass when min_cluster_size > 1.
resolution can be a single gamma or a vector of gamma values. When resolution is provided, scICER skips the cluster_range resolution search, evaluates the supplied gamma values directly, and keeps the lowest-IC result for each final cluster number. If both resolution and cluster_range are provided, resolution takes priority and cluster_range is ignored with an informational message.
# Evaluate one user-provided gamma
result_single_resolution <- scICE_clustering(
object = your_seurat_object,
resolution = 0.2,
n_trials = 15,
n_bootstrap = 100,
min_cluster_size = 2,
seed = 42
)
# Evaluate several user-provided gamma values
results_manual <- scICE_clustering(
object = your_seurat_object,
resolution = c(0.1, 0.2, 0.4, 0.8),
n_trials = 15,
n_bootstrap = 100,
min_cluster_size = 2,
seed = 42
)
# Plot the retained results
plot_ic(results_manual, threshold = 1.005)
# Per-gamma diagnostics for all user-supplied resolutions
results_manual$resolution_diagnostics
# Lowest-IC returned solution
results_manual$best_cluster
results_manual$best_resolution
# resolution takes priority if cluster_range is also supplied
results_manual2 <- scICE_clustering(
object = your_seurat_object,
cluster_range = 3:10,
resolution = c(0.15, 0.3, 0.6),
n_trials = 15,
n_bootstrap = 100,
min_cluster_size = 2,
seed = 42
)# Plot Inconsistency Coefficient (IC) scores
plot_ic(results,
threshold = 1.005,
title = "Clustering Consistency Analysis")
# Plot stability analysis
plot_stability(results)
# Extract summary of consistent clusters
summary <- extract_consistent_clusters(results, threshold = 1.005)
print(summary)
# Get clustering labels as data frame
labels_df <- get_robust_labels(results)
# Add results directly to Seurat object
seurat_obj <- get_robust_labels(results, return_seurat = TRUE, object = seurat_obj)# Complete Seurat + scICER workflow
seurat_obj <- CreateSeuratObject(counts = your_counts)
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)
# Run scICER with reproducibility
scice_results <- scICE_clustering(
seurat_obj,
cluster_range = 3:15,
seed = 42 # For reproducible results
)
# Add consistent clusters to Seurat object
seurat_obj <- get_robust_labels(scice_results, return_seurat = TRUE, object = seurat_obj)
# Continue with standard Seurat analysis
seurat_obj <- RunUMAP(seurat_obj, dims = 1:20)
DimPlot(seurat_obj, group.by = "clusters_8") # Use consistent clustering
# Find markers for consistent clusters
Idents(seurat_obj) <- "clusters_8"
markers <- FindAllMarkers(seurat_obj)The scICE_clustering() function returns a list containing:
gamma: Resolution parameters for each cluster numberlabels: Raw clustering arrays used for IC/MEI (no small-cluster merge)ic: Inconsistency scores (lower = more consistent)ic_vec: Bootstrap IC distributionsn_cluster: Number of clusters testedbest_labels: Final best labels (single post-optimization merge whenmin_cluster_size > 1)consistent_clusters: Cluster numbers meeting consistency thresholdmei: Mutual Element-wise Information scoresanalysis_mode:"cluster_range"or"resolution"resolution_input: Manual gamma values requested by the user inresolutionmoderesolution_diagnostics: Per-gamma summary table inresolutionmodebest_cluster: Returned cluster number with the lowest IC scorebest_resolution: Gamma corresponding tobest_cluster
scICE_clustering() returns a lightweight results object and does not store the full
Seurat input object. When you need metadata columns added back, pass the original
Seurat object explicitly:
seurat_obj <- get_robust_labels(results, return_seurat = TRUE, object = seurat_obj)In resolution mode, resolution_diagnostics keeps one row per input gamma, while the main result object is deduplicated by final cluster number so that each returned cluster number retains only the lowest-IC gamma.
- IC < 1.005: Highly consistent clustering (recommended), less than 0.5% of cells is inconsistent.
- IC 1.005-1.01: Moderately consistent clustering, 0.5% to 1% of cells is inconsistent.
- IC > 1.01: Low consistency (not recommended), more than 1% of cells is inconsistent.
- IC Plot: Shows consistency scores across cluster numbers
- Stability Plot: Displays bootstrap distributions
- UMAP/t-SNE: Visualize clusters with consistent labels
# Get recommended parameters based on dataset size
recommended <- get_recommended_parameters(ncol(large_seurat_obj))
# Optimized settings for large datasets
results <- scICE_clustering(
object = large_seurat_obj,
cluster_range = 5:15, # Focused range
n_trials = 8, # Fewer trials
n_bootstrap = 50, # Fewer bootstrap samples
n_workers = detectCores() - 1, # Worker budget; actual CPU use can be higher
seed = 42 # For reproducibility
)For very large Seurat objects, the biggest avoidable memory cost is keeping the
full object in memory during the entire scICER run. scICE_clustering() only
needs the graph and cell names, so a practical pattern is:
- read the original object with
qs::qread(); - build a lightweight Seurat object that keeps two features plus the target graph;
- remove the original object and run
gc()before clustering; - run
scICE_clustering()on the lightweight object; - reload the original object and add the chosen labels back with
get_robust_labels(); - save both the results object and the augmented Seurat object with
qs::qsave().
library(Seurat)
library(scICER)
library(qs)
input_qs <- "/path/to/very_large_object.qs"
graph_name <- "RNA_snn"
qread_threads <- 40
full_obj <- qs::qread(input_qs, nthread = qread_threads)
# Keep two features plus the graph so the clustering phase does not hold the
# full assay/reduction payload in memory. Some Seurat v5 Assay5 objects fail
# when DietSeurat() keeps only one feature.
assay_name <- DefaultAssay(full_obj)
keep_features <- head(rownames(full_obj[[assay_name]]), 2L)
if (length(keep_features) < 2L || anyNA(keep_features) || any(!nzchar(keep_features))) {
stop("Need at least two valid feature names for the lightweight Seurat object.")
}
small_obj <- DietSeurat(
object = full_obj,
assays = assay_name,
graphs = graph_name,
dimreducs = NULL,
features = keep_features,
misc = FALSE
)
rm(full_obj)
gc()
results <- scICE_clustering(
object = small_obj,
graph_name = graph_name,
cluster_range = 4:5, # Keep the range focused for huge objects
n_trials = 4, # Start small and increase only if needed
n_bootstrap = 20,
n_workers = 40,
remove_threshold = Inf,
seed = 123,
verbose = TRUE
)
rm(small_obj)
gc()
full_obj <- qs::qread(input_qs, nthread = qread_threads)
full_obj <- get_robust_labels(
results,
return_seurat = TRUE,
object = full_obj,
threshold = 1.01
)
qs::qsave(results, "scice_results.qs", preset = "balanced")
qs::qsave(full_obj, "very_large_object_with_scice.qs", preset = "balanced")If this is still too slow, reduce cluster_range further or test a small set
of user-provided resolution values directly instead of scanning a wide target
cluster range. For the bundled background script, you can switch to manual
resolution mode without editing the file:
SCICE_RESOLUTION=0.01,0.02 \
SCICE_N_TRIALS=2 \
SCICE_N_BOOTSTRAP=10 \
Rscript examples/run_scice.R --input_qs input.qs --output_qs augmented_with_scice.qs- Reduce
n_trialsandn_bootstrapfor memory-limited systems - Use focused
cluster_rangebased on biological expectations - Install the
leidenpackage for improved performance - Use
get_recommended_parameters()for optimal settings
-
"Graph not found" error:
- Run
FindNeighbors()on your Seurat object first - Use
check_seurat_ready()to validate preprocessing - By default, scICER uses the SNN graph from the active assay (e.g., "RNA_snn")
- You can check available graphs with
names(your_seurat_object@graphs) - Specify a different graph with the
graph_nameparameter if needed
- Run
-
No consistent clusters found:
- Try adjusting
ic_threshold(e.g., 1.01) - Expand
cluster_range - Check preprocessing quality
- Use
create_results_summary()for detailed analysis
- Try adjusting
-
Memory issues:
- Reduce
n_workers,n_trials, orcluster_range - Use
get_recommended_parameters()for guidance
- Reduce
-
Slow performance:
- Install the
leidenpackage - Increase
n_workers - Use recommended parameters for your dataset size
- Install the
-
Reproducibility issues:
- Always set the
seedparameter - Keep all other parameters constant, including
n_workers - In
cluster_rangemode, differentn_workersvalues can produce different results even with the sameseed - Document parameter values used
- Always set the
# Check if leiden package is available
if (requireNamespace("leiden", quietly = TRUE)) {
message("leiden package detected - enhanced performance available")
} else {
message("Consider installing 'leiden' package for better performance")
}
# Validate Seurat object preprocessing
check_seurat_ready(seurat_obj)
# Get recommended parameters
params <- get_recommended_parameters(ncol(seurat_obj))If you use scICER in your research, please cite:
Cao, X. (2024). scICER: Single-cell Inconsistency-based Clustering Evaluation in R.
R package version 1.1.0. https://github.com/ATPs/scICER
The original scICE algorithm citation will be added upon publication
In cluster_range mode, n_workers can affect the resolution-search probe grid
and search path. A fixed seed controls randomness within a given run
configuration, but it does not guarantee identical results after changing
n_workers. For exact reruns, keep n_workers and other analysis parameters
unchanged.
For each number of clusters (e.g., when testing 3-10 clusters):
- The algorithm runs
n_trialsindependent clustering attempts (default: 15) - Each trial uses the Leiden algorithm with different random initializations
- For each cluster number:
- Calculates pairwise similarities between all trial results using Element-Centric Similarity (ECS)
- Computes an Inconsistency (IC) score measuring how different the clustering solutions are
- A lower IC score indicates more consistent clustering across trials
- For cluster numbers with IC scores below your threshold (e.g., < 1.05):
- Selects the "best" clustering solution from the trials
- The "best" solution is the one with highest average similarity to other solutions (most representative)
- This becomes the reported clustering for that number of clusters
So when you get results showing clusters 3-10 have IC scores < 1.05, it means:
- Each of these cluster numbers produced stable results across different random initializations
- The reported clustering solution is the most representative one from all trials
- You can be confident in using these clustering solutions for downstream analysis
If you use resolution = c(...), the same IC logic is applied directly to each supplied gamma. When multiple gamma values collapse to the same final cluster number, scICER keeps the lowest-IC one in the main result and stores all evaluated gamma values in resolution_diagnostics.
n_trials:
- Increasing
n_trials(default: 15) can improve the robustness of results - More trials = more chances to find truly stable clustering solutions
- For large datasets, you might want to reduce trials to improve speed
- Recommended ranges:
- Small datasets (<10k cells): 15-20 trials
- Medium datasets (10k-50k cells): 10-15 trials
- Large datasets (>50k cells): 8-10 trials
n_bootstrap:
- Increasing
n_bootstrap(default: 100) improves confidence in IC score estimates - More bootstrap iterations = more accurate stability assessment
- The default 100 iterations is usually sufficient
- Consider increasing for:
- Critical analyses requiring high confidence
- Publications or benchmark studies
- Unusual or complex datasets
- Recommended ranges:
- Standard analysis: 100 iterations
- High-confidence analysis: 200-500 iterations
- Note: Higher values increase computation time
Remember: Quality preprocessing (normalization, feature selection, dimensionality reduction) often has more impact than increasing these parameters.
- π Issues: Report bugs or request features on GitHub Issues
- π§ Contact: atps@outlook.com
- π€ Contributing: Pull requests welcome!
This package is licensed under the MIT License. See LICENSE file for details.
- Original scICE Julia package developers
- Seurat team for the excellent single-cell framework
- R community for the robust package ecosystem