seuratclustering

Performs unsupervised clustering on single-cell RNA-seq data using Seurat. This process finds nearest neighbors, computes UMAP for visualization, and applies Louvain/Leiden algorithms to identify cell clusters. Clusters can be explored at multiple resolutions to balance granularity and biological relevance.

pwwang 21 4 Updated 6mo ago

GitHub

Install

npx skillscat add pwwang/immunopipe/seuratclustering

Install via the SkillsCat registry.

SKILL.md

SeuratClustering Process Configuration

Purpose

When to Use

After SeuratPreparing: Standard workflow after QC and normalization
T/B cell selection: After SeuratClusteringOfAllCells (if TOrBCellSelection enabled)
Reference-based annotation: Alternative to SeuratMap2Ref or CellTypeAnnotation
Standard clustering: When you need unsupervised cell type discovery
Multi-resolution exploration: When unsure of optimal cluster granularity

Configuration Structure

Process Enablement

[SeuratClustering]
cache = true  # Cache intermediate results for faster re-runs

Input Specification

[SeuratClustering.in]
srtobj = ["SeuratPreparing"]  # Path or reference to Seurat object

Environment Variables

Core Parameters

[SeuratClustering.envs]
# Number of cores for parallelization
ncores = 1  # int; Higher values speed up computation

# Metadata column name for cluster labels
ident = "seurat_clusters"  # Default column name

# Cache location for intermediate results
cache = "/tmp"  # Path; Set to false to disable caching

FindNeighbors Parameters

[SeuratClustering.envs.FindNeighbors]
# K-nearest neighbors: Defines neighborhood size (default: 20)
# Larger values capture more global structure
k.param = 20  # int; Range: 5-100 depending on dataset size

# Reduction to use for building neighbor graph
# If not specified, uses sobj@misc$integrated_new_reduction
reduction = "pca"  # Options: "pca", "integrated.cca", "integrated.rpca", etc.

# Dimensions to use from reduction
dims = 30  # int or 1:N; Automatically expanded to 1:dims

# Pruning threshold for shared nearest neighbor (SNN) graph
# 0 = no pruning, 1 = prune everything
prune.SNN = 0.067  # Default: 1/15; Controls graph connectivity

# Nearest neighbor method
nn.method = "annoy"  # Options: "annoy", "rann"

# Graph naming (for multiple integration methods)
graph.name = ["pca_nn", "pca_snn"]  # [NN, SNN] graph names

Full FindNeighbors Parameter List:

k.param (int): Number of nearest neighbors (default: 20)
reduction (str): Dimensional reduction to use (default: "pca")
dims (int): Number of dimensions to use (default: 1:10)
assay (str): Assay to use when dims is NULL
features (list): Features to use when dims is NULL
compute.SNN (bool): Compute shared nearest neighbor graph (default: TRUE)
prune.SNN (float): SNN pruning threshold (default: 1/15)
nn.method (str): NN algorithm - "annoy" or "rann" (default: "annoy")
n.trees (int): Annoy tree count (default: 50)
annoy.metric (str): Distance metric - "euclidean", "cosine", "manhattan", "hamming" (default: "euclidean")
graph.name (list): Names for NN and SNN graphs
verbose (bool): Print output (default: TRUE)

RunUMAP Parameters

[SeuratClustering.envs.RunUMAP]
# Reduction to use for UMAP embedding
reduction = "pca"  # Options: "pca", "integrated.cca", "integrated.rpca"

# Dimensions to use
dims = 30  # int; Automatically expanded to 1:dims

# Use specific features instead of dimensions
# Can be a list with "order" and "n" fields
features = 30  # or list: {order = "desc(abs(avg_log2FC))", n = 30}

# Number of neighboring points (global vs local structure)
n.neighbors = 30  # int; Range: 5-50; Higher = more global

# Min distance controls cluster tightness
min.dist = 0.3  # float; Range: 0.001-0.5; Higher = more spread out

# Effective scale of embedded points
spread = 1  # float; Works with min.dist for cluster distribution

# Number of embedding dimensions
n.components = 2  # int; Usually 2 for visualization

# Distance metric
metric = "cosine"  # Options: "euclidean", "cosine", "manhattan", etc.

# Learning rate for optimization
learning.rate = 1  # float; Initial learning rate

# Random seed for reproducibility
seed.use = 42  # int

Full RunUMAP Parameter List:

dims (int): Dimensions to use (default: NULL, uses reduction)
reduction (str): Reduction to use (default: "pca")
features (list/int): Features to use instead of dims
n.neighbors (int): Neighborhood size (default: 30)
n.components (int): Embedding dimensions (default: 2)
metric (str): Distance metric (default: "cosine")
min.dist (float): Cluster tightness (default: 0.3)
spread (float): Scale of embedding (default: 1.0)
learning.rate (float): Initial learning rate (default: 1.0)
n.epochs (int): Training epochs (default: auto: 200 large, 500 small)
set.op.mix.ratio (float): Fuzzy set operation ratio (default: 1.0)
local.connectivity (int): Local connectivity (default: 1)
seed.use (int): Random seed (default: 42)
reduction.name (str): Name for UMAP reduction (default: "umap")
verbose (bool): Print output (default: TRUE)

FindClusters Parameters

[SeuratClustering.envs.FindClusters]
# Resolution: Higher = more clusters, Lower = fewer clusters
resolution = 0.8  # float; Default: 0.8
# Multiple resolutions supported: [0.4, 0.6, 0.8, 1.0]
# Range syntax: "0.2:1.0:0.1" -> [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

# Clustering algorithm (1=Louvain, 2=Louvain multilevel, 3=SLM, 4=Leiden)
# Leiden (4) is preferred over Louvain (1-3)
algorithm = 4  # int; 1-4; 4 = Leiden (recommended)

# Leiden implementation
leiden_method = "leidenbase"  # Options: "leidenbase", "igraph"

# Leiden objective function
leiden_objective_function = "modularity"  # Options: "modularity", "CPM"

# Random seed for reproducibility
random.seed = 0  # int

# Number of random starts
n.start = 10  # int

# Max iterations per start
n.iter = 10  # int

# Graph to use for clustering
graph.name = "pca_snn"  # Must match FindNeighbors graph.name[1]

# Cluster name in metadata
cluster.name = "seurat_clusters"  # Can use envs.ident instead

# Group singletons into nearest cluster
group.singletons = TRUE  # bool

Full FindClusters Parameter List:

resolution (float): Cluster granularity (default: 0.8)
algorithm (int): 1=Louvain, 2=Louvain multilevel, 3=SLM, 4=Leiden (default: 1)
leiden_method (str): "leidenbase" or "igraph" (default: "leidenbase")
leiden_objective_function (str): "modularity" or "CPM" (default: "modularity")
random.seed (int): Random seed (default: 0)
n.start (int): Random starts (default: 10)
n.iter (int): Max iterations (default: 10)
graph.name (str): SNN graph name to use
cluster.name (str): Metadata column for clusters
modularity.fxn (int): Modularity function (default: 1)
group.singletons (bool): Group singletons (default: TRUE)
verbose (bool): Print output (default: TRUE)

External References

FindNeighbors (Seurat v5)

https://satijalab.org/seurat/reference/findneighbors

Constructs shared nearest neighbor (SNN) graph
Computes k-nearest neighbors and Jaccard index for neighborhood overlap
Pruning controls graph connectivity (higher = stricter)

RunUMAP (Seurat v5)

https://satijalab.org/seurat/reference/runumap

Uniform Manifold Approximation and Projection for visualization
n.neighbors: Global structure vs local detail trade-off (5-50)
min.dist: Cluster tightness (0.001-0.5)
spread: Scale of embedding (works with min.dist)

FindClusters (Seurat v5)

https://www.rdocumentation.org/packages/Seurat/versions/5.3.1/topics/FindClusters

Louvain (1-3) vs Leiden (4) clustering algorithms
Leiden preferred: Better community detection, improved over Louvain
Resolution: >1.0 = more clusters, <1.0 = fewer clusters

Algorithm Comparison: Leiden vs Louvain

Leiden (algorithm=4): Preferred, better community detection, refined clusters
Louvain (algorithm=1-3): Faster but less accurate, can produce badly connected communities
Source: Single-cell best practices recommend Leiden
https://www.sc-best-practices.org/cellular_structure/clustering.html

Integration Method Support

When using SeuratData integration workflows, use integrated reduction:

CCA integration: reduction = "integrated.cca"
RPCA integration: reduction = "integrated.rpca"
Harmony integration: reduction = "integrated.harmony"

Configuration Examples

Minimal Configuration

[SeuratClustering]
[SeuratClustering.in]
srtobj = ["SeuratPreparing"]

Result: Uses defaults (PCA, 30 dims, resolution 0.8, Louvain algorithm)

Standard Resolution Sweep

[SeuratClustering]
[SeuratClustering.in]
srtobj = ["SeuratPreparing"]

[SeuratClustering.envs.FindClusters]
resolution = [0.4, 0.6, 0.8, 1.0]  # Test 4 resolutions

Result: Creates seurat_clusters_0.4, seurat_clusters_0.6, etc. Final = 1.0

Range Syntax for Resolution Sweep

[SeuratClustering.envs.FindClusters]
# From 0.2 to 1.0 with step 0.1
resolution = "0.2:1.0:0.1"  # Equivalent to [0.2, 0.3, ..., 1.0]

Leiden Algorithm with Custom Parameters

[SeuratClustering]

[SeuratClustering.envs.FindNeighbors]
k.param = 30  # Larger neighborhood
prune.SNN = 0.05  # Less pruning
graph.name = ["pca_nn", "pca_snn"]

[SeuratClustering.envs.FindClusters]
algorithm = 4  # Leiden
resolution = 1.2  # Higher resolution for more clusters
random.seed = 42  # Reproducible clustering
graph.name = "pca_snn"

Integrated Data (CCA/RPCA)

[SeuratClustering]

[SeuratClustering.envs.FindNeighbors]
reduction = "integrated.cca"  # Use integrated reduction
dims = 30

[SeuratClustering.envs.RunUMAP]
reduction = "integrated.cca"
dims = 30
reduction.name = "umap.cca"

[SeuratClustering.envs.FindClusters]
resolution = 1.0

Custom UMAP Parameters for Better Separation

[SeuratClustering]

[SeuratClustering.envs.RunUMAP]
n.neighbors = 15  # More local detail
min.dist = 0.1  # Tighter clusters
spread = 1.5  # More spread out
seed.use = 123

Using Top Markers for UMAP

[SeuratClustering]

[SeuratClustering.envs.RunUMAP]
# Use top 30 markers for UMAP instead of PCs
features = {order = "desc(abs(avg_log2FC))", n = 30}

Multi-Process with Custom Cluster Names

[SeuratClustering]
[SeuratClustering.envs]
ident = "my_clusters"  # Custom column name

[CellTypeAnnotation]
[CellTypeAnnotation.envs]
newcol = "cell_types"  # Different column to avoid overwriting

[SeuratMap2Ref]
[SeuratMap2Ref.envs]
name = "ref_clusters"  # Another clustering method

Common Patterns

Pattern 1: Single Resolution (Standard)

[SeuratClustering]

[SeuratClustering.envs.FindClusters]
resolution = 0.8  # Default balanced resolution

Pattern 2: Resolution Sweep for Exploration

[SeuratClustering]

[SeuratClustering.envs.FindClusters]
resolution = "0.4:1.2:0.2"  # [0.4, 0.6, 0.8, 1.0, 1.2]

Pattern 3: Leiden with High Resolution (Fine-grained)

[SeuratClustering]

[SeuratClustering.envs.FindNeighbors]
k.param = 25

[SeuratClustering.envs.FindClusters]
algorithm = 4  # Leiden
resolution = 1.5  # More clusters

Pattern 4: Integrated Data (Post-Integration)

[SeuratClustering]

[SeuratClustering.envs.FindNeighbors]
reduction = "integrated.cca"
dims = 30

[SeuratClustering.envs.RunUMAP]
reduction = "integrated.cca"
dims = 30

[SeuratClustering.envs.FindClusters]
resolution = 1.0

Pattern 5: Sparse UMAP for Large Datasets

[SeuratClustering]

[SeuratClustering.envs.FindNeighbors]
nn.method = "annoy"
n.trees = 50

[SeuratClustering.envs.RunUMAP]
n.neighbors = 50  # More global for large datasets

Dependencies

Upstream Processes

Required: SeuratPreparing (or SeuratClusteringOfAllCells if TOrBCellSelection used)
Optional: LoadingRNAFromSeurat with prepared = false (if loading unprepared Seurat object)

Downstream Processes

SeuratClusterStats: Cluster statistics and quality metrics
ClusterMarkers: Differential expression between clusters
MarkersFinder: Flexible marker finding with enrichment analysis
ScRepCombiningExpression: If TCR data present (combines RNA + TCR)
TESSA: TCR-specific clustering analysis

Validation Rules

Resolution Constraints

Must be positive (resolution > 0)
Single value or list of values allowed
Range syntax: "start:end:step" (step defaults to 0.1 if omitted)

Dimension Requirements

dims must not exceed available dimensions in reduction
Automatically truncated to min(dims, ncol(reduction) - 1)

Graph Name Consistency

FindClusters.graph.name must match FindNeighbors.graph.name[1] (SNN graph name)
When using multiple integration methods, use unique graph names

Algorithm Selection

Louvain: algorithm = 1 (original), 2 (multilevel), 3 (SLM)
Leiden: algorithm = 4 (recommended)
Leiden requires leiden_method and leiden_objective_function parameters

Troubleshooting

Issue: Too Many Small Clusters

Symptoms: Hundreds of tiny clusters, many singletons

Solutions:

[SeuratClustering.envs.FindClusters]
resolution = 0.4  # Lower resolution
algorithm = 4  # Leiden handles singletons better
group.singletons = TRUE  # Group singletons into nearest cluster

Issue: Clusters Overlapping in UMAP

Symptoms: Poor separation in UMAP visualization

Solutions:

[SeuratClustering.envs.RunUMAP]
min.dist = 0.1  # Tighter clusters
n.neighbors = 15  # More local detail
spread = 1.2  # More separation

Issue: Clustering Not Reproducible

Symptoms: Different clusters on each run

Solutions:

[SeuratClustering.envs.FindNeighbors]
# Set seed for reproducible nearest neighbors
seed.use = 42

[SeuratClustering.envs.FindClusters]
random.seed = 42
n.start = 10  # More starts for stable results

Issue: Slow Performance

Symptoms: Clustering takes hours

Solutions:

[SeuratClustering.envs]
ncores = 8  # Use more cores

[SeuratClustering.envs.FindNeighbors]
nn.method = "annoy"  # Faster approximate NN
dims = 20  # Fewer dimensions

[SeuratClustering.envs.RunUMAP]
n.epochs = 200  # Fewer epochs (default auto-selects based on size)

Issue: Badly Connected Communities (Louvain)

Symptoms: Leiden warning about disconnected clusters

Solutions:

[SeuratClustering.envs.FindClusters]
algorithm = 4  # Switch to Leiden
leiden_method = "leidenbase"  # Use leidenbase implementation

Issue: Graph Name Conflicts with Multiple Integrations

Symptoms: Wrong graph used for clustering

Solutions:

# CCA integration
[SeuratClustering.envs.FindNeighbors]
reduction = "integrated.cca"
graph.name = ["cca_nn", "cca_snn"]

[SeuratClustering.envs.FindClusters]
graph.name = "cca_snn"  # Must match SNN graph name

# RPCA integration
[SeuratClustering.envs.FindNeighbors]
reduction = "integrated.rpca"
graph.name = ["rpca_nn", "rpca_snn"]

[SeuratClustering.envs.FindClusters]
graph.name = "rpca_snn"

Issue: Clustering Not Using Integration

Symptoms: Clustering on raw RNA instead of integrated data

Solutions:

[SeuratClustering.envs.FindNeighbors]
reduction = "integrated.cca"  # Specify integrated reduction
dims = 30

# If using SCTransform + integration:
[SeuratPreparing.envs]
integration_method = "CCA"  # Ensure integration is performed

Best Practices

Use Leiden algorithm (algorithm = 4) for better community detection
Test multiple resolutions to find optimal granularity
Set random seeds for reproducible results
Match reduction to integration if using CCA/RPCA/Harmony
Custom cluster names when running multiple annotation methods to avoid overwriting
Cache intermediate results for faster re-runs with different parameters
Parallelize with ncores for large datasets (>50k cells)
Use resolution sweeps when unsure of optimal granularity

Related Processes

SeuratClusteringOfAllCells: Clustering before T/B cell selection
SeuratSubClustering: Re-clustering within specific clusters
SeuratMap2Ref: Reference-based supervised clustering
CellTypeAnnotation: Automated cell type annotation

seuratclustering

Install

SeuratClustering Process Configuration

Purpose

When to Use

Configuration Structure

Process Enablement

Input Specification

Environment Variables

Core Parameters

FindNeighbors Parameters

RunUMAP Parameters

FindClusters Parameters

External References

FindNeighbors (Seurat v5)

RunUMAP (Seurat v5)

FindClusters (Seurat v5)

Algorithm Comparison: Leiden vs Louvain

Integration Method Support

Configuration Examples

Minimal Configuration

Standard Resolution Sweep

Range Syntax for Resolution Sweep

Leiden Algorithm with Custom Parameters

Integrated Data (CCA/RPCA)

Custom UMAP Parameters for Better Separation

Using Top Markers for UMAP

Multi-Process with Custom Cluster Names

Common Patterns

Pattern 1: Single Resolution (Standard)

Pattern 2: Resolution Sweep for Exploration

Pattern 3: Leiden with High Resolution (Fine-grained)

Pattern 4: Integrated Data (Post-Integration)

Pattern 5: Sparse UMAP for Large Datasets

Dependencies

Upstream Processes

Downstream Processes

Validation Rules

Resolution Constraints

Dimension Requirements

Graph Name Consistency

Algorithm Selection

Troubleshooting

Issue: Too Many Small Clusters

Issue: Clusters Overlapping in UMAP

Issue: Clustering Not Reproducible

Issue: Slow Performance

Issue: Badly Connected Communities (Louvain)

Issue: Graph Name Conflicts with Multiple Integrations

Issue: Clustering Not Using Integration

Best Practices

Related Processes

Categories

Install

Recommended Skills