cdr3clustering

Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds `CDR3_Cluster` column to metadata for clonotype analysis.

pwwang 21 4 Updated 6mo ago

GitHub

Install

npx skillscat add pwwang/immunopipe/cdr3clustering

Install via the SkillsCat registry.

SKILL.md

CDR3Clustering Process Configuration

Purpose

Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds CDR3_Cluster column to metadata for clonotype analysis.

When to Use

To identify groups of similar TCR/BCR clonotypes
For analyzing TCR sequence convergence
After ScRepCombiningExpression when TCR/BCR integrated with RNA
For investigating public clonotypes across samples
Before TESSA analysis for epitope specificity

Important: Only runs when VDJ input present (TCRData/BCRData columns in SampleInfo).

Configuration Structure

Process Enablement

[CDR3Clustering]
cache = true

Input Specification

[CDR3Clustering.in]
screpfile = "path/to/combined_object.qs"

Environment Variables

[CDR3Clustering.envs]
type = "auto"      # TCR, BCR, or auto
tool = "GIANA"     # GIANA or ClusTCR
python = "python"   # Path to python
within_sample = true  # Cluster per sample
args = {}          # Tool-specific arguments
chain = "both"     # TRA, TRB, IGH, IGL, IGK, both, heavy, light

GIANA Arguments (via `args`)

[CDR3Clustering.envs.args]
method = "hierarchical"    # hierarchical, kmeans
dist = "hamming"          # hamming, levenshtein
threshold = 0.15           # Distance threshold

ClusTCR Arguments (via `args`)

[CDR3Clustering.envs.args]
method = "two-step"       # mcl, faiss, two-step
n_cpus = 4                # CPUs for MCL
faiss_cluster_size = 5000  # Supercluster size
mcl_params = [1.2, 2]    # [inflation, expansion]

Configuration Examples

Minimal Configuration

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

GIANA with Custom Distance Threshold

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "GIANA"

[CDR3Clustering.envs.args]
method = "hierarchical"
dist = "hamming"
threshold = 0.15

ClusTCR Two-Step (Large Datasets)

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "two-step"
faiss_cluster_size = 5000
n_cpus = 8

ClusTCR MCL (Small Datasets)

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "mcl"
n_cpus = 4

TRB Chain Only

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
chain = "TRB"

Cross-Sample Clustering

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
within_sample = false

Common Patterns

Pattern 1: Standard TCR Beta Chain

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
type = "TCR"
tool = "GIANA"
chain = "TRB"

Pattern 2: Large Dataset (>100K sequences)

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "two-step"
faiss_cluster_size = 5000
n_cpus = 8

Pattern 3: Custom Threshold

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "GIANA"

[CDR3Clustering.envs.args]
threshold = 0.15  # Higher=fewer clusters, Lower=more clusters

Dependencies

Upstream

ScRepCombiningExpression (required): Combined scRepertoire object with TCR/BCR data

Downstream

TESSA: TCR epitope specificity prediction
ClonalStats: Clonality statistics (uses CDR3_Cluster metadata)

Validation Rules

Tool must be "GIANA" or "ClusTCR"
Chain must be valid for data type (TCR: TRA/TRB, BCR: IGH/IGL/IGK)
GIANA requires: biopython, faiss, scikit-learn
ClusTCR requires: clustcr package

Computational Considerations

<50K sequences: ClusTCR method = "mcl" (highest quality)
50K-500K sequences: ClusTCR method = "two-step" (balanced)
500K sequences: GIANA or ClusTCR method = "two-step" (fastest)
Memory: GIANA ~2-4 GB/100K, ClusTCR ~4-8 GB/100K
Runtime: GIANA 1-5 min/100K, ClusTCR two-step 2-10 min/100K

Troubleshooting

Process not running

Cause: No VDJ data available
Solution: Verify ScRepCombiningExpression output contains TCR/BCR data

ModuleNotFoundError

Cause: Missing dependencies
Solution:

GIANA: pip install biopython faiss-cpu scikit-learn
ClusTCR: conda install -c conda-forge clustcr

Too many/few clusters

Cause: Threshold inappropriate
Solution: Adjust threshold (higher = fewer clusters, lower = more clusters)

Out of memory

Cause: Dataset too large for RAM
Solution: Use within_sample = true, reduce n_cpus, or use GIANA

Slow clustering

Cause: Suboptimal method for dataset size
Solution:

50K: ClusTCR method = "two-step" with increased n_cpus
Very large (>500K): Use GIANA

Notes on Output Format

Metadata column: CDR3_Cluster

Cluster naming:

S_1, S_2: Single unique CDR3 sequence (may have multiple cells)
M_1, M_2: Multiple unique CDR3 sequences (similar but different)

Interpretation:

S_ prefix: Cells share identical CDR3 sequence
M_ prefix: Cells have similar but different CDR3 sequences
Use CDR3_Cluster as grouping factor in Seurat plots

Performance Tips:

Small (<10K): GIANA defaults (quality over speed)
Medium (10K-100K): ClusTCR two-step with n_cpus=4
Large (100K-1M): ClusTCR two-step with n_cpus=8+ or GIANA
Very large (>1M): GIANA with increased faiss_cluster_size

cdr3clustering

Install

CDR3Clustering Process Configuration

Purpose

When to Use

Configuration Structure

Process Enablement

Input Specification

Environment Variables

GIANA Arguments (via args)

ClusTCR Arguments (via args)

Configuration Examples

Minimal Configuration

GIANA with Custom Distance Threshold

ClusTCR Two-Step (Large Datasets)

ClusTCR MCL (Small Datasets)

TRB Chain Only

Cross-Sample Clustering

Common Patterns

Pattern 1: Standard TCR Beta Chain

Pattern 2: Large Dataset (>100K sequences)

Pattern 3: Custom Threshold

Dependencies

Upstream

Downstream

Validation Rules

Computational Considerations

Troubleshooting

Process not running

ModuleNotFoundError

Too many/few clusters

Out of memory

Slow clustering

Notes on Output Format

Categories

Install

Recommended Skills

GIANA Arguments (via `args`)

ClusTCR Arguments (via `args`)