pwwang

cdr3clustering

Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds `CDR3_Cluster` column to metadata for clonotype analysis.

pwwang 21 4 Updated 4mo ago
GitHub

Install

npx skillscat add pwwang/immunopipe/cdr3clustering

Install via the SkillsCat registry.

SKILL.md

CDR3Clustering Process Configuration

Purpose

Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds CDR3_Cluster column to metadata for clonotype analysis.

When to Use

  • To identify groups of similar TCR/BCR clonotypes
  • For analyzing TCR sequence convergence
  • After ScRepCombiningExpression when TCR/BCR integrated with RNA
  • For investigating public clonotypes across samples
  • Before TESSA analysis for epitope specificity

Important: Only runs when VDJ input present (TCRData/BCRData columns in SampleInfo).

Configuration Structure

Process Enablement

[CDR3Clustering]
cache = true

Input Specification

[CDR3Clustering.in]
screpfile = "path/to/combined_object.qs"

Environment Variables

[CDR3Clustering.envs]
type = "auto"      # TCR, BCR, or auto
tool = "GIANA"     # GIANA or ClusTCR
python = "python"   # Path to python
within_sample = true  # Cluster per sample
args = {}          # Tool-specific arguments
chain = "both"     # TRA, TRB, IGH, IGL, IGK, both, heavy, light

GIANA Arguments (via args)

[CDR3Clustering.envs.args]
method = "hierarchical"    # hierarchical, kmeans
dist = "hamming"          # hamming, levenshtein
threshold = 0.15           # Distance threshold

ClusTCR Arguments (via args)

[CDR3Clustering.envs.args]
method = "two-step"       # mcl, faiss, two-step
n_cpus = 4                # CPUs for MCL
faiss_cluster_size = 5000  # Supercluster size
mcl_params = [1.2, 2]    # [inflation, expansion]

Configuration Examples

Minimal Configuration

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

GIANA with Custom Distance Threshold

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "GIANA"

[CDR3Clustering.envs.args]
method = "hierarchical"
dist = "hamming"
threshold = 0.15

ClusTCR Two-Step (Large Datasets)

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "two-step"
faiss_cluster_size = 5000
n_cpus = 8

ClusTCR MCL (Small Datasets)

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "mcl"
n_cpus = 4

TRB Chain Only

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
chain = "TRB"

Cross-Sample Clustering

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
within_sample = false

Common Patterns

Pattern 1: Standard TCR Beta Chain

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
type = "TCR"
tool = "GIANA"
chain = "TRB"

Pattern 2: Large Dataset (>100K sequences)

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "two-step"
faiss_cluster_size = 5000
n_cpus = 8

Pattern 3: Custom Threshold

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "GIANA"

[CDR3Clustering.envs.args]
threshold = 0.15  # Higher=fewer clusters, Lower=more clusters

Dependencies

Upstream

  • ScRepCombiningExpression (required): Combined scRepertoire object with TCR/BCR data

Downstream

  • TESSA: TCR epitope specificity prediction
  • ClonalStats: Clonality statistics (uses CDR3_Cluster metadata)

Validation Rules

  1. Tool must be "GIANA" or "ClusTCR"
  2. Chain must be valid for data type (TCR: TRA/TRB, BCR: IGH/IGL/IGK)
  3. GIANA requires: biopython, faiss, scikit-learn
  4. ClusTCR requires: clustcr package

Computational Considerations

  • <50K sequences: ClusTCR method = "mcl" (highest quality)
  • 50K-500K sequences: ClusTCR method = "two-step" (balanced)
  • 500K sequences: GIANA or ClusTCR method = "two-step" (fastest)

  • Memory: GIANA ~2-4 GB/100K, ClusTCR ~4-8 GB/100K
  • Runtime: GIANA 1-5 min/100K, ClusTCR two-step 2-10 min/100K

Troubleshooting

Process not running

Cause: No VDJ data available
Solution: Verify ScRepCombiningExpression output contains TCR/BCR data

ModuleNotFoundError

Cause: Missing dependencies
Solution:

  • GIANA: pip install biopython faiss-cpu scikit-learn
  • ClusTCR: conda install -c conda-forge clustcr

Too many/few clusters

Cause: Threshold inappropriate
Solution: Adjust threshold (higher = fewer clusters, lower = more clusters)

Out of memory

Cause: Dataset too large for RAM
Solution: Use within_sample = true, reduce n_cpus, or use GIANA

Slow clustering

Cause: Suboptimal method for dataset size
Solution:

  • 50K: ClusTCR method = "two-step" with increased n_cpus

  • Very large (>500K): Use GIANA

Notes on Output Format

Metadata column: CDR3_Cluster

Cluster naming:

  • S_1, S_2: Single unique CDR3 sequence (may have multiple cells)
  • M_1, M_2: Multiple unique CDR3 sequences (similar but different)

Interpretation:

  • S_ prefix: Cells share identical CDR3 sequence
  • M_ prefix: Cells have similar but different CDR3 sequences
  • Use CDR3_Cluster as grouping factor in Seurat plots

Performance Tips:

  • Small (<10K): GIANA defaults (quality over speed)
  • Medium (10K-100K): ClusTCR two-step with n_cpus=4
  • Large (100K-1M): ClusTCR two-step with n_cpus=8+ or GIANA
  • Very large (>1M): GIANA with increased faiss_cluster_size