jaechang-hits

cnvkit-copy-number

"Detect somatic copy number variants (CNVs) from WES, WGS, or targeted sequencing BAM files with CNVkit v0.9.x. Pipeline: calculate bin-level coverage in target/antitarget regions, normalize against a reference, segment copy ratios with CBS or HMM, call amplifications and deletions, generate scatter/diagram plots, estimate tumor purity and ploidy, and export to VCF/SEG. Both CLI and Python API (cnvlib) shown. Use GATK CNV instead for deep WGS with population-scale controls; use CNVkit for targeted or exome sequencing where antitarget bins are critical."

jaechang-hits 188 19 Updated 3mo ago
GitHub

Install

npx skillscat add jaechang-hits/scicraft/cnvkit-copy-number

Install via the SkillsCat registry.

SKILL.md

CNVkit Copy Number Analysis

Overview

CNVkit detects somatic copy number variants (CNVs) from whole-exome sequencing (WES), whole-genome sequencing (WGS), or targeted panel BAM files. It calculates read depth in both on-target (capture) bins and off-target (antitarget) bins, corrects for GC bias and library depth, segments the log2 copy ratio profile with circular binary segmentation (CBS) or a hidden Markov model (HMM), and calls amplifications and deletions. CNVkit provides both a CLI (cnvkit.py) and a Python API (cnvlib) for integration into analysis pipelines, and produces scatter plots, chromosome diagrams, heatmaps, and export files in VCF, BED, and SEG formats.

When to Use

  • Calling somatic copy number variants from tumor-normal paired exome (WES) or targeted panel sequencing
  • Detecting copy number alterations in tumor-only samples using a pooled normal reference
  • Running CNV analysis on whole-genome sequencing (WGS) data with the --method wgs mode
  • Estimating tumor purity and ploidy for samples where purity is unknown, to interpret copy ratio calls
  • Generating SEG format copy number files for GISTIC2, cBioPortal, or IGV visualization
  • Identifying focal amplifications (e.g., ERBB2, MYC) or homozygous deletions (e.g., CDKN2A, RB1)
  • Use GATK CNV (gatk DenoiseReadCounts / gatk ModelSegments) instead for deep WGS cohorts with large matched panel-of-normals (PoN); CNVkit is better suited for targeted/exome data
  • Use Control-FREEC instead when you need allele-frequency-based B-allele fraction modeling alongside CNV calling

Prerequisites

  • Software: CNVkit v0.9.x, Python 3.8+, R (for CBS segmentation), samtools
  • Python packages: cnvlib (installed as part of CNVkit), matplotlib, pandas
  • Input files: sorted, indexed BAM files (tumor ± matched normal); BED file of capture targets; reference genome FASTA; access to R with DNAcopy package for CBS
  • Data requirements: minimum ~50× mean target coverage for WES; WGS works at 20-30×
# Install CNVkit via conda (recommended — handles R/DNAcopy dependency)
conda install -c bioconda cnvkit

# Or via pip (requires R + DNAcopy already installed)
pip install cnvkit

# Verify
cnvkit.py version
# cnvkit 0.9.10

# Install R DNAcopy (for CBS segmentation)
Rscript -e 'if (!requireNamespace("BiocManager")) install.packages("BiocManager"); BiocManager::install("DNAcopy")'

# Index BAM files if not already indexed
samtools index tumor.bam
samtools index normal.bam

Quick Start

# One-command paired tumor/normal CNV analysis (WES)
cnvkit.py batch tumor.bam \
    --normal normal.bam \
    --targets targets.bed \
    --fasta GRCh38.fa \
    --output-dir cnvkit_results/ \
    --diagram --scatter \
    --method hybrid

# Output files in cnvkit_results/:
#   tumor.targetcoverage.cnn   — target bin coverage
#   tumor.antitargetcoverage.cnn — antitarget coverage
#   tumor.cnr                  — copy number ratios
#   tumor.cns                  — segmented copy numbers
#   tumor-scatter.png          — genome-wide scatter plot
#   tumor-diagram.pdf          — chromosome diagram
echo "CNV analysis complete"

Workflow

Step 1: Create Copy Number Reference

Build a reference from one or more normal BAM files. This corrects for systematic biases (GC content, mappability) and sets the neutral baseline.

# Option A: Paired normal reference (single matched normal)
cnvkit.py reference normal.targetcoverage.cnn normal.antitargetcoverage.cnn \
    --fasta GRCh38.fa \
    -o reference_normal.cnn

# Option B: Flat reference (no normal; uses GC/mappability correction only)
# Use when no matched normal is available
cnvkit.py reference \
    --targets targets.bed \
    --fasta GRCh38.fa \
    --output flat_reference.cnn

# Option C: Pooled normal reference from multiple normals (most robust)
cnvkit.py batch \
    normal1.bam normal2.bam normal3.bam \
    --normal \
    --targets targets.bed \
    --fasta GRCh38.fa \
    --output-reference pooled_reference.cnn \
    --output-dir normals_cov/

echo "Reference created: pooled_reference.cnn"

Step 2: Calculate Coverage in Target and Antitarget Bins

Bin the target BED file and compute per-bin read depth for tumor and normal samples.

# First, create accessible bins from the target BED
cnvkit.py target targets.bed \
    --annotate refFlat.txt \
    --split \
    -o targets.split.bed

cnvkit.py antitarget targets.bed \
    --access data/access-5k-mappable.hg38.bed \
    -o antitargets.bed

# Calculate coverage for tumor sample
cnvkit.py coverage tumor.bam targets.split.bed \
    -o tumor.targetcoverage.cnn

cnvkit.py coverage tumor.bam antitargets.bed \
    -o tumor.antitargetcoverage.cnn

echo "Coverage files:"
echo "  tumor.targetcoverage.cnn"
echo "  tumor.antitargetcoverage.cnn"
# Python API equivalent: compute coverage with cnvlib
import cnvlib

# Load and inspect coverage files
target_cov = cnvlib.read("tumor.targetcoverage.cnn")
antitarget_cov = cnvlib.read("tumor.antitargetcoverage.cnn")

print(f"Target bins: {len(target_cov)}")
print(f"Antitarget bins: {len(antitarget_cov)}")
print(f"Mean target depth: {target_cov['depth'].mean():.1f}×")
print(f"Median target depth: {target_cov['depth'].median():.1f}×")

# Check coverage distribution
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 4))
target_cov["depth"].clip(upper=500).hist(bins=50, ax=ax, color="#2c6fad", alpha=0.7)
ax.set_xlabel("Read depth")
ax.set_ylabel("Bin count")
ax.set_title("Target bin coverage distribution")
plt.tight_layout()
plt.savefig("coverage_distribution.png", dpi=150)
print("Saved: coverage_distribution.png")

Step 3: Normalize and Correct Copy Ratios

Normalize tumor coverage against the reference (correcting for GC bias, library depth, and target efficiency).

# Fix sample-level biases relative to the reference
cnvkit.py fix \
    tumor.targetcoverage.cnn \
    tumor.antitargetcoverage.cnn \
    pooled_reference.cnn \
    -o tumor.cnr

echo "Copy number ratios: tumor.cnr"
# tumor.cnr columns: chromosome, start, end, gene, log2, depth, weight
# Inspect copy ratio file with cnvlib Python API
import cnvlib
import pandas as pd

cnr = cnvlib.read("tumor.cnr")
print(f"Total bins: {len(cnr)}")
print(f"Chromosomes: {sorted(cnr.chromosome.unique())}")

# Convert to DataFrame and inspect
df = cnr.data
print(f"\nLog2 copy ratio summary:")
print(df["log2"].describe().round(3))

# Flag high-amplitude events
high_amp = df[df["log2"] >= 2.0]
hom_del  = df[df["log2"] <= -3.0]
print(f"\nHigh amplitude bins (log2 >= 2.0): {len(high_amp)}")
print(f"Homozygous deletion bins (log2 <= -3.0): {len(hom_del)}")
if not high_amp.empty:
    print(high_amp[["chromosome", "start", "end", "gene", "log2"]].head())

Step 4: Segment Copy Number Ratios

Identify contiguous regions of similar copy ratio using CBS (Circular Binary Segmentation) or HMM segmentation.

# CBS segmentation (default; requires R DNAcopy)
cnvkit.py segment tumor.cnr \
    -o tumor.cns \
    --method cbs

# HMM segmentation (no R required; faster)
cnvkit.py segment tumor.cnr \
    -o tumor.hmm.cns \
    --method hmm

echo "Segments: tumor.cns"
# tumor.cns columns: chromosome, start, end, gene, log2, cn, depth, p_ttest, weight, probes
# Python API: segment and inspect
import cnvlib
import subprocess

# Run segmentation via Python subprocess (mirrors CLI)
subprocess.run(
    ["cnvkit.py", "segment", "tumor.cnr", "-o", "tumor.cns", "--method", "cbs"],
    check=True
)

# Load and analyze segments
cns = cnvlib.read("tumor.cns")
df_seg = cns.data
print(f"Total segments: {len(df_seg)}")
print(f"\nSegment log2 summary:")
print(df_seg["log2"].describe().round(3))

# Large segments (>5 Mb) with copy gain or loss
large_events = df_seg[
    ((df_seg["end"] - df_seg["start"]) > 5_000_000) &
    (df_seg["log2"].abs() > 0.3)
].copy()
large_events["size_mb"] = (large_events["end"] - large_events["start"]) / 1e6
print(f"\nLarge CNV segments (>5 Mb, |log2|>0.3): {len(large_events)}")
print(large_events[["chromosome", "start", "end", "log2", "size_mb"]].head(8).to_string(index=False))

Step 5: Call CNV States

Assign integer copy number states and classify amplifications and deletions.

# Call with default thresholds (diploid normal)
cnvkit.py call tumor.cns \
    -o tumor.call.cns \
    --ploidy 2

# Call with tumor purity estimate (if known)
cnvkit.py call tumor.cns \
    --purity 0.7 \
    --ploidy 2 \
    -o tumor.call.purity.cns

echo "Called CNVs: tumor.call.cns"
# Parse called CNV file and classify events
import cnvlib
import pandas as pd

cns_called = cnvlib.read("tumor.call.cns")
df = cns_called.data

# Classify by log2 thresholds (diploid assumed)
# log2 >= 1.0 = high-level amplification (CN >= 4)
# log2 0.2–1.0 = copy gain (CN = 3)
# log2 -1.0–-0.2 = heterozygous deletion (CN = 1)
# log2 <= -3.5 = homozygous deletion (CN = 0)

def classify_cnv(log2):
    if log2 >= 1.0:    return "AMP"
    if log2 >= 0.2:    return "GAIN"
    if log2 <= -3.5:   return "HOMDEL"
    if log2 <= -1.0:   return "LOSS"
    return "NEUTRAL"

df["cnv_class"] = df["log2"].apply(classify_cnv)
print("CNV class counts:")
print(df["cnv_class"].value_counts().to_string())

# Focal amplifications in known oncogenes
oncogenes = ["ERBB2", "MYC", "EGFR", "CCND1", "CDK6", "MDM2", "KRAS"]
focal_amps = df[(df["cnv_class"] == "AMP") &
                (df["gene"].str.split(",").apply(
                    lambda genes: any(g in oncogenes for g in genes)))]
if not focal_amps.empty:
    print(f"\nFocal amplifications in oncogenes:")
    print(focal_amps[["chromosome", "start", "end", "gene", "log2", "cn"]].to_string(index=False))

Step 6: Visualize CNV Profile

Generate scatter plots and chromosome diagrams to review the copy number landscape.

# Genome-wide scatter plot (CNR bins + segments)
cnvkit.py scatter tumor.cnr \
    -s tumor.cns \
    -o tumor-scatter.png

# Chromosome diagram (color-coded by CN state)
cnvkit.py diagram tumor.cnr \
    -s tumor.cns \
    -o tumor-diagram.pdf

# Heatmap across multiple samples
cnvkit.py heatmap sample1.cns sample2.cns sample3.cns \
    -o cohort_heatmap.pdf

echo "Plots saved: tumor-scatter.png, tumor-diagram.pdf, cohort_heatmap.pdf"
# Custom scatter plot with matplotlib highlighting specific genes
import cnvlib
import matplotlib.pyplot as plt
import numpy as np

cnr = cnvlib.read("tumor.cnr")
cns = cnvlib.read("tumor.cns")

# Plot chr7 (EGFR locus) in detail
fig, ax = plt.subplots(figsize=(12, 4))
chrom = "chr7"
cnr_chr = cnr.data[cnr.data["chromosome"] == chrom]
cns_chr = cns.data[cns.data["chromosome"] == chrom]

# Bin dots
ax.scatter(cnr_chr["start"], cnr_chr["log2"],
           s=3, alpha=0.3, color="#aaa", label="Bins")
# Segment lines
for _, seg in cns_chr.iterrows():
    ax.hlines(seg["log2"], seg["start"], seg["end"],
              colors=("#d32f2f" if seg["log2"] > 0.3 else
                      "#1565c0" if seg["log2"] < -0.3 else "#666"),
              lw=3, label="_nolegend_")

# Mark EGFR
egfr_start, egfr_end = 55_019_017, 55_211_628
ax.axvspan(egfr_start, egfr_end, alpha=0.15, color="orange")
ax.text((egfr_start + egfr_end) / 2, ax.get_ylim()[1] * 0.85,
        "EGFR", ha="center", fontsize=9, color="darkorange")

ax.axhline(0, color="black", lw=0.8, ls="--")
ax.axhline(0.585, color="#d32f2f", lw=0.5, ls=":")   # gain threshold
ax.axhline(-1.0, color="#1565c0", lw=0.5, ls=":")    # loss threshold
ax.set_xlabel("Chromosome 7 position (bp)")
ax.set_ylabel("Log2 copy ratio")
ax.set_title(f"CNV Profile — {chrom}")
plt.tight_layout()
plt.savefig("chr7_cnv_scatter.png", dpi=150, bbox_inches="tight")
print("Saved: chr7_cnv_scatter.png")

Step 7: Estimate Tumor Purity and Ploidy

Use CNVkit's purity/ploidy estimation to interpret absolute copy numbers.

# Estimate purity and ploidy from the segmented CNV profile
cnvkit.py call tumor.cns \
    --purity auto \
    --ploidy 2 \
    --method clonal \
    -o tumor.call.auto.cns \
    --center median

# Print purity/ploidy estimate embedded in header
head -5 tumor.call.auto.cns
# Python API: purity/ploidy estimation with cnvlib
import cnvlib
from cnvlib import segmetrics

cns = cnvlib.read("tumor.cns")
cnr = cnvlib.read("tumor.cnr")

# Compute segment-level statistics
cns_with_stats = segmetrics.do_segmetrics(cnr, cns,
    location_stats=["mean", "median"],
    spread_stats=["stdev"])

# Export stats to CSV for review
df = cns_with_stats.data
df.to_csv("tumor_segment_stats.csv", index=False)
print(f"Segment stats saved: tumor_segment_stats.csv")
print(f"Segments: {len(df)}")
print(df[["chromosome", "start", "end", "log2", "cn", "mean", "stdev"]].head(8).to_string(index=False))

Step 8: Export to VCF, BED, and SEG Formats

Export CNV calls for downstream tools (GISTIC2, cBioPortal, IGV, clinical reporting).

# Export to VCF (for clinical variant databases)
cnvkit.py export vcf tumor.call.cns \
    -o tumor.cnv.vcf

# Export to SEG format (for GISTIC2 and cBioPortal)
cnvkit.py export seg tumor.call.cns \
    -o tumor.seg

# Export to BED format (for bedtools/IGV)
cnvkit.py export bed tumor.call.cns \
    -o tumor.cnv.bed

echo "Exported:"
echo "  tumor.cnv.vcf   — VCF format"
echo "  tumor.seg       — SEG format (GISTIC2 / cBioPortal)"
echo "  tumor.cnv.bed   — BED format (IGV / bedtools)"
# Parse and summarize SEG file
import pandas as pd

seg = pd.read_csv("tumor.seg", sep="\t",
                  names=["sample", "chrom", "start", "end", "n_probes", "log2"],
                  comment="#")
print(f"SEG file: {len(seg)} segments")
print(seg.head())

# Count amplifications and deletions by chromosome arm
amp_count = (seg["log2"] >= 0.585).sum()
del_count  = (seg["log2"] <= -1.0).sum()
print(f"\nAmplifications (log2 >= 0.585): {amp_count}")
print(f"Deletions (log2 <= -1.0):       {del_count}")
seg.to_csv("tumor_seg_annotated.csv", index=False)

Key Parameters

Parameter Default Range / Options Effect
--method (batch/coverage) "hybrid" "hybrid", "wgs", "amplicon" Sequencing method; selects target binning strategy
--segment-method (segment) "cbs" "cbs", "hmm", "haar", "none" Segmentation algorithm; CBS requires R DNAcopy
--ploidy (call) 2 16 Assumed baseline ploidy for absolute CN calling
--purity (call) 1.0 0.11.0 or "auto" Tumor cell fraction; corrects log2 ratios for admixed normal
--target-avg-size (target) 200 (WES) 50500 bp Desired mean target bin size after splitting
--antitarget-avg-size 150000 10000500000 bp Antitarget bin size (larger = fewer bins, less noise)
--drop-low-coverage (segment) off flag Drop bins with depth < 5× before segmentation
-p / --processes (coverage) 1 1–CPU count Parallel processes for coverage calculation
--scatter (batch) off flag Automatically generate scatter plot
--diagram (batch) off flag Automatically generate chromosome diagram

Common Recipes

Recipe: Tumor-Only Analysis with Flat Reference

When to use: No matched normal is available; use a flat GC/mappability-corrected reference.

# Step 1: Create flat reference
cnvkit.py reference \
    --targets targets.bed \
    --fasta GRCh38.fa \
    --output flat_reference.cnn

# Step 2: Run batch on tumor-only
cnvkit.py batch tumor.bam \
    --reference flat_reference.cnn \
    --output-dir tumor_only_results/

echo "Tumor-only results: tumor_only_results/"

Recipe: WGS Mode for Whole-Genome Data

When to use: Input is whole-genome sequencing (no capture BED needed).

# WGS mode uses a uniform genome-wide bin grid
cnvkit.py batch tumor_wgs.bam \
    --normal normal_wgs.bam \
    --method wgs \
    --fasta GRCh38.fa \
    --output-dir wgs_results/ \
    --scatter

echo "WGS CNV analysis complete: wgs_results/"

Recipe: Snakemake Integration for Multi-Sample Cohort

When to use: Automate CNVkit for a cohort of paired tumor/normal samples.

# Snakefile — CNVkit batch pipeline
configfile: "config.yaml"
SAMPLES = config["tumor_samples"]   # list of sample names
GENOME  = config["genome_fasta"]
TARGETS = config["targets_bed"]
POOLED  = config["pooled_reference"]  # pre-built pooled normal reference

rule all:
    input:
        expand("cnvkit/{sample}.call.cns", sample=SAMPLES),
        expand("cnvkit/{sample}-scatter.png", sample=SAMPLES),

rule cnvkit_batch:
    input:
        tumor  = "bam/{sample}.tumor.bam",
        ref    = POOLED,
    output:
        cnr    = "cnvkit/{sample}.cnr",
        cns    = "cnvkit/{sample}.cns",
        call   = "cnvkit/{sample}.call.cns",
        scatter = "cnvkit/{sample}-scatter.png",
    threads: 4
    shell:
        """
        cnvkit.py batch {input.tumor} \
            --reference {input.ref} \
            --targets {TARGETS} \
            --fasta {GENOME} \
            --output-dir cnvkit/ \
            --scatter --processes {threads}
        cnvkit.py call cnvkit/{wildcards.sample}.cns \
            --ploidy 2 -o {output.call}
        """

Recipe: Export Gene-Level CNV Table

When to use: Summarize copy number status for a panel of cancer genes from a called CNS file.

import cnvlib
import pandas as pd

cns = cnvlib.read("tumor.call.cns")
df = cns.data

# Cancer genes of interest
cancer_genes = ["ERBB2", "MYC", "EGFR", "CDKN2A", "RB1", "TP53",
                "KRAS", "PTEN", "BRCA1", "BRCA2", "MDM2", "CDK4"]

rows = []
for gene in cancer_genes:
    hits = df[df["gene"].str.contains(gene, na=False)]
    if hits.empty:
        rows.append({"gene": gene, "log2_mean": float("nan"), "cn": "n/a", "status": "not_detected"})
    else:
        best = hits.loc[hits["log2"].abs().idxmax()]
        status = ("AMP" if best["log2"] >= 1.0 else
                  "GAIN" if best["log2"] >= 0.2 else
                  "HOMDEL" if best["log2"] <= -3.5 else
                  "LOSS" if best["log2"] <= -1.0 else "NEUTRAL")
        rows.append({"gene": gene, "log2_mean": round(best["log2"], 3),
                     "cn": best.get("cn", "?"), "status": status})

result_df = pd.DataFrame(rows)
print(result_df.to_string(index=False))
result_df.to_csv("cancer_gene_cnv_summary.csv", index=False)
print("\nSaved: cancer_gene_cnv_summary.csv")

Expected Outputs

Output File Format Description
tumor.targetcoverage.cnn CNN Per-bin target region read depth and log2 coverage
tumor.antitargetcoverage.cnn CNN Per-bin antitarget region coverage (off-target reads)
tumor.cnr CNR Normalized log2 copy ratio per bin (GC/depth corrected)
tumor.cns CNS Segmented copy ratios (CBS/HMM output); one row per segment
tumor.call.cns CNS Called copy number states with integer CN and purity adjustment
tumor-scatter.png PNG Genome-wide scatter plot of bins + segment overlays
tumor-diagram.pdf PDF Chromosome arm diagram color-coded by CN state
tumor.seg SEG GISTIC2/cBioPortal-format segment file
tumor.cnv.vcf VCF VCF-format CNV calls for clinical databases

Troubleshooting

Problem Cause Solution
High noise in antitarget bins Low off-target coverage (<0.1× mean) Increase --antitarget-avg-size to 500kb; use --drop-low-coverage to exclude low bins before segmentation
CBS segmentation fails with Error in DNAcopy R DNAcopy not installed or incompatible Install via BiocManager::install("DNAcopy"); alternatively use --method hmm (no R required)
All segments near 0 (no CNVs detected) Low tumor purity (<20%) or shallow coverage Verify purity with cnvkit.py call --purity auto; check target coverage depth with cnvkit.py coverage
Wavy GC bias across chromosomes GC normalization failed or flat reference used with biased tumor Rebuild reference with matched normal BAMs; check cnvkit.py reference includes --fasta for GC correction
Many very short segments (over-segmentation) CBS threshold too sensitive Add --threshold 0.2 or --smooth-cbs to reduce false segment boundaries; increase minimum segment size
ImportError: No module named cnvlib CNVkit not installed in active environment Activate the correct conda env: conda activate cnvkit_env; verify cnvkit.py version
Chromosome naming mismatch BAM uses 1 but reference uses chr1 or vice versa Ensure BAM, BED, and FASTA all use the same chromosome naming convention

References