jaechang-hits

bwa-mem2-dna-aligner

"Fast short-read DNA aligner for whole-genome, whole-exome, and ChIP-seq alignment to a reference genome. BWA-MEM2 is the 2× faster successor to BWA-MEM; outputs SAM/BAM with read group headers required by GATK. Produces primary alignments with supplementary records for chimeric reads. Use STAR instead for RNA-seq splice-aware alignment; use Bowtie2 as an alternative with comparable accuracy."

jaechang-hits 188 19 Updated 3mo ago
GitHub

Install

npx skillscat add jaechang-hits/scicraft/bwa-mem2-dna-aligner

Install via the SkillsCat registry.

SKILL.md

BWA-MEM2 — DNA Short-Read Aligner

Overview

BWA-MEM2 aligns short DNA reads (Illumina, 50–250 bp) to a reference genome using the BWT-FM index. It is the standard aligner for whole-genome sequencing (WGS), whole-exome sequencing (WES), ChIP-seq, and ATAC-seq DNA alignment. BWA-MEM2 is 2× faster than the original BWA-MEM while producing identical results. It outputs SAM format with proper read group (@RG) headers required by GATK HaplotypeCaller and Picard tools. For paired-end reads, it marks proper pairs and resolves chimeric/split reads into supplementary alignments.

When to Use

  • Aligning WGS or WES Illumina reads to a reference genome for variant calling (SNP, indel, SV)
  • ChIP-seq or ATAC-seq DNA alignment to produce BAM files for peak calling with MACS3
  • Producing GATK-compatible BAM files with @RG read group tags
  • Aligning reads ≥ 50 bp; for shorter reads (< 50 bp), BWA-backtrack may be more appropriate
  • Re-aligning legacy FASTQ files to an updated reference genome assembly
  • Use STAR instead for RNA-seq reads that span splice junctions
  • Use Bowtie2 as an alternative for local alignment or when index size must be minimized

Prerequisites

  • Software: bwa-mem2 (conda or pre-compiled binary), samtools
  • Reference: genome FASTA (e.g., GRCh38, hg19, mm10)
  • RAM: ~28 GB for human genome index; 6–8 GB for mouse
# Install with conda (recommended)
conda install -c bioconda bwa-mem2 samtools

# Or download pre-compiled binary
wget https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.2.1/bwa-mem2-2.2.1_x64-linux.tar.bz2
tar -jxf bwa-mem2-2.2.1_x64-linux.tar.bz2
export PATH="$PWD/bwa-mem2-2.2.1_x64-linux:$PATH"

# Verify
bwa-mem2 version
# 2.2.1

Quick Start

# 1. Build genome index (~30 min, run once)
bwa-mem2 index GRCh38.fa

# 2. Align paired-end reads and sort
bwa-mem2 mem -t 16 -R "@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA" \
    GRCh38.fa sample1_R1.fastq.gz sample1_R2.fastq.gz \
    | samtools sort -@ 8 -o sample1.sorted.bam

# 3. Index the BAM
samtools index sample1.sorted.bam
echo "Aligned reads: $(samtools view -c -F 4 sample1.sorted.bam)"

Workflow

Step 1: Download Reference Genome

Obtain the reference genome FASTA file matching the target assembly.

# Download GRCh38 primary assembly (human)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GRCh38.fa

# Or use ENSEMBL/GENCODE
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh38.primary_assembly.genome.fa.gz
gunzip GRCh38.primary_assembly.genome.fa.gz

echo "Reference size: $(du -sh GRCh38.fa)"

Step 2: Build BWA-MEM2 Index

Index the reference genome — required once per genome, takes ~25-35 min for human.

# Build index (~28 GB RAM required for human genome)
bwa-mem2 index GRCh38.fa

# This creates: GRCh38.fa.0123, GRCh38.fa.amb, GRCh38.fa.ann,
#               GRCh38.fa.bwt.2bit.64, GRCh38.fa.pac
echo "Index files: $(ls GRCh38.fa.* | wc -l) created"
ls -lh GRCh38.fa.*

Step 3: Align Paired-End Reads

Align FASTQ reads with a read group header required for GATK compatibility.

# Align with read group (required for GATK)
# @RG fields: ID (run ID), SM (sample name), PL (platform), LB (library), PU (flowcell)
bwa-mem2 mem \
    -t 16 \
    -R "@RG\tID:sample1_run1\tSM:sample1\tPL:ILLUMINA\tLB:lib1\tPU:flowcell1" \
    GRCh38.fa \
    sample1_R1.fastq.gz \
    sample1_R2.fastq.gz \
    | samtools sort -@ 8 -m 2G -o sample1.sorted.bam

samtools index sample1.sorted.bam
echo "Alignment complete."
echo "Total reads:   $(samtools view -c sample1.sorted.bam)"
echo "Mapped reads:  $(samtools view -c -F 4 sample1.sorted.bam)"

Step 4: Mark PCR Duplicates

Remove or mark optical and PCR duplicates before variant calling.

# Option A: samtools markdup (fast)
samtools fixmate -m sample1.sorted.bam sample1.fixmate.bam
samtools sort -@ 8 -o sample1.fixmate.sorted.bam sample1.fixmate.bam
samtools markdup -@ 8 sample1.fixmate.sorted.bam sample1.markdup.bam
samtools index sample1.markdup.bam

echo "Duplication rate:"
samtools flagstat sample1.markdup.bam | grep "duplicate"

# Option B: Picard MarkDuplicates (GATK best practices)
picard MarkDuplicates \
    INPUT=sample1.sorted.bam \
    OUTPUT=sample1.markdup.bam \
    METRICS_FILE=sample1.dupmetrics.txt \
    REMOVE_DUPLICATES=false \
    CREATE_INDEX=true
cat sample1.dupmetrics.txt | grep -A2 "ESTIMATED"

Step 5: Assess Alignment Quality

Generate alignment statistics and check key quality metrics.

# Full alignment statistics
samtools flagstat sample1.markdup.bam > sample1.flagstat.txt
cat sample1.flagstat.txt

# Coverage statistics
samtools coverage sample1.markdup.bam | head -30

# Parse key metrics with Python
python3 - << 'EOF'
from pathlib import Path

flagstat = Path("sample1.flagstat.txt").read_text()
for line in flagstat.splitlines():
    if any(kw in line for kw in ["total", "mapped", "properly paired", "duplicate"]):
        print(line)
EOF

Step 6: Complete WGS/WES Pipeline → Variant Calling

Pipe BWA-MEM2 output directly into GATK HaplotypeCaller.

#!/bin/bash
# Complete WGS alignment → variant calling pipeline
GENOME="GRCh38.fa"
SAMPLE="sample1"
R1="data/${SAMPLE}_R1.fastq.gz"
R2="data/${SAMPLE}_R2.fastq.gz"
THREADS=16
OUTDIR="results/${SAMPLE}"
mkdir -p "$OUTDIR"

# Step 1: Align + sort
bwa-mem2 mem -t $THREADS \
    -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:lib1\tPU:run1" \
    $GENOME $R1 $R2 \
    | samtools sort -@ 8 -o $OUTDIR/${SAMPLE}.sorted.bam
samtools index $OUTDIR/${SAMPLE}.sorted.bam

# Step 2: Mark duplicates
samtools fixmate -m $OUTDIR/${SAMPLE}.sorted.bam - \
    | samtools sort -@ 8 \
    | samtools markdup -@ 8 - $OUTDIR/${SAMPLE}.markdup.bam
samtools index $OUTDIR/${SAMPLE}.markdup.bam

# Step 3: GATK variant calling
gatk HaplotypeCaller \
    -R $GENOME \
    -I $OUTDIR/${SAMPLE}.markdup.bam \
    -O $OUTDIR/${SAMPLE}.g.vcf.gz \
    -ERC GVCF \
    --native-pair-hmm-threads 4

echo "Pipeline complete: $OUTDIR/${SAMPLE}.g.vcf.gz"

Key Parameters

Parameter Default Range/Options Effect
-t 1 1–64 CPU threads; use 8–16 for production runs
-R @RG\tID:...\tSM:... Read group string; required for GATK compatibility
-k 19 10–28 Minimum seed length; lower = more sensitive for shorter reads
-w 100 50–500 Band width for Smith-Waterman alignment
-M off flag Mark split/supplementary reads as secondary (BWA-MEM style); needed for Picard compatibility
-a off flag Output all alignments for single-end reads (for seeding; increases file size)
-p off flag Treat input as interleaved paired-end FASTQ
-c 500 100–10000 Skip alignment for MEM count > threshold (reduces multi-mapper noise)
-T 30 20–60 Minimum alignment score threshold; lower = report more low-quality alignments
-Y off flag Use soft clipping for supplementary alignments (recommended for GATK)

Common Recipes

Recipe 1: Batch Align Multiple Samples

#!/bin/bash
# Align all samples in parallel using GNU parallel or sequential loop
GENOME="GRCh38.fa"
SAMPLES=(ctrl_1 ctrl_2 treat_1 treat_2)
THREADS=12

for sample in "${SAMPLES[@]}"; do
    echo "=== Aligning $sample ==="
    bwa-mem2 mem -t $THREADS \
        -R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA\tLB:lib1\tPU:run1" \
        $GENOME \
        data/${sample}_R1.fastq.gz \
        data/${sample}_R2.fastq.gz \
        | samtools sort -@ 4 -m 2G -o results/${sample}.sorted.bam
    samtools index results/${sample}.sorted.bam
    
    MAPPED=$(samtools view -c -F 4 results/${sample}.sorted.bam)
    TOTAL=$(samtools view -c results/${sample}.sorted.bam)
    echo "$sample: $MAPPED / $TOTAL reads mapped"
done

Recipe 2: Parse Alignment Metrics with Python

import subprocess
import pandas as pd
from pathlib import Path

samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]
metrics = []

for sample in samples:
    bam = f"results/{sample}.sorted.bam"
    result = subprocess.run(
        ["samtools", "flagstat", bam], capture_output=True, text=True
    )
    stats = {}
    for line in result.stdout.splitlines():
        if "total" in line:
            stats["total"] = int(line.split()[0])
        elif "mapped" in line and "%" in line:
            stats["mapped"] = int(line.split()[0])
            stats["pct_mapped"] = float(line.split("(")[1].split("%")[0])
        elif "properly paired" in line:
            stats["properly_paired"] = int(line.split()[0])
    stats["sample"] = sample
    metrics.append(stats)

df = pd.DataFrame(metrics).set_index("sample")
print(df[["total", "mapped", "pct_mapped", "properly_paired"]])
df.to_csv("alignment_metrics.tsv", sep="\t")

Expected Outputs

Output Format Description
*.sorted.bam BAM Coordinate-sorted aligned reads; index with samtools index
*.sorted.bam.bai BAI BAM index; required for random access by GATK and IGV
*.flagstat.txt Text Alignment summary: total/mapped/paired/duplicate counts and percentages
*.markdup.bam BAM Duplicate-marked BAM; use as input to GATK HaplotypeCaller
*.dupmetrics.txt Text Picard duplication metrics with estimated library size

Troubleshooting

Problem Cause Solution
Low mapping rate (< 85%) Genome mismatch, contamination, or low quality reads Verify genome assembly matches sample; run FastQC; trim adapters with Trim Galore
@RG header missing error in GATK -R flag not specified during alignment Re-align with -R "@RG\tID:...\tSM:...\tPL:ILLUMINA"
Out of memory during indexing Insufficient RAM for genome index BWA-MEM2 requires ~28 GB for human; use classic bwa with less RAM if needed
Unbalanced paired-end counts Interleaved FASTQ or file mismatch Add -p for interleaved; verify R1/R2 read counts with zcat r1.fq.gz | wc -l
[E::bwa_idx_load_from_disk] error Index files missing or wrong prefix Re-run bwa-mem2 index genome.fa; ensure all .0123, .bwt.2bit.64 files exist
Slow alignment speed Low thread count or slow I/O Use -t 16 or more; store data on SSD; pipe directly to samtools sort
Supplementary alignments causing issues Split reads in downstream tools Add -M flag to mark split reads as secondary (Picard compatibility)
GATK base quality score recalibration fails Missing known variant VCF Download dbSNP VCF for your genome assembly from NCBI or GATK resource bundle

References