GPTomics

bio-duplicate-handling

Mark and remove PCR/optical duplicates using samtools fixmate and markdup. Use when preparing alignments for variant calling or when duplicate reads would bias analysis.

GPTomics 839 152 Updated 3mo ago

Resources

2
GitHub

Install

npx skillscat add gptomics/bioskills/bio-duplicate-handling

Install via the SkillsCat registry.

SKILL.md

Version Compatibility

Reference examples tested with: picard 3.1+, pysam 0.22+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> then help(module.function) to check signatures
  • CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

Duplicate Handling

"Remove PCR duplicates from my BAM file" → Mark or remove duplicate reads using the fixmate-sort-markdup pipeline to prevent duplicate bias in variant calling.

  • CLI: samtools fixmate, samtools markdup (samtools)
  • Python: pysam.fixmate(), pysam.markdup() (pysam)

Mark and remove PCR/optical duplicates using samtools.

Why Remove Duplicates?

PCR duplicates are identical copies of the same original molecule, created during library preparation. They:

  • Inflate coverage artificially
  • Bias allele frequencies
  • Can create false positive variant calls

Optical duplicates are clusters read multiple times due to their proximity on the flowcell.

Duplicate Marking Workflow

Goal: Mark PCR/optical duplicates so they can be excluded from downstream variant calling and coverage analysis.

Approach: Name-sort, add mate tags with fixmate, coordinate-sort, then run markdup. The pipeline version avoids intermediate files.

Reference (samtools 1.19+):

# 1. Sort by name (required for fixmate)
samtools sort -n -o namesort.bam input.bam

# 2. Add mate information with fixmate
samtools fixmate -m namesort.bam fixmate.bam

# 3. Sort by coordinate (required for markdup)
samtools sort -o coordsort.bam fixmate.bam

# 4. Mark duplicates
samtools markdup coordsort.bam marked.bam

# 5. Index result
samtools index marked.bam

Pipeline Version

samtools sort -n input.bam | \
    samtools fixmate -m - - | \
    samtools sort - | \
    samtools markdup - marked.bam

samtools index marked.bam

samtools fixmate

Adds mate information required by markdup. Must be run on name-sorted BAM.

Basic Usage

samtools fixmate namesorted.bam fixmate.bam

Add Mate Score Tag (-m)

# Required for markdup to work correctly
samtools fixmate -m namesorted.bam fixmate.bam

Multi-threaded

samtools fixmate -m -@ 4 namesorted.bam fixmate.bam

Remove Secondary/Unmapped

samtools fixmate -r -m namesorted.bam fixmate.bam

samtools markdup

Marks or removes duplicate alignments. Requires coordinate-sorted BAM with mate tags from fixmate.

Mark Duplicates (Keep in File)

samtools markdup input.bam marked.bam

Remove Duplicates

samtools markdup -r input.bam deduped.bam

Output Statistics

samtools markdup -s input.bam marked.bam 2> markdup_stats.txt

Optical Duplicate Distance

# Set pixel distance for optical duplicate detection (default: 100)
samtools markdup -d 2500 input.bam marked.bam

Multi-threaded

samtools markdup -@ 4 input.bam marked.bam

Write Stats to File

samtools markdup -f stats.txt input.bam marked.bam

Duplicate Statistics

Check Duplicate Rate

samtools flagstat marked.bam
# Look for "duplicates" line

Count Duplicates

# Count reads with duplicate flag
samtools view -c -f 1024 marked.bam

Percentage Duplicates

total=$(samtools view -c marked.bam)
dups=$(samtools view -c -f 1024 marked.bam)
echo "scale=2; $dups * 100 / $total" | bc

pysam Python Alternative

Full Pipeline

import pysam

# Sort by name
pysam.sort('-n', '-o', 'namesort.bam', 'input.bam')

# Fixmate
pysam.fixmate('-m', 'namesort.bam', 'fixmate.bam')

# Sort by coordinate
pysam.sort('-o', 'coordsort.bam', 'fixmate.bam')

# Mark duplicates
pysam.markdup('coordsort.bam', 'marked.bam')

# Index
pysam.index('marked.bam')

Check Duplicate Flag

import pysam

with pysam.AlignmentFile('marked.bam', 'rb') as bam:
    total = 0
    duplicates = 0
    for read in bam:
        total += 1
        if read.is_duplicate:
            duplicates += 1

    print(f'Total: {total}')
    print(f'Duplicates: {duplicates}')
    print(f'Rate: {duplicates/total*100:.2f}%')

Filter Out Duplicates

import pysam

with pysam.AlignmentFile('marked.bam', 'rb') as infile:
    with pysam.AlignmentFile('nodup.bam', 'wb', header=infile.header) as outfile:
        for read in infile:
            if not read.is_duplicate:
                outfile.write(read)

Mark Duplicates Manually (Simple Case)

import pysam
from collections import defaultdict

def simple_markdup(input_bam, output_bam):
    seen = defaultdict(set)

    with pysam.AlignmentFile(input_bam, 'rb') as infile:
        with pysam.AlignmentFile(output_bam, 'wb', header=infile.header) as outfile:
            for read in infile:
                if read.is_unmapped:
                    outfile.write(read)
                    continue

                key = (read.reference_id, read.reference_start, read.is_reverse,
                       read.next_reference_id, read.next_reference_start)

                if key in seen:
                    read.is_duplicate = True
                else:
                    seen[key].add(read.query_name)

                outfile.write(read)

simple_markdup('sorted.bam', 'marked.bam')

Alternative: From Aligner

Some aligners can mark duplicates directly:

BWA-MEM2 with samblaster

bwa-mem2 mem ref.fa R1.fq R2.fq | \
    samblaster | \
    samtools sort -o marked.bam

Using Picard (Alternative Tool)

java -jar picard.jar MarkDuplicates \
    I=input.bam \
    O=marked.bam \
    M=metrics.txt

Quick Reference

Task Command
Full workflow sort -n | fixmate -m | sort | markdup
Mark duplicates samtools markdup in.bam out.bam
Remove duplicates samtools markdup -r in.bam out.bam
Count duplicates samtools view -c -f 1024 marked.bam
View non-duplicates samtools view -F 1024 marked.bam
Get stats samtools markdup -s in.bam out.bam

Duplicate FLAG

Flag Value Meaning
0x400 1024 PCR or optical duplicate

Filter Commands

# View only duplicates
samtools view -f 1024 marked.bam

# View non-duplicates only
samtools view -F 1024 marked.bam

# Count non-duplicates
samtools view -c -F 1024 marked.bam

Common Errors

Error Cause Solution
mate not found Input not name-sorted Run samtools sort -n first
no MC tag fixmate not run with -m Re-run fixmate with -m flag
not coordinate sorted Input to markdup not sorted Run samtools sort after fixmate

Related Skills

  • alignment-sorting - Sort by name/coordinate for workflow
  • alignment-filtering - Filter duplicates from output
  • bam-statistics - Check duplicate rates with flagstat
  • variant-calling - Duplicate marking before calling