Mark and remove PCR/optical duplicates using samtools fixmate and markdup. Use when preparing alignments for variant calling or when duplicate reads would bias analysis.
Resources
2Install
npx skillscat add gptomics/bioskills/bio-duplicate-handling Install via the SkillsCat registry.
Version Compatibility
Reference examples tested with: picard 3.1+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
pip show <package>thenhelp(module.function)to check signatures - CLI:
<tool> --versionthen<tool> --helpto confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
Duplicate Handling
"Remove PCR duplicates from my BAM file" → Mark or remove duplicate reads using the fixmate-sort-markdup pipeline to prevent duplicate bias in variant calling.
- CLI:
samtools fixmate,samtools markdup(samtools) - Python:
pysam.fixmate(),pysam.markdup()(pysam)
Mark and remove PCR/optical duplicates using samtools.
Why Remove Duplicates?
PCR duplicates are identical copies of the same original molecule, created during library preparation. They:
- Inflate coverage artificially
- Bias allele frequencies
- Can create false positive variant calls
Optical duplicates are clusters read multiple times due to their proximity on the flowcell.
Duplicate Marking Workflow
Goal: Mark PCR/optical duplicates so they can be excluded from downstream variant calling and coverage analysis.
Approach: Name-sort, add mate tags with fixmate, coordinate-sort, then run markdup. The pipeline version avoids intermediate files.
Reference (samtools 1.19+):
# 1. Sort by name (required for fixmate)
samtools sort -n -o namesort.bam input.bam
# 2. Add mate information with fixmate
samtools fixmate -m namesort.bam fixmate.bam
# 3. Sort by coordinate (required for markdup)
samtools sort -o coordsort.bam fixmate.bam
# 4. Mark duplicates
samtools markdup coordsort.bam marked.bam
# 5. Index result
samtools index marked.bamPipeline Version
samtools sort -n input.bam | \
samtools fixmate -m - - | \
samtools sort - | \
samtools markdup - marked.bam
samtools index marked.bamsamtools fixmate
Adds mate information required by markdup. Must be run on name-sorted BAM.
Basic Usage
samtools fixmate namesorted.bam fixmate.bamAdd Mate Score Tag (-m)
# Required for markdup to work correctly
samtools fixmate -m namesorted.bam fixmate.bamMulti-threaded
samtools fixmate -m -@ 4 namesorted.bam fixmate.bamRemove Secondary/Unmapped
samtools fixmate -r -m namesorted.bam fixmate.bamsamtools markdup
Marks or removes duplicate alignments. Requires coordinate-sorted BAM with mate tags from fixmate.
Mark Duplicates (Keep in File)
samtools markdup input.bam marked.bamRemove Duplicates
samtools markdup -r input.bam deduped.bamOutput Statistics
samtools markdup -s input.bam marked.bam 2> markdup_stats.txtOptical Duplicate Distance
# Set pixel distance for optical duplicate detection (default: 100)
samtools markdup -d 2500 input.bam marked.bamMulti-threaded
samtools markdup -@ 4 input.bam marked.bamWrite Stats to File
samtools markdup -f stats.txt input.bam marked.bamDuplicate Statistics
Check Duplicate Rate
samtools flagstat marked.bam
# Look for "duplicates" lineCount Duplicates
# Count reads with duplicate flag
samtools view -c -f 1024 marked.bamPercentage Duplicates
total=$(samtools view -c marked.bam)
dups=$(samtools view -c -f 1024 marked.bam)
echo "scale=2; $dups * 100 / $total" | bcpysam Python Alternative
Full Pipeline
import pysam
# Sort by name
pysam.sort('-n', '-o', 'namesort.bam', 'input.bam')
# Fixmate
pysam.fixmate('-m', 'namesort.bam', 'fixmate.bam')
# Sort by coordinate
pysam.sort('-o', 'coordsort.bam', 'fixmate.bam')
# Mark duplicates
pysam.markdup('coordsort.bam', 'marked.bam')
# Index
pysam.index('marked.bam')Check Duplicate Flag
import pysam
with pysam.AlignmentFile('marked.bam', 'rb') as bam:
total = 0
duplicates = 0
for read in bam:
total += 1
if read.is_duplicate:
duplicates += 1
print(f'Total: {total}')
print(f'Duplicates: {duplicates}')
print(f'Rate: {duplicates/total*100:.2f}%')Filter Out Duplicates
import pysam
with pysam.AlignmentFile('marked.bam', 'rb') as infile:
with pysam.AlignmentFile('nodup.bam', 'wb', header=infile.header) as outfile:
for read in infile:
if not read.is_duplicate:
outfile.write(read)Mark Duplicates Manually (Simple Case)
import pysam
from collections import defaultdict
def simple_markdup(input_bam, output_bam):
seen = defaultdict(set)
with pysam.AlignmentFile(input_bam, 'rb') as infile:
with pysam.AlignmentFile(output_bam, 'wb', header=infile.header) as outfile:
for read in infile:
if read.is_unmapped:
outfile.write(read)
continue
key = (read.reference_id, read.reference_start, read.is_reverse,
read.next_reference_id, read.next_reference_start)
if key in seen:
read.is_duplicate = True
else:
seen[key].add(read.query_name)
outfile.write(read)
simple_markdup('sorted.bam', 'marked.bam')Alternative: From Aligner
Some aligners can mark duplicates directly:
BWA-MEM2 with samblaster
bwa-mem2 mem ref.fa R1.fq R2.fq | \
samblaster | \
samtools sort -o marked.bamUsing Picard (Alternative Tool)
java -jar picard.jar MarkDuplicates \
I=input.bam \
O=marked.bam \
M=metrics.txtQuick Reference
| Task | Command |
|---|---|
| Full workflow | sort -n | fixmate -m | sort | markdup |
| Mark duplicates | samtools markdup in.bam out.bam |
| Remove duplicates | samtools markdup -r in.bam out.bam |
| Count duplicates | samtools view -c -f 1024 marked.bam |
| View non-duplicates | samtools view -F 1024 marked.bam |
| Get stats | samtools markdup -s in.bam out.bam |
Duplicate FLAG
| Flag | Value | Meaning |
|---|---|---|
| 0x400 | 1024 | PCR or optical duplicate |
Filter Commands
# View only duplicates
samtools view -f 1024 marked.bam
# View non-duplicates only
samtools view -F 1024 marked.bam
# Count non-duplicates
samtools view -c -F 1024 marked.bamCommon Errors
| Error | Cause | Solution |
|---|---|---|
mate not found |
Input not name-sorted | Run samtools sort -n first |
no MC tag |
fixmate not run with -m | Re-run fixmate with -m flag |
not coordinate sorted |
Input to markdup not sorted | Run samtools sort after fixmate |
Related Skills
- alignment-sorting - Sort by name/coordinate for workflow
- alignment-filtering - Filter duplicates from output
- bam-statistics - Check duplicate rates with flagstat
- variant-calling - Duplicate marking before calling