resistome-profiler

Reproducible antimicrobial resistance gene detection and functional annotation from bacterial whole-genome sequencing data. Downloads public NCBI reads, performs quality control, genome assembly, annotation, and AMR profiling with validated open-source tools.

FunctionalTools 0 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add functionaltools/resistomeprofiler

Install via the SkillsCat registry.

SKILL.md

ResistomeProfiler: Antimicrobial Resistance Profiling from Bacterial WGS Data

Overview

This skill executes a complete, end-to-end bioinformatics pipeline for detecting antimicrobial resistance (AMR) genes from bacterial whole-genome sequencing (WGS) data. It downloads publicly available Illumina paired-end reads from NCBI SRA, performs quality control, de novo genome assembly, gene annotation, and comprehensive AMR gene detection using multiple complementary databases.

Input: NCBI SRA accession number (default: SRR10971381, an ESBL-producing Escherichia coli isolate)
Output: AMR gene report, annotated genome, assembly quality metrics, and resistance class summary

Step 1: Set Up Environment

Install all required tools via conda/mamba. This ensures version-pinned, reproducible dependencies.

# Create isolated conda environment with all dependencies
mamba create -n resistome_profiler -y -c bioconda -c conda-forge \
  sra-tools=3.1.1 \
  fastp=0.23.4 \
  spades=4.0.0 \
  quast=5.2.0 \
  prokka=1.14.6 \
  ncbi-amrfinderplus=4.0.3 \
  abricate=1.0.1 \
  mlst=2.23.0 \
  seqkit=2.8.2 \
  csvtk=0.30.0

conda activate resistome_profiler

# Update AMRFinderPlus database to latest version
amrfinder --update

Expected output: Environment created successfully, AMRFinderPlus database updated.

Validation: Run conda list -n resistome_profiler | grep -E "fastp|spades|prokka|amrfinderplus" — all four tools should appear with correct versions.

Step 2: Download Sequencing Data from NCBI SRA

Download paired-end Illumina reads for the target isolate. Default accession is SRR10971381 (ESBL-producing E. coli clinical isolate, ~200x coverage).

# Set the SRA accession (change this to analyze a different isolate)
ACCESSION="SRR10971381"
WORKDIR="resistome_profiler_output"
mkdir -p ${WORKDIR}/raw_reads
cd ${WORKDIR}

# Download paired-end FASTQ files
fasterq-dump --split-files --threads 4 --outdir raw_reads/ ${ACCESSION}

# Verify download integrity
echo "=== Read counts ==="
seqkit stats raw_reads/${ACCESSION}_1.fastq raw_reads/${ACCESSION}_2.fastq

Expected output: Two FASTQ files (_1.fastq and _2.fastq) with millions of paired-end reads. seqkit stats should report read counts and average read lengths (typically 150 bp for Illumina).

Validation: Both files should exist, have equal read counts, and total file size > 100 MB.

Step 3: Quality Control and Read Trimming

Remove adapters, low-quality bases, and short reads using fastp. This step is critical for accurate downstream assembly.

mkdir -p qc_reads

fastp \
  --in1 raw_reads/${ACCESSION}_1.fastq \
  --in2 raw_reads/${ACCESSION}_2.fastq \
  --out1 qc_reads/${ACCESSION}_trimmed_1.fastq.gz \
  --out2 qc_reads/${ACCESSION}_trimmed_2.fastq.gz \
  --json qc_reads/fastp_report.json \
  --html qc_reads/fastp_report.html \
  --thread 4 \
  --qualified_quality_phred 20 \
  --length_required 50 \
  --detect_adapter_for_pe \
  --correction \
  --cut_front \
  --cut_tail \
  --cut_window_size 4 \
  --cut_mean_quality 20

# Summarize QC results
echo "=== Post-QC Read Statistics ==="
seqkit stats qc_reads/${ACCESSION}_trimmed_1.fastq.gz qc_reads/${ACCESSION}_trimmed_2.fastq.gz

Expected output: Trimmed FASTQ files with adapter sequences removed. fastp HTML report shows >90% reads passing filters. Average quality score >Q30.

Validation: Check fastp_report.json — filtering_result.passed_filter_reads should be >80% of total input reads.

Step 4: De Novo Genome Assembly

Assemble the bacterial genome using SPAdes in isolate mode (optimized for single-isolate bacterial WGS).

mkdir -p assembly

spades.py \
  --isolate \
  -1 qc_reads/${ACCESSION}_trimmed_1.fastq.gz \
  -2 qc_reads/${ACCESSION}_trimmed_2.fastq.gz \
  -o assembly/${ACCESSION} \
  --threads 4 \
  --memory 8

# Copy final scaffolds
cp assembly/${ACCESSION}/scaffolds.fasta assembly/${ACCESSION}_scaffolds.fasta

Expected output: Assembled genome in scaffolds.fasta. For E. coli, expect ~4.5-5.5 Mbp total assembly size.

Step 5: Assembly Quality Assessment

Evaluate assembly quality using QUAST. This provides N50, total length, number of contigs, and other key metrics.

mkdir -p quality

quast \
  assembly/${ACCESSION}_scaffolds.fasta \
  -o quality/${ACCESSION}_quast \
  --min-contig 500 \
  --threads 4

# Display key metrics
echo "=== Assembly Quality Metrics ==="
cat quality/${ACCESSION}_quast/report.txt

Expected output: QUAST report showing:

Total length: ~4.5-5.5 Mbp (for E. coli)
Number of contigs (>= 500 bp): typically 50-200
N50: typically >50,000 bp
GC content: ~50.5% for E. coli

Validation: Total assembly length should be within 20% of expected genome size for the species. N50 > 20,000 bp indicates acceptable assembly quality.

Step 6: Gene Annotation

Annotate the assembled genome using Prokka to identify coding sequences, rRNAs, tRNAs, and assign functional annotations.

mkdir -p annotation

prokka \
  --outdir annotation/${ACCESSION} \
  --prefix ${ACCESSION} \
  --kingdom Bacteria \
  --genus Escherichia \
  --species coli \
  --cpus 4 \
  --force \
  assembly/${ACCESSION}_scaffolds.fasta

echo "=== Annotation Summary ==="
cat annotation/${ACCESSION}/${ACCESSION}.txt

Expected output: Prokka output directory containing .gff, .gbk, .faa, .ffn files. Summary should show ~4,500-5,500 CDS for E. coli.

Validation: Number of CDS should be proportional to genome size (~1 gene per 1 kbp). At least 3 rRNA operons expected for E. coli.

Step 7: Antimicrobial Resistance Gene Detection (AMRFinderPlus)

Run NCBI AMRFinderPlus for gold-standard AMR gene detection using both nucleotide and protein sequences.

mkdir -p amr_results

# Run AMRFinderPlus with both nucleotide and protein input for maximum sensitivity
amrfinder \
  --nucleotide assembly/${ACCESSION}_scaffolds.fasta \
  --protein annotation/${ACCESSION}/${ACCESSION}.faa \
  --gff annotation/${ACCESSION}/${ACCESSION}.gff \
  --organism Escherichia \
  --plus \
  --threads 4 \
  --output amr_results/${ACCESSION}_amrfinder.tsv \
  --name ${ACCESSION}

echo "=== AMRFinderPlus Results ==="
echo "Total AMR/stress/virulence genes detected:"
wc -l < amr_results/${ACCESSION}_amrfinder.tsv
echo ""
echo "AMR genes by class:"
tail -n +2 amr_results/${ACCESSION}_amrfinder.tsv | \
  csvtk -t cut -f "Class" | sort | uniq -c | sort -rn
echo ""
echo "Top hits:"
head -20 amr_results/${ACCESSION}_amrfinder.tsv | csvtk -t pretty

Expected output: TSV file listing detected AMR genes with gene symbol, sequence name, element type, element subtype, drug class, and percent identity. For an ESBL-producing E. coli, expect detection of beta-lactamase genes (e.g., blaCTX-M, blaTEM), and potentially aminoglycoside, quinolone, and tetracycline resistance genes.

Step 8: Cross-Validation with ABRicate (Multiple Databases)

Run ABRicate against multiple curated databases to cross-validate AMRFinderPlus results and detect additional resistance genes.

# Screen against multiple AMR databases for comprehensive coverage
for DB in ncbi card resfinder argannot megares vfdb; do
  abricate \
    --db ${DB} \
    --minid 80 \
    --mincov 60 \
    --threads 4 \
    assembly/${ACCESSION}_scaffolds.fasta \
    > amr_results/${ACCESSION}_abricate_${DB}.tsv

  echo "=== ABRicate ${DB} database: $(tail -n +2 amr_results/${ACCESSION}_abricate_${DB}.tsv | wc -l) genes detected ==="
done

# Generate summary across all databases
abricate --summary amr_results/${ACCESSION}_abricate_*.tsv > amr_results/${ACCESSION}_abricate_summary.tsv

echo "=== Cross-Database Summary ==="
cat amr_results/${ACCESSION}_abricate_summary.tsv | csvtk -t pretty

Expected output: Gene hits from each database. The ncbi and resfinder databases typically show the highest concordance with AMRFinderPlus. The vfdb database additionally reports virulence factors.

Validation: Core AMR genes (e.g., beta-lactamases) should be detected by at least 3 of the 5 AMR databases, confirming high-confidence calls.

Step 9: MLST Typing

Determine the sequence type (ST) of the isolate for epidemiological context.

mlst assembly/${ACCESSION}_scaffolds.fasta > amr_results/${ACCESSION}_mlst.tsv

echo "=== MLST Result ==="
cat amr_results/${ACCESSION}_mlst.tsv

Expected output: Sequence type assignment (e.g., ST131, ST410) with allele profiles. ST131 is a globally disseminated ESBL-producing E. coli lineage.

Step 10: Generate Consolidated Report

Compile all results into a single, human- and machine-readable summary report.

mkdir -p report

cat > report/${ACCESSION}_resistome_report.md << 'REPORT_HEADER'
# ResistomeProfiler Report
REPORT_HEADER

cat >> report/${ACCESSION}_resistome_report.md << EOF
**Accession:** ${ACCESSION}
**Date:** $(date -u +"%Y-%m-%d %H:%M:%S UTC")

## 1. Assembly Quality
\`\`\`
$(cat quality/${ACCESSION}_quast/report.txt)
\`\`\`

## 2. Annotation Summary
\`\`\`
$(cat annotation/${ACCESSION}/${ACCESSION}.txt)
\`\`\`

## 3. MLST
\`\`\`
$(cat amr_results/${ACCESSION}_mlst.tsv)
\`\`\`

## 4. AMR Genes (AMRFinderPlus)

Total genes/elements detected: $(tail -n +2 amr_results/${ACCESSION}_amrfinder.tsv | wc -l)

### By Drug Class:
\`\`\`
$(tail -n +2 amr_results/${ACCESSION}_amrfinder.tsv | csvtk -t cut -f "Class" | sort | uniq -c | sort -rn)
\`\`\`

### Full AMRFinderPlus Results:
\`\`\`
$(head -30 amr_results/${ACCESSION}_amrfinder.tsv | csvtk -t pretty)
\`\`\`

## 5. Cross-Database Validation (ABRicate)
\`\`\`
$(cat amr_results/${ACCESSION}_abricate_summary.tsv | csvtk -t pretty)
\`\`\`

## 6. Reproducibility
- **Environment:** conda (resistome_profiler)
- **Tools:** fastp 0.23.4, SPAdes 4.0.0, QUAST 5.2.0, Prokka 1.14.6, AMRFinderPlus 4.0.3, ABRicate 1.0.1
- **Input data:** NCBI SRA ${ACCESSION}
- **All tools are open-source and version-pinned for full reproducibility.**
EOF

echo "=== Report generated at report/${ACCESSION}_resistome_report.md ==="
cat report/${ACCESSION}_resistome_report.md

Expected output: A complete markdown report summarizing all analysis results, ready for inspection or publication.

Adapting This Skill

To analyze a different bacterial isolate:

Change the accession: Set ACCESSION="SRR_YOUR_ACCESSION" in Step 2
Change the organism: Update --genus and --species in Step 6 (Prokka) and --organism in Step 7 (AMRFinderPlus)
For metagenomic data: Replace --isolate with --meta in Step 4 (SPAdes)

This skill generalizes to any bacterial species with WGS data in NCBI SRA, including Klebsiella pneumoniae, Staphylococcus aureus, Pseudomonas aeruginosa, Acinetobacter baumannii, and others.

resistome-profiler

Resources

Install

ResistomeProfiler: Antimicrobial Resistance Profiling from Bacterial WGS Data

Overview

Step 1: Set Up Environment

Step 2: Download Sequencing Data from NCBI SRA

Step 3: Quality Control and Read Trimming

Step 4: De Novo Genome Assembly

Step 5: Assembly Quality Assessment

Step 6: Gene Annotation

Step 7: Antimicrobial Resistance Gene Detection (AMRFinderPlus)

Step 8: Cross-Validation with ABRicate (Multiple Databases)

Step 9: MLST Typing

Step 10: Generate Consolidated Report

Adapting This Skill

Categories

Install

Recommended Skills