Resources
1Install
npx skillscat add delphine-l/claude-global/skills-bioinformatics-vgp-pipeline Install via the SkillsCat registry.
VGP Assembly Pipeline Skill
Overview
The Vertebrate Genome Project (VGP) assembly pipeline consists of Galaxy workflows for producing high-quality, phased, chromosome-level genome assemblies. This skill covers workflow selection, execution patterns, and quality control checkpoints.
Trajectories (by frequency of use)
Trajectory A: HiFi + Hi-C (Most Common)
- Inputs: HiFi Reads, Hi-C Reads
- Path: WF1 → WF4 → [WF6] → WF8 → WF9 → PreCuration
- Output: HiC Phased assembly (hap1/hap2)
- WF6: Optional (can skip directly to WF8)
Trajectory B: HiFi + Trio
- Inputs: HiFi Reads, Hi-C Reads, Parental Reads
- Path: WF2 → WF5 → [WF6] → WF8 → WF9 → PreCuration
- Output: Trio Phased assembly (maternal/paternal)
- WF6: Optional (can skip directly to WF8)
Trajectory C: HiFi Only (Least Common)
- Inputs: HiFi Reads only
- Path: WF1 → WF3 → WF6 → WF9 → PreCuration
- Output: Pseudohaplotype assembly (primary/alternate)
- WF6: Required (no Hi-C scaffolding step)
- Note: Skips WF8 entirely
Workflow Selection by Data Availability
Non-trio workflows (HiFi reads only)
- VGP1 (WF1): K-mer profiling with HiFi reads alone
- VGP3 (WF3): HiFi-only assembly with HiFiasm
Trio workflows (HiFi + Parental Illumina)
- VGP2 (WF2): Trio k-mer profiling (HiFi child + Illumina parents)
- VGP5 (WF5): Trio-phased assembly with HiFiasm
Universal scaffolding workflows
- RagTag scaffolding: Used for both trio and non-trio assemblies
- Requires reference genome specification
Methods language pattern
When documenting workflow selection in publications:
"For species with available parental data (trio datasets), we employed
VGP2 → VGP5 workflows. For species without parental data (non-trio datasets),
we performed VGP1 → VGP3 workflows."Workflow Descriptions
| Workflow | Name | Description |
|---|---|---|
| WF0 | Mitochondrial Assembly | MitoHiFi assembly (runs in parallel, may fail if no mito reads) |
| WF1 | K-mer Profiling | Genome size, heterozygosity estimation (HiFi) |
| WF2 | Trio K-mer Profiling | K-mer profiling with parental data |
| WF3 | Hifiasm | HiFi-only assembly |
| WF4 | Hifiasm + HiC | HiC-phased assembly |
| WF5 | Hifiasm Trio | Trio-phased assembly |
| WF6 | Purge Duplicates | Remove haplotypic duplications |
| Deprecated - no longer used | ||
| WF8 | Hi-C Scaffolding | YAHS chromosome scaffolding |
| WF9 | Decontamination | Remove contaminants |
| PreCuration | Pretext Snapshot | Prepare files for manual curation |
Haplotype Execution Patterns
Run Once (Both Haplotypes Together)
- WF1, WF2 (K-mer profiling)
- WF3, WF4, WF5 (Assembly)
- WF6 (Purge Duplicates) - depends on trajectory
- PreCuration
Run Twice (×2 per Haplotype)
- WF8 (Hi-C Scaffolding)
- WF9 (Decontamination)
WF6 (Purge Duplicates) Decision Logic
if trajectory == "C" (HiFi only):
WF6 is REQUIRED
WF6 border: solid
else: # Trajectory A or B
WF6 is OPTIONAL
WF6 border: dashed
Can skip directly to WF8When to skip WF6 (Trajectories A/B):
- Merqury k-mer spectra shows clean haplotype separation
- Assembly QV is already high
- No significant duplication detected
When to run WF6 (Trajectories A/B):
- K-mer spectra shows residual duplications
- Higher heterozygosity samples
- Conservative approach preferred
Coverage Requirements
| Data Type | Minimum Coverage | Notes |
|---|---|---|
| HiFi | 30× | Diploid genome |
| Hi-C | 60× | Diploid genome |
QC Checkpoints
After WF1/WF2 (K-mer Profiling)
- Verify GenomeScope2 model fit
- Check estimated genome size
- Review heterozygosity estimate
After WF4/WF5 (Assembly)
- Inspect Merqury k-mer spectra
- Decide whether to run WF6 based on duplication levels
After WF8 (Hi-C Scaffolding)
- Check Pretext Hi-C contact maps
- Verify chromosome-level scaffolding
- Validate against expected karyotype (see Karyotype Validation below)
After WF9 (Decontamination)
- Review contamination reports
- Check for unexpected removals
Karyotype-Based Scaffold Validation
Sex Chromosome Adjustment
Problem: VGP assemblies often place both sex chromosomes (X+Y or Z+W) in the main haplotype, requiring adjustment to expected chromosome counts.
Solution: When both sex chromosomes present, expected = n + 1 (not n)
Implementation:
# Adjust haploid expected when BOTH sex chromosomes in main haplotype
df['num_chromosomes_haploid_adjusted'] = df['num_chromosomes_haploid'].copy()
both_sex_chr_patterns = [
'Has X and Y',
'Has Z and W',
'has Z and W',
'Has X1, X2, and Y',
'Has Z1, Z2, and W',
'Has 5X and 5Y'
]
if 'Sex chromosomes main haploptype' in df.columns:
has_both_sex = df['Sex chromosomes main haploptype'].isin(both_sex_chr_patterns)
df.loc[has_both_sex & df['num_chromosomes_haploid'].notna(),
'num_chromosomes_haploid_adjusted'] = \
df.loc[has_both_sex & df['num_chromosomes_haploid'].notna(),
'num_chromosomes_haploid'] + 1Biological Reasoning:
- Diploid organisms have two sex chromosomes (XX, XY, ZZ, ZW)
- X and Y (or Z and W) are distinct chromosomes
- If both in main haplotype → two separate scaffolds expected
- Example: Asian elephant 2n=56, n=28, has X+Y → expect 29 scaffolds
Impact: Improved perfect match rate from 0% to ~90% in validation analyses
Validation Metrics:
# Use adjusted counts for validation
achieved = df['total_number_of_chromosomes']
expected = df['num_chromosomes_haploid_adjusted']
perfect_matches = (achieved == expected).sum()
within_1 = ((achieved - expected).abs() <= 1).sum()
ratio = achieved / expectedCommon Pitfalls
❌ Wrong: Compare diploid expected (2n) to haploid assembly
- Results in ~50% achievement rates
- Biologically incorrect
❌ Wrong: Use haploid (n) when both sex chromosomes present
- Underestimates by 1
- Shows artificial "extra scaffold" problem
✅ Correct: Use adjusted haploid (n or n+1 depending on sex chromosome configuration)
WF0 (Mitochondrial) Handling
WF0 runs in parallel with the main pipeline and may fail if:
- No mitochondrial reads present in HiFi data
- This is a biological failure, not technical
def check_mitohifi_failure(wf0_result):
"""Distinguish biological vs technical failure"""
if "no_mito_reads" in wf0_result.log:
return "biological" # Expected for some samples
else:
return "technical" # Investigate furtherVisual Diagram Elements
When creating workflow diagrams:
Color Coding (Suggested)
- K-mer Profiling section: Orange (
#fff3e0) - Assembly section: Green (
#e8f5e9) - Purging section: Purple (
#f3e5f5) - Scaffolding section: Blue (
#e3f2fd) - Finishing section: Green (
#e8f5e9) - WF0 (Mitochondrial): Pink (
#fce4ec)
Visual Indicators
- Solid lines: Required workflow connections
- Dashed lines: Optional skip paths
- Dashed box border: Optional workflow (WF6 in trajectories A/B)
- Solid box border: Required workflow
- Dimmed elements: Workflows not used in current trajectory
Haplotype Badges
- Blue badge (
#e3f2fd): "×2 per haplotype" - runs separately - Green badge (
#e8f5e9): "both haplotypes" - runs together
Input Data Labels
- HiFi Reads: Blue (
#4285f4) - Hi-C Reads: Green (
#34a853) - Parental Reads: Red (
#ea4335)
Summary Table
| Trajectory | Inputs | K-mer | Assembly | Purge | Scaffold | Finish | Output |
|---|---|---|---|---|---|---|---|
| A | HiFi+HiC | WF1 | WF4 | [WF6] | WF8 | WF9→Pre | hap1/hap2 |
| B | HiFi+Trio | WF2 | WF5 | [WF6] | WF8 | WF9→Pre | mat/pat |
| C | HiFi only | WF1 | WF3 | WF6 | - | WF9→Pre | pri/alt |
[WF6] = optional, WF6 = required, - = skipped
Reference Genomes for Scaffolding
Common Reference Genome
GCA_011100685.1 - Frequently used reference genome for RagTag scaffolding in canid genome assemblies.
When documenting scaffolding in methods sections:
- Always specify the reference genome accession
- Include version number if applicable
- Example: "scaffolded using RagTag v2.1.0 with the reference genome GCA_011100685.1"
Best Practices
For reproducibility:
- Document exact accession used
- Specify if custom modifications were made to reference
- Note if different references used for different species/assemblies
Resource Analysis Insights
VGP Workflow Canonical Names
Official VGP workflows normalized to canonical names for analysis:
| ID | Canonical Name | Description |
|---|---|---|
| VGP0 | Mitogenome Assembly (VGP0) | Mitochondrial genome assembly |
| VGP1/WF1 | K-mer Profiling (VGP1) | K-mer profiling for genome size estimation (HiFi) |
| VGP2 | K-mer Profiling Trio (VGP2) | Trio k-mer profiling |
| VGP3 | HiFi-only Assembly (VGP3) | HiFi-only assembly |
| VGP4 | HiFi-HiC Phased Assembly (VGP4) | HiFi-HiC phased assembly |
| VGP5 | HiFi-Trio Phased Assembly (VGP5) | HiFi-Trio phased assembly |
| VGP6 | Purge Duplicates (VGP6) | Purge duplicate contigs |
| VGP6b | Purge Duplicates One Haplotype (VGP6b) | Purge duplicates (haploid mode) |
| VGP7 | BioNano Scaffolding (VGP7) | BioNano optical mapping scaffolding |
| VGP8 | Hi-C Scaffolding (VGP8) | Hi-C scaffolding |
| VGP9 | Assembly Decontamination (VGP9) | Assembly decontamination |
Note: WF1-9 are aliases for VGP1-9 (same workflows, different naming convention).
Official vs Non-Official Workflow Identification
When analyzing VGP workflow metrics, filter for official workflows only:
Official workflows (include):
- Core VGP workflows: VGP0-VGP9, WF0-WF9
- PretextMap/PreCuration workflows
- Workflows with version info (e.g., "v0.1.8", "release v0.3")
- Workflows imported from uploaded files or URLs (official sources)
- "WORKFLOW REPORT TEST" workflows (testing report generation, not execution)
Non-official workflows (exclude):
- Export workflows (utility workflows, e.g., "Export PretextMap Workflow")
- test1, test2, etc. (debug/retry runs with lowercase test + number)
- Attempt1, Attempt2, Fix_Attempt (retry/debug runs)
- "Copy of" workflows (user copies)
- Numbered prefixes: "1. ", "2. ", "3. " (custom project markers)
- Custom project annotations (e.g., "Used for other Columbiformes")
- Typos (e.g., "Gnome Assembly" instead of "Genome")
- "training workflow" (tutorial workflows)
- Experimental integrations (e.g., "ONT-INTEGRATED")
Impact: Filtering typically removes ~20-25% of data, leaving ~80% official workflow executions for accurate analysis.
Species ID (ToLID) Patterns
VGP ToLID Format
VGP uses Tree of Life IDs (ToLIDs) to uniquely identify species:
Pattern: [clade][Genus][Species][Version]
Regex: [a-z][A-Z][a-z]{2}[A-Z][a-z]{2,3}\d+
Components:
[a-z]- Clade prefix (lowercase)a= Amphibianb= Birdf= Fishm= Mammali= Invertebrater= Reptile
[A-Z][a-z]{2}- Genus (3 letters, capitalized first)[A-Z][a-z]{2,3}- Species (2-3 letters, capitalized first)\d+- Version number
Examples:
aGasCar1- Gastrophryne carolinensis (Eastern narrowmouth toad)bAcrTri1- Acridotheres tristis (Common myna)fHopMal1- Hoplias malabaricus (Trahira)mBalRic1- Balaenoptera ricei (Rice's whale)
ToLID Locations in Galaxy
When analyzing VGP workflows in Galaxy, ToLIDs appear in:
- History names (most common) - e.g., "aGasCar1 - HiFi Assembly"
- Workflow inputs - Input dataset names or labels
- Dataset names - Input files named with ToLID prefix
Extracting ToLIDs for Resource Analysis
To link workflow resource usage with genome metadata:
import re
tolid_pattern = r'\b([a-z][A-Z][a-z]{2}[A-Z][a-z]{2,3}\d+)\b'
# From Galaxy history name
history_name = "aGasCar1 - VGP Assembly HiFi-HiC"
match = re.search(tolid_pattern, history_name)
species_id = match.group(1) # "aGasCar1"
# Link with VGP genome metadata
# Match species_id with ToLID column in genome tablesLinking ToLIDs with Genome Characteristics
VGP genome metadata tables use ToLID as primary key:
Common columns:
ToLID- Species identifierSpecies- Scientific nameCommon name- Vernacular nameGenome size¹- Estimated genome size (bp)Heterozygosity¹- Heterozygosity percentageSequencing depth- Coverage depthRepeat content¹- Repeat percentageAssembly version- hap1/hap2 designation
Usage for resource analysis:
# Correlate memory usage with genome size
# Correlate runtime with heterozygosity
# Compare resource efficiency across cladesMerging Species IDs with Metrics Data
Challenge: Galaxy API data comes from separate endpoints:
/api/invocations/{id}+/api/histories/{history_id}→ Species IDs, history names, inputs/api/invocations/{id}/metrics+/api/jobs/{id}/metrics→ Resource usage metrics
Result: Two complementary files:
- Enriched file: Has species_id, history_name, inputs (NO metrics)
- Metrics file: Has memory, CPU, runtime (NO species_id)
Solution Pattern:
# Load both data sources
with open('vgp_assembly_enriched_YYYYMMDD.json') as f:
enriched_data = json.load(f)
with open('vgp_workflows_assembly_runs_metrics.json') as f:
metrics_data = json.load(f)
# Create lookup dictionary keyed by invocation ID
enriched_dict = {inv['id']: inv for inv in enriched_data}
# Merge: Add species data to metrics
merged_data = []
for inv in metrics_data:
inv_id = inv['id']
if inv_id in enriched_dict:
enriched_inv = enriched_dict[inv_id]
inv['species_id'] = enriched_inv.get('species_id')
inv['history_name'] = enriched_inv.get('history_name')
inv['inputs'] = enriched_inv.get('inputs', {})
merged_data.append(inv)
# Save complete dataset
with open('vgp_workflows_assembly_runs_metrics_enriched_YYYYMMDD.json', 'w') as f:
json.dump(merged_data, f, indent=2)
# Report statistics
total = len(merged_data)
with_species = sum(1 for inv in merged_data if inv.get('species_id'))
unique_species = len(set(inv['species_id'] for inv in merged_data if inv.get('species_id')))
print(f'Total invocations: {total}')
print(f'With species_id: {with_species} ({with_species/total*100:.1f}%)')
print(f'Unique species: {unique_species}')Expected Results (VGP assembly workflows):
- ~1,630 invocations total
- ~45% with species_id (not all histories follow naming convention)
- ~129 unique species
Pipeline Integration:
- Add as Step 2.6 in fetch notebook
- Run AFTER both enrichment (Step 2.5) AND metrics (Step 4)
- Output file used by resource analysis notebook
- Fast execution (<1 min, no API calls)
Enables Analysis:
- Memory usage vs genome size
- Runtime vs heterozygosity
- Resource efficiency by species/clade
- Workflow performance across genome characteristics
Data Analysis Troubleshooting
Metric Availability Issues
When analyzing VGP workflow resource usage, be aware of metric availability:
Memory Metrics:
memory.max_usage_in_bytes/memory.limit_in_bytes- Only ~7% of invocations- These are cgroup-based metrics from Docker/container systems
- Only available for newer workflow runs or specific Galaxy configurations
- Most reliable for actual memory usage data
galaxy_memory_mb- ~98% of invocations- Represents memory allocation/request, not actual usage
- Much better coverage for correlation analyses
- Use when cgroup metrics unavailable
Runtime Metrics:
runtime_seconds- Available for most invocations- Use sum of (cores × runtime) for "True CPU Hours" to reflect actual computational resources
Common Data Issues:
Enrichment appears empty: If genome characteristics show 0 species after enrichment:
- Check if enrichment cell has been run (data only enriched in memory during notebook execution)
- Verify species ToLIDs match between workflow data and genome metadata TSV
- Check if genome metadata fields are populated (species may be in TSV with empty values)
Limited correlation data: When combining metrics with genome characteristics:
- Species with rare metrics (like cgroup memory) may not overlap with species having genome data
- Check actual overlap: how many species have BOTH the metric AND genome characteristics
- Consider using more widely available metrics (galaxy_memory_mb, runtime_seconds)
Debugging data availability:
# Check metric availability species_with_metric = set() for inv in data_with_species: metrics = inv.get('metrics', []) for metric in metrics: if metric.get('name') == 'target_metric_name': if metric.get('raw_value'): species_with_metric.add(inv.get('species_id')) # Check genome data availability species_with_genome = set() for inv in data_with_species: if inv.get('genome_size'): # or other characteristic species_with_genome.add(inv.get('species_id')) # Find overlap overlap = species_with_metric.intersection(species_with_genome) print(f"Species with both: {len(overlap)}")
GenomeArk S3 Data Integration
Fetching GenomeScope2 Summaries
GenomeScope2 summary files contain genome characteristics needed for workflow analysis.
S3 Path Pattern:
s3://genomeark/species/{Genus}_{species}/{ToLID}/assembly_vgp_HiC_2.0/evaluation/genomescope/{ToLID}_genomescope__Summary.txtFetch Example:
import subprocess
import re
def fetch_genomescope_summary(species_name, tolid):
"""Fetch genomescope summary from GenomeArk S3."""
species_name_s3 = species_name.replace(' ', '_')
path = f"s3://genomeark/species/{species_name_s3}/{tolid}/assembly_vgp_HiC_2.0/evaluation/genomescope/{tolid}_genomescope__Summary.txt"
result = subprocess.run(
['aws', 's3', 'cp', path, '-', '--no-sign-request'],
capture_output=True,
text=True,
timeout=10
)
return result.stdout if result.returncode == 0 else None
def parse_genomescope_summary(content):
"""Extract genome characteristics from summary."""
data = {}
# Genome Haploid Length (use max value - second column)
match = re.search(r'Genome Haploid Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
if match:
data['genome_size'] = match.group(1).replace(',', '')
# Heterozygosity percentage (max value)
match = re.search(r'Heterozygous \(ab\)\s+[\d.]+%\s+([\d.]+)%', content)
if match:
data['heterozygosity'] = match.group(1)
# Calculate repeat content from repeat and unique lengths
repeat_match = re.search(r'Genome Repeat Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
unique_match = re.search(r'Genome Unique Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
if repeat_match and unique_match:
repeat_length = float(repeat_match.group(1).replace(',', ''))
unique_length = float(unique_match.group(1).replace(',', ''))
total_length = repeat_length + unique_length
if total_length > 0:
repeat_percent = (repeat_length / total_length) * 100
data['repeat_content'] = f"{repeat_percent:.1f}"
return data if data else NoneKey Points:
- Use
--no-sign-requestfor public bucket access - GenomeScope2 format has min/max columns - use max (second) values
- Species names use underscores in S3 paths
- Not all species have GenomeScope data available
- Add timeout protection for batch processing
Workflow Integration:
# Enrich workflow invocations with genome characteristics
for inv in workflow_invocations:
species_id = inv.get('species_id')
if species_id:
content = fetch_genomescope_summary(species_name, species_id)
if content:
genome_data = parse_genomescope_summary(content)
inv.update(genome_data)Success Rate: Expect ~80-90% success rate for VGP species in GenomeArk.
Finding Missing VGP Accessions
Problem
Some curated VGP species have no accession numbers in the enriched dataset and aren't found in GenomeArk AWS. They may have been submitted directly to NCBI or are in external collaborating projects.
Solution: Search NCBI with VGP-specific Filtering
Search NCBI Assembly database:
import requests
from urllib.parse import quote
def search_ncbi_assemblies(species_name):
"""Search NCBI for assemblies by species name"""
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
# Search for assembly IDs
search_url = f"{base_url}esearch.fcgi"
search_params = {
'db': 'assembly',
'term': f'"{species_name}"[Organism]',
'retmode': 'json',
'retmax': 100
}
search_response = requests.get(search_url, params=search_params)
assembly_ids = search_response.json()['esearchresult']['idlist']
# Fetch assembly details
fetch_url = f"{base_url}esummary.fcgi"
fetch_params = {
'db': 'assembly',
'id': ','.join(assembly_ids),
'retmode': 'json'
}
fetch_response = requests.get(fetch_url, params=fetch_params)
# Extract submitter information
results = []
for assembly_id, data in fetch_response.json()['result'].items():
if assembly_id == 'uids':
continue
results.append({
'accession': data.get('assemblyaccession'),
'submitter': data.get('submitter'),
'name': data.get('assemblyname')
})
return resultsCRITICAL: Filter for VGP-only submissions:
# Only keep VGP-submitted assemblies
vgp_assemblies = [
r for r in results
if r['submitter'] == 'Vertebrate Genomes Project'
]Why this matters:
- Non-VGP assemblies may have different quality standards
- Other projects (Bat1K, Ocean Genomes) may use different assembly methods
- VGP-specific filtering ensures data consistency
- Prevents mixing different curation standards
Example results from VGP Phase 1 enrichment:
- Searched: 17 species missing accessions
- Found: 24 assemblies total from NCBI
- VGP-only: 5 accessions recovered
- mTenEca1 (Tenrec ecaudatus): GCF_050624435.1
- mNeoFlo1 (Neotoma floridana): GCA_050000055.1
- rHydTec1 (Hydromedusa tectifera): GCA_049999965.1
- rPelSom1 (Pelomedusa somalica): GCA_051311615.1
- rPodUni1 (Podocnemis unifilis): GCA_050000005.1
Store recovered data separately:
# Add new columns instead of overwriting existing data
df['accession_recovered'] = None # Primary accession
df['accession_recovered_all'] = None # All accessions (pipe-separated)
# Fill in recovered accessions
for tolid, accs in recovered_accessions.items():
mask = df['tolid'] == tolid
df.loc[mask, 'accession_recovered'] = accs['primary']
df.loc[mask, 'accession_recovered_all'] = accs['all']This preserves data provenance and makes it clear which accessions were found later.
Species not in GenomeArk:
If species have accessions but no tolids and aren't found in AWS:
- These are likely direct NCBI submissions
- May be from collaborating projects (not in GenomeArk)
- May be pre-VGP naming convention
- Document separately for tracking purposes
Curation Impact Analysis - Comparing Methods
Data Filtering for Fair Comparison
When comparing Dual vs Pri/alt curation:
def load_data():
df = pd.read_csv(DATA_FILE)
# Create curation type column
df['curation_type'] = 'None'
df.loc[df['Dual'] == 'Y', 'curation_type'] = 'Dual'
df.loc[df['Pri/alt'] == 'Y', 'curation_type'] = 'Pri/alt'
# Filter to assemblies with accessions only
df = df[df['accession'].notna()].copy()
# EXPLICITLY exclude uncurated assemblies
df = df[df['curation_type'].isin(['Dual', 'Pri/alt'])].copy()
# Verify no "None" remain
none_count = (df['curation_type'] == 'None').sum()
if none_count > 0:
print(f"WARNING: {none_count} uncurated assemblies found!")
return dfQuality Metrics Focus
Include: Post-curation quality metrics
- Scaffold N50, L50, L90
- Gap percentage, scaffold count
- Chromosome-level status
- Assembly size accuracy
- Derived metrics (efficiency ratios, concentration scores)
Exclude: Genome characteristics
- Heterozygosity, repeat content (intrinsic properties)
- Contig-based metrics (pre-curation)
Statistical Testing Pattern
Use non-parametric methods for non-normal distributions:
- Continuous metrics: Mann-Whitney U test
- Categorical metrics: Chi-square test (or Fisher's exact if n<5)
- Handle failures gracefully:
try: chi2, pval, dof, expected = stats.chi2_contingency(table) stats_text = f'p = {pval:.3e}' except ValueError: stats_text = 'N/A (insufficient variation)'
Missing Data Handling
Always check for data availability before plotting:
if len(data_dual) == 0 or len(data_prialt) == 0:
ax.text(0.5, 0.5, 'Insufficient data available',
transform=ax.transAxes, ha='center', va='center')
returnThis prevents crashes when metrics like QV are sparsely populated.
Incremental Script Development
When building analysis scripts with many (10+) plotting functions:
- Start with core functions (01-06): Basic metrics
- Add advanced functions (07-11): Computed metrics
- Test incrementally - Run after each batch
- Update main() function to call new plots
- Update summary statistics in output
Critical: Always update main() when adding new functions:
def main():
df = load_data()
# Basic metrics (01-06)
plot_metric_1(df)
plot_metric_2(df)
# NEW: Advanced metrics (07-08) <- Add these
plot_metric_7(df)
plot_metric_8(df)
# Update count in summary
print(f"Generated {8} figures") # <- Update countForgetting to call new functions means they're defined but never executed.
Tool-Level Resource Optimization
Identifying Top Resource Offender Tools
To prioritize optimization efforts, identify tools consuming most resources:
# From df_jobs DataFrame (job-level metrics)
tool_cpu = df_jobs.groupby('tool_name').agg({
'cpu_hours': ['sum', 'count', 'mean']
}).reset_index()
tool_cpu.columns = ['tool_name', 'total_cpu_hours', 'job_count', 'avg_cpu_hours']
top_cpu_tools = tool_cpu.sort_values('total_cpu_hours', ascending=False).head(3)
tool_memory = df_jobs.groupby('tool_name').agg({
'peak_memory_gb': ['sum', 'count', 'mean', 'max']
}).reset_index()
top_memory_tools = tool_memory.sort_values('total_memory_gb', ascending=False).head(3)Correlating Tool Usage with Genome Characteristics
For each top tool, analyze how genome characteristics affect resource usage:
from scipy import stats
for tool_name in top_cpu_tools['tool_name']:
tool_jobs = df_jobs[df_jobs['tool_name'] == tool_name]
# Aggregate by species
species_data = {}
for _, job in tool_jobs.iterrows():
workflow_id = job['workflow_id']
# Match with workflow invocation to get genome characteristics
for inv in invocations:
if inv['id'] == workflow_id and inv.get('species_id'):
species_id = inv['species_id']
genome_size = inv.get('genome_size')
cpu_hours = job['cpu_hours']
if species_id not in species_data:
species_data[species_id] = {
'genome_size': genome_size,
'cpu_hours': 0
}
species_data[species_id]['cpu_hours'] += cpu_hours
# Calculate correlation
x = [d['genome_size'] for d in species_data.values()]
y = [d['cpu_hours'] for d in species_data.values()]
if len(x) >= 3:
corr, pval = stats.pearsonr(x, y)
sig = '***' if pval < 0.001 else '**' if pval < 0.01 else '*' if pval < 0.05 else 'ns'
print(f'{tool_name}: r={corr:.3f}, p={pval:.2e} ({sig})')Interpretation
High correlation + high p-value significance → Tool resource usage scales with genome characteristic
- Actionable: Optimize tool for this characteristic
- Example: Tool uses 2x CPU per 1 Gb genome size increase
Low/no correlation → Resource usage independent of genome characteristics
- Actionable: Look for algorithmic inefficiencies
- Example: Tool has fixed overhead or inefficient implementation
Significance levels:
- *** (p < 0.001): Highly significant correlation
- ** (p < 0.01): Very significant
- (p < 0.05): Significant
- ns: Not significant
Visualization
Create separate plot for each tool showing all 5 genome characteristics:
- Genome Size, Heterozygosity, Repeat Content, Contig N50, GC Content
- Include trend line, correlation coefficient, and p-value on each subplot
- Point size proportional to number of jobs
Communication Patterns
File Path Confirmation
When modifying files, especially when multiple similar files exist:
Always state the FULL path when making edits:
❌ Bad: "Updating the notebook..." ✅ Good: "Updating /Users/delphine/Workdir/Stats_workflow_run/vgp_workflow_resource_analysis.ipynb"Ask for clarification if multiple candidates exist:
Found two notebooks: 1. /Users/delphine/Workdir/Stats_workflow_run/vgp_workflow_resource_analysis.ipynb 2. /Users/delphine/Workdir/Stats_workflow_run/sharing/vgp_workflow_resource_analysis.ipynb Which one are you running?After user correction, explicitly confirm:
✅ Confirmed: Now modifying the correct file at /path/to/correct/file
User Feedback Interpretation
When user says "the outlier is still there":
- Don't assume code is wrong - may be running different file
- Don't assume user error - may be our mistake
- Verify the file path immediately
- Check if changes were applied to the file user is actually running
This pattern prevents major errors where edits are applied to the wrong file for multiple iterations.
Data Quality Validation and Filtering
GenomeScope Data Validation
Critical Issue: VGP Phase 1 dataset contains placeholder values where genome_size_genomescope == total_length (assembly size copied into GenomeScope column).
Detection and Filtering:
# Filter out fake GenomeScope values
df_filtered = df[(df['total_length'].notna()) &
(df['genome_size_genomescope'].notna()) &
(df['genome_size_genomescope'] != df['total_length'])].copy()
# Report filtering statistics
n_total = len(df[df['genome_size_genomescope'].notna()])
n_fake = len(df[(df['genome_size_genomescope'].notna()) &
(df['genome_size_genomescope'] == df['total_length'])])
n_real = len(df_filtered)
print(f"Total with GenomeScope: {n_total}")
print(f"Fake (copied): {n_fake} ({n_fake/n_total*100:.1f}%)")
print(f"Real estimates: {n_real} ({n_real/n_total*100:.1f}%)")Impact: In VGP Phase 1, 396/545 (72.7%) GenomeScope values were fake, leaving only 149 (27.3%) real independent estimates.
Visual Detection: Scatter plots showing all points on diagonal (assembly = expected) indicate circular data.
Best Practice: Always validate that "expected" genome size values are independent from assembly size before comparative analysis.
Meryl K-mer Database Management
Accessing Meryl Histograms for GenomeScope
Key Insight: GenomeScope only needs histogram files (.hist), not full meryl databases.
File Structure on GenomeArk S3:
s3://genomeark/species/{species}/{tolid}/assembly_*/intermediates/meryl/
├── {tolid}.cut.meryl.hist # ~700KB - THIS IS WHAT YOU NEED
└── {tolid}.cut.meryl/ # Many GB - full database
├── 0x000000.merylData
├── 0x000000.merylIndex
└── ...Direct Histogram URLs (for Galaxy import):
# Pattern:
https://genomeark.s3.amazonaws.com/species/{species}/{tolid}/path/to/meryl/{tolid}.cut.meryl.hist
# Example:
https://genomeark.s3.amazonaws.com/species/Rhinolophus_ferrumequinum/mRhiFer1/assembly_vgp_standard_1.0/intermediates/meryl/mRhiFer1.cut.meryl.histBenefits:
- 1000x smaller download (~700KB vs ~10GB)
- Can import directly into Galaxy via URL
- No need to download full meryl database
- Much faster for batch GenomeScope analysis
Common Mistake: Downloading entire meryl directory when only .hist file is needed.
Assembly Size vs Expected Genome Size Interpretation
Common Pattern: Assemblies ~12% Larger Than Expected
Typical Statistics (from real VGP data, n=149):
- Mean ratio (assembly/expected): 1.116
- Median ratio: 1.077
- Within ±10%: ~58%
- Within ±20%: ~88%
Why Assemblies Are Larger:
- Incomplete haplotig purging: Haplotype-specific sequences retained in primary
- GenomeScope underestimation: High heterozygosity/repeats affect k-mer frequencies
- Organellar DNA: Mitochondrial genomes (~15-20kb) assembled and included
Interpretation:
- Ratio 1.0-1.2: Normal, good assembly
- Ratio > 1.3: Possible retained duplications, check purge_dups metrics
- Ratio < 0.9: Possible missing sequence, check coverage
Not a Quality Problem: Moderate scatter (std dev ~30%) reflects biological variation and k-mer estimation limitations, not assembly quality issues.