VGP Assembly Pipeline Skill

- [Vertebrate Genome Project](https://vertebrategenomesproject.org/)

Delphine-L 17 2 Updated 6mo ago

Resources

GitHub

Install

npx skillscat add delphine-l/claude-global/skills-bioinformatics-vgp-pipeline

Install via the SkillsCat registry.

SKILL.md

VGP Assembly Pipeline Skill

Overview

The Vertebrate Genome Project (VGP) assembly pipeline consists of Galaxy workflows for producing high-quality, phased, chromosome-level genome assemblies. This skill covers workflow selection, execution patterns, and quality control checkpoints.

Trajectories (by frequency of use)

Trajectory A: HiFi + Hi-C (Most Common)

Inputs: HiFi Reads, Hi-C Reads
Path: WF1 → WF4 → [WF6] → WF8 → WF9 → PreCuration
Output: HiC Phased assembly (hap1/hap2)
WF6: Optional (can skip directly to WF8)

Trajectory B: HiFi + Trio

Inputs: HiFi Reads, Hi-C Reads, Parental Reads
Path: WF2 → WF5 → [WF6] → WF8 → WF9 → PreCuration
Output: Trio Phased assembly (maternal/paternal)
WF6: Optional (can skip directly to WF8)

Trajectory C: HiFi Only (Least Common)

Inputs: HiFi Reads only
Path: WF1 → WF3 → WF6 → WF9 → PreCuration
Output: Pseudohaplotype assembly (primary/alternate)
WF6: Required (no Hi-C scaffolding step)
Note: Skips WF8 entirely

Workflow Selection by Data Availability

Non-trio workflows (HiFi reads only)

VGP1 (WF1): K-mer profiling with HiFi reads alone
VGP3 (WF3): HiFi-only assembly with HiFiasm

Trio workflows (HiFi + Parental Illumina)

VGP2 (WF2): Trio k-mer profiling (HiFi child + Illumina parents)
VGP5 (WF5): Trio-phased assembly with HiFiasm

Universal scaffolding workflows

RagTag scaffolding: Used for both trio and non-trio assemblies
Requires reference genome specification

Methods language pattern

When documenting workflow selection in publications:

"For species with available parental data (trio datasets), we employed
VGP2 → VGP5 workflows. For species without parental data (non-trio datasets),
we performed VGP1 → VGP3 workflows."

Workflow Descriptions

Workflow	Name	Description
WF0	Mitochondrial Assembly	MitoHiFi assembly (runs in parallel, may fail if no mito reads)
WF1	K-mer Profiling	Genome size, heterozygosity estimation (HiFi)
WF2	Trio K-mer Profiling	K-mer profiling with parental data
WF3	Hifiasm	HiFi-only assembly
WF4	Hifiasm + HiC	HiC-phased assembly
WF5	Hifiasm Trio	Trio-phased assembly
WF6	Purge Duplicates	Remove haplotypic duplications
~~WF7~~	~~Bionano~~	Deprecated - no longer used
WF8	Hi-C Scaffolding	YAHS chromosome scaffolding
WF9	Decontamination	Remove contaminants
PreCuration	Pretext Snapshot	Prepare files for manual curation

Haplotype Execution Patterns

Run Once (Both Haplotypes Together)

WF1, WF2 (K-mer profiling)
WF3, WF4, WF5 (Assembly)
WF6 (Purge Duplicates) - depends on trajectory
PreCuration

Run Twice (×2 per Haplotype)

WF8 (Hi-C Scaffolding)
WF9 (Decontamination)

WF6 (Purge Duplicates) Decision Logic

if trajectory == "C" (HiFi only):
    WF6 is REQUIRED
    WF6 border: solid
else:  # Trajectory A or B
    WF6 is OPTIONAL
    WF6 border: dashed
    Can skip directly to WF8

When to skip WF6 (Trajectories A/B):

Merqury k-mer spectra shows clean haplotype separation
Assembly QV is already high
No significant duplication detected

When to run WF6 (Trajectories A/B):

K-mer spectra shows residual duplications
Higher heterozygosity samples
Conservative approach preferred

Coverage Requirements

Data Type	Minimum Coverage	Notes
HiFi	30×	Diploid genome
Hi-C	60×	Diploid genome

QC Checkpoints

After WF1/WF2 (K-mer Profiling)

Verify GenomeScope2 model fit
Check estimated genome size
Review heterozygosity estimate

After WF4/WF5 (Assembly)

Inspect Merqury k-mer spectra
Decide whether to run WF6 based on duplication levels

After WF8 (Hi-C Scaffolding)

Check Pretext Hi-C contact maps
Verify chromosome-level scaffolding
Validate against expected karyotype (see Karyotype Validation below)

After WF9 (Decontamination)

Review contamination reports
Check for unexpected removals

Karyotype-Based Scaffold Validation

Sex Chromosome Adjustment

Problem: VGP assemblies often place both sex chromosomes (X+Y or Z+W) in the main haplotype, requiring adjustment to expected chromosome counts.

Solution: When both sex chromosomes present, expected = n + 1 (not n)

Implementation:

# Adjust haploid expected when BOTH sex chromosomes in main haplotype
df['num_chromosomes_haploid_adjusted'] = df['num_chromosomes_haploid'].copy()

both_sex_chr_patterns = [
    'Has X and Y',
    'Has Z and W',
    'has Z and W',
    'Has X1, X2, and Y',
    'Has Z1, Z2, and W',
    'Has 5X and 5Y'
]

if 'Sex chromosomes main haploptype' in df.columns:
    has_both_sex = df['Sex chromosomes main haploptype'].isin(both_sex_chr_patterns)
    df.loc[has_both_sex & df['num_chromosomes_haploid'].notna(),
           'num_chromosomes_haploid_adjusted'] = \
        df.loc[has_both_sex & df['num_chromosomes_haploid'].notna(),
              'num_chromosomes_haploid'] + 1

Biological Reasoning:

Diploid organisms have two sex chromosomes (XX, XY, ZZ, ZW)
X and Y (or Z and W) are distinct chromosomes
If both in main haplotype → two separate scaffolds expected
Example: Asian elephant 2n=56, n=28, has X+Y → expect 29 scaffolds

Impact: Improved perfect match rate from 0% to ~90% in validation analyses

Validation Metrics:

# Use adjusted counts for validation
achieved = df['total_number_of_chromosomes']
expected = df['num_chromosomes_haploid_adjusted']

perfect_matches = (achieved == expected).sum()
within_1 = ((achieved - expected).abs() <= 1).sum()
ratio = achieved / expected

Common Pitfalls

❌ Wrong: Compare diploid expected (2n) to haploid assembly

Results in ~50% achievement rates
Biologically incorrect

❌ Wrong: Use haploid (n) when both sex chromosomes present

Underestimates by 1
Shows artificial "extra scaffold" problem

✅ Correct: Use adjusted haploid (n or n+1 depending on sex chromosome configuration)

WF0 (Mitochondrial) Handling

WF0 runs in parallel with the main pipeline and may fail if:

No mitochondrial reads present in HiFi data
This is a biological failure, not technical

def check_mitohifi_failure(wf0_result):
    """Distinguish biological vs technical failure"""
    if "no_mito_reads" in wf0_result.log:
        return "biological"  # Expected for some samples
    else:
        return "technical"   # Investigate further

Visual Diagram Elements

When creating workflow diagrams:

Color Coding (Suggested)

K-mer Profiling section: Orange (#fff3e0)
Assembly section: Green (#e8f5e9)
Purging section: Purple (#f3e5f5)
Scaffolding section: Blue (#e3f2fd)
Finishing section: Green (#e8f5e9)
WF0 (Mitochondrial): Pink (#fce4ec)

Visual Indicators

Solid lines: Required workflow connections
Dashed lines: Optional skip paths
Dashed box border: Optional workflow (WF6 in trajectories A/B)
Solid box border: Required workflow
Dimmed elements: Workflows not used in current trajectory

Haplotype Badges

Blue badge (#e3f2fd): "×2 per haplotype" - runs separately
Green badge (#e8f5e9): "both haplotypes" - runs together

Input Data Labels

HiFi Reads: Blue (#4285f4)
Hi-C Reads: Green (#34a853)
Parental Reads: Red (#ea4335)

Summary Table

Trajectory	Inputs	K-mer	Assembly	Purge	Scaffold	Finish	Output
A	HiFi+HiC	WF1	WF4	[WF6]	WF8	WF9→Pre	hap1/hap2
B	HiFi+Trio	WF2	WF5	[WF6]	WF8	WF9→Pre	mat/pat
C	HiFi only	WF1	WF3	WF6	-	WF9→Pre	pri/alt

[WF6] = optional, WF6 = required, - = skipped

Reference Genomes for Scaffolding

Common Reference Genome

GCA_011100685.1 - Frequently used reference genome for RagTag scaffolding in canid genome assemblies.

When documenting scaffolding in methods sections:

Always specify the reference genome accession
Include version number if applicable
Example: "scaffolded using RagTag v2.1.0 with the reference genome GCA_011100685.1"

Best Practices

For reproducibility:

Document exact accession used
Specify if custom modifications were made to reference
Note if different references used for different species/assemblies

Resource Analysis Insights

VGP Workflow Canonical Names

Official VGP workflows normalized to canonical names for analysis:

ID	Canonical Name	Description
VGP0	Mitogenome Assembly (VGP0)	Mitochondrial genome assembly
VGP1/WF1	K-mer Profiling (VGP1)	K-mer profiling for genome size estimation (HiFi)
VGP2	K-mer Profiling Trio (VGP2)	Trio k-mer profiling
VGP3	HiFi-only Assembly (VGP3)	HiFi-only assembly
VGP4	HiFi-HiC Phased Assembly (VGP4)	HiFi-HiC phased assembly
VGP5	HiFi-Trio Phased Assembly (VGP5)	HiFi-Trio phased assembly
VGP6	Purge Duplicates (VGP6)	Purge duplicate contigs
VGP6b	Purge Duplicates One Haplotype (VGP6b)	Purge duplicates (haploid mode)
VGP7	BioNano Scaffolding (VGP7)	BioNano optical mapping scaffolding
VGP8	Hi-C Scaffolding (VGP8)	Hi-C scaffolding
VGP9	Assembly Decontamination (VGP9)	Assembly decontamination

Note: WF1-9 are aliases for VGP1-9 (same workflows, different naming convention).

Official vs Non-Official Workflow Identification

When analyzing VGP workflow metrics, filter for official workflows only:

Official workflows (include):

Core VGP workflows: VGP0-VGP9, WF0-WF9
PretextMap/PreCuration workflows
Workflows with version info (e.g., "v0.1.8", "release v0.3")
Workflows imported from uploaded files or URLs (official sources)
"WORKFLOW REPORT TEST" workflows (testing report generation, not execution)

Non-official workflows (exclude):

Export workflows (utility workflows, e.g., "Export PretextMap Workflow")
test1, test2, etc. (debug/retry runs with lowercase test + number)
Attempt1, Attempt2, Fix_Attempt (retry/debug runs)
"Copy of" workflows (user copies)
Numbered prefixes: "1. ", "2. ", "3. " (custom project markers)
Custom project annotations (e.g., "Used for other Columbiformes")
Typos (e.g., "Gnome Assembly" instead of "Genome")
"training workflow" (tutorial workflows)
Experimental integrations (e.g., "ONT-INTEGRATED")

Impact: Filtering typically removes ~20-25% of data, leaving ~80% official workflow executions for accurate analysis.

Species ID (ToLID) Patterns

VGP ToLID Format

VGP uses Tree of Life IDs (ToLIDs) to uniquely identify species:

Pattern: [clade][Genus][Species][Version]

Regex: [a-z][A-Z][a-z]{2}[A-Z][a-z]{2,3}\d+

Components:

[a-z] - Clade prefix (lowercase)
- a = Amphibian
- b = Bird
- f = Fish
- m = Mammal
- i = Invertebrate
- r = Reptile
[A-Z][a-z]{2} - Genus (3 letters, capitalized first)
[A-Z][a-z]{2,3} - Species (2-3 letters, capitalized first)
\d+ - Version number

Examples:

aGasCar1 - Gastrophryne carolinensis (Eastern narrowmouth toad)
bAcrTri1 - Acridotheres tristis (Common myna)
fHopMal1 - Hoplias malabaricus (Trahira)
mBalRic1 - Balaenoptera ricei (Rice's whale)

ToLID Locations in Galaxy

When analyzing VGP workflows in Galaxy, ToLIDs appear in:

History names (most common) - e.g., "aGasCar1 - HiFi Assembly"
Workflow inputs - Input dataset names or labels
Dataset names - Input files named with ToLID prefix

Extracting ToLIDs for Resource Analysis

To link workflow resource usage with genome metadata:

import re

tolid_pattern = r'\b([a-z][A-Z][a-z]{2}[A-Z][a-z]{2,3}\d+)\b'

# From Galaxy history name
history_name = "aGasCar1 - VGP Assembly HiFi-HiC"
match = re.search(tolid_pattern, history_name)
species_id = match.group(1)  # "aGasCar1"

# Link with VGP genome metadata
# Match species_id with ToLID column in genome tables

Linking ToLIDs with Genome Characteristics

VGP genome metadata tables use ToLID as primary key:

Common columns:

ToLID - Species identifier
Species - Scientific name
Common name - Vernacular name
Genome size¹ - Estimated genome size (bp)
Heterozygosity¹ - Heterozygosity percentage
Sequencing depth - Coverage depth
Repeat content¹ - Repeat percentage
Assembly version - hap1/hap2 designation

Usage for resource analysis:

# Correlate memory usage with genome size
# Correlate runtime with heterozygosity
# Compare resource efficiency across clades

Merging Species IDs with Metrics Data

Challenge: Galaxy API data comes from separate endpoints:

/api/invocations/{id} + /api/histories/{history_id} → Species IDs, history names, inputs
/api/invocations/{id}/metrics + /api/jobs/{id}/metrics → Resource usage metrics

Result: Two complementary files:

Enriched file: Has species_id, history_name, inputs (NO metrics)
Metrics file: Has memory, CPU, runtime (NO species_id)

Solution Pattern:

# Load both data sources
with open('vgp_assembly_enriched_YYYYMMDD.json') as f:
    enriched_data = json.load(f)

with open('vgp_workflows_assembly_runs_metrics.json') as f:
    metrics_data = json.load(f)

# Create lookup dictionary keyed by invocation ID
enriched_dict = {inv['id']: inv for inv in enriched_data}

# Merge: Add species data to metrics
merged_data = []
for inv in metrics_data:
    inv_id = inv['id']
    if inv_id in enriched_dict:
        enriched_inv = enriched_dict[inv_id]
        inv['species_id'] = enriched_inv.get('species_id')
        inv['history_name'] = enriched_inv.get('history_name')
        inv['inputs'] = enriched_inv.get('inputs', {})
        merged_data.append(inv)

# Save complete dataset
with open('vgp_workflows_assembly_runs_metrics_enriched_YYYYMMDD.json', 'w') as f:
    json.dump(merged_data, f, indent=2)

# Report statistics
total = len(merged_data)
with_species = sum(1 for inv in merged_data if inv.get('species_id'))
unique_species = len(set(inv['species_id'] for inv in merged_data if inv.get('species_id')))

print(f'Total invocations: {total}')
print(f'With species_id: {with_species} ({with_species/total*100:.1f}%)')
print(f'Unique species: {unique_species}')

Expected Results (VGP assembly workflows):

~1,630 invocations total
~45% with species_id (not all histories follow naming convention)
~129 unique species

Pipeline Integration:

Add as Step 2.6 in fetch notebook
Run AFTER both enrichment (Step 2.5) AND metrics (Step 4)
Output file used by resource analysis notebook
Fast execution (<1 min, no API calls)

Enables Analysis:

Memory usage vs genome size
Runtime vs heterozygosity
Resource efficiency by species/clade
Workflow performance across genome characteristics

Data Analysis Troubleshooting

Metric Availability Issues

When analyzing VGP workflow resource usage, be aware of metric availability:

Memory Metrics:

memory.max_usage_in_bytes / memory.limit_in_bytes - Only ~7% of invocations
- These are cgroup-based metrics from Docker/container systems
- Only available for newer workflow runs or specific Galaxy configurations
- Most reliable for actual memory usage data
galaxy_memory_mb - ~98% of invocations
- Represents memory allocation/request, not actual usage
- Much better coverage for correlation analyses
- Use when cgroup metrics unavailable

Runtime Metrics:

runtime_seconds - Available for most invocations
Use sum of (cores × runtime) for "True CPU Hours" to reflect actual computational resources

Common Data Issues:

Enrichment appears empty: If genome characteristics show 0 species after enrichment:
- Check if enrichment cell has been run (data only enriched in memory during notebook execution)
- Verify species ToLIDs match between workflow data and genome metadata TSV
- Check if genome metadata fields are populated (species may be in TSV with empty values)
Limited correlation data: When combining metrics with genome characteristics:
- Species with rare metrics (like cgroup memory) may not overlap with species having genome data
- Check actual overlap: how many species have BOTH the metric AND genome characteristics
- Consider using more widely available metrics (galaxy_memory_mb, runtime_seconds)

Debugging data availability:

# Check metric availability
species_with_metric = set()
for inv in data_with_species:
    metrics = inv.get('metrics', [])
    for metric in metrics:
        if metric.get('name') == 'target_metric_name':
            if metric.get('raw_value'):
                species_with_metric.add(inv.get('species_id'))

# Check genome data availability
species_with_genome = set()
for inv in data_with_species:
    if inv.get('genome_size'):  # or other characteristic
        species_with_genome.add(inv.get('species_id'))

# Find overlap
overlap = species_with_metric.intersection(species_with_genome)
print(f"Species with both: {len(overlap)}")

GenomeArk S3 Data Integration

Fetching GenomeScope2 Summaries

GenomeScope2 summary files contain genome characteristics needed for workflow analysis.

S3 Path Pattern:

s3://genomeark/species/{Genus}_{species}/{ToLID}/assembly_vgp_HiC_2.0/evaluation/genomescope/{ToLID}_genomescope__Summary.txt

Fetch Example:

import subprocess
import re

def fetch_genomescope_summary(species_name, tolid):
    """Fetch genomescope summary from GenomeArk S3."""
    species_name_s3 = species_name.replace(' ', '_')
    path = f"s3://genomeark/species/{species_name_s3}/{tolid}/assembly_vgp_HiC_2.0/evaluation/genomescope/{tolid}_genomescope__Summary.txt"

    result = subprocess.run(
        ['aws', 's3', 'cp', path, '-', '--no-sign-request'],
        capture_output=True,
        text=True,
        timeout=10
    )

    return result.stdout if result.returncode == 0 else None

def parse_genomescope_summary(content):
    """Extract genome characteristics from summary."""
    data = {}

    # Genome Haploid Length (use max value - second column)
    match = re.search(r'Genome Haploid Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['genome_size'] = match.group(1).replace(',', '')

    # Heterozygosity percentage (max value)
    match = re.search(r'Heterozygous \(ab\)\s+[\d.]+%\s+([\d.]+)%', content)
    if match:
        data['heterozygosity'] = match.group(1)

    # Calculate repeat content from repeat and unique lengths
    repeat_match = re.search(r'Genome Repeat Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    unique_match = re.search(r'Genome Unique Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)

    if repeat_match and unique_match:
        repeat_length = float(repeat_match.group(1).replace(',', ''))
        unique_length = float(unique_match.group(1).replace(',', ''))
        total_length = repeat_length + unique_length
        if total_length > 0:
            repeat_percent = (repeat_length / total_length) * 100
            data['repeat_content'] = f"{repeat_percent:.1f}"

    return data if data else None

Key Points:

Use --no-sign-request for public bucket access
GenomeScope2 format has min/max columns - use max (second) values
Species names use underscores in S3 paths
Not all species have GenomeScope data available
Add timeout protection for batch processing

Workflow Integration:

# Enrich workflow invocations with genome characteristics
for inv in workflow_invocations:
    species_id = inv.get('species_id')
    if species_id:
        content = fetch_genomescope_summary(species_name, species_id)
        if content:
            genome_data = parse_genomescope_summary(content)
            inv.update(genome_data)

Success Rate: Expect ~80-90% success rate for VGP species in GenomeArk.

Finding Missing VGP Accessions

Problem

Some curated VGP species have no accession numbers in the enriched dataset and aren't found in GenomeArk AWS. They may have been submitted directly to NCBI or are in external collaborating projects.

Solution: Search NCBI with VGP-specific Filtering

Search NCBI Assembly database:

import requests
from urllib.parse import quote

def search_ncbi_assemblies(species_name):
    """Search NCBI for assemblies by species name"""
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"

    # Search for assembly IDs
    search_url = f"{base_url}esearch.fcgi"
    search_params = {
        'db': 'assembly',
        'term': f'"{species_name}"[Organism]',
        'retmode': 'json',
        'retmax': 100
    }
    search_response = requests.get(search_url, params=search_params)
    assembly_ids = search_response.json()['esearchresult']['idlist']

    # Fetch assembly details
    fetch_url = f"{base_url}esummary.fcgi"
    fetch_params = {
        'db': 'assembly',
        'id': ','.join(assembly_ids),
        'retmode': 'json'
    }
    fetch_response = requests.get(fetch_url, params=fetch_params)

    # Extract submitter information
    results = []
    for assembly_id, data in fetch_response.json()['result'].items():
        if assembly_id == 'uids':
            continue
        results.append({
            'accession': data.get('assemblyaccession'),
            'submitter': data.get('submitter'),
            'name': data.get('assemblyname')
        })

    return results

CRITICAL: Filter for VGP-only submissions:

# Only keep VGP-submitted assemblies
vgp_assemblies = [
    r for r in results
    if r['submitter'] == 'Vertebrate Genomes Project'
]

Why this matters:

Non-VGP assemblies may have different quality standards
Other projects (Bat1K, Ocean Genomes) may use different assembly methods
VGP-specific filtering ensures data consistency
Prevents mixing different curation standards

Example results from VGP Phase 1 enrichment:

Searched: 17 species missing accessions
Found: 24 assemblies total from NCBI
VGP-only: 5 accessions recovered
- mTenEca1 (Tenrec ecaudatus): GCF_050624435.1
- mNeoFlo1 (Neotoma floridana): GCA_050000055.1
- rHydTec1 (Hydromedusa tectifera): GCA_049999965.1
- rPelSom1 (Pelomedusa somalica): GCA_051311615.1
- rPodUni1 (Podocnemis unifilis): GCA_050000005.1

Store recovered data separately:

# Add new columns instead of overwriting existing data
df['accession_recovered'] = None  # Primary accession
df['accession_recovered_all'] = None  # All accessions (pipe-separated)

# Fill in recovered accessions
for tolid, accs in recovered_accessions.items():
    mask = df['tolid'] == tolid
    df.loc[mask, 'accession_recovered'] = accs['primary']
    df.loc[mask, 'accession_recovered_all'] = accs['all']

This preserves data provenance and makes it clear which accessions were found later.

Species not in GenomeArk:
If species have accessions but no tolids and aren't found in AWS:

These are likely direct NCBI submissions
May be from collaborating projects (not in GenomeArk)
May be pre-VGP naming convention
Document separately for tracking purposes

Curation Impact Analysis - Comparing Methods

Data Filtering for Fair Comparison

When comparing Dual vs Pri/alt curation:

def load_data():
    df = pd.read_csv(DATA_FILE)

    # Create curation type column
    df['curation_type'] = 'None'
    df.loc[df['Dual'] == 'Y', 'curation_type'] = 'Dual'
    df.loc[df['Pri/alt'] == 'Y', 'curation_type'] = 'Pri/alt'

    # Filter to assemblies with accessions only
    df = df[df['accession'].notna()].copy()

    # EXPLICITLY exclude uncurated assemblies
    df = df[df['curation_type'].isin(['Dual', 'Pri/alt'])].copy()

    # Verify no "None" remain
    none_count = (df['curation_type'] == 'None').sum()
    if none_count > 0:
        print(f"WARNING: {none_count} uncurated assemblies found!")

    return df

Quality Metrics Focus

Include: Post-curation quality metrics

Scaffold N50, L50, L90
Gap percentage, scaffold count
Chromosome-level status
Assembly size accuracy
Derived metrics (efficiency ratios, concentration scores)

Exclude: Genome characteristics

Heterozygosity, repeat content (intrinsic properties)
Contig-based metrics (pre-curation)

Statistical Testing Pattern

Use non-parametric methods for non-normal distributions:

Continuous metrics: Mann-Whitney U test
Categorical metrics: Chi-square test (or Fisher's exact if n<5)

Handle failures gracefully:

try:
    chi2, pval, dof, expected = stats.chi2_contingency(table)
    stats_text = f'p = {pval:.3e}'
except ValueError:
    stats_text = 'N/A (insufficient variation)'

Missing Data Handling

Always check for data availability before plotting:

if len(data_dual) == 0 or len(data_prialt) == 0:
    ax.text(0.5, 0.5, 'Insufficient data available',
           transform=ax.transAxes, ha='center', va='center')
    return

This prevents crashes when metrics like QV are sparsely populated.

Incremental Script Development

When building analysis scripts with many (10+) plotting functions:

Start with core functions (01-06): Basic metrics
Add advanced functions (07-11): Computed metrics
Test incrementally - Run after each batch
Update main() function to call new plots
Update summary statistics in output

Critical: Always update main() when adding new functions:

def main():
    df = load_data()

    # Basic metrics (01-06)
    plot_metric_1(df)
    plot_metric_2(df)

    # NEW: Advanced metrics (07-08)  <- Add these
    plot_metric_7(df)
    plot_metric_8(df)

    # Update count in summary
    print(f"Generated {8} figures")  # <- Update count

Forgetting to call new functions means they're defined but never executed.

Tool-Level Resource Optimization

Identifying Top Resource Offender Tools

To prioritize optimization efforts, identify tools consuming most resources:

# From df_jobs DataFrame (job-level metrics)
tool_cpu = df_jobs.groupby('tool_name').agg({
    'cpu_hours': ['sum', 'count', 'mean']
}).reset_index()
tool_cpu.columns = ['tool_name', 'total_cpu_hours', 'job_count', 'avg_cpu_hours']

top_cpu_tools = tool_cpu.sort_values('total_cpu_hours', ascending=False).head(3)

tool_memory = df_jobs.groupby('tool_name').agg({
    'peak_memory_gb': ['sum', 'count', 'mean', 'max']
}).reset_index()
top_memory_tools = tool_memory.sort_values('total_memory_gb', ascending=False).head(3)

Correlating Tool Usage with Genome Characteristics

For each top tool, analyze how genome characteristics affect resource usage:

from scipy import stats

for tool_name in top_cpu_tools['tool_name']:
    tool_jobs = df_jobs[df_jobs['tool_name'] == tool_name]

    # Aggregate by species
    species_data = {}
    for _, job in tool_jobs.iterrows():
        workflow_id = job['workflow_id']
        # Match with workflow invocation to get genome characteristics
        for inv in invocations:
            if inv['id'] == workflow_id and inv.get('species_id'):
                species_id = inv['species_id']
                genome_size = inv.get('genome_size')
                cpu_hours = job['cpu_hours']

                if species_id not in species_data:
                    species_data[species_id] = {
                        'genome_size': genome_size,
                        'cpu_hours': 0
                    }
                species_data[species_id]['cpu_hours'] += cpu_hours

    # Calculate correlation
    x = [d['genome_size'] for d in species_data.values()]
    y = [d['cpu_hours'] for d in species_data.values()]

    if len(x) >= 3:
        corr, pval = stats.pearsonr(x, y)
        sig = '***' if pval < 0.001 else '**' if pval < 0.01 else '*' if pval < 0.05 else 'ns'
        print(f'{tool_name}: r={corr:.3f}, p={pval:.2e} ({sig})')

Interpretation

High correlation + high p-value significance → Tool resource usage scales with genome characteristic

Actionable: Optimize tool for this characteristic
Example: Tool uses 2x CPU per 1 Gb genome size increase

Low/no correlation → Resource usage independent of genome characteristics

Actionable: Look for algorithmic inefficiencies
Example: Tool has fixed overhead or inefficient implementation

Significance levels:

*** (p < 0.001): Highly significant correlation
** (p < 0.01): Very significant
- (p < 0.05): Significant
ns: Not significant

Visualization

Create separate plot for each tool showing all 5 genome characteristics:

Genome Size, Heterozygosity, Repeat Content, Contig N50, GC Content
Include trend line, correlation coefficient, and p-value on each subplot
Point size proportional to number of jobs

Communication Patterns

File Path Confirmation

When modifying files, especially when multiple similar files exist:

Always state the FULL path when making edits:

❌ Bad: "Updating the notebook..."
✅ Good: "Updating /Users/delphine/Workdir/Stats_workflow_run/vgp_workflow_resource_analysis.ipynb"

Ask for clarification if multiple candidates exist:

Found two notebooks:
1. /Users/delphine/Workdir/Stats_workflow_run/vgp_workflow_resource_analysis.ipynb
2. /Users/delphine/Workdir/Stats_workflow_run/sharing/vgp_workflow_resource_analysis.ipynb

Which one are you running?

After user correction, explicitly confirm:

✅ Confirmed: Now modifying the correct file at /path/to/correct/file

User Feedback Interpretation

When user says "the outlier is still there":

Don't assume code is wrong - may be running different file
Don't assume user error - may be our mistake
Verify the file path immediately
Check if changes were applied to the file user is actually running

This pattern prevents major errors where edits are applied to the wrong file for multiple iterations.

Data Quality Validation and Filtering

GenomeScope Data Validation

Critical Issue: VGP Phase 1 dataset contains placeholder values where genome_size_genomescope == total_length (assembly size copied into GenomeScope column).

Detection and Filtering:

# Filter out fake GenomeScope values
df_filtered = df[(df['total_length'].notna()) &
                 (df['genome_size_genomescope'].notna()) &
                 (df['genome_size_genomescope'] != df['total_length'])].copy()

# Report filtering statistics
n_total = len(df[df['genome_size_genomescope'].notna()])
n_fake = len(df[(df['genome_size_genomescope'].notna()) &
                (df['genome_size_genomescope'] == df['total_length'])])
n_real = len(df_filtered)

print(f"Total with GenomeScope: {n_total}")
print(f"Fake (copied): {n_fake} ({n_fake/n_total*100:.1f}%)")
print(f"Real estimates: {n_real} ({n_real/n_total*100:.1f}%)")

Impact: In VGP Phase 1, 396/545 (72.7%) GenomeScope values were fake, leaving only 149 (27.3%) real independent estimates.

Visual Detection: Scatter plots showing all points on diagonal (assembly = expected) indicate circular data.

Best Practice: Always validate that "expected" genome size values are independent from assembly size before comparative analysis.

Meryl K-mer Database Management

Accessing Meryl Histograms for GenomeScope

Key Insight: GenomeScope only needs histogram files (.hist), not full meryl databases.

File Structure on GenomeArk S3:

s3://genomeark/species/{species}/{tolid}/assembly_*/intermediates/meryl/
├── {tolid}.cut.meryl.hist          # ~700KB - THIS IS WHAT YOU NEED
└── {tolid}.cut.meryl/              # Many GB - full database
    ├── 0x000000.merylData
    ├── 0x000000.merylIndex
    └── ...

Direct Histogram URLs (for Galaxy import):

# Pattern:
https://genomeark.s3.amazonaws.com/species/{species}/{tolid}/path/to/meryl/{tolid}.cut.meryl.hist

# Example:
https://genomeark.s3.amazonaws.com/species/Rhinolophus_ferrumequinum/mRhiFer1/assembly_vgp_standard_1.0/intermediates/meryl/mRhiFer1.cut.meryl.hist

Benefits:

1000x smaller download (~700KB vs ~10GB)
Can import directly into Galaxy via URL
No need to download full meryl database
Much faster for batch GenomeScope analysis

Common Mistake: Downloading entire meryl directory when only .hist file is needed.

Assembly Size vs Expected Genome Size Interpretation

Common Pattern: Assemblies ~12% Larger Than Expected

Typical Statistics (from real VGP data, n=149):

Mean ratio (assembly/expected): 1.116
Median ratio: 1.077
Within ±10%: ~58%
Within ±20%: ~88%

Why Assemblies Are Larger:

Incomplete haplotig purging: Haplotype-specific sequences retained in primary
GenomeScope underestimation: High heterozygosity/repeats affect k-mer frequencies
Organellar DNA: Mitochondrial genomes (~15-20kb) assembled and included

Interpretation:

Ratio 1.0-1.2: Normal, good assembly
Ratio > 1.3: Possible retained duplications, check purge_dups metrics
Ratio < 0.9: Possible missing sequence, check coverage

Not a Quality Problem: Moderate scatter (std dev ~30%) reflects biological variation and k-mer estimation limitations, not assembly quality issues.

VGP Assembly Pipeline Skill

Resources

Install

VGP Assembly Pipeline Skill

Overview

Trajectories (by frequency of use)

Trajectory A: HiFi + Hi-C (Most Common)

Trajectory B: HiFi + Trio

Trajectory C: HiFi Only (Least Common)

Workflow Selection by Data Availability

Non-trio workflows (HiFi reads only)

Trio workflows (HiFi + Parental Illumina)

Universal scaffolding workflows

Methods language pattern

Workflow Descriptions

Haplotype Execution Patterns

Run Once (Both Haplotypes Together)

Run Twice (×2 per Haplotype)

WF6 (Purge Duplicates) Decision Logic

Coverage Requirements

QC Checkpoints

After WF1/WF2 (K-mer Profiling)

After WF4/WF5 (Assembly)

After WF8 (Hi-C Scaffolding)

After WF9 (Decontamination)

Karyotype-Based Scaffold Validation

Sex Chromosome Adjustment

Common Pitfalls

WF0 (Mitochondrial) Handling

Visual Diagram Elements

Color Coding (Suggested)

Visual Indicators

Haplotype Badges

Input Data Labels

Summary Table

Reference Genomes for Scaffolding

Common Reference Genome

Best Practices

Resource Analysis Insights

VGP Workflow Canonical Names

Official vs Non-Official Workflow Identification

Species ID (ToLID) Patterns

VGP ToLID Format

ToLID Locations in Galaxy

Extracting ToLIDs for Resource Analysis

Linking ToLIDs with Genome Characteristics

Merging Species IDs with Metrics Data

Data Analysis Troubleshooting

Metric Availability Issues

GenomeArk S3 Data Integration

Fetching GenomeScope2 Summaries

Finding Missing VGP Accessions

Problem

Solution: Search NCBI with VGP-specific Filtering

Curation Impact Analysis - Comparing Methods

Data Filtering for Fair Comparison

Quality Metrics Focus

Statistical Testing Pattern

Missing Data Handling

Incremental Script Development

Tool-Level Resource Optimization

Identifying Top Resource Offender Tools

Correlating Tool Usage with Genome Characteristics

Interpretation

Visualization

Communication Patterns

File Path Confirmation

User Feedback Interpretation

Data Quality Validation and Filtering

GenomeScope Data Validation

Meryl K-mer Database Management

Accessing Meryl Histograms for GenomeScope

Assembly Size vs Expected Genome Size Interpretation

Common Pattern: Assemblies ~12% Larger Than Expected

References

Categories

Install

Recommended Skills