fmschulz

jgi-lakehouse

Query JGI Lakehouse (Dremio) for genomics metadata across GOLD/IMG/Mycocosm/Phytozome and download genome files via IMG taxon OIDs.

fmschulz 3 1 Updated 2mo ago
GitHub

Install

npx skillscat add fmschulz/omics-skills/jgi-lakehouse

Install via the SkillsCat registry.

SKILL.md

JGI Lakehouse

Use JGI Lakehouse (Dremio) for metadata queries and the JGI filesystem for sequence downloads.

Instructions

  1. Authenticate to Dremio using a PAT.
  2. Explore schemas and tables to find the required metadata.
  3. Run SQL queries for project/sample/taxon discovery.
  4. Use IMG taxon OIDs to fetch genome packages from the filesystem.
  5. Validate outputs and record provenance.

Quick Reference

Task Action
Auth setup See docs/authentication.md
SQL cheatsheet See docs/sql-quick-reference.md
Table catalog See docs/data-catalog.md
GOLD exploration See docs/explore_gold.md

Input Requirements

  • DREMIO_PAT token (for Lakehouse access)
  • Query intent (taxonomy, ecosystem, project IDs, etc.)
  • JGI filesystem access for downloads

Output

  • Query results (tables or CSVs)
  • Lists of taxon OIDs or accessions
  • Downloaded genome packages (FNA/FAA/GFF)

Quality Gates

  • SQL queries return expected row counts
  • Taxon OIDs map to existing filesystem packages
  • Downloaded files pass basic integrity checks

Examples

Example 1: Basic GOLD query

SELECT gold_id, project_name
FROM "gold-db-2 postgresql".gold.project
WHERE is_public = 'Yes'
LIMIT 5;

Troubleshooting

Issue: Authentication failures
Solution: Re-create the PAT and confirm it is exported before querying.

Issue: Missing genome files
Solution: Verify IMG taxon OIDs and filesystem path permissions.