gget-genomic-databases

"Unified CLI/Python interface to 20+ genomic databases. Use for quick gene lookups (Ensembl search/info/seq), BLAST/BLAT sequence alignment, AlphaFold structure prediction, enrichment analysis (Enrichr), disease/drug associations (OpenTargets), single-cell data (CELLxGENE), cancer genomics (cBioPortal/COSMIC), and expression correlation (ARCHS4). Covers genomics, proteomics, and disease domains. For batch processing or advanced BLAST use biopython; for multi-database Python SDK workflows use bioservices."

jaechang-hits 279 26 Updated 5mo ago

Resources

GitHub

Install

npx skillscat add jaechang-hits/scicraft/gget-genomic-databases

Install via the SkillsCat registry.

SKILL.md

gget — Unified Genomic Database Access

Overview

gget is a command-line and Python package providing unified access to 20+ genomic databases and analysis methods. Query gene information, sequences, protein structures, expression data, and disease associations through a consistent interface. All modules work as both CLI tools and Python functions, returning DataFrames (Python) or JSON/CSV (CLI).

When to Use

Looking up gene information (names, IDs, descriptions) across species from Ensembl
Retrieving nucleotide or protein sequences for Ensembl gene/transcript IDs
Running BLAST or BLAT searches against standard reference databases
Predicting protein 3D structures with AlphaFold2 from amino acid sequences
Performing gene set enrichment analysis (GO, KEGG, disease terms) via Enrichr
Querying single-cell RNA-seq datasets from CELLxGENE Census
Finding disease and drug associations for a gene target via OpenTargets
Downloading Ensembl reference genomes and annotations for a species
Finding cancer mutations and genomic alterations via cBioPortal or COSMIC
Getting tissue expression and correlated genes from ARCHS4
For batch processing or advanced BLAST parameters, use biopython instead
For programmatic multi-database workflows with rate limiting, use bioservices instead

Prerequisites

Python packages: gget
Optional setup: Some modules require gget setup <module> before first use (alphafold, cellxgene, elm, gpt)
Environment: Clean virtual environment recommended to avoid dependency conflicts
API notes: gget queries remote databases — rate-limit large batch queries with time.sleep(). Databases update biweekly; keep gget updated. Max ~1000 Ensembl IDs per gget.info() call

pip install gget

# Optional: setup modules that need additional dependencies
gget setup alphafold   # ~4GB model parameters, requires OpenMM
gget setup cellxgene   # cellxgene-census package
gget setup elm         # local ELM database

Quick Start

import gget

# Search for genes by keyword
results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens")
print(f"Found {len(results)} genes")

# Get detailed gene information (Ensembl + UniProt + NCBI)
info = gget.info(["ENSG00000012048"])
print(f"Gene: {info.iloc[0]['primary_gene_name']}")

# Enrichment analysis on a gene list
enrichment = gget.enrichr(["ACE2", "AGT", "AGTR1"], database="ontology")
print(f"Enriched terms: {len(enrichment)}")

Core API

Module 1: Reference & Gene Search (ref, search, info, seq)

Query Ensembl for gene references, search by keywords, retrieve gene metadata, and fetch sequences.

import gget

# Search for genes by keyword
results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens")
print(f"Found {len(results)} genes")
print(results[["ensembl_id", "gene_name", "biotype"]].head())

# Get detailed gene information (Ensembl + UniProt + NCBI)
info = gget.info(["ENSG00000012048", "ENSG00000139618"])
print(f"Gene info columns: {list(info.columns)}")

import gget

# Retrieve sequences
nucleotide_seqs = gget.seq(["ENSG00000012048"])
protein_seqs = gget.seq(["ENSG00000012048"], translate=True, isoforms=True)
print(f"Retrieved {len(protein_seqs)} isoform sequences")

# Download reference genome files (specify release for reproducibility)
ref_links = gget.ref("homo_sapiens", which="gtf", release=112)
print(f"GTF download link: {ref_links}")

Module 2: Sequence Alignment (blast, blat, muscle, diamond)

BLAST/BLAT remote searches, multiple sequence alignment, and fast local alignment.

import gget
import time

# BLAST against SwissProt (remote API — add delay for batch queries)
blast_results = gget.blast(
    "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
    database="swissprot", limit=10
)
print(f"Top hit: {blast_results.iloc[0]['Description']}, E-value: {blast_results.iloc[0]['e-value']}")
time.sleep(2)  # Rate-limit between BLAST queries

# BLAT — find genomic position (UCSC)
blat_results = gget.blat("ATCGATCGATCGATCGATCG", assembly="human")
print(f"Genomic location: chr{blat_results.iloc[0]['chromosome']}:{blat_results.iloc[0]['start']}")

import gget

# Multiple sequence alignment with Muscle5
aligned = gget.muscle("sequences.fasta", save=True)

# Fast local alignment with DIAMOND (local, no rate limit needed)
diamond_results = gget.diamond(
    "GGETISAWESQME",
    reference="reference.fasta",
    sensitivity="very-sensitive",
    threads=4
)
print(f"Alignments found: {len(diamond_results)}")

Module 3: Protein Structure (pdb, alphafold, elm)

Download PDB structures, predict structures with AlphaFold2, find linear motifs.

import gget

# Download PDB structure
pdb_data = gget.pdb("7S7U", save=True)

# Predict structure with AlphaFold2 (requires gget setup alphafold)
structure = gget.alphafold(
    "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
    plot=True, show_sidechains=True
)
print("Structure prediction complete, PDB file saved")

import gget

# Find Eukaryotic Linear Motifs (requires gget setup elm)
ortholog_df, regex_df = gget.elm("LIAQSIGQASFV")
print(f"Ortholog motifs: {len(ortholog_df)}, Regex motifs: {len(regex_df)}")

Module 4: Expression & Correlation (archs4, cellxgene, bgee)

Gene expression, tissue expression, correlated genes, single-cell data.

import gget

# Tissue expression from ARCHS4
tissue_expr = gget.archs4("ACE2", which="tissue")
print(f"Expression across {len(tissue_expr)} tissues")

# Correlated genes from ARCHS4
correlated = gget.archs4("ACE2", which="correlation")
print(f"Top correlated gene: {correlated.iloc[0]['gene_symbol']}")

import gget

# Single-cell data from CELLxGENE (requires gget setup cellxgene)
adata = gget.cellxgene(
    gene=["ACE2", "TMPRSS2"],
    tissue="lung",
    cell_type="epithelial cell",
    census_version="2023-07-25"  # pin version for reproducibility
)
print(f"Cells: {adata.n_obs}, Genes: {adata.n_vars}")

# Orthologs and expression from Bgee
orthologs = gget.bgee("ENSG00000169194", type="orthologs")
print(f"Orthologs in {len(orthologs)} species")

Module 5: Disease & Drug Associations (opentargets, enrichr)

Disease associations, drug targets, enrichment analysis.

import gget

# Disease associations from OpenTargets
diseases = gget.opentargets("ENSG00000169194", resource="diseases", limit=10)
print(f"Associated diseases: {len(diseases)}")

# Drug associations
drugs = gget.opentargets("ENSG00000169194", resource="drugs", limit=10)
print(f"Associated drugs: {len(drugs)}")

# OpenTargets resources: diseases, drugs, tractability, pharmacogenetics,
#   expression, depmap, interactions

import gget

# Enrichment analysis via Enrichr
# Database shortcuts: 'pathway' (KEGG), 'transcription' (ChEA),
#   'ontology' (GO_BP), 'diseases_drugs' (GWAS), 'celltypes' (PanglaoDB)
enrichment = gget.enrichr(
    ["ACE2", "AGT", "AGTR1", "TMPRSS2", "DPP4"],
    database="ontology"
)
print(f"Enriched terms: {len(enrichment)}")
print(enrichment[["Term", "Adjusted P-value"]].head())

Module 6: Cancer Genomics (cbio, cosmic)

Cancer mutations, copy number alterations, and somatic mutation databases.

import gget

# Search cBioPortal studies
studies = gget.cbio_search(["breast", "lung"])
print(f"Studies found: {len(studies)}")

# Plot cancer genomics heatmap
gget.cbio_plot(
    ["msk_impact_2017"],
    ["AKT1", "ALK", "BRAF"],
    stratification="tissue",
    variation_type="mutation_occurrences"
)

import gget

# COSMIC: requires account + local database download
# First-time: gget.cosmic(searchterm="", download_cosmic=True,
#   email="user@example.com", password="xxx", cosmic_project="cancer")
cosmic_results = gget.cosmic("EGFR", cosmic_tsv_path="cosmic_data.tsv", limit=10)
print(f"COSMIC mutations: {len(cosmic_results)}")

Module 7: Mutation Generation & Utilities (mutate, setup)

Generate mutated sequences and manage module dependencies.

import gget
import pandas as pd

# Generate mutated sequences from mutation annotations
mutations_df = pd.DataFrame({
    "seq_ID": ["seq1", "seq1"],
    "mutation": ["c.4G>T", "c.10del"]
})
mutated = gget.mutate(["ATCGCTAAGCTGATCG"], mutations=mutations_df)
print(f"Generated {len(mutated)} mutated sequences")

Key Concepts

Module Overview

gget organizes 20+ modules by domain. Python interface uses gget.<module>():

Domain	Modules	Primary Database
Gene reference	`ref`, `search`, `info`, `seq`	Ensembl, UniProt, NCBI
Sequence alignment	`blast`, `blat`, `muscle`, `diamond`	NCBI BLAST, UCSC, local
Protein structure	`pdb`, `alphafold`, `elm`	RCSB PDB, AlphaFold2, ELM
Expression	`archs4`, `cellxgene`, `bgee`	ARCHS4, CZ CELLxGENE, Bgee
Disease/drugs	`opentargets`, `enrichr`	OpenTargets, Enrichr
Cancer	`cbio`, `cosmic`	cBioPortal, COSMIC
Utilities	`mutate`, `setup`, `gpt`	local / OpenAI

Output Formats

Context	Default Format	Alternatives
Python	DataFrame or dict	`json=True` for JSON; `save=True` to file
CLI	JSON	`-csv` for CSV; `-o file` to save
Sequences	FASTA (seq, mutate)	--
Structures	PDB file (pdb, alphafold)	JSON alignment error data
Single-cell	AnnData object (cellxgene)	`meta_only=True` for metadata only
Visualization	PNG (cbio plot)	`show=True` for interactive display

Enrichr Database Shortcuts

Shortcut	Full Database Name
`'pathway'`	KEGG_2021_Human
`'transcription'`	ChEA_2016
`'ontology'`	GO_Biological_Process_2021
`'diseases_drugs'`	GWAS_Catalog_2019
`'celltypes'`	PanglaoDB_Augmented_2021

Custom libraries: pass any Enrichr library name directly (e.g., "Jensen_TISSUES").

OpenTargets Resources

Resource	Description
`diseases`	Disease associations with evidence scores
`drugs`	Drug associations and clinical trial data
`tractability`	Target tractability assessment
`pharmacogenetics`	Pharmacogenetic variants
`expression`	Baseline tissue expression
`depmap`	DepMap gene-disease effects
`interactions`	Protein-protein interactions

Reproducibility

Pin database versions for consistent results across analyses:

import gget
# Pin Ensembl release
ref = gget.ref("homo_sapiens", release=112)

# Pin CELLxGENE Census version
adata = gget.cellxgene(gene=["ACE2"], census_version="2023-07-25")

# Always record gget version
print(f"gget version: {gget.__version__}")

Common Workflows

Workflow 1: Gene Discovery to Functional Analysis

Goal: Find genes of interest, get their sequences, and perform enrichment analysis.

import gget

# 1. Search for genes
results = gget.search(["GABA", "receptor"], species="homo_sapiens")
gene_ids = results["ensembl_id"].tolist()[:10]

# 2. Get detailed information
info = gget.info(gene_ids)
print(f"Retrieved info for {len(info)} genes")

# 3. Get protein sequences
sequences = gget.seq(gene_ids, translate=True)

# 4. Find correlated genes
correlated = gget.archs4(info.index[0], which="correlation")

# 5. Enrichment analysis on correlated genes
gene_list = correlated["gene_symbol"].tolist()[:50]
enrichment = gget.enrichr(gene_list, database="ontology")
print(f"Top enriched term: {enrichment.iloc[0]['Term']}")

Workflow 2: Target Validation for Drug Discovery

Goal: Investigate a gene's disease associations, druggability, and cancer mutations.

import gget

gene_id = "ENSG00000169194"  # ZBTB16

# 1. Disease associations
diseases = gget.opentargets(gene_id, resource="diseases", limit=20)

# 2. Drug associations
drugs = gget.opentargets(gene_id, resource="drugs")

# 3. Tractability assessment
tractability = gget.opentargets(gene_id, resource="tractability")

# 4. Protein interactions
interactions = gget.opentargets(gene_id, resource="interactions")
print(f"Diseases: {len(diseases)}, Drugs: {len(drugs)}, Interactions: {len(interactions)}")

# 5. Cancer genomics
gget.cbio_plot(["msk_impact_2017"], ["ZBTB16"], stratification="cancer_type")

Workflow 3: Comparative Genomics

Goal: Compare a gene across species using orthologs and sequence alignment.

import gget

# 1. Find orthologs
orthologs = gget.bgee("ENSG00000169194", type="orthologs")

# 2. Get sequences for human and mouse
human_seq = gget.seq("ENSG00000169194", translate=True)
mouse_seq = gget.seq("ENSMUSG00000026091", translate=True)

# 3. Align sequences
alignment = gget.muscle([human_seq, mouse_seq])

# 4. Get human protein structure from PDB
pdb_structure = gget.pdb("7S7U")
print("Comparative analysis complete")

Key Parameters

Parameter	Module(s)	Default	Range / Options	Effect
`species`	search, archs4, cellxgene, enrichr	`"homo_sapiens"`	Any Ensembl species; shortcuts: 'human', 'mouse'	Target organism
`limit`	blast, opentargets, cosmic	`50` / `100`	`1`-`1000`	Maximum results returned
`database`	blast, enrichr	varies	blast: nt/nr/swissprot/pdbaa; enrichr: shortcuts or library names	Target database for query
`which`	ref, archs4	varies	ref: `gtf`,`cdna`,`dna`,`cds`,`pep`; archs4: `correlation`,`tissue`	Data type to retrieve
`translate`	seq	`False`	`True`/`False`	Return amino acid instead of nucleotide sequences
`resource`	opentargets	`"diseases"`	diseases, drugs, tractability, pharmacogenetics, expression, depmap, interactions	OpenTargets data type
`release`	ref, search	latest	Integer Ensembl release number	Pin database version for reproducibility
`census_version`	cellxgene	`"stable"`	`"stable"`, `"latest"`, date string	Pin CELLxGENE Census version
`sensitivity`	diamond, elm	`"very-sensitive"`	`fast` to `ultra-sensitive`	Alignment sensitivity vs speed
`threads`	diamond, elm	`1`	`1`-`N`	CPU threads for alignment
`multimer_recycles`	alphafold	`3`	`3`-`20`	Higher = more accurate multimer prediction

Best Practices

Pin database versions for reproducibility: Use release=112 for Ensembl and census_version="2023-07-25" for CELLxGENE to ensure consistent results across analyses.
Rate-limit batch queries: gget queries remote APIs. Add time.sleep(2) between BLAST/BLAT queries in loops. For gget.info(), limit to ~1000 IDs per call.
Keep gget updated: Databases change their structure biweekly. Run pip install --upgrade gget regularly to avoid breakage from schema changes.
Use Python interface for pipelines, CLI for exploration: Python functions return DataFrames suitable for chaining. CLI with -csv is better for quick one-off lookups.
Check PDB before running AlphaFold: gget.pdb() is instant; AlphaFold prediction takes minutes to hours. Always check if the structure already exists in PDB.
Use database shortcuts in enrichr: The shortcuts ('pathway', 'ontology', etc.) map to curated Enrichr libraries. For custom analyses, pass any Enrichr library name directly.
Cache cBioPortal data for repeated analyses: Use data_dir="./cache" parameter to avoid re-downloading large cancer genomics datasets.

Common Recipes

Recipe: Batch Gene Information Retrieval

When to use: Need information for many genes at once (up to ~1000 IDs per call).

import gget
import time

gene_ids = ["ENSG00000012048", "ENSG00000139618", "ENSG00000141510"]
info = gget.info(gene_ids)
info.to_csv("gene_info_batch.csv")
print(f"Saved info for {len(info)} genes")

# For >1000 genes, batch with rate limiting
all_ids = [f"ENSG{i:011d}" for i in range(2000)]
results = []
for i in range(0, len(all_ids), 500):
    batch = all_ids[i:i+500]
    results.append(gget.info(batch))
    time.sleep(1)

Recipe: Custom Enrichment with Background

When to use: Running enrichment against a custom background gene set.

import gget

# Use specific Enrichr library with background genes
enrichment = gget.enrichr(
    ["ACE2", "AGT", "AGTR1"],
    database="Jensen_TISSUES",
    background_list=["ACE2", "AGT", "AGTR1", "TP53", "BRCA1", "MYC"]
)
print(enrichment[["Term", "Adjusted P-value"]].head())

Recipe: AlphaFold Structure Prediction with Visualization

When to use: Predicting and visualizing protein structures with confidence coloring.

import gget

# Predict with visualization (PAE + 3D structure)
result = gget.alphafold(
    "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
    plot=True,
    show_sidechains=True,
    relax=True  # AMBER relaxation for final structure
)
# Output: PDB file + predicted aligned error (PAE) JSON
# PAE heatmap auto-generated with plot=True

Recipe: Download Reference Genome for RNA-seq Pipeline

When to use: Setting up reference files for RNA-seq alignment pipelines.

# Download GTF and cDNA for human (specific release)
gget ref -w gtf -w cdna -d -r 112 homo_sapiens

# Download genome DNA
gget ref -w dna -d homo_sapiens

Troubleshooting

Problem	Cause	Solution
`ModuleNotFoundError: gget`	Package not installed	`pip install gget` in clean virtual environment
`gget setup alphafold` fails	Python version incompatibility	Use Python 3.8-3.10; check `gget --version`
Empty BLAST results	Sequence too short or no matches	Try longer sequence, different database, or `megablast_off=True`
`cellxgene` gene not found	Case-sensitive gene symbols	Use `'ACE2'` for human, `'Ace2'` for mouse (exact capitalization required)
`gget info` timeout	Too many IDs at once	Limit to ~1000 Ensembl IDs per call; batch with `time.sleep()`
Database structure changed	gget databases update biweekly	`pip install --upgrade gget`
COSMIC authentication error	Missing or expired credentials	Re-enter email/password; check COSMIC account status
AlphaFold out of memory	Protein too long for GPU memory	Use shorter sequences or split into domains
Different results on re-run	Database updated between runs	Pin versions: `release=112` for Ensembl, `census_version` for CELLxGENE

Bundled Resources

2 reference files provide extended coverage of capabilities from the original 3 reference files and 3 script files:

references/module_parameters.md — Consolidates module_reference.md (468 lines). Covers: detailed parameter tables for all 15+ modules with types, defaults, and return value descriptions; CLI vs Python interface differences; setup requirements per module. Relocated inline: most-used module parameters (Core API code blocks), output format summary (Key Concepts table). Omitted: gget gpt module details — trivial OpenAI wrapper, not genomics-specific.
references/databases_workflows.md — Consolidates database_info.md (301 lines) and workflows.md (815 lines). Covers: complete database directory with update frequencies and citation info, extended workflow examples (building reference indices, disease-drug pipeline, multi-species comparative analysis), data consistency and reproducibility guidance. Relocated inline: core database overview (Key Concepts table), top 3 workflows (Common Workflows), reproducibility patterns (Key Concepts). Omitted: scripts/ content (3 files, 590 lines total) — thin wrappers around gget API calls for CLI automation; core patterns absorbed into Core API and Common Workflows. Promotional/vendor content stripped per migration rule 4.

Related Skills

biopython — advanced BLAST parameters, batch sequence processing, GenBank record parsing
bioservices — programmatic multi-database queries with built-in rate limiting (UniProt, KEGG, ChEMBL)
anndata-data-structure — working with AnnData objects returned by gget.cellxgene()
enrichr — deeper enrichment analysis with custom gene set libraries

References

gget documentation — official docs and tutorials
gget GitHub — source code, issues
Luebbert, L. & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac836

gget-genomic-databases

Resources

Install

gget — Unified Genomic Database Access

Overview

When to Use

Prerequisites

Quick Start

Core API

Module 1: Reference & Gene Search (ref, search, info, seq)

Module 2: Sequence Alignment (blast, blat, muscle, diamond)

Module 3: Protein Structure (pdb, alphafold, elm)

Module 4: Expression & Correlation (archs4, cellxgene, bgee)

Module 5: Disease & Drug Associations (opentargets, enrichr)

Module 6: Cancer Genomics (cbio, cosmic)

Module 7: Mutation Generation & Utilities (mutate, setup)

Key Concepts

Module Overview

Output Formats

Enrichr Database Shortcuts

OpenTargets Resources

Reproducibility

Common Workflows

Workflow 1: Gene Discovery to Functional Analysis

Workflow 2: Target Validation for Drug Discovery

Workflow 3: Comparative Genomics

Key Parameters

Best Practices

Common Recipes

Recipe: Batch Gene Information Retrieval

Recipe: Custom Enrichment with Background

Recipe: AlphaFold Structure Prediction with Visualization

Recipe: Download Reference Genome for RNA-seq Pipeline

Troubleshooting

Bundled Resources

Related Skills

References

Categories

Install

Recommended Skills