ensembl-database

"Query Ensembl REST API for gene/transcript/variant annotations across 300+ species. Retrieve gene info by symbol/ID, sequence, cross-references (HGNC, RefSeq, UniProt), variants, regulatory features, comparative genomics. For bulk local access use pyensembl; for pathway lookups use kegg-database or reactome-database."

jaechang-hits 279 26 Updated 5mo ago

GitHub

Install

npx skillscat add jaechang-hits/scicraft/ensembl-database

Install via the SkillsCat registry.

SKILL.md

Ensembl Genome Database

Overview

Ensembl is a comprehensive genome annotation database covering 300+ vertebrate and non-vertebrate species. The Ensembl REST API provides programmatic access to gene models, transcript/protein sequences, variant annotations, cross-references, regulatory features, and comparative genomics without requiring any login or API key.

When to Use

Retrieving official gene and transcript annotations (stable IDs, biotype, genomic coordinates) for human or model organism genes
Converting between gene identifier namespaces (HGNC symbol ↔ Ensembl ID ↔ RefSeq ↔ UniProt)
Fetching genomic or cDNA/CDS/protein sequences for a gene or transcript
Looking up variant consequences and functional impact (VEP) for a list of SNPs
Querying regulatory features (promoters, enhancers, CTCF sites) in a genomic region
Performing comparative genomics queries (orthologs, paralogs, gene trees) across species
For local offline access to large genomic annotations, use pyensembl instead
For pathway and metabolic annotations, use kegg-database or reactome-database instead

Prerequisites

Python packages: requests
Data requirements: gene symbols, Ensembl stable IDs (ENSG…/ENST…/ENSP…), or genomic coordinates
Environment: internet connection required; no API key needed
Rate limits: max ~15 requests/second; use expand=1 and batch endpoints to minimize calls

pip install requests

Quick Start

import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

def ensembl_get(endpoint, params=None):
    r = requests.get(f"{BASE}{endpoint}", headers=HEADERS, params=params)
    r.raise_for_status()
    return r.json()

# Look up human BRCA1
gene = ensembl_get("/lookup/symbol/homo_sapiens/BRCA1", params={"expand": 1})
print(f"ID: {gene['id']}, Chr: {gene['seq_region_name']}:{gene['start']}-{gene['end']}")
print(f"Transcripts: {len(gene.get('Transcript', []))}")

Core API

Query 1: Gene Lookup by Symbol or Stable ID

Retrieve gene metadata from a gene symbol or Ensembl stable ID.

import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# By gene symbol
r = requests.get(
    f"{BASE}/lookup/symbol/homo_sapiens/TP53",
    headers=HEADERS,
    params={"expand": 1}
)
gene = r.json()
print(f"Ensembl ID : {gene['id']}")
print(f"Location   : {gene['seq_region_name']}:{gene['start']}-{gene['end']} ({gene['strand']})")
print(f"Biotype    : {gene['biotype']}")
print(f"Transcripts: {len(gene.get('Transcript', []))}")

# By stable ID (works for genes, transcripts, proteins)
r = requests.get(
    f"{BASE}/lookup/id/ENSG00000141510",
    headers=HEADERS,
    params={"expand": 0}
)
obj = r.json()
print(f"Symbol: {obj.get('display_name')}, Species: {obj.get('species')}")

Query 2: Batch Lookup

Retrieve information for multiple IDs in one call (POST endpoint).

import requests, json

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Batch lookup by symbols
symbols = ["BRCA1", "BRCA2", "TP53", "EGFR", "MYC"]
r = requests.post(
    f"{BASE}/lookup/symbol/homo_sapiens",
    headers=HEADERS,
    data=json.dumps({"symbols": symbols})
)
results = r.json()
for sym, data in results.items():
    if data:
        print(f"{sym}: {data['id']} ({data['seq_region_name']}:{data['start']}-{data['end']})")

Query 3: Sequence Retrieval

Fetch genomic, cDNA, CDS, or protein sequences.

import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "text/plain"}

# Protein sequence for canonical transcript
r = requests.get(
    f"{BASE}/sequence/id/ENST00000269305",
    headers=HEADERS,
    params={"type": "protein"}
)
seq = r.text
print(f"Protein sequence ({len(seq)} aa): {seq[:60]}...")

# Genomic region sequence
HEADERS_JSON = {"Content-Type": "application/json"}
r = requests.get(
    f"{BASE}/sequence/region/human/17:43044295..43125364",
    headers=HEADERS_JSON,
    params={"coord_system_version": "GRCh38"}
)
result = r.json()
print(f"Retrieved {len(result['seq'])} bp of genomic sequence")

Query 4: Cross-References (ID Mapping)

Map Ensembl IDs to external database identifiers.

import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# All xrefs for a gene
r = requests.get(
    f"{BASE}/xrefs/id/ENSG00000141510",
    headers=HEADERS
)
xrefs = r.json()

# Group by database
from collections import defaultdict
by_db = defaultdict(list)
for x in xrefs:
    by_db[x["dbname"]].append(x["primary_id"])

for db in ["HGNC", "RefSeq_gene_name", "Uniprot_gn", "MIM_gene"]:
    if db in by_db:
        print(f"{db}: {by_db[db]}")

Query 5: Variant Consequence Annotation (VEP)

Predict functional consequences of variants via REST VEP endpoint.

import requests, json

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Annotate a list of hgvs notations
variants = ["17:g.43094692C>T", "13:g.32929387C>T"]
r = requests.post(
    f"{BASE}/vep/human/hgvs",
    headers=HEADERS,
    data=json.dumps({"hgvs_notations": variants})
)
for v in r.json():
    print(f"\nVariant: {v.get('input')}")
    for tc in v.get("transcript_consequences", [])[:2]:
        print(f"  Gene: {tc.get('gene_symbol')}, Impact: {tc.get('impact')}, Consequence: {tc.get('consequence_terms')}")

# Annotate by rsID
r = requests.get(
    f"{BASE}/vep/human/id/rs699",
    headers=HEADERS
)
v = r.json()[0]
print(f"rsID rs699 in gene: {v['transcript_consequences'][0]['gene_symbol']}")
print(f"Consequence: {v['transcript_consequences'][0]['consequence_terms']}")

Query 6: Regulatory Features

Query regulatory build features in a genomic region.

import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Regulatory features in BRCA1 region
r = requests.get(
    f"{BASE}/overlap/region/human/17:43044000-43126000",
    headers=HEADERS,
    params={"feature": "regulatory"}
)
features = r.json()
print(f"Found {len(features)} regulatory features")
for f in features[:5]:
    print(f"  {f.get('feature_type')}: {f.get('start')}-{f.get('end')} ({f.get('description', 'n/a')})")

Query 7: Comparative Genomics (Orthologs / Gene Trees)

Find orthologs and paralogs across species.

import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Get mouse ortholog for human TP53
r = requests.get(
    f"{BASE}/homology/symbol/human/TP53",
    headers=HEADERS,
    params={"target_species": "mus_musculus", "type": "orthologues"}
)
data = r.json()
for homo in data["data"][0]["homologies"][:3]:
    tgt = homo["target"]
    print(f"Mouse ortholog: {tgt['id']} ({tgt.get('perc_id', 'n/a')}% identity)")

Key Concepts

Stable IDs and Versioning

Ensembl uses stable IDs with optional version suffixes (e.g., ENSG00000141510.17). Genes (ENSG), transcripts (ENST), proteins (ENSP), and exons (ENSE) each have their own prefix. IDs are preserved across releases when possible; retired IDs can still be resolved via the archive API.

Assembly Versions

Human genome: GRCh38 (current) and GRCh37 (legacy, via grch37.rest.ensembl.org). Always specify which assembly your coordinates belong to when making region-based queries.

Common Workflows

Workflow 1: Gene-to-Protein Information Pipeline

Goal: Retrieve all key annotations for a gene list — coordinates, transcripts, xrefs, and canonical protein sequence.

import requests, json, time

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

def batch_lookup(symbols, species="homo_sapiens"):
    r = requests.post(
        f"{BASE}/lookup/symbol/{species}",
        headers=HEADERS,
        data=json.dumps({"symbols": symbols, "expand": 1})
    )
    return r.json()

def canonical_transcript(gene_data):
    """Return the ID of the canonical (longest CDS) transcript."""
    transcripts = gene_data.get("Transcript", [])
    coding = [t for t in transcripts if t.get("biotype") == "protein_coding"]
    if not coding:
        return None
    return max(coding, key=lambda t: t.get("Translation", {}).get("length", 0))

genes = ["BRCA1", "BRCA2", "TP53"]
lookup = batch_lookup(genes)

for sym in genes:
    g = lookup.get(sym)
    if not g:
        print(f"{sym}: not found")
        continue
    canon = canonical_transcript(g)
    print(f"\n{sym} ({g['id']})")
    print(f"  Location: {g['seq_region_name']}:{g['start']}-{g['end']}")
    if canon:
        prot_len = canon.get("Translation", {}).get("length", "n/a")
        print(f"  Canonical transcript: {canon['id']} ({prot_len} aa)")
    time.sleep(0.1)  # be polite

Workflow 2: Variant Annotation Pipeline

Goal: Annotate a VCF-style variant list with gene, consequence, and impact.

import requests, json, pandas as pd

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Input: list of hgvs notations
hgvs_list = [
    "17:g.43094692C>T",
    "17:g.43063873A>G",
    "13:g.32929387C>T",
]

# Annotate in batches of 200
def vep_batch(hgvs_batch):
    r = requests.post(
        f"{BASE}/vep/human/hgvs",
        headers=HEADERS,
        data=json.dumps({"hgvs_notations": hgvs_batch})
    )
    r.raise_for_status()
    return r.json()

records = []
for ann in vep_batch(hgvs_list):
    for tc in ann.get("transcript_consequences", []):
        if tc.get("canonical") == 1:
            records.append({
                "variant": ann["input"],
                "gene": tc.get("gene_symbol"),
                "consequence": ",".join(tc.get("consequence_terms", [])),
                "impact": tc.get("impact"),
                "biotype": tc.get("biotype"),
            })

df = pd.DataFrame(records)
print(df.to_string(index=False))
df.to_csv("vep_results.csv", index=False)
print(f"\nSaved {len(df)} variant annotations → vep_results.csv")

Key Parameters

Parameter	Module	Default	Range / Options	Effect
`expand`	Lookup	`0`	`0` or `1`	Include nested transcripts/translations
`type`	Sequence	`"genomic"`	`"genomic"`, `"cDNA"`, `"CDS"`, `"protein"`	Sequence type to return
`target_species`	Homology	`None`	Species name or taxon ID	Filter homologs to target species
`feature`	Overlap	required	`"gene"`, `"transcript"`, `"regulatory"`, `"variation"`	Feature type to retrieve
`coord_system_version`	Region	`"GRCh38"`	`"GRCh38"`, `"GRCh37"`	Genome assembly
`content_type`	All	via header	`"application/json"`, `"text/plain"`	Response format

Best Practices

Use batch endpoints: POST /lookup/symbol/{species} and POST /vep/human/hgvs accept up to 1000 IDs; single-ID GET requests in a loop will hit rate limits quickly.
Pin assembly version: For region-based queries always specify coord_system_version=GRCh38 (or use grch37.rest.ensembl.org for legacy coordinates) to avoid silent mismatch errors.
Cache responses: Gene metadata rarely changes between Ensembl releases; cache results to disk (joblib.Memory) to avoid redundant API calls during development.
```
from joblib import Memory
mem = Memory("cache/", verbose=0)
cached_lookup = mem.cache(batch_lookup)
```
Use expand=0 for metadata: When you only need gene coordinates and biotype (not transcript details), keep expand=0 for smaller payloads and faster responses.
Check canonical flag in VEP: VEP returns consequences for all overlapping transcripts; filter on tc.get("canonical") == 1 to get the biologically most relevant consequence per variant.

Common Recipes

Recipe: Symbol → Ensembl ID Mapping Table

When to use: Build a lookup table from gene symbols to Ensembl IDs for downstream analysis.

import requests, json, pandas as pd

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

symbols = ["EGFR", "KRAS", "BRAF", "PIK3CA", "PTEN", "AKT1", "MYC", "RB1"]
r = requests.post(
    f"{BASE}/lookup/symbol/homo_sapiens",
    headers=HEADERS,
    data=json.dumps({"symbols": symbols})
)
data = r.json()
rows = [{"symbol": s, "ensembl_id": d["id"] if d else None,
         "chrom": d["seq_region_name"] if d else None} for s, d in data.items()]
df = pd.DataFrame(rows)
df.to_csv("symbol_to_ensembl.csv", index=False)
print(df.to_string(index=False))

Recipe: Region Gene Overlap

When to use: Find all genes overlapping a genomic interval (e.g., a GWAS locus).

import requests, pandas as pd

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

chrom, start, end = "17", 43044295, 43125364
r = requests.get(
    f"{BASE}/overlap/region/human/{chrom}:{start}-{end}",
    headers=HEADERS,
    params={"feature": "gene", "biotype": "protein_coding"}
)
genes = r.json()
df = pd.DataFrame([{
    "id": g["id"], "name": g.get("external_name"),
    "start": g["start"], "end": g["end"], "strand": g["strand"]
} for g in genes])
print(df.to_string(index=False))
print(f"\n{len(df)} protein-coding genes in region")

Recipe: Species List

When to use: Check which species are available in Ensembl before querying.

import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

r = requests.get(f"{BASE}/info/species", headers=HEADERS)
species_list = r.json()["species"]
print(f"Total species: {len(species_list)}")
vertebrates = [s for s in species_list if s.get("division") == "EnsemblVertebrates"]
print(f"Vertebrates: {len(vertebrates)}")
for s in vertebrates[:5]:
    print(f"  {s['common_name']} ({s['name']}): {s['assembly']}")

Troubleshooting

Problem	Cause	Solution
`HTTP 429 Too Many Requests`	Exceeding ~15 req/s rate limit	Add `time.sleep(0.1)` between requests; use batch POST endpoints
`HTTP 400 Bad Request` on VEP	Malformed HGVS notation	Verify format: `chr:g.posREF>ALT` (e.g., `17:g.43094692C>T`)
`Gene not found`	Gene symbol not in Ensembl	Try alternative symbol; check species name (use `homo_sapiens` not `human` for symbols)
Region query returns wrong genes	Assembly mismatch	Set `coord_system_version=GRCh38` or use `grch37.rest.ensembl.org`
Old ID not resolving	Retired Ensembl ID	Query `GET /archive/id/{id}` to get current mapping
`HTTP 503 Service Unavailable`	Server maintenance	Retry after a few minutes; check Ensembl status at status.ensembl.org

Related Skills

gget-genomic-databases — CLI/Python wrapper covering Ensembl + 20 other databases; use for quick lookups without raw API code
biopython-molecular-biology — Biopython's Entrez module for NCBI databases (alternative for RefSeq/GenBank queries)
kegg-database — Pathway/metabolic annotations for the same gene set
reactome-database — Pathway enrichment and hierarchy queries

References

Ensembl REST API documentation — Interactive API explorer and endpoint reference
Ensembl Help & Documentation — REST API overview
Ensembl stable IDs guide — ID versioning policy
VEP documentation — Variant Effect Predictor full reference

ensembl-database

Install

Ensembl Genome Database

Overview

When to Use

Prerequisites

Quick Start

Core API

Query 1: Gene Lookup by Symbol or Stable ID

Query 2: Batch Lookup

Query 3: Sequence Retrieval

Query 4: Cross-References (ID Mapping)

Query 5: Variant Consequence Annotation (VEP)

Query 6: Regulatory Features

Query 7: Comparative Genomics (Orthologs / Gene Trees)

Key Concepts

Stable IDs and Versioning

Assembly Versions

Common Workflows

Workflow 1: Gene-to-Protein Information Pipeline

Workflow 2: Variant Annotation Pipeline

Key Parameters

Best Practices

Common Recipes

Recipe: Symbol → Ensembl ID Mapping Table

Recipe: Region Gene Overlap

Recipe: Species List

Troubleshooting

Related Skills

References

Categories

Install

Recommended Skills