Query logs from Grafana Loki for Lodestar beacon nodes, execution clients, and validators. Use when investigating node crashes, errors, sync issues, or any log-level debugging. Covers CL (beacon), EL (execution), validator, and infrastructure logs. Requires Grafana access to the Lodestar monitoring stack.
Resources
1Install
npx skillscat add lodekeeper/dotfiles/grafana-loki Install via the SkillsCat registry.
SKILL.md
Grafana Loki Log Querying
Query structured logs from the Lodestar infrastructure via Grafana's Loki datasource.
When to Use
- Investigating node crashes, restarts, or OOM kills
- Debugging sync issues (finalized sync, head sync, range sync)
- Checking EL ↔ CL communication errors (engine API, newPayload, forkchoiceUpdated)
- Finding error patterns across multiple nodes
- Investigating network/gossip issues
- Checking validator performance issues (missed duties, attestation errors)
Quick Start
- Read
references/loki-queries.md— contains all LogQL queries organized by use case - Use
curlwith the Grafana Loki proxy to run queries - Parse results with
jqfor readable output
Connection Info
- Grafana:
https://grafana-lodestar.chainsafe.io - Loki datasource ID: 4
- API endpoint:
GET /api/datasources/proxy/4/loki/api/v1/query_range - Auth: Bearer token (same Grafana service account as Prometheus)
Query Patterns
# Basic log query
curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \
"$GRAFANA_URL/api/datasources/proxy/4/loki/api/v1/query_range" \
--data-urlencode 'query={instance="<INSTANCE>",job="<JOB>"} |~ "<PATTERN>"' \
--data-urlencode "start=$(date -d '<TIME_AGO>' +%s)" \
--data-urlencode "end=$(date +%s)" \
--data-urlencode "limit=<N>" \
--data-urlencode "direction=backward" | jq -r '.data.result[].values[][1]'Key Parameters
| Parameter | Description | Example |
|---|---|---|
query |
LogQL query string | {instance="unstable-super"} |~ "error" |
start |
Unix timestamp (start of range) | $(date -d '6 hours ago' +%s) |
end |
Unix timestamp (end of range) | $(date +%s) |
limit |
Max log lines to return | 50 |
direction |
forward (oldest first) or backward (newest first) |
backward |
Label Schema
Jobs (log sources)
| Job | Description | Key instances |
|---|---|---|
beacon |
Lodestar CL beacon node | {group}-{type} e.g. unstable-super |
execution |
EL client (Geth, Nethermind, etc.) | Same pattern, may differ |
validator |
Lodestar validator client | Per-node |
checkpointz |
Checkpoint sync server | |
cl_bootnode |
CL bootnode | |
commit-boost |
MEV commit-boost | |
validator-ejector |
Validator ejector service | |
validator-monitor |
Validator monitoring |
Common Labels
| Label | Values | Notes |
|---|---|---|
instance |
unstable-super, beta-sas, etc. |
Primary node identifier |
group |
unstable, beta, stable, feat1–feat4, chiado, gnosis, etc. |
Node group |
job |
See above | Log source type |
network |
holesky, mainnet, gnosis, chiado |
Network |
container |
beacon, execution, etc. |
Docker container |
logstream |
stdout, stderr |
Log stream |
Instance Naming Convention
- Testnet:
{group}-{type}→unstable-super,beta-sas,stable-semi - Mainnet:
{group}-mainnet-{type}→unstable-mainnet-super - Node types:
solo(4-8 custody),semi(64),super(128),sas(128+validator+EL),arm64
LogQL Syntax Reference
Stream Selectors
{instance="unstable-super"} # exact match
{instance=~"unstable-.*"} # regex match
{group="unstable",job="beacon"} # multiple labels
{instance!="unstable-solo"} # negative matchLine Filters
|= "error" # contains (case-sensitive)
|~ "(?i)error" # regex (case-insensitive)
!= "debug" # does not contain
!~ "verbose|debug" # regex exclusionPipeline Stages
| line_format "{{.instance}}: {{__line__}}" # format output
| json # parse JSON logs
| logfmt # parse logfmt
| label_format level=`{{ToUpper .level}}` # transform labelsMetric Queries (Log-based metrics)
# Count errors per minute
count_over_time({instance="unstable-super"} |= "error" [5m])
# Rate of specific events
rate({instance="unstable-super"} |~ "BLOCK_ERROR" [1h])Investigation Playbook
1. Node Crash Investigation
# Step 1: Check restart history via Prometheus
# (See release-metrics skill for process_start_time_seconds)
# Step 2: Find fatal/exit messages
{instance="<NODE>"} |~ "(?i)(fatal|exit|kill|SIGTERM|SIGKILL|OOM|heap out|JavaScript heap|Allocation failed)"
# Step 3: Get last logs before a specific restart time
{instance="<NODE>"} | direction=backward
# Set end= to the restart timestamp, limit=50
# Step 4: Check for warn/error level logs
{instance="<NODE>",job="beacon"} |~ "warn |error "
# Step 5: Check EL errors if block processing issues
{instance="<NODE>",job="execution"} |~ "(?i)(INVALID|rejected|error|failed)"
# Step 6: Check CL↔EL engine API communication
{instance="<NODE>",job="beacon"} |~ "(?i)(engine|newPayload|forkchoiceUpdate|EXECUTION_ERROR)"2. Sync Issue Investigation
# Finalized sync errors
{instance="<NODE>"} |~ "Batch process error|Batch download error|SyncChain Error"
# Sync progress
{instance="<NODE>"} |~ "Sync peer joined|sync.*startEpoch|targetSlot"
# EL sync status
{instance="<NODE>"} |~ "Execution client is syncing|execution.*SYNCED|SYNCING"
# Block import failures
{instance="<NODE>"} |~ "BLOCK_ERROR|Block error|Block verification"3. Network Issue Investigation
# Req/resp errors
{instance="<NODE>"} |~ "Req error|Resp error|REQUEST_ERROR"
# Peer connection issues
{instance="<NODE>"} |~ "DIAL_ERROR|ECONNREFUSED|connection.*close|unexpected end"
# Gossip issues
{instance="<NODE>"} |~ "gossip.*error|gossip.*dropped|gossip.*invalid"4. Validator Performance Investigation
# Missed duties
{instance="<NODE>",job="validator"} |~ "(?i)(miss|skip|late|timeout)"
# Attestation issues
{instance="<NODE>"} |~ "attestation.*error|attestation.*fail"
# Block proposal issues
{instance="<NODE>"} |~ "proposal.*error|propose.*fail|produceBlock"5. Cross-Node Comparison
# Compare errors across a group
{group="unstable",job="execution"} |~ "(?i)(error|INVALID)" | line_format "{{.instance}}: {{__line__}}"
# Find which nodes have a specific error
{group="unstable"} |~ "SPECIFIC_ERROR_MESSAGE" | line_format "{{.instance}}"Tips
- Direction matters: Use
backwardto see the most recent logs first (default for investigation) - Time ranges: Keep queries narrow (30min–6h) for fast results; Loki can be slow on wide ranges
- Limit wisely: Start with
limit=30-50; increase only if needed - Label filtering first: Always filter by
instanceand/orgroupbefore applying line filters - EL instance names may differ: Some EL instances use host-level names (e.g.,
hetzner-gnosis-prod-bn-rescue-0), not the same as CL instance names. Usegrouplabel to match. - No EL logs? Not all nodes have EL log collection configured. Check available instances first.
- ANSI codes: Loki logs contain ANSI color codes (e.g.,
[36m,[39m). These are cosmetic; ignore them. - Lodestar log levels in brackets:
[info],[debug],[verbose],[warn],[error]— filter with|~ "warn |error "(trailing space important to avoid false matches).
References
references/loki-queries.md— Ready-to-use LogQL queries for common scenarios- Related:
../release-metrics/— Prometheus metrics (complements log analysis)
Self-Maintenance
If any commands, file paths, URLs, or configurations in this skill are outdated or no longer work, update this SKILL.md with the correct information after completing your current task. Skills should stay accurate and self-healing — fix what you find broken.