Create and manage production Grafana dashboards with multiple data sources (Prometheus, InfluxDB, Elasticsearch, CloudWatch, Loki, Tempo) for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
Install
npx skillscat add tringo0108/z-command/grafana-dashboards Install via the SkillsCat registry.
Grafana Dashboards
Create and manage production-ready Grafana dashboards with multi-source observability for comprehensive system monitoring.
Purpose
Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics across multiple data sources with proper correlations and performance optimization.
When to Use
- Visualize metrics from multiple data sources
- Create custom multi-source dashboards
- Implement SLO dashboards with traces and logs
- Monitor infrastructure with correlated views
- Track business KPIs across systems
- Build unified observability dashboards
Supported Data Sources
| Data Source | Purpose | Query Language |
|---|---|---|
| Prometheus | Metrics collection | PromQL |
| InfluxDB | Time series metrics | InfluxQL / Flux |
| Elasticsearch | Logs and search | Lucene / KQL |
| CloudWatch | AWS metrics/logs | CloudWatch Syntax |
| Loki | Log aggregation | LogQL |
| Tempo | Distributed tracing | TraceQL |
| PostgreSQL | Business data | SQL |
| MySQL | Business data | SQL |
Dashboard Design Principles
1. Hierarchy of Information
┌─────────────────────────────────────┐
│ Critical Metrics (Big Numbers) │ ← SLIs, Error rates
├─────────────────────────────────────┤
│ Key Trends (Time Series) │ ← Request rates, latency
├─────────────────────────────────────┤
│ Detailed Metrics (Tables/Heatmaps) │ ← Per-service breakdown
├─────────────────────────────────────┤
│ Correlated Views (Logs/Traces) │ ← Debug information
└─────────────────────────────────────┘2. RED Method (Services)
- Rate - Requests per second
- Errors - Error rate
- Duration - Latency/response time
3. USE Method (Resources)
- Utilization - % time resource is busy
- Saturation - Queue length/wait time
- Errors - Error count
Multi-Source Dashboards
Using Mixed Data Source
The Mixed data source enables combining queries from different sources in a single panel:
{
"title": "Service Health Overview",
"type": "timeseries",
"datasource": {
"type": "mixed",
"uid": "-- Mixed --"
},
"targets": [
{
"datasource": { "type": "prometheus" },
"expr": "sum(rate(http_requests_total{service=\"$service\"}[5m]))",
"refId": "A",
"legendFormat": "Requests/s (Prometheus)"
},
{
"datasource": { "type": "cloudwatch" },
"namespace": "AWS/ApplicationELB",
"metricName": "RequestCount",
"dimensions": { "LoadBalancer": "$loadbalancer" },
"refId": "B",
"legendFormat": "ALB Requests (CloudWatch)"
},
{
"datasource": { "type": "influxdb" },
"query": "from(bucket: \"metrics\") |> filter(fn: (r) => r._measurement == \"requests\")",
"refId": "C",
"legendFormat": "Requests (InfluxDB)"
}
]
}Cross-Source Correlations
Correlate metrics, logs, and traces using shared dimensions:
{
"panels": [
{
"title": "Metrics: Error Rate",
"type": "timeseries",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\", service=\"$service\"}[5m]))",
"legendFormat": "5xx Errors"
}
],
"gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
},
{
"title": "Logs: Error Messages",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{service=\"$service\"} |= \"error\" | json | level=\"error\"",
"refId": "A"
}
],
"gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
},
{
"title": "Traces: Failed Requests",
"type": "traces",
"datasource": "Tempo",
"targets": [
{
"query": "{ status = error && resource.service.name = \"$service\" }",
"refId": "A"
}
],
"gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
}
]
}Data Source Examples
Prometheus Queries
# Request rate per service
sum(rate(http_requests_total[5m])) by (service)
# Error percentage
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
# P95 Latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# CPU usage per pod
100 - (avg by (pod) (rate(container_cpu_usage_seconds_total[5m])) * 100)Loki LogQL Queries
# Error logs for a service
{namespace="$namespace", app="$service"} |= "error"
# Parse JSON logs and filter by level
{app="$service"} | json | level="error" | line_format "{{.message}}"
# Count errors over time
sum(rate({app="$service"} |= "error" [5m]))
# Top error messages
{app="$service"} | json | level="error" | line_format "{{.error}}"
| pattern `<error>` | topk 10 by (error)Tempo TraceQL Queries
# Traces with errors
{ status = error }
# Traces for specific service with high latency
{ resource.service.name = "$service" && duration > 500ms }
# Find traces by span name
{ name = "HTTP GET /api/users" }
# Traces with specific attributes
{ span.http.status_code = 500 }InfluxDB Flux Queries
from(bucket: "metrics")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r._measurement == "http_requests")
|> filter(fn: (r) => r.service == "${service}")
|> aggregateWindow(every: v.windowPeriod, fn: mean)
// Join multiple measurements
metrics = from(bucket: "metrics")
|> range(start: -1h)
|> filter(fn: (r) => r._measurement == "cpu")
errors = from(bucket: "metrics")
|> range(start: -1h)
|> filter(fn: (r) => r._measurement == "errors")
join(tables: {m: metrics, e: errors}, on: ["host"])CloudWatch Queries
{
"datasource": "CloudWatch",
"namespace": "AWS/EC2",
"metricName": "CPUUtilization",
"dimensions": {
"InstanceId": ["$instance"]
},
"statistics": ["Average"],
"period": "300"
}Elasticsearch Queries
{
"datasource": "Elasticsearch",
"query": "level:error AND service:$service",
"timeField": "@timestamp",
"metrics": [{ "type": "count", "id": "1" }],
"bucketAggs": [
{
"type": "date_histogram",
"field": "@timestamp",
"id": "2",
"settings": { "interval": "auto" }
}
]
}Variables and Templating
Data Source Variable
Allow users to switch between data sources dynamically:
{
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"multi": false,
"label": "Data Source"
}
]
}
}Query Variables Across Sources
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_pod_info, namespace)",
"refresh": 1,
"multi": false
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
"refresh": 2,
"multi": true
},
{
"name": "log_stream",
"type": "query",
"datasource": "Loki",
"query": "label_values({namespace=\"$namespace\"}, app)",
"refresh": 2
},
{
"name": "aws_region",
"type": "query",
"datasource": "CloudWatch",
"query": "regions()",
"refresh": 1
}
]
}
}Ad-Hoc Filters
Enable dynamic filtering across all panels:
{
"name": "Filters",
"type": "adhoc",
"datasource": "Prometheus"
}Usage in queries:
sum(rate(http_requests_total{$Filters}[5m]))Panel Types
1. Stat Panel (Single Value)
{
"type": "stat",
"title": "Total Requests",
"targets": [
{
"expr": "sum(http_requests_total)"
}
],
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
},
"orientation": "auto",
"textMode": "auto",
"colorMode": "value"
},
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": 0, "color": "green" },
{ "value": 80, "color": "yellow" },
{ "value": 90, "color": "red" }
]
}
}
}
}2. Time Series Graph
{
"type": "timeseries",
"title": "CPU Usage",
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100
}
}
}3. Logs Panel
{
"type": "logs",
"title": "Application Logs",
"datasource": "Loki",
"targets": [
{
"expr": "{namespace=\"$namespace\", app=\"$service\"}"
}
],
"options": {
"showTime": true,
"showLabels": true,
"wrapLogMessage": true,
"enableLogDetails": true
}
}4. Traces Panel
{
"type": "traces",
"title": "Distributed Traces",
"datasource": "Tempo",
"targets": [
{
"query": "{ resource.service.name = \"$service\" }"
}
]
}5. Table with Transformations
{
"type": "table",
"title": "Service Status (Multi-Source)",
"datasource": { "type": "mixed" },
"targets": [
{
"datasource": "Prometheus",
"expr": "up{job=~\"$service\"}",
"format": "table",
"instant": true,
"refId": "A"
},
{
"datasource": "CloudWatch",
"namespace": "AWS/ECS",
"metricName": "CPUUtilization",
"refId": "B"
}
],
"transformations": [
{
"id": "merge"
},
{
"id": "organize",
"options": {
"excludeByName": { "Time": true },
"renameByName": {
"instance": "Instance",
"Value #A": "Status",
"Value #B": "CPU %"
}
}
}
]
}6. Heatmap
{
"type": "heatmap",
"title": "Latency Distribution",
"targets": [
{
"expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
"format": "heatmap"
}
],
"options": {
"yAxis": { "unit": "s" },
"color": { "scheme": "Turbo" }
}
}Transformations for Multi-Source Data
Join by Field
Join data from different sources by common field:
{
"transformations": [
{
"id": "joinByField",
"options": {
"byField": "instance",
"mode": "outer"
}
}
]
}Merge
Combine all series into single frame:
{
"transformations": [
{
"id": "merge"
}
]
}Rename and Organize
{
"transformations": [
{
"id": "organize",
"options": {
"renameByName": {
"Value #A": "Prometheus Requests",
"Value #B": "CloudWatch Requests"
},
"excludeByName": {
"__name__": true
}
}
}
]
}Dashboard Provisioning
datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
url: http://loki:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: "traceID=(\\w+)"
name: TraceID
url: "$${__value.raw}"
- name: Tempo
type: tempo
url: http://tempo:3200
jsonData:
tracesToLogs:
datasourceUid: loki
tags: ["service.name", "pod"]
- name: InfluxDB
type: influxdb
url: http://influxdb:8086
database: metrics
jsonData:
version: Flux
organization: myorg
defaultBucket: metrics
secureJsonData:
token: ${INFLUXDB_TOKEN}
- name: CloudWatch
type: cloudwatch
jsonData:
authType: default
defaultRegion: us-west-2
- name: Elasticsearch
type: elasticsearch
url: http://elasticsearch:9200
database: "logs-*"
jsonData:
esVersion: "8.0.0"
timeField: "@timestamp"dashboards.yml
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: "Production"
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/dashboardsAlerts with Multi-Source Context
{
"alert": {
"name": "High Error Rate with Log Context",
"conditions": [
{
"evaluator": { "params": [5], "type": "gt" },
"operator": { "type": "and" },
"query": { "params": ["A", "5m", "now"] },
"reducer": { "type": "avg" },
"type": "query"
}
],
"executionErrorState": "alerting",
"for": "5m",
"frequency": "1m",
"message": "Error rate is above 5%\n\nCheck logs: ${__dashboard_url__}?var-service=$service&tab=logs\nTraces: ${__dashboard_url__}?var-service=$service&tab=traces",
"noDataState": "no_data",
"notifications": [{ "uid": "slack-oncall" }]
}
}Performance Best Practices
1. Query Optimization
| Data Source | Optimization |
|---|---|
| Prometheus | Use recording rules for complex queries |
| Loki | Add stream selectors before filter expressions |
| Tempo | Limit by time range and attributes |
| InfluxDB | Push aggregations down to query level |
| CloudWatch | Use appropriate period (min 60s) |
| Elasticsearch | Use index patterns and time filters |
2. Dashboard Settings
{
"refresh": "30s",
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": ["5s", "10s", "30s", "1m", "5m"],
"time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
}
}3. Reduce Panel Load
- Use
$__intervalfor automatic resolution - Set reasonable
maxDataPoints - Use instant queries for non-time-series data
- Lazy load panels (only visible in viewport)
Dashboard Organization
Folder Structure
Production/
├── Infrastructure/
│ ├── Node Overview
│ ├── Kubernetes Cluster
│ └── Network
├── Applications/
│ ├── API Gateway
│ ├── User Service
│ └── Payment Service
├── Databases/
│ ├── PostgreSQL
│ ├── Redis
│ └── Elasticsearch
└── Business/
├── Revenue Dashboard
└── User AnalyticsTagging Convention
- Environment:
production,staging,development - Team:
platform,backend,frontend - Service:
api,database,cache - Source:
prometheus,loki,cloudwatch
Common Pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
| Too many panels | Slow load times | Focus on key metrics |
| Mixed source overload | Query conflicts | Use transformations to align |
| Missing time alignment | Mismatched data | Use consistent time filters |
| High cardinality queries | Memory issues | Filter early, aggregate |
| No variable cascading | Stale filters | Chain variables with refresh |
Best Practices Summary
- Use Mixed data source for multi-source panels
- Correlate with shared dimensions (service, pod, instance)
- Chain variables from coarse to fine (region → cluster → service)
- Optimize queries at source (filters first, aggregations early)
- Use transformations for joining and organizing data
- Set appropriate refresh rates (not too frequent)
- Version control dashboards as JSON/code
- Link traces to logs via derived fields
- Add context to alerts with dashboard URLs
- Organize with folders and tags
Parent Hub
Part of Workflow
This skill is utilized in the following sequential workflows: