grafana-dashboards

Create and manage production Grafana dashboards with multiple data sources (Prometheus, InfluxDB, Elasticsearch, CloudWatch, Loki, Tempo) for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.

TriNgo0108 0 Updated 4mo ago

GitHub

Install

npx skillscat add tringo0108/z-command/grafana-dashboards

Install via the SkillsCat registry.

SKILL.md

Grafana Dashboards

Create and manage production-ready Grafana dashboards with multi-source observability for comprehensive system monitoring.

Purpose

Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics across multiple data sources with proper correlations and performance optimization.

When to Use

Visualize metrics from multiple data sources
Create custom multi-source dashboards
Implement SLO dashboards with traces and logs
Monitor infrastructure with correlated views
Track business KPIs across systems
Build unified observability dashboards

Supported Data Sources

Data Source	Purpose	Query Language
Prometheus	Metrics collection	PromQL
InfluxDB	Time series metrics	InfluxQL / Flux
Elasticsearch	Logs and search	Lucene / KQL
CloudWatch	AWS metrics/logs	CloudWatch Syntax
Loki	Log aggregation	LogQL
Tempo	Distributed tracing	TraceQL
PostgreSQL	Business data	SQL
MySQL	Business data	SQL

Dashboard Design Principles

1. Hierarchy of Information

┌─────────────────────────────────────┐
│  Critical Metrics (Big Numbers)     │  ← SLIs, Error rates
├─────────────────────────────────────┤
│  Key Trends (Time Series)           │  ← Request rates, latency
├─────────────────────────────────────┤
│  Detailed Metrics (Tables/Heatmaps) │  ← Per-service breakdown
├─────────────────────────────────────┤
│  Correlated Views (Logs/Traces)     │  ← Debug information
└─────────────────────────────────────┘

2. RED Method (Services)

Rate - Requests per second
Errors - Error rate
Duration - Latency/response time

3. USE Method (Resources)

Utilization - % time resource is busy
Saturation - Queue length/wait time
Errors - Error count

Multi-Source Dashboards

Using Mixed Data Source

The Mixed data source enables combining queries from different sources in a single panel:

{
  "title": "Service Health Overview",
  "type": "timeseries",
  "datasource": {
    "type": "mixed",
    "uid": "-- Mixed --"
  },
  "targets": [
    {
      "datasource": { "type": "prometheus" },
      "expr": "sum(rate(http_requests_total{service=\"$service\"}[5m]))",
      "refId": "A",
      "legendFormat": "Requests/s (Prometheus)"
    },
    {
      "datasource": { "type": "cloudwatch" },
      "namespace": "AWS/ApplicationELB",
      "metricName": "RequestCount",
      "dimensions": { "LoadBalancer": "$loadbalancer" },
      "refId": "B",
      "legendFormat": "ALB Requests (CloudWatch)"
    },
    {
      "datasource": { "type": "influxdb" },
      "query": "from(bucket: \"metrics\") |> filter(fn: (r) => r._measurement == \"requests\")",
      "refId": "C",
      "legendFormat": "Requests (InfluxDB)"
    }
  ]
}

Cross-Source Correlations

Correlate metrics, logs, and traces using shared dimensions:

{
  "panels": [
    {
      "title": "Metrics: Error Rate",
      "type": "timeseries",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\", service=\"$service\"}[5m]))",
          "legendFormat": "5xx Errors"
        }
      ],
      "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
    },
    {
      "title": "Logs: Error Messages",
      "type": "logs",
      "datasource": "Loki",
      "targets": [
        {
          "expr": "{service=\"$service\"} |= \"error\" | json | level=\"error\"",
          "refId": "A"
        }
      ],
      "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
    },
    {
      "title": "Traces: Failed Requests",
      "type": "traces",
      "datasource": "Tempo",
      "targets": [
        {
          "query": "{ status = error && resource.service.name = \"$service\" }",
          "refId": "A"
        }
      ],
      "gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
    }
  ]
}

Data Source Examples

Prometheus Queries

# Request rate per service
sum(rate(http_requests_total[5m])) by (service)

# Error percentage
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

# P95 Latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# CPU usage per pod
100 - (avg by (pod) (rate(container_cpu_usage_seconds_total[5m])) * 100)

Loki LogQL Queries

# Error logs for a service
{namespace="$namespace", app="$service"} |= "error"

# Parse JSON logs and filter by level
{app="$service"} | json | level="error" | line_format "{{.message}}"

# Count errors over time
sum(rate({app="$service"} |= "error" [5m]))

# Top error messages
{app="$service"} | json | level="error" | line_format "{{.error}}"
  | pattern `<error>` | topk 10 by (error)

Tempo TraceQL Queries

# Traces with errors
{ status = error }

# Traces for specific service with high latency
{ resource.service.name = "$service" && duration > 500ms }

# Find traces by span name
{ name = "HTTP GET /api/users" }

# Traces with specific attributes
{ span.http.status_code = 500 }

InfluxDB Flux Queries

from(bucket: "metrics")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "http_requests")
  |> filter(fn: (r) => r.service == "${service}")
  |> aggregateWindow(every: v.windowPeriod, fn: mean)

// Join multiple measurements
metrics = from(bucket: "metrics")
  |> range(start: -1h)
  |> filter(fn: (r) => r._measurement == "cpu")

errors = from(bucket: "metrics")
  |> range(start: -1h)
  |> filter(fn: (r) => r._measurement == "errors")

join(tables: {m: metrics, e: errors}, on: ["host"])

CloudWatch Queries

{
  "datasource": "CloudWatch",
  "namespace": "AWS/EC2",
  "metricName": "CPUUtilization",
  "dimensions": {
    "InstanceId": ["$instance"]
  },
  "statistics": ["Average"],
  "period": "300"
}

Elasticsearch Queries

{
  "datasource": "Elasticsearch",
  "query": "level:error AND service:$service",
  "timeField": "@timestamp",
  "metrics": [{ "type": "count", "id": "1" }],
  "bucketAggs": [
    {
      "type": "date_histogram",
      "field": "@timestamp",
      "id": "2",
      "settings": { "interval": "auto" }
    }
  ]
}

Variables and Templating

Data Source Variable

Allow users to switch between data sources dynamically:

{
  "templating": {
    "list": [
      {
        "name": "datasource",
        "type": "datasource",
        "query": "prometheus",
        "multi": false,
        "label": "Data Source"
      }
    ]
  }
}

Query Variables Across Sources

{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 1,
        "multi": false
      },
      {
        "name": "service",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
        "refresh": 2,
        "multi": true
      },
      {
        "name": "log_stream",
        "type": "query",
        "datasource": "Loki",
        "query": "label_values({namespace=\"$namespace\"}, app)",
        "refresh": 2
      },
      {
        "name": "aws_region",
        "type": "query",
        "datasource": "CloudWatch",
        "query": "regions()",
        "refresh": 1
      }
    ]
  }
}

Ad-Hoc Filters

Enable dynamic filtering across all panels:

{
  "name": "Filters",
  "type": "adhoc",
  "datasource": "Prometheus"
}

Usage in queries:

sum(rate(http_requests_total{$Filters}[5m]))

Panel Types

1. Stat Panel (Single Value)

{
  "type": "stat",
  "title": "Total Requests",
  "targets": [
    {
      "expr": "sum(http_requests_total)"
    }
  ],
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"]
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "value"
  },
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          { "value": 0, "color": "green" },
          { "value": 80, "color": "yellow" },
          { "value": 90, "color": "red" }
        ]
      }
    }
  }
}

2. Time Series Graph

{
  "type": "timeseries",
  "title": "CPU Usage",
  "targets": [
    {
      "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percent",
      "min": 0,
      "max": 100
    }
  }
}

3. Logs Panel

{
  "type": "logs",
  "title": "Application Logs",
  "datasource": "Loki",
  "targets": [
    {
      "expr": "{namespace=\"$namespace\", app=\"$service\"}"
    }
  ],
  "options": {
    "showTime": true,
    "showLabels": true,
    "wrapLogMessage": true,
    "enableLogDetails": true
  }
}

4. Traces Panel

{
  "type": "traces",
  "title": "Distributed Traces",
  "datasource": "Tempo",
  "targets": [
    {
      "query": "{ resource.service.name = \"$service\" }"
    }
  ]
}

5. Table with Transformations

{
  "type": "table",
  "title": "Service Status (Multi-Source)",
  "datasource": { "type": "mixed" },
  "targets": [
    {
      "datasource": "Prometheus",
      "expr": "up{job=~\"$service\"}",
      "format": "table",
      "instant": true,
      "refId": "A"
    },
    {
      "datasource": "CloudWatch",
      "namespace": "AWS/ECS",
      "metricName": "CPUUtilization",
      "refId": "B"
    }
  ],
  "transformations": [
    {
      "id": "merge"
    },
    {
      "id": "organize",
      "options": {
        "excludeByName": { "Time": true },
        "renameByName": {
          "instance": "Instance",
          "Value #A": "Status",
          "Value #B": "CPU %"
        }
      }
    }
  ]
}

6. Heatmap

{
  "type": "heatmap",
  "title": "Latency Distribution",
  "targets": [
    {
      "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
      "format": "heatmap"
    }
  ],
  "options": {
    "yAxis": { "unit": "s" },
    "color": { "scheme": "Turbo" }
  }
}

Transformations for Multi-Source Data

Join by Field

Join data from different sources by common field:

{
  "transformations": [
    {
      "id": "joinByField",
      "options": {
        "byField": "instance",
        "mode": "outer"
      }
    }
  ]
}

Merge

Combine all series into single frame:

{
  "transformations": [
    {
      "id": "merge"
    }
  ]
}

Rename and Organize

{
  "transformations": [
    {
      "id": "organize",
      "options": {
        "renameByName": {
          "Value #A": "Prometheus Requests",
          "Value #B": "CloudWatch Requests"
        },
        "excludeByName": {
          "__name__": true
        }
      }
    }
  ]
}

Dashboard Provisioning

datasources.yml

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true

  - name: Loki
    type: loki
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: "traceID=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"

  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        tags: ["service.name", "pod"]

  - name: InfluxDB
    type: influxdb
    url: http://influxdb:8086
    database: metrics
    jsonData:
      version: Flux
      organization: myorg
      defaultBucket: metrics
    secureJsonData:
      token: ${INFLUXDB_TOKEN}

  - name: CloudWatch
    type: cloudwatch
    jsonData:
      authType: default
      defaultRegion: us-west-2

  - name: Elasticsearch
    type: elasticsearch
    url: http://elasticsearch:9200
    database: "logs-*"
    jsonData:
      esVersion: "8.0.0"
      timeField: "@timestamp"

dashboards.yml

apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: "Production"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/dashboards

Alerts with Multi-Source Context

{
  "alert": {
    "name": "High Error Rate with Log Context",
    "conditions": [
      {
        "evaluator": { "params": [5], "type": "gt" },
        "operator": { "type": "and" },
        "query": { "params": ["A", "5m", "now"] },
        "reducer": { "type": "avg" },
        "type": "query"
      }
    ],
    "executionErrorState": "alerting",
    "for": "5m",
    "frequency": "1m",
    "message": "Error rate is above 5%\n\nCheck logs: ${__dashboard_url__}?var-service=$service&tab=logs\nTraces: ${__dashboard_url__}?var-service=$service&tab=traces",
    "noDataState": "no_data",
    "notifications": [{ "uid": "slack-oncall" }]
  }
}

Performance Best Practices

1. Query Optimization

Data Source	Optimization
Prometheus	Use recording rules for complex queries
Loki	Add stream selectors before filter expressions
Tempo	Limit by time range and attributes
InfluxDB	Push aggregations down to query level
CloudWatch	Use appropriate period (min 60s)
Elasticsearch	Use index patterns and time filters

2. Dashboard Settings

{
  "refresh": "30s",
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {
    "refresh_intervals": ["5s", "10s", "30s", "1m", "5m"],
    "time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
  }
}

3. Reduce Panel Load

Use $__interval for automatic resolution
Set reasonable maxDataPoints
Use instant queries for non-time-series data
Lazy load panels (only visible in viewport)

Dashboard Organization

Folder Structure

Production/
├── Infrastructure/
│   ├── Node Overview
│   ├── Kubernetes Cluster
│   └── Network
├── Applications/
│   ├── API Gateway
│   ├── User Service
│   └── Payment Service
├── Databases/
│   ├── PostgreSQL
│   ├── Redis
│   └── Elasticsearch
└── Business/
    ├── Revenue Dashboard
    └── User Analytics

Tagging Convention

Environment: production, staging, development
Team: platform, backend, frontend
Service: api, database, cache
Source: prometheus, loki, cloudwatch

Common Pitfalls

Pitfall	Problem	Solution
Too many panels	Slow load times	Focus on key metrics
Mixed source overload	Query conflicts	Use transformations to align
Missing time alignment	Mismatched data	Use consistent time filters
High cardinality queries	Memory issues	Filter early, aggregate
No variable cascading	Stale filters	Chain variables with refresh

Best Practices Summary

Use Mixed data source for multi-source panels
Correlate with shared dimensions (service, pod, instance)
Chain variables from coarse to fine (region → cluster → service)
Optimize queries at source (filters first, aggregations early)
Use transformations for joining and organizing data
Set appropriate refresh rates (not too frequent)
Version control dashboards as JSON/code
Link traces to logs via derived fields
Add context to alerts with dashboard URLs
Organize with folders and tags

Parent Hub

_devops-cloud-mastery

Part of Workflow

This skill is utilized in the following sequential workflows:

_workflow-data-pipeline

grafana-dashboards

Install

Grafana Dashboards

Purpose

When to Use

Supported Data Sources

Dashboard Design Principles

1. Hierarchy of Information

2. RED Method (Services)

3. USE Method (Resources)

Multi-Source Dashboards

Using Mixed Data Source

Cross-Source Correlations

Data Source Examples

Prometheus Queries

Loki LogQL Queries

Tempo TraceQL Queries

InfluxDB Flux Queries

CloudWatch Queries

Elasticsearch Queries

Variables and Templating

Data Source Variable

Query Variables Across Sources

Ad-Hoc Filters

Panel Types

1. Stat Panel (Single Value)

2. Time Series Graph

3. Logs Panel

4. Traces Panel

5. Table with Transformations

6. Heatmap

Transformations for Multi-Source Data

Join by Field

Merge

Rename and Organize

Dashboard Provisioning

datasources.yml

dashboards.yml

Alerts with Multi-Source Context

Performance Best Practices

1. Query Optimization

2. Dashboard Settings

3. Reduce Panel Load

Dashboard Organization

Folder Structure

Tagging Convention

Common Pitfalls

Best Practices Summary

Parent Hub

Part of Workflow

Categories

Install

Recommended Skills