ethanrcohen

datadog-observability

Search, aggregate, and analyze Datadog logs, metrics, and APM data using pup (Datadog's official CLI). Use when debugging production issues, investigating errors, triaging incidents, checking service health, querying metrics, or when the user mentions Datadog, logs, metrics, APM, or error investigation. Triggers on requests involving log analysis, metric queries, service debugging, error counts, or production monitoring.

ethanrcohen 1 Updated 3mo ago
GitHub

Install

npx skillscat add ethanrcohen/datadog-agent-skill/datadog-observability

Install via the SkillsCat registry.

SKILL.md

Datadog Observability Skill (via pup)

Requires: pup CLI, authenticated via pup auth login or DD_API_KEY + DD_APP_KEY env vars.

Choose Your Workflow

Goal Command
Find errors in a service Search Logs
Count errors / compute metrics Aggregate Logs
Query time-series metrics Query Metrics
List APM services + perf stats APM Services
View service dependencies APM Dependencies

Search Logs

Returns log entries matching a Datadog query.

# Errors in a service in the last hour
pup logs search --query="service:payment AND status:error" --from="1h"

# Filter by service + environment
pup logs search --query="service:user-service AND env:production" --from="15m"

# Advanced attribute filters
pup logs search --query="service:payment AND @duration:>5s" --from="1h"

# Control result count
pup logs search --query="service:payment AND status:error" --from="24h" --limit=200

# Sort oldest first
pup logs search --query="status:error" --from="1h" --sort="asc"

Search Flags

Flag Description
--query Datadog query string (required)
--from Start time: relative (1h, 30m, 7d) or Unix ms (required)
--to End time (default: now)
--limit Max results (default: 50, max: 1000)
--sort asc or desc (default: desc)
--index Comma-separated log indexes
--output / -o json (default), table, yaml

Aggregate Logs

Compute metrics from logs -- counts, averages, percentiles. Useful for triage.

# How many errors per service in the last 24h?
pup logs aggregate --query="status:error" --from="24h" --compute="count" --group-by="service"

# Average request duration by service
pup logs aggregate --query="*" --from="1h" --compute="avg(@duration)" --group-by="service"

# 99th percentile latency
pup logs aggregate --query="service:api" --from="2h" --compute="percentile(@duration, 99)"

# Error count by HTTP status code
pup logs aggregate --query="status:error" --from="1d" --compute="count" --group-by="@http.status_code"

Compute Options

Compute Example Description
count --compute="count" Count matching logs
avg(metric) --compute="avg(@duration)" Average of a numeric attribute
sum(metric) --compute="sum(@bytes)" Sum
min(metric) --compute="min(@latency)" Minimum
max(metric) --compute="max(@latency)" Maximum
cardinality(field) --compute="cardinality(@user.id)" Unique values
percentile(metric, N) --compute="percentile(@duration, 99)" Percentile

Query Metrics

Query time-series metrics data.

# CPU usage across all hosts in the last hour
pup metrics query --query="avg:system.cpu.user{*}" --from="1h"

# Memory for a specific service in production
pup metrics query --query="avg:system.mem.used{service:web,env:prod}" --from="4h"

# Search for available metrics
pup metrics list --filter="system.cpu.*"

# Get metadata for a specific metric
pup metrics get system.cpu.user

Metrics Flags

Flag Description
--query Datadog metrics query (required)
--from Start time: relative (1h, 30m, 7d) or Unix ms (required)
--to End time (default: now)
--output / -o json (default), table, yaml

APM Services

List services and their performance statistics. Note: APM commands use Unix timestamps (not relative time).

# List all APM services
pup apm services list

# Service performance stats (last hour)
pup apm services stats --start=$(date -v-1H +%s) --end=$(date +%s)

# Filter by environment
pup apm services stats --start=$(date -v-1H +%s) --end=$(date +%s) --env=prod

# List operations for a service
pup apm services operations web-server --start=$(date -v-1H +%s) --end=$(date +%s)

# List resources (endpoints) for a service operation
pup apm services resources web-server --operation="GET /api/users" --from=$(date -v-1H +%s) --to=$(date +%s)

APM Flags

Flag Description
--start Start time as Unix timestamp (required for stats/operations)
--end End time as Unix timestamp (required for stats/operations)
--env Filter by environment
--primary-tag Filter by primary tag (group:value)
--output / -o json (default), table, yaml

APM Dependencies

View service call relationships based on trace data.

# All service dependencies in production
pup apm dependencies list --env=prod --start=$(date -v-1H +%s) --end=$(date +%s)

# Dependencies for a specific service
pup apm dependencies list web-server --env=prod --start=$(date -v-1H +%s) --end=$(date +%s)

# Service flow map with performance metrics
pup apm flow-map --query="env:prod" --from=$(date -v-1H +%s) --to=$(date +%s)

Datadog Query Syntax

All query filters use Datadog's standard search syntax:

service:my-service              Filter by service
status:error                    Filter by log status
host:my-host                    Filter by host
env:production                  Filter by environment
@duration:>5s                   Numeric attribute filter
"exact phrase"                  Exact match
service:web AND status:error    Boolean operators (AND, OR, NOT)
service:web-*                   Wildcards
-status:info                    Negation

Output Formats

All commands support --output / -o:

Format Flag Use when
JSON --output json (default) Piping to jq, programmatic analysis
Table --output table Human-readable overview
YAML --output yaml Configuration-style output
# Pipe JSON to jq for field selection
pup logs search --query="status:error" --from="1h" | jq '.data[].attributes.message'

# Human-readable table
pup logs search --query="status:error" --from="1h" --output table

Common Investigation Patterns

# 1. Start broad: what services have errors?
pup logs aggregate --query="status:error" --from="1h" --compute="count" --group-by="service"

# 2. Drill into the top offender
pup logs search --query="service:payment AND status:error" --from="1h" --output table

# 3. Get full JSON details for a specific timeframe
pup logs search --query="service:payment AND status:error" --from="30m" --limit=10

# 4. Check if it's environment-specific
pup logs aggregate --query="service:payment AND status:error" --from="1h" --compute="count" --group-by="env"

# 5. Check APM service health
pup apm services stats --start=$(date -v-1H +%s) --end=$(date +%s) --env=prod

# 6. View service dependencies
pup apm dependencies list payment --env=prod --start=$(date -v-1H +%s) --end=$(date +%s)

# 7. Check a specific metric
pup metrics query --query="avg:trace.servlet.request.duration{service:payment}" --from="1h"

Time Ranges

Logs & Metrics accept relative durations:

Input Meaning
1h 1 hour ago
30m 30 minutes ago
7d 7 days ago
1w 1 week ago
now Current time (default for --to)

APM commands require Unix timestamps. Use date to compute them:

Shell 1 hour ago Now
macOS $(date -v-1H +%s) $(date +%s)
Linux $(date -d '1 hour ago' +%s) $(date +%s)