Search, aggregate, and analyze Datadog logs, metrics, and APM data using pup (Datadog's official CLI). Use when debugging production issues, investigating errors, triaging incidents, checking service health, querying metrics, or when the user mentions Datadog, logs, metrics, APM, or error investigation. Triggers on requests involving log analysis, metric queries, service debugging, error counts, or production monitoring.
Install
npx skillscat add ethanrcohen/datadog-agent-skill/datadog-observability Install via the SkillsCat registry.
Datadog Observability Skill (via pup)
Requires: pup CLI, authenticated via pup auth login or DD_API_KEY + DD_APP_KEY env vars.
Choose Your Workflow
| Goal | Command |
|---|---|
| Find errors in a service | Search Logs |
| Count errors / compute metrics | Aggregate Logs |
| Query time-series metrics | Query Metrics |
| List APM services + perf stats | APM Services |
| View service dependencies | APM Dependencies |
Search Logs
Returns log entries matching a Datadog query.
# Errors in a service in the last hour
pup logs search --query="service:payment AND status:error" --from="1h"
# Filter by service + environment
pup logs search --query="service:user-service AND env:production" --from="15m"
# Advanced attribute filters
pup logs search --query="service:payment AND @duration:>5s" --from="1h"
# Control result count
pup logs search --query="service:payment AND status:error" --from="24h" --limit=200
# Sort oldest first
pup logs search --query="status:error" --from="1h" --sort="asc"Search Flags
| Flag | Description |
|---|---|
--query |
Datadog query string (required) |
--from |
Start time: relative (1h, 30m, 7d) or Unix ms (required) |
--to |
End time (default: now) |
--limit |
Max results (default: 50, max: 1000) |
--sort |
asc or desc (default: desc) |
--index |
Comma-separated log indexes |
--output / -o |
json (default), table, yaml |
Aggregate Logs
Compute metrics from logs -- counts, averages, percentiles. Useful for triage.
# How many errors per service in the last 24h?
pup logs aggregate --query="status:error" --from="24h" --compute="count" --group-by="service"
# Average request duration by service
pup logs aggregate --query="*" --from="1h" --compute="avg(@duration)" --group-by="service"
# 99th percentile latency
pup logs aggregate --query="service:api" --from="2h" --compute="percentile(@duration, 99)"
# Error count by HTTP status code
pup logs aggregate --query="status:error" --from="1d" --compute="count" --group-by="@http.status_code"Compute Options
| Compute | Example | Description |
|---|---|---|
count |
--compute="count" |
Count matching logs |
avg(metric) |
--compute="avg(@duration)" |
Average of a numeric attribute |
sum(metric) |
--compute="sum(@bytes)" |
Sum |
min(metric) |
--compute="min(@latency)" |
Minimum |
max(metric) |
--compute="max(@latency)" |
Maximum |
cardinality(field) |
--compute="cardinality(@user.id)" |
Unique values |
percentile(metric, N) |
--compute="percentile(@duration, 99)" |
Percentile |
Query Metrics
Query time-series metrics data.
# CPU usage across all hosts in the last hour
pup metrics query --query="avg:system.cpu.user{*}" --from="1h"
# Memory for a specific service in production
pup metrics query --query="avg:system.mem.used{service:web,env:prod}" --from="4h"
# Search for available metrics
pup metrics list --filter="system.cpu.*"
# Get metadata for a specific metric
pup metrics get system.cpu.userMetrics Flags
| Flag | Description |
|---|---|
--query |
Datadog metrics query (required) |
--from |
Start time: relative (1h, 30m, 7d) or Unix ms (required) |
--to |
End time (default: now) |
--output / -o |
json (default), table, yaml |
APM Services
List services and their performance statistics. Note: APM commands use Unix timestamps (not relative time).
# List all APM services
pup apm services list
# Service performance stats (last hour)
pup apm services stats --start=$(date -v-1H +%s) --end=$(date +%s)
# Filter by environment
pup apm services stats --start=$(date -v-1H +%s) --end=$(date +%s) --env=prod
# List operations for a service
pup apm services operations web-server --start=$(date -v-1H +%s) --end=$(date +%s)
# List resources (endpoints) for a service operation
pup apm services resources web-server --operation="GET /api/users" --from=$(date -v-1H +%s) --to=$(date +%s)APM Flags
| Flag | Description |
|---|---|
--start |
Start time as Unix timestamp (required for stats/operations) |
--end |
End time as Unix timestamp (required for stats/operations) |
--env |
Filter by environment |
--primary-tag |
Filter by primary tag (group:value) |
--output / -o |
json (default), table, yaml |
APM Dependencies
View service call relationships based on trace data.
# All service dependencies in production
pup apm dependencies list --env=prod --start=$(date -v-1H +%s) --end=$(date +%s)
# Dependencies for a specific service
pup apm dependencies list web-server --env=prod --start=$(date -v-1H +%s) --end=$(date +%s)
# Service flow map with performance metrics
pup apm flow-map --query="env:prod" --from=$(date -v-1H +%s) --to=$(date +%s)Datadog Query Syntax
All query filters use Datadog's standard search syntax:
service:my-service Filter by service
status:error Filter by log status
host:my-host Filter by host
env:production Filter by environment
@duration:>5s Numeric attribute filter
"exact phrase" Exact match
service:web AND status:error Boolean operators (AND, OR, NOT)
service:web-* Wildcards
-status:info NegationOutput Formats
All commands support --output / -o:
| Format | Flag | Use when |
|---|---|---|
| JSON | --output json (default) |
Piping to jq, programmatic analysis |
| Table | --output table |
Human-readable overview |
| YAML | --output yaml |
Configuration-style output |
# Pipe JSON to jq for field selection
pup logs search --query="status:error" --from="1h" | jq '.data[].attributes.message'
# Human-readable table
pup logs search --query="status:error" --from="1h" --output tableCommon Investigation Patterns
# 1. Start broad: what services have errors?
pup logs aggregate --query="status:error" --from="1h" --compute="count" --group-by="service"
# 2. Drill into the top offender
pup logs search --query="service:payment AND status:error" --from="1h" --output table
# 3. Get full JSON details for a specific timeframe
pup logs search --query="service:payment AND status:error" --from="30m" --limit=10
# 4. Check if it's environment-specific
pup logs aggregate --query="service:payment AND status:error" --from="1h" --compute="count" --group-by="env"
# 5. Check APM service health
pup apm services stats --start=$(date -v-1H +%s) --end=$(date +%s) --env=prod
# 6. View service dependencies
pup apm dependencies list payment --env=prod --start=$(date -v-1H +%s) --end=$(date +%s)
# 7. Check a specific metric
pup metrics query --query="avg:trace.servlet.request.duration{service:payment}" --from="1h"Time Ranges
Logs & Metrics accept relative durations:
| Input | Meaning |
|---|---|
1h |
1 hour ago |
30m |
30 minutes ago |
7d |
7 days ago |
1w |
1 week ago |
now |
Current time (default for --to) |
APM commands require Unix timestamps. Use date to compute them:
| Shell | 1 hour ago | Now |
|---|---|---|
| macOS | $(date -v-1H +%s) |
$(date +%s) |
| Linux | $(date -d '1 hour ago' +%s) |
$(date +%s) |