dmonteroh

migration-observability

"Make database migrations safe and observable. Define progress + safety metrics, dashboards, and runbook gates (go/no-go criteria) for live migrations, backfills, and cutovers. Works standalone and is database/tooling agnostic."

dmonteroh 1 Updated 3mo ago

Resources

1
GitHub

Install

npx skillscat add dmonteroh/curated-agent-skills/migration-observability

Install via the SkillsCat registry.

SKILL.md

migration-observability

Provides guidance for running migrations safely (not for authoring SQL/ORM migration steps). It focuses on:

  • progress visibility (are we moving? how fast? ETA?)
  • safety signals (are we harming prod? are errors rising? is lag growing?)
  • runbook gates (objective go/no-go thresholds and rollback triggers)

Use this skill when

  • Running a production migration/backfill/cutover that can’t be “fire and forget”.
  • Dashboards/alerts and objective gates (pause, slow down, rollback, proceed) are required.
  • A shared, deterministic runbook is required for the migration.

Do not use this skill when

  • The change is a tiny, low-risk schema tweak with trivial rollback.
  • Only authoring migration scripts (this skill is about operating them).

Required inputs

  • Migration summary: what changes, why, scope, expected duration, rollback complexity.
  • Migration type: schema-only / backfill / online rewrite / cutover / dual-write / reindex.
  • Operational constraints: maintenance window, allowed error budget, throttling knobs.
  • Available telemetry: metrics, logs, traces, dashboards, alerting system.
  • Data correctness expectations: invariants, validation approach, sampling strategy.

Outputs (artifacts produced)

Minimum artifacts (paths are suggestions; adapt to repo conventions):

  • docs/runbooks/migrations/.md (the runbook with gates)
  • docs/runbooks/migrations/-dashboard.md (dashboard/queries, tool-agnostic)
  • docs/runbooks/migrations/-alerts.md (alert thresholds + paging policy)

Templates:

  • references/runbook-template.md
  • references/metrics-and-gates.md

Workflow (step-by-step with outputs)

  1. Classify the migration (controls what must be observed)
  • Decide type, blast radius, and rollback complexity.
  • Output: short classification summary to include in the runbook.
  1. Define progress metrics (prove it is moving)
  • Pick counters and derived rates (rows processed, throughput, ETA, lag).
  • Decision point: if instrumentation is unavailable, define structured logs and manual sampling cadence.
  • Output: a progress metrics list with target values or expected ranges.
  1. Define safety metrics (prove it is not hurting prod)
  • Select DB, query, app, and correctness signals that exist today.
  • Decision point: if a metric is missing, either add lightweight instrumentation or choose a proxy metric.
  • Output: a safety metrics list with thresholds and measurement sources.
  1. Turn metrics into gates (objective go/no-go)
  • Write Proceed (green), Pause/Throttle (yellow), Rollback (red) per phase.
  • Decision point: if rollback is impossible, define a “stop and stabilize” gate plus explicit escalation steps.
  • Output: a gate table for each phase.
  1. Build the runbook, dashboard, and alert specs
  • Use the templates to capture phases, checks, and thresholds.
  • Output: the three runbook artifacts in your repo structure.
  1. Execute with controls and document outcomes
  • Start with a canary, ramp cautiously, and validate invariants continuously.
  • Decision point: if any gate triggers, follow the runbook’s pause/rollback steps.
  • Output: a short closeout section with verification results and follow-ups.

Common pitfalls

  • Missing baselines, so regressions cannot be detected.
  • Gates with vague language instead of numeric thresholds.
  • No throttle/rollback plan for long-running backfills.
  • Correctness checks that rely on assumptions that cannot be measured.
  • Alerting that pages the wrong team or has no escalation path.

Tooling guidance (agnostic)

This skill does not require a specific stack. Common setups:

  • Metrics: Prometheus/OpenTelemetry/Cloud metrics
  • Dashboards: Grafana / cloud dashboards
  • Alerts: Grafana alerting / PagerDuty / OpsGenie / Slack

If none exist, remain “observable” by emitting structured logs + writing a runbook with manual checks and thresholds.

Examples

Example input:

  • Migration type: backfill
  • Operational constraints: 2-hour maintenance window, batch size throttle
  • Available telemetry: Prometheus metrics + Grafana dashboarding
  • Data correctness: checksum sampling every 10k rows

Example output summary (abbreviated):

  • Runbook: docs/runbooks/migrations/2024-09-customer-backfill.md
  • Dashboard spec: docs/runbooks/migrations/2024-09-customer-backfill-dashboard.md
  • Alerts spec: docs/runbooks/migrations/2024-09-customer-backfill-alerts.md
  • Gates: proceed/pause/rollback thresholds for canary, ramp, full run

Output contract (reporting format)

Return a single summary using labeled bullets with the following fields, in this order:

  • Migration classification (type, blast radius, rollback complexity).
  • Progress metrics with thresholds + sources.
  • Safety metrics with thresholds + sources.
  • Gate table by phase (proceed/pause/rollback).
  • Runbook/dashboard/alerts file paths produced.
  • Risks, assumptions, and missing instrumentation.

References

  • references/README.md