deployment-rollout-debug

Diagnose Deployment rollout failures (stuck rollouts, ProgressDeadlineExceeded, replica mismatch). Checks rollout status, ReplicaSets, and new pod health to identify why an update is failing.

scitix 220 26 Updated 4mo ago

GitHub

Install

npx skillscat add scitix/siclaw/deployment-rollout-debug

Install via the SkillsCat registry.

SKILL.md

Deployment Rollout Failure Diagnosis

When a Deployment rollout is stuck, not progressing, or shows replica mismatches, follow this flow to identify the root cause.

Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt to roll back, scale, or modify the Deployment — that should be left to the user.

Diagnostic Flow

1. Check rollout status

kubectl rollout status deployment/<name> -n <ns> --timeout=5s

This shows whether the rollout is progressing, complete, or stuck. A timeout indicates the rollout is not making progress.

2. Get Deployment details

kubectl get deployment <name> -n <ns> -o wide

Compare the columns:

DESIRED — target replica count
CURRENT — total pods (old + new)
UP-TO-DATE — pods running the new version
AVAILABLE — pods ready to serve traffic

If UP-TO-DATE < DESIRED or AVAILABLE < DESIRED, the rollout is incomplete.

3. Describe the Deployment

kubectl describe deployment <name> -n <ns>

Focus on:

Conditions — look for Progressing (status and reason) and Available
Events — look for scaling events, errors, or warnings
Strategy — note maxSurge and maxUnavailable settings

4. Check ReplicaSets

kubectl get rs -n <ns> -l app=<name>

If the label app=<name> doesn't match, find ReplicaSets owned by the Deployment:

kubectl get rs -n <ns> --sort-by='.metadata.creationTimestamp' | grep <name>

You should see the old RS (with reduced replicas) and the new RS (scaling up). If the new RS has 0 ready replicas, its pods are failing.

5. Check new ReplicaSet's pods

Find pods from the new RS:

kubectl get pods -n <ns> -l app=<name> --sort-by='.metadata.creationTimestamp'

Check the status of the newest pods. Based on their status:

Pending → Use the pod-pending-debug skill
CrashLoopBackOff / Error → Use the pod-crash-debug skill
ImagePullBackOff / ErrImagePull → Use the image-pull-debug skill
Running but not Ready → Check readiness probe below

6. Match deployment-level patterns

`ProgressDeadlineExceeded` — Rollout timed out

The Deployment's spec.progressDeadlineSeconds (default 600s) has been exceeded without progress.

This is a symptom, not a cause. The root cause is in the new pods — check step 5 for why new pods are not becoming ready.

`MinimumReplicasUnavailable` — Not enough replicas available

The Deployment cannot maintain the minimum number of available replicas during the rollout.

Check the new pods' status (step 5). The new version's pods are failing to start or pass readiness probes.

New pods `Running` but not `Ready` — Readiness probe failing

The new pods started successfully but are failing their readiness probe. The rollout will not progress because the new pods are not considered available.

Check the readiness probe configuration and failures:

kubectl describe pod <new-pod> -n <ns>

Look for Readiness probe failed events. Common causes:

Application not listening on the expected port
Health endpoint returning errors
Probe timing too aggressive (initialDelaySeconds too low)

Rollout stuck with `maxSurge: 0` and `maxUnavailable: 0` — Invalid strategy

If both maxSurge and maxUnavailable are 0, the rollout cannot proceed because it cannot create extra pods and cannot remove existing pods. At least one must be greater than 0.

New RS created but not scaling up — Admission webhook or quota

The new ReplicaSet exists but has 0 replicas or cannot create pods:

kubectl describe rs <new-rs> -n <ns>

Check events for:

Admission webhook denied — a webhook is rejecting the new pod spec
Exceeded quota — resource quota in the namespace prevents creating more pods
FailedCreate — other creation failures

Old RS not scaling down — Waiting for new pods

The old ReplicaSet keeps its replicas because the new RS pods are not ready yet. This is expected behavior — Kubernetes will not remove old pods until new pods are available. Fix the new pods first.

Notes

kubectl rollout status waits for the rollout to complete by default. Use --timeout to avoid blocking.
If the user wants to undo a failed rollout: kubectl rollout undo deployment/<name> -n <ns> (but let the user decide, do not execute this).
For Deployments managed by Helm or ArgoCD, the rollout may be triggered by those tools — check if the issue is in the Helm chart values or ArgoCD sync.
StatefulSet rollouts follow a different pattern (ordered, one-at-a-time by default). This skill is specific to Deployments.

deployment-rollout-debug

Install

Deployment Rollout Failure Diagnosis

Diagnostic Flow

1. Check rollout status

2. Get Deployment details

3. Describe the Deployment

4. Check ReplicaSets

5. Check new ReplicaSet's pods

6. Match deployment-level patterns

ProgressDeadlineExceeded — Rollout timed out

MinimumReplicasUnavailable — Not enough replicas available

New pods Running but not Ready — Readiness probe failing

Rollout stuck with maxSurge: 0 and maxUnavailable: 0 — Invalid strategy

New RS created but not scaling up — Admission webhook or quota

Old RS not scaling down — Waiting for new pods

Notes

Categories

Install

Recommended Skills

`ProgressDeadlineExceeded` — Rollout timed out

`MinimumReplicasUnavailable` — Not enough replicas available

New pods `Running` but not `Ready` — Readiness probe failing

Rollout stuck with `maxSurge: 0` and `maxUnavailable: 0` — Invalid strategy