k8s-ops

Kubernetes cluster operations, deployments, and troubleshooting. Use when deploying manifests, checking rollout status, monitoring pods, debugging failures, viewing logs, managing namespaces, or any kubectl operations. Triggers on mentions of kubernetes, k8s, kubectl, pods, deployments, services, namespaces, or cluster management.

martin-janci 3 Updated 6mo ago

GitHub

Install

npx skillscat add martin-janci/claude-marketplace/k8s-ops

Install via the SkillsCat registry.

SKILL.md

Kubernetes Operations

Core Workflow

Deployment Lifecycle

# 1. Validate before applying
kubectl apply --dry-run=server -f <manifest> -n <namespace>

# 2. Apply manifests
kubectl apply -f <manifest> -n <namespace>

# 3. Monitor rollout (blocks until complete or timeout)
kubectl rollout status deployment/<name> -n <namespace> --timeout=300s

# 4. Verify pods running
kubectl get pods -n <namespace> -l app=<label> -o wide

# 5. Check events for issues
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Quick Health Check

# Cluster overview
kubectl cluster-info
kubectl get nodes -o wide
kubectl top nodes  # requires metrics-server

# Namespace health
kubectl get all -n <namespace>
kubectl get pods -n <namespace> -o wide
kubectl top pods -n <namespace>

Troubleshooting Decision Tree

Pod Not Starting

Check pod status: kubectl get pods -n <ns> -o wide
Describe for events: kubectl describe pod <pod> -n <ns>
Check logs: kubectl logs <pod> -n <ns> --previous (if crashed)

Common causes:

ImagePullBackOff: Wrong image name/tag, missing imagePullSecrets
CrashLoopBackOff: App crash - check logs, health probes too aggressive
Pending: Insufficient resources, node selector/affinity issues
ContainerCreating: Volume mount issues, init container stuck

Pod Running But Not Receiving Traffic

Check readiness: kubectl get pods -n <ns> (READY column)
Check endpoints: kubectl get endpoints <service> -n <ns>
Check service selector: kubectl describe service <svc> -n <ns>
Test connectivity: kubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>

High Restart Count

# Get restart details
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'

# Check terminated state
kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'

# Review liveness probe config
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[0].livenessProbe}'

Common Operations

Logs

# Single pod
kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> -c <container>  # multi-container
kubectl logs <pod> -n <ns> --previous      # crashed container
kubectl logs <pod> -n <ns> -f              # follow/stream

# All pods with label
kubectl logs -l app=<label> -n <ns> --all-containers

# Since time
kubectl logs <pod> -n <ns> --since=1h
kubectl logs <pod> -n <ns> --since-time="2024-01-01T00:00:00Z"

Exec/Debug

# Interactive shell
kubectl exec -it <pod> -n <ns> -- /bin/sh
kubectl exec -it <pod> -n <ns> -c <container> -- /bin/bash

# Run command
kubectl exec <pod> -n <ns> -- <command>

# Debug with ephemeral container (k8s 1.25+)
kubectl debug -it <pod> -n <ns> --image=busybox --target=<container>

Scaling

# Manual scale
kubectl scale deployment/<name> -n <ns> --replicas=3

# Autoscaling
kubectl autoscale deployment/<name> -n <ns> --min=2 --max=10 --cpu-percent=80
kubectl get hpa -n <ns>

Rollback

# View history
kubectl rollout history deployment/<name> -n <ns>

# Rollback to previous
kubectl rollout undo deployment/<name> -n <ns>

# Rollback to specific revision
kubectl rollout undo deployment/<name> -n <ns> --to-revision=<N>

# Pause/resume rollout
kubectl rollout pause deployment/<name> -n <ns>
kubectl rollout resume deployment/<name> -n <ns>

Resource Management

# Get resource usage
kubectl top pods -n <ns> --sort-by=memory
kubectl top pods -n <ns> --sort-by=cpu

# Describe resource limits
kubectl describe limitrange -n <ns>
kubectl describe resourcequota -n <ns>

# Get requests/limits for pods
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'

Context & Namespace Management

# View contexts
kubectl config get-contexts
kubectl config current-context

# Switch context
kubectl config use-context <context-name>

# Set default namespace
kubectl config set-context --current --namespace=<ns>

# Create namespace
kubectl create namespace <name>

Output Formats

# Wide output with more columns
kubectl get pods -o wide

# YAML/JSON export
kubectl get deployment <name> -o yaml
kubectl get pod <name> -o json

# Custom columns
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP

# JSONPath
kubectl get pods -o jsonpath='{.items[*].metadata.name}'
kubectl get secret <name> -o jsonpath='{.data.password}' | base64 -d

Port Forwarding

# Forward pod port
kubectl port-forward pod/<name> <local>:<remote> -n <ns>

# Forward service port
kubectl port-forward svc/<name> <local>:<remote> -n <ns>

# Forward deployment (picks a pod)
kubectl port-forward deployment/<name> <local>:<remote> -n <ns>

Labels & Selectors

# Add label
kubectl label pods <pod> env=prod -n <ns>

# Remove label
kubectl label pods <pod> env- -n <ns>

# Select by label
kubectl get pods -l app=nginx,env=prod -n <ns>
kubectl get pods -l 'env in (prod,staging)' -n <ns>
kubectl delete pods -l app=test -n <ns>

Resource Cleanup

# Delete by manifest
kubectl delete -f <manifest> -n <ns>

# Delete by label
kubectl delete pods -l app=<label> -n <ns>

# Force delete stuck pod
kubectl delete pod <pod> -n <ns> --grace-period=0 --force

# Delete completed/failed pods
kubectl delete pods -n <ns> --field-selector=status.phase=Succeeded
kubectl delete pods -n <ns> --field-selector=status.phase=Failed

Health Probes Reference

Probe Types

Liveness: Is container alive? Failure → restart
Readiness: Can container serve traffic? Failure → remove from endpoints
Startup: Has app started? Blocks liveness/readiness until success

Debugging Probes

# Check probe config
kubectl get pod <pod> -n <ns> -o yaml | grep -A10 livenessProbe

# Test HTTP probe manually
kubectl exec <pod> -n <ns> -- wget -qO- localhost:<port>/healthz

# Check probe events
kubectl describe pod <pod> -n <ns> | grep -A5 "Liveness\|Readiness"

Tips

Always use -n <namespace> explicitly to avoid mistakes
Use --dry-run=client -o yaml to generate manifests
Add --watch to continuously monitor: kubectl get pods -w
Use kubectl explain <resource>.<field> to understand spec fields
Annotate changes: kubectl annotate deployment/<name> kubernetes.io/change-cause="<reason>"

k8s-ops

Install

Kubernetes Operations

Core Workflow

Deployment Lifecycle

Quick Health Check

Troubleshooting Decision Tree

Pod Not Starting

Pod Running But Not Receiving Traffic

High Restart Count

Common Operations

Logs

Exec/Debug

Scaling

Rollback

Resource Management

Context & Namespace Management

Output Formats

Port Forwarding

Labels & Selectors

Resource Cleanup

Health Probes Reference

Probe Types

Debugging Probes

Tips

Categories

Install

Recommended Skills