rohitg00

k8s-troubleshoot

Debug Kubernetes pods, nodes, and workloads. Use when pods are failing, containers crash, nodes are unhealthy, or users mention debugging, troubleshooting, or diagnosing Kubernetes issues.

rohitg00 906 176 Updated 4mo ago

Resources

2
GitHub

Install

npx skillscat add rohitg00/kubectl-mcp-server/k8s-troubleshoot

Install via the SkillsCat registry.

SKILL.md

Kubernetes Troubleshooting

Expert debugging and diagnostics for Kubernetes clusters using kubectl-mcp-server tools.

When to Apply

Use this skill when:

  • User mentions: "debug", "troubleshoot", "diagnose", "failing", "crash", "not starting", "broken"
  • Pod states: Pending, CrashLoopBackOff, ImagePullBackOff, OOMKilled, Error, Unknown
  • Node issues: NotReady, MemoryPressure, DiskPressure, NetworkUnavailable, PIDPressure
  • Keywords: "logs", "events", "describe", "why isn't working", "stuck", "not responding"

Priority Rules

Priority Rule Impact Tools
1 Check pod status first CRITICAL get_pods, describe_pod
2 View recent events CRITICAL get_events
3 Inspect logs (including previous) HIGH get_pod_logs
4 Check resource metrics HIGH get_pod_metrics
5 Verify endpoints MEDIUM get_endpoints
6 Review network policies MEDIUM get_network_policies
7 Examine node status LOW get_nodes, describe_node

Quick Reference

Symptom First Tool Next Steps
Pod Pending describe_pod Check events, node capacity, resource requests
CrashLoopBackOff get_pod_logs(previous=True) Check exit code, resources, liveness probes
ImagePullBackOff describe_pod Verify image name, registry auth, network
OOMKilled get_pod_metrics Increase memory limits, check for memory leaks
ContainerCreating describe_pod Check PVC binding, secrets, configmaps
Terminating (stuck) describe_pod Check finalizers, PDBs, preStop hooks

Diagnostic Workflows

Pod Not Starting

1. get_pods(namespace, label_selector) - Get pod status
2. describe_pod(name, namespace) - See events and conditions
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - Check events
4. get_pod_logs(name, namespace, previous=True) - For crash loops

Common Pod States

State Likely Cause Tools to Use
Pending Scheduling issues describe_pod, get_nodes, get_events
ImagePullBackOff Registry/auth describe_pod, check image name
CrashLoopBackOff App crash get_pod_logs(previous=True)
OOMKilled Memory limit get_pod_metrics, adjust limits
ContainerCreating Volume/network describe_pod, get_pvc

Node Issues

1. get_nodes() - List nodes and status
2. describe_node(name) - See conditions and capacity
3. Check: Ready, MemoryPressure, DiskPressure, PIDPressure
4. node_logs_tool(name, "kubelet") - Kubelet logs

Deep Debugging Workflows

CrashLoopBackOff Investigation

1. get_pod_logs(name, namespace, previous=True) - See why it crashed
2. describe_pod(name, namespace) - Check resource limits, probes
3. get_pod_metrics(name, namespace) - Memory/CPU at crash time
4. If OOM: compare requests/limits to actual usage
5. If app error: check logs for stack trace

Networking Issues

1. get_services(namespace) - Verify service exists
2. get_endpoints(namespace) - Check endpoint backends
3. If empty endpoints: pods don't match selector
4. get_network_policies(namespace) - Check traffic rules
5. For Cilium: cilium_endpoints_list_tool(), hubble_flows_query_tool()

Storage Problems

1. get_pvc(namespace) - Check PVC status
2. describe_pvc(name, namespace) - See binding issues
3. get_storage_classes() - Verify provisioner exists
4. If Pending: check storage class, access modes

DNS Resolution

1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - Test DNS
2. If fails: check coredns pods in kube-system
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")

Multi-Cluster Debugging

All tools support context parameter for targeting different clusters:

get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")

Diagnostic Scripts

For comprehensive diagnostics, run the bundled scripts:

Decision Tree

See references/DECISION-TREE.md for visual troubleshooting flowcharts.

Common Errors Reference

See references/COMMON-ERRORS.md for error message explanations and fixes.

Related Tools

Core Diagnostics

  • get_pods, describe_pod, get_pod_logs, get_pod_metrics
  • get_events, get_nodes, describe_node
  • get_resource_usage, compare_namespaces

Advanced (Ecosystem)

  • Cilium: cilium_endpoints_list_tool, hubble_flows_query_tool
  • Istio: istio_proxy_status_tool, istio_analyze_tool

Related Skills