← Back to Blog

Kubernetes CrashLoopBackOff: The Definitive Troubleshooting Guide for Production Clusters

Your pod is stuck in CrashLoopBackOff. The restart count is climbing. PagerDuty is firing. You have about three minutes before someone asks what is going on. I have debugged this exact scenario hundreds of times across production EKS, GKE, and bare-metal clusters. Here is the systematic approach that works every single time.

TL;DR: The 60-Second Diagnostic

Run these two commands first:

kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

Check the Last State reason in the describe output:

  • OOMKilled = container exceeded its memory limit. Raise resources.limits.memory
  • Error (exit code 1) = application crashed. Check logs with --previous flag
  • ContainerCannotRun = bad entrypoint, wrong command, or missing binary
  • ImagePullBackOff = wrong image tag, expired registry credentials, or private repo without imagePullSecrets

What CrashLoopBackOff Actually Means

CrashLoopBackOff is not an error. It is a status. Kubernetes is telling you that a container in your pod started, crashed, and now Kubernetes is waiting before restarting it again. The "BackOff" part refers to the exponential backoff timer that Kubernetes applies between restart attempts.

The backoff sequence works like this: 10 seconds, 20 seconds, 40 seconds, 80 seconds, 160 seconds, and then it caps at 300 seconds (5 minutes). After a successful run of 10 minutes, the backoff timer resets. This means if your container crashes immediately on startup, you will see restarts spaced further and further apart, up to a maximum of 5 minutes between attempts.

Understanding this is critical because it tells you something important: Kubernetes is doing exactly what it should. The container is crashing, and Kubernetes is giving it progressively more breathing room before trying again. The real question is: why is the container crashing?

There are five primary categories of CrashLoopBackOff causes that account for over 95% of incidents I have seen in production. We will walk through each one with the exact kubectl commands to diagnose and fix them.

Step 1: Get the Pod Status and Read the Exit Code

Before you do anything else, get the pod status. This single command gives you 80% of the information you need.

# List all pods in the namespace, sorted by restart count
kubectl get pods -n production --sort-by='.status.containerStatuses[0].restartCount'

# Get detailed status for the crashing pod
kubectl describe pod my-app-7d4b8c6f9-x2k4p -n production

In the describe output, scroll to the Containers section. You are looking for three things:

  • State: Will show Waiting with reason CrashLoopBackOff
  • Last State: Shows Terminated with the reason and exit code from the last crash
  • Restart Count: How many times Kubernetes has restarted this container

The exit code in Last State is your most important clue. Here is what each common exit code means:

  • Exit Code 0: Container exited successfully. This happens when your process completes and exits cleanly, but Kubernetes expects it to keep running. Common with Jobs configured as Deployments by mistake.
  • Exit Code 1: Application error. The process threw an unhandled exception or returned a non-zero status. Check the logs.
  • Exit Code 137: OOMKilled. The container exceeded its memory limit and the kernel killed it with SIGKILL (128 + 9 = 137).
  • Exit Code 139: Segmentation fault (128 + 11). The process accessed invalid memory. Usually a bug in native code or a corrupted binary.
  • Exit Code 143: Container received SIGTERM (128 + 15). Kubernetes sent a graceful shutdown signal, but the container did not exit in time and was killed. Often caused by liveness probe failures.

Step 2: OOMKilled (Exit Code 137)

This is the single most common cause of CrashLoopBackOff in production. Your container tried to use more memory than its limit allows, and the Linux OOM killer terminated it. There is no warning, no graceful shutdown. The process just dies.

First, confirm the OOMKill:

# Check the Last State reason
kubectl describe pod my-app-7d4b8c6f9-x2k4p -n production | grep -A 5 "Last State"

# Output will show:
#   Last State:  Terminated
#     Reason:    OOMKilled
#     Exit Code: 137

Next, check current memory usage across your pods:

# Requires metrics-server to be installed
kubectl top pod -n production

# Example output:
# NAME                      CPU(cores)   MEMORY(bytes)
# my-app-7d4b8c6f9-x2k4p   45m          480Mi
# my-app-7d4b8c6f9-r8j2n   52m          512Mi

If pods are consistently using memory close to the limit, you need to raise the limit. Here is the fix in your deployment YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      containers:
      - name: my-app
        image: my-app:latest
        resources:
          requests:
            memory: "256Mi"   # Scheduler reservation
            cpu: "100m"
          limits:
            memory: "512Mi"   # OOM kill threshold
            cpu: "500m"

A critical concept here: requests vs limits. The request is what the scheduler uses to place your pod on a node. The limit is the hard ceiling. If the container exceeds the memory limit, it gets OOMKilled. If it exceeds the CPU limit, it gets throttled (not killed).

My rule of thumb for production: set the memory request to your application's steady-state usage, and set the limit to 1.5x to 2x the request. This gives the application room for spikes without wasting cluster resources. For Java applications, set -Xmx to about 75% of the memory limit to leave room for off-heap memory, thread stacks, and the JVM itself.

# Quick check: are ANY pods on this node under memory pressure?
kubectl describe node <node-name> | grep -A 5 "Conditions"

# Check memory pressure across all nodes
kubectl top nodes

Step 3: Application Error (Exit Code 1)

When the exit code is 1, the application itself is crashing. The fix depends entirely on what the application logs say. Here is how to get them:

# Get logs from the PREVIOUS (crashed) container
kubectl logs my-app-7d4b8c6f9-x2k4p -n production --previous

# If the pod has multiple containers, specify which one
kubectl logs my-app-7d4b8c6f9-x2k4p -n production -c my-app --previous

# Stream logs from a running (but soon to crash) container
kubectl logs my-app-7d4b8c6f9-x2k4p -n production -f

The --previous flag is essential. Without it, you are looking at logs from the current container instance, which might be empty if the container just started. The previous logs show you what happened right before the crash.

The most common application-level causes I see in production:

Missing Environment Variables

The application starts, tries to read a required environment variable, gets an empty string, and crashes. Check that your ConfigMap or Secret is mounted correctly:

# Check if the ConfigMap exists
kubectl get configmap my-app-config -n production -o yaml

# Check if the Secret exists (values will be base64 encoded)
kubectl get secret my-app-secrets -n production -o yaml

# Verify env vars are set inside the pod
kubectl exec my-app-7d4b8c6f9-x2k4p -n production -- env | grep DATABASE

Failed Database Connection

The application tries to connect to a database on startup, the connection times out or is refused, and the application exits. This often happens after a database failover, a security group change, or when the database is in a different VPC. Verify network connectivity from inside the pod:

# Test database connectivity from inside the pod
kubectl exec my-app-7d4b8c6f9-x2k4p -n production -- nc -zv db-host.rds.amazonaws.com 3306

# Check DNS resolution
kubectl exec my-app-7d4b8c6f9-x2k4p -n production -- nslookup db-host.rds.amazonaws.com

Config File Not Mounted

The application expects a config file at a specific path, but the volume mount is missing or pointing to the wrong path. Check your volume mounts:

# List files at the expected config path
kubectl exec my-app-7d4b8c6f9-x2k4p -n production -- ls -la /app/config/

# Check if the volume mount exists in the pod spec
kubectl get pod my-app-7d4b8c6f9-x2k4p -n production -o jsonpath='{.spec.containers[0].volumeMounts}'

Step 4: Failed Liveness and Readiness Probes

This one is sneaky. Your application might be perfectly healthy, but a misconfigured liveness probe is killing it. Kubernetes uses three types of probes:

  • Startup Probe: Runs only during container startup. If it fails, the container is killed and restarted. Other probes are disabled until the startup probe succeeds.
  • Liveness Probe: Runs continuously after startup. If it fails, Kubernetes kills and restarts the container.
  • Readiness Probe: Runs continuously. If it fails, the pod is removed from Service endpoints (no traffic) but NOT restarted.

The most common mistake: setting a liveness probe with aggressive timing on a slow-starting application. The app takes 60 seconds to initialize, but the liveness probe starts checking at 10 seconds with a 3-second timeout and 3 failure threshold. The app gets killed before it ever finishes starting.

Check if probes are causing the restarts:

# Look for probe failure events
kubectl describe pod my-app-7d4b8c6f9-x2k4p -n production | grep -A 3 "Liveness\|Readiness\|Startup"

# Check events for probe failures
kubectl get events -n production --field-selector involvedObject.name=my-app-7d4b8c6f9-x2k4p

If you see Liveness probe failed in the events, here is the correct way to configure all three probes:

containers:
- name: my-app
  image: my-app:latest
  ports:
  - containerPort: 8080
  # Startup probe: give the app up to 5 minutes to start
  startupProbe:
    httpGet:
      path: /healthz
      port: 8080
    failureThreshold: 30
    periodSeconds: 10
  # Liveness probe: restart if the app becomes unresponsive
  livenessProbe:
    httpGet:
      path: /healthz
      port: 8080
    initialDelaySeconds: 0
    periodSeconds: 15
    timeoutSeconds: 5
    failureThreshold: 3
  # Readiness probe: stop sending traffic if the app is busy
  readinessProbe:
    httpGet:
      path: /ready
      port: 8080
    periodSeconds: 10
    timeoutSeconds: 3
    failureThreshold: 3

The startup probe is the key here. With failureThreshold: 30 and periodSeconds: 10, the application gets up to 300 seconds (5 minutes) to start. During that time, the liveness and readiness probes are disabled. Once the startup probe succeeds, liveness and readiness probes take over. This pattern works for Java apps with Spring Boot, Python apps loading large ML models, and any application with variable startup times.

Pro Tip: Debug Containers That Crash Too Fast

If the container crashes instantly, you cannot exec into it. Use this workaround to get an interactive shell with the same image:

# Override the entrypoint to sleep instead of starting the app
kubectl run debug-pod --image=my-app:latest --restart=Never \
  --command -- sleep 3600

# Now exec into it and poke around
kubectl exec -it debug-pod -- /bin/sh

# Check if the binary exists, config files are present, etc.
ls -la /app/
cat /app/config/application.yml
env | sort

# Clean up when done
kubectl delete pod debug-pod

For running containers that have not crashed yet, use kubectl exec -it <pod> -- /bin/sh directly. If the image does not have a shell (distroless images), use ephemeral debug containers:

kubectl debug -it my-app-7d4b8c6f9-x2k4p --image=busybox --target=my-app

Is Your Infrastructure Exposed?

CrashLoopBackOff can expose sensitive data through error pages, debug endpoints, and verbose logs. Run a quick exposure check to see what is publicly visible.

Check Your Exposure Free

Step 5: Image Pull Failures

Sometimes the pod never even gets to start the container because Kubernetes cannot pull the image. You will see two related statuses:

  • ErrImagePull: The initial pull attempt failed
  • ImagePullBackOff: Kubernetes is backing off before retrying the pull

Common causes and their fixes:

Wrong Image Tag

# Check what image the pod is trying to pull
kubectl describe pod my-app-7d4b8c6f9-x2k4p -n production | grep "Image:"

# Verify the image exists in your registry
aws ecr describe-images --repository-name my-app --image-ids imageTag=v2.1.0

# Common mistake: using "latest" tag when the registry has no "latest"
# Fix: always use explicit version tags in production

Private Registry Without Credentials

If you are pulling from a private registry (ECR, GCR, Docker Hub private repos), the node needs credentials. For EKS with ECR, this is handled automatically through the node IAM role. For other registries, you need an imagePullSecret:

# Create a docker-registry secret
kubectl create secret docker-registry my-registry-cred \
  --docker-server=registry.example.com \
  --docker-username=myuser \
  --docker-password=mypass \
  -n production

# Reference it in your deployment
spec:
  imagePullSecrets:
  - name: my-registry-cred
  containers:
  - name: my-app
    image: registry.example.com/my-app:v2.1.0

ECR Token Expiry

ECR authentication tokens expire every 12 hours. If your nodes have been running for a while and the kubelet cached an expired token, new pulls will fail. The fix depends on your setup:

# For EKS: ensure the node IAM role has ecr:GetAuthorizationToken
# The kubelet credential provider handles token refresh automatically

# For non-EKS clusters using ECR, use a CronJob to refresh the secret
# or install the ecr-credential-helper

Step 6: Volume Mount and Secret Failures

Volume-related failures prevent the container from starting at all. The pod will be stuck in ContainerCreating or immediately crash with a mount error.

PVC Not Bound

# Check PVC status
kubectl get pvc -n production

# If status is "Pending", check why
kubectl describe pvc my-data-pvc -n production

# Common causes:
# - No StorageClass matches the PVC request
# - EBS CSI driver not installed
# - Availability zone mismatch (EBS volumes are AZ-specific)
# - Insufficient capacity in the storage backend

Secret or ConfigMap Not Found

# The pod spec references a secret that does not exist
# Events will show:
# "MountVolume.SetUp failed: secret 'my-app-tls' not found"

# Verify the secret exists in the correct namespace
kubectl get secret my-app-tls -n production

# Check for typos in the volume mount name
kubectl get pod my-app-7d4b8c6f9-x2k4p -n production \
  -o jsonpath='{.spec.volumes}' | jq .

Wrong Key Name in Secret

Your volume mount references a specific key in a Secret or ConfigMap, but that key does not exist. This is a silent failure that often results in an empty file being mounted:

# Check what keys exist in the secret
kubectl get secret my-app-config -n production -o jsonpath='{.data}' | jq 'keys'

# If you expect a key called "config.yaml" but it is actually "config.yml"
# your application will read an empty file and crash

Exit Code Reference Table

Exit CodeMeaningCommon CauseFix
0Success (clean exit)Process completed normally. Wrong workload type (should be Job, not Deployment).Use a Job or CronJob instead of a Deployment. Or fix the app to run as a long-lived process.
1General errorApplication exception, missing config, failed dependency connection.Check kubectl logs --previous. Fix the application error.
2Shell misuseIncorrect shell script syntax, wrong command arguments.Validate your entrypoint script. Check for bash vs sh compatibility.
126Permission deniedEntrypoint binary exists but is not executable.Run chmod +x on the entrypoint in your Dockerfile. Check securityContext settings.
127Command not foundEntrypoint binary does not exist in the container image.Verify the binary path. Check multi-stage build copies. Ensure the image is correct.
137SIGKILL (OOMKilled)Container exceeded memory limit. Kernel OOM killer terminated process.Increase resources.limits.memory. Profile the application for memory leaks.
139SIGSEGV (Segfault)Invalid memory access. Corrupted binary, native library bug.Rebuild the image. Check for architecture mismatch (amd64 vs arm64).
143SIGTERMGraceful shutdown timeout exceeded. Liveness probe killed the container.Increase terminationGracePeriodSeconds. Handle SIGTERM in your application.

Preventing CrashLoopBackOff in Production

Debugging CrashLoopBackOff is reactive. Here is how to prevent it from happening in the first place.

Set Resource Quotas and LimitRanges

Enforce sane defaults at the namespace level so no pod can be deployed without resource limits:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:
      memory: "512Mi"
      cpu: "500m"
    defaultRequest:
      memory: "256Mi"
      cpu: "100m"
    type: Container

Use PodDisruptionBudgets

PDBs prevent Kubernetes from evicting too many pods at once during node drains or cluster upgrades:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

Add Init Containers for Dependencies

Instead of crashing because the database is not ready, use an init container to wait for it:

initContainers:
- name: wait-for-db
  image: busybox:1.36
  command: ['sh', '-c',
    'until nc -z db-service.production.svc.cluster.local 5432; do
      echo "Waiting for database...";
      sleep 2;
    done']

Test with the Same Image in CI

Build the image once in CI and deploy the exact same image tag to staging and production. Never use :latest in production. Tag images with the git SHA or a semantic version. This eliminates the entire class of "works on my machine" issues.

Implement Proper Health Endpoints

Your application should expose separate endpoints for liveness, readiness, and startup checks. The liveness endpoint should return 200 if the process is alive. The readiness endpoint should return 200 only if the application can serve traffic (database connected, caches warmed, etc.). Do not make your liveness probe depend on external services, or a database outage will cascade into a pod restart storm.

Secure Your Kubernetes Cluster

CrashLoopBackOff is just one symptom. Run a full exposure check on your domains to find misconfigurations, open ports, and security gaps before attackers do.

Run Free Exposure Check

Common Mistakes That Make Debugging Harder

After years of production Kubernetes operations, here are the mistakes I see teams make repeatedly:

  1. Not using the --previous flag. Without it, kubectl logs shows the current container instance, which might be empty or only show startup messages. The crash information is in the previous instance.
  2. Setting memory limit equal to request. This gives the application zero burst headroom. If memory usage spikes by even 1 byte over the limit, the container gets OOMKilled. Set limits to 1.5x or 2x the request.
  3. Liveness probe too aggressive. A periodSeconds: 5 with failureThreshold: 1 means one slow response kills your container. Use failureThreshold: 3 minimum, and prefer periodSeconds: 15 or higher.
  4. Missing startup probe for slow applications. Java applications, Python ML services, and anything loading large models at startup need a startup probe. Without one, the liveness probe starts checking too early and kills the container before it finishes initializing.
  5. Ignoring events. kubectl get events -n production --sort-by='.lastTimestamp' often shows the root cause (failed mount, failed pull, probe failure) before you even look at logs.
  6. Not checking node-level resources. If the node itself is under memory or disk pressure, Kubernetes will evict pods. Run kubectl describe node and check the Conditions section for MemoryPressure, DiskPressure, and PIDPressure.
  7. Deploying without resource limits. A container without limits can consume all memory on a node, causing OOM kills on every other pod running there.

Frequently Asked Questions

How long does CrashLoopBackOff last? Will it resolve on its own?

The backoff timer caps at 5 minutes between restart attempts. Kubernetes will keep retrying indefinitely. If the underlying issue resolves (for example, a database comes back online), the container will eventually start successfully and the backoff timer resets. However, for permanent issues like misconfigured environment variables or OOM kills, it will loop forever until you fix the root cause.

What is the difference between CrashLoopBackOff and ImagePullBackOff?

CrashLoopBackOff means the container image was pulled successfully, the container started, and then crashed. ImagePullBackOff means Kubernetes could not download the container image at all. The causes are completely different: CrashLoopBackOff is an application or resource problem, while ImagePullBackOff is a registry, authentication, or network problem.

Can CrashLoopBackOff cause data loss?

Yes, potentially. If your application writes to an emptyDir volume (which is ephemeral), all data is lost on every restart. If it writes to a PVC, the data persists across restarts. However, if the crash happens mid-write, you could end up with corrupted files or incomplete database transactions. This is why proper shutdown handling (catching SIGTERM and flushing writes) is critical.

How do I fix CrashLoopBackOff on a pod with no shell?

Distroless and scratch-based images have no shell, so kubectl exec does not work. Use ephemeral debug containers instead: kubectl debug -it <pod> --image=busybox --target=<container>. This attaches a temporary container to the pod that shares the process namespace, so you can inspect the filesystem and running processes. Alternatively, run a separate debug pod with the same image but override the entrypoint to sleep.

Should I set CPU limits on my pods?

This is debated in the Kubernetes community. CPU limits cause throttling, not OOM kills. A throttled container runs slowly but does not crash. Many production teams set CPU requests (for scheduling) but leave CPU limits unset, allowing pods to burst when the node has spare capacity. Memory limits, on the other hand, should always be set because exceeding them causes OOMKills and CrashLoopBackOff.

The Bottom Line

CrashLoopBackOff is a symptom, not a root cause. The debugging process is always the same: run kubectl describe pod to get the exit code, run kubectl logs --previous to get the crash logs, and work from there. In my experience, 90% of production CrashLoopBackOff incidents fall into three buckets: OOMKilled (raise memory limits), application config errors (check env vars and secrets), and probe failures (add a startup probe). Master these three, and you will resolve most incidents in under five minutes.

For a deeper dive into securing your Kubernetes infrastructure, check out our AWS Security Checklist for Production and Open Ports Security Risks guides. Use our Exposure Checker to scan your public-facing services for misconfigurations.

Related tools: Exposure Checker, SSL Checker, DNS Lookup, ENV Validator, YAML Validator, and 70+ more free tools.

UK
Written by Usman Khan
DevOps Engineer | MSc Cybersecurity | CEH | AWS Solutions Architect

Usman has 10+ years of experience running production Kubernetes clusters, managing high-traffic infrastructure on AWS, and building zero-knowledge security tools. Read more about the author.