DevOps2026-04-0618 min read

Nginx 502 Bad Gateway: The Complete Troubleshooting Playbook for Production

It is 2 AM. Your monitoring fires. Every page on the site returns a white screen with "502 Bad Gateway." Traffic is dropping. Revenue is bleeding. This guide is the playbook I wish I had the first time I got that page.

TL;DR: Quick Diagnostic

Run this first:

sudo tail -f /var/log/nginx/error.log | grep -i "upstream"

The three most common causes of Nginx 502 errors:

PHP-FPM is down or crashed (check systemctl status php-fpm)
Socket path mismatch between Nginx fastcgi_pass and PHP-FPM listen
Upstream timeout (backend took too long to respond, Nginx gave up)

Fix the upstream, not Nginx. Nginx is just the messenger.

Why Nginx Returns 502

Before you start changing configuration files, understand what a 502 actually means. Nginx is a reverse proxy. It sits in front of your application (PHP-FPM, Node.js, Gunicorn, uWSGI) and forwards requests to it. When Nginx returns a 502 Bad Gateway, it is telling you: "I tried to reach the upstream backend, and it either refused the connection, returned garbage, or died mid-response."

This is not an Nginx bug. Nginx is doing exactly what it should. It attempted to proxy the request, the backend failed, and Nginx reported the failure honestly. The fix is almost always on the backend side, not in Nginx configuration.

In my experience running high-traffic Magento and PHP platforms, roughly 80% of 502 errors trace back to PHP-FPM. The remaining 20% split between socket misconfigurations, upstream timeouts, and resource exhaustion. Let us walk through each one systematically.

Step 1: Read the Error Log First

Every Nginx 502 troubleshooting session starts in the same place: the error log. Do not guess. Do not start tweaking timeouts. Read the log.

# Watch the error log in real time
sudo tail -f /var/log/nginx/error.log

# Or search for recent upstream errors
sudo grep "upstream" /var/log/nginx/error.log | tail -50

# If you use a custom log path, check your nginx.conf
grep error_log /etc/nginx/nginx.conf

Nginx writes specific error messages for each failure mode. Here is what each one means:

connect() failed (111: Connection refused)

This is the most common message. It means Nginx tried to connect to the upstream (usually PHP-FPM on a socket or TCP port) and nothing was listening. PHP-FPM is either stopped, crashed, or listening on a different address than Nginx expects.

2026/04/06 02:14:33 [error] 1234#0: *5678 connect() failed
(111: Connection refused) while connecting to upstream,
client: 10.0.0.1, server: example.com,
upstream: "fastcgi://127.0.0.1:9000"

connect() failed (2: No such file or directory)

Nginx is trying to connect to a Unix socket file that does not exist. Either PHP-FPM is down (so the socket file was removed), or the socket path in your Nginx config does not match the PHP-FPM pool configuration.

2026/04/06 02:14:33 [error] 1234#0: *5678 connect() failed
(2: No such file or directory) while connecting to upstream,
upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock"

upstream timed out (110: Connection timed out)

Nginx connected to the backend successfully, but the backend did not respond within the configured timeout. The request is still running on the backend (or it hung). This is different from the backend being down.

2026/04/06 02:14:33 [error] 1234#0: *5678 upstream timed out
(110: Connection timed out) while reading response header
from upstream

recv() failed (104: Connection reset by peer)

The backend accepted the connection, started processing, then abruptly closed it. This usually means PHP-FPM crashed mid-request (segfault, OOM kill, or a fatal PHP error).

2026/04/06 02:14:33 [error] 1234#0: *5678 recv() failed
(104: Connection reset by peer) while reading response
header from upstream

Once you identify which error message you are seeing, you know exactly where to look next.

Step 2: PHP-FPM Is Down or Crashed

This is the cause of the majority of 502 errors I have debugged in production. PHP-FPM either stopped running entirely, or it ran out of worker processes and cannot accept new connections.

Check if PHP-FPM is running

# Check the service status
sudo systemctl status php8.2-fpm

# Check if the process is actually running
ps aux | grep php-fpm | grep -v grep

# Check if the socket or port is listening
sudo ss -tlnp | grep php
# or for Unix sockets:
ls -la /run/php/php8.2-fpm.sock

If PHP-FPM is not running, start it and check the logs for why it stopped:

# Start PHP-FPM
sudo systemctl start php8.2-fpm

# Check PHP-FPM logs for crash reason
sudo journalctl -u php8.2-fpm --since "1 hour ago" --no-pager

# Also check the PHP-FPM log file directly
sudo tail -100 /var/log/php8.2-fpm.log

The OOM Killer strikes again

On servers running memory-heavy applications (Magento, WordPress with WooCommerce, Laravel with large queues), the Linux OOM killer will terminate PHP-FPM when the system runs out of RAM. This is silent from PHP-FPM's perspective. The process just vanishes.

# Check if OOM killer terminated PHP-FPM
dmesg | grep -i oom
# or
sudo journalctl -k | grep -i "out of memory"

# Check current memory usage
free -h

# See which processes are using the most memory
ps aux --sort=-%mem | head -20

If you find OOM kill entries mentioning php-fpm, you have two options: add more RAM, or reduce PHP-FPM memory consumption by tuning the pool configuration.

pm.max_children exhaustion

Even when PHP-FPM is running, if all worker processes are busy, new requests queue up. Once the queue fills, Nginx gets a connection refused. This is the "silent 502" because PHP-FPM is technically running but unable to accept work.

# Check the PHP-FPM pool config
sudo cat /etc/php/8.2/fpm/pool.d/www.conf | grep -E "^(pm|pm\.)"

# Key settings you will see:
# pm = dynamic
# pm.max_children = 5        <-- often way too low
# pm.start_servers = 2
# pm.min_spare_servers = 1
# pm.max_spare_servers = 3

The default pm.max_children = 5 is absurdly low for any production workload. Each PHP-FPM worker handles one request at a time. Five workers means five concurrent requests. Request number six gets queued or refused.

To calculate an appropriate value:

# Find average memory per PHP-FPM worker
ps --no-headers -o rss -C php-fpm | awk '{ sum += $1 } END { printf "Average: %.0f MB\n", sum/NR/1024 }'

# Formula: max_children = (Total RAM - RAM for OS and other services) / Average worker memory
# Example: Server has 8 GB, OS and MySQL use 3 GB, each worker uses 80 MB
# max_children = (8192 - 3072) / 80 = 64

Step 3: Socket Path Mismatch

This is the classic "it worked yesterday" problem. Someone upgraded PHP (7.4 to 8.2, for example), and the socket path changed. Nginx is still pointing to the old socket. Or the server was rebuilt and a different PHP-FPM pool config was deployed.

You need to verify that Nginx and PHP-FPM agree on the connection method and path.

Check the Nginx side

# Find the fastcgi_pass directive in your site config
sudo grep -r "fastcgi_pass" /etc/nginx/sites-enabled/
# or
sudo grep -r "fastcgi_pass" /etc/nginx/conf.d/

# You will see one of these:
# fastcgi_pass unix:/run/php/php8.2-fpm.sock;   (Unix socket)
# fastcgi_pass 127.0.0.1:9000;                   (TCP)

Check the PHP-FPM side

# Find the listen directive in the pool config
sudo grep "^listen" /etc/php/8.2/fpm/pool.d/www.conf

# You will see one of these:
# listen = /run/php/php8.2-fpm.sock   (Unix socket)
# listen = 127.0.0.1:9000              (TCP)

These two values must match exactly. If Nginx says fastcgi_pass unix:/run/php/php7.4-fpm.sock but PHP-FPM is listening on /run/php/php8.2-fpm.sock, you will get a 502 every time.

TCP vs Unix socket: which to use?

Unix sockets are faster (no TCP overhead) and more secure (filesystem permissions control access). TCP sockets are required when Nginx and PHP-FPM run on different machines or in different containers.

# Unix socket config (recommended for same-server setups)
# Nginx:
fastcgi_pass unix:/run/php/php8.2-fpm.sock;

# PHP-FPM pool:
listen = /run/php/php8.2-fpm.sock
listen.owner = www-data
listen.group = www-data
listen.mode = 0660

# TCP config (required for cross-server or containerized setups)
# Nginx:
fastcgi_pass 127.0.0.1:9000;

# PHP-FPM pool:
listen = 127.0.0.1:9000

After fixing the mismatch, reload both services:

sudo systemctl reload php8.2-fpm
sudo nginx -t && sudo systemctl reload nginx

Step 4: Upstream Timeout

If the error log shows "upstream timed out," the backend is alive but slow. Nginx waited for a response, hit its timeout limit, and gave up with a 502.

The default fastcgi_read_timeout is 60 seconds. For most web requests, that should be plenty. If your application routinely needs more than 60 seconds to respond, the real fix is to optimize the application, not increase the timeout.

When to increase the timeout

There are legitimate cases for longer timeouts: report generation, large file exports, bulk data imports, or long-running cron jobs triggered via HTTP. For these specific routes, increase the timeout selectively:

# In your Nginx server block, for PHP-FPM:
location ~ \.php$ {
    fastcgi_pass unix:/run/php/php8.2-fpm.sock;
    fastcgi_read_timeout 60s;    # default, fine for most requests
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
}

# For a specific slow endpoint, use a longer timeout:
location = /admin/export-report {
    fastcgi_pass unix:/run/php/php8.2-fpm.sock;
    fastcgi_read_timeout 300s;   # 5 minutes for this route only
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
}

For reverse proxy setups (Node.js, Gunicorn, etc.)

# Reverse proxy to a Node.js or Python backend
location / {
    proxy_pass http://127.0.0.1:3000;
    proxy_read_timeout 60s;
    proxy_connect_timeout 10s;
    proxy_send_timeout 60s;

    # Also set buffer sizes to avoid 502 on large responses
    proxy_buffer_size 128k;
    proxy_buffers 4 256k;
    proxy_busy_buffers_size 256k;
}

When NOT to increase the timeout

If every page load takes 30+ seconds and you are tempted to set fastcgi_read_timeout 600s, stop. You are masking the real problem. Common culprits: unindexed database queries, missing caching layers, external API calls without timeouts, or N+1 query patterns. Fix the slow code. Do not make Nginx wait for it.

Pro Tip: Enable the PHP-FPM Status Page

The pm.status_path directive lets you see PHP-FPM worker status in real time. This is invaluable when debugging 502 errors under load.

# In /etc/php/8.2/fpm/pool.d/www.conf:
pm.status_path = /fpm-status

# In Nginx (restrict to localhost or your monitoring IP):
location = /fpm-status {
    allow 127.0.0.1;
    allow 10.0.0.0/8;    # your monitoring subnet
    deny all;
    fastcgi_pass unix:/run/php/php8.2-fpm.sock;
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
}

# Then check it:
curl http://127.0.0.1/fpm-status?full

The output shows active processes, idle count, listen queue length, max children reached count, and per-process request details. If "listen queue" is consistently above zero or "max children reached" keeps climbing, you need more workers.

While You Fix the 502, Check What Else Is Exposed

A 502 often surfaces during config changes. Those same changes might leave security gaps. Run a quick exposure check on your domain.

Run Exposure Checker

Step 5: Memory Exhaustion

PHP applications are notorious for consuming memory. A single Magento 2 request can use 256 MB or more. WordPress with WooCommerce and a page builder plugin easily hits 128 MB per request. Multiply that by your worker count and you can exhaust server RAM quickly.

Check PHP memory_limit

# Check the current memory limit
php -i | grep memory_limit

# Also check it from the web context (may differ from CLI)
# Create a temporary phpinfo file, load it in browser, then delete it
php -r "echo ini_get('memory_limit');"

# Check actual memory usage of running workers
ps -eo pid,rss,command | grep php-fpm | awk '{printf "PID: %s  RSS: %.0f MB  %s\n", $1, $2/1024, $3}'

The relationship between memory_limit and system RAM is critical. If memory_limit is 256M and you have 50 PHP-FPM workers, the theoretical maximum PHP memory consumption is 12.5 GB. If your server only has 8 GB of RAM, you will hit OOM before all workers are busy.

# Safe formula:
# max_children = (Available RAM in MB) / memory_limit
# Available RAM = Total RAM - OS overhead - MySQL/Redis/other services
#
# Example: 8 GB server, 2.5 GB for OS + services, memory_limit = 256M
# max_children = (8192 - 2560) / 256 = 22 workers
#
# For Magento 2 (known memory hog):
# Typical worker memory: 150-300 MB
# Recommended: memory_limit = 756M, max_children based on actual RSS

OOM scenarios with Magento and WordPress

Magento 2 is particularly aggressive with memory during catalog reindexing, sitemap generation, and order export operations. I have seen single CLI processes consume 2-4 GB. In a Kubernetes environment, if your PHP-FPM pod has a 2 GB memory limit and a reindex runs via cron in the same pod, the OOM killer terminates the container, taking all web workers with it. Every request returns 502 until the pod restarts.

The fix: run cron and CLI processes in a separate pod (or at minimum, a separate PHP-FPM pool) with its own memory limit. Do not let a runaway batch job kill your web-serving workers.

Step 6: Kubernetes and Docker Specific 502s

Containerized environments introduce a whole new category of 502 causes that do not exist on bare metal. If you are running Nginx as an ingress controller or sidecar in Kubernetes, these are the most common traps.

Liveness probe killing pods

If your liveness probe is too aggressive (checking every 5 seconds with a 1-second timeout, failing after 3 attempts), a brief CPU spike can cause the probe to fail. Kubernetes kills the pod, Nginx gets "connection refused" for a few seconds during restart, and users see 502.

# Bad: too aggressive
livenessProbe:
  httpGet:
    path: /health
    port: 9000
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 1
  failureThreshold: 3    # killed after 15 seconds of unresponsiveness

# Better: give the application room to breathe
livenessProbe:
  httpGet:
    path: /health
    port: 9000
  initialDelaySeconds: 30
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 5    # killed after 75 seconds of unresponsiveness

Readiness probe not configured

Without a readiness probe, Kubernetes sends traffic to a pod the moment it starts, even if PHP-FPM has not finished initializing. During deployments or pod restarts, you get a brief window of 502 errors. Always define a readiness probe that checks if the application is actually ready to serve requests.

readinessProbe:
  httpGet:
    path: /health
    port: 9000
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

SCRIPT_FILENAME wrong in containerized setups

When Nginx and PHP-FPM run in separate containers (a common pattern in Kubernetes), Nginx does not have the PHP files. It only serves static assets and proxies dynamic requests. If you use $document_root in your SCRIPT_FILENAME parameter, it resolves to Nginx's root (which has no PHP files). PHP-FPM receives a path that does not exist in its filesystem and returns an error. Nginx reports 502.

# Wrong (in multi-container setups):
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;

# Correct (hardcode the PHP-FPM container's document root):
fastcgi_param SCRIPT_FILENAME /var/www/html/pub$fastcgi_script_name;

This specific issue has caught me multiple times in Kubernetes Magento deployments. The fix is always to hardcode the path as PHP-FPM sees it, not as Nginx sees it.

PHP-FPM Process Manager Modes Compared

The pm directive in your PHP-FPM pool configuration controls how worker processes are managed. Choosing the right mode directly impacts your vulnerability to 502 errors.

Mode	Worker Behavior	Memory Usage	Best For	Config Example
static	Fixed number of workers, always running. No spawn delay on traffic spikes.	High (constant). All workers consume memory even when idle.	High-traffic production servers with consistent load and sufficient RAM.	`pm = static pm.max_children = 50`
dynamic	Scales between min and max spare. Spawns new workers when demand increases, kills idle ones.	Medium (variable). Scales with traffic, but spawning takes time.	General purpose. Good default for most servers.	`pm = dynamic pm.max_children = 50 pm.start_servers = 10 pm.min_spare_servers = 5 pm.max_spare_servers = 20`
ondemand	Zero workers at idle. Spawns workers only when requests arrive. Kills them after idle timeout.	Low (minimal at idle). Workers exist only during active requests.	Low-traffic sites, shared hosting, development. Not recommended for production under load.	`pm = ondemand pm.max_children = 50 pm.process_idle_timeout = 10s`

For production servers handling real traffic, I recommend static or dynamic. The ondemand mode saves memory but introduces latency on the first request after idle (worker spawn time), which can cascade into 502s during traffic bursts.

Preventing Future 502s

Fixing the immediate 502 is only half the job. If you do not set up monitoring and alerting, you will be debugging the same issue again next month.

Monitor PHP-FPM metrics

# Key metrics to track (from /fpm-status):
# - active processes: should stay below max_children
# - listen queue: should be 0 (requests waiting for a worker)
# - max children reached: counter that increments when all workers are busy
# - slow requests: tracks requests exceeding request_slowlog_timeout

# Enable the slow log to catch problematic scripts
request_slowlog_timeout = 5s
slowlog = /var/log/php8.2-fpm.slow.log

Set up alerts

At minimum, alert on these conditions:

PHP-FPM service down: immediate page (systemd watchdog or external health check)
active_processes > 80% of max_children: warning, you are running out of headroom
listen_queue > 0 for more than 30 seconds: requests are queueing, 502s are imminent
max_children_reached counter increasing: your pool is undersized
Server memory usage > 85%: OOM kill risk is high

Auto-restart with systemd

# Edit the PHP-FPM systemd service to auto-restart on failure
sudo systemctl edit php8.2-fpm

# Add these lines:
[Service]
Restart=on-failure
RestartSec=5s

# This ensures PHP-FPM restarts automatically if it crashes
# reducing the 502 window from "until someone notices" to ~5 seconds

Load test before it matters

Run load tests against your staging environment to find your breaking point before production finds it for you. Tools like ab, wrk, or k6 can simulate concurrent connections and reveal pm.max_children limits:

# Quick load test with Apache Bench
ab -n 1000 -c 50 https://staging.example.com/

# Better: use k6 for realistic load patterns
# k6 will show you at what concurrency level 502s start appearing

Common Mistakes

After years of debugging 502 errors across dozens of production servers, these are the mistakes I see most often:

Only increasing the timeout. If your application takes 120 seconds to respond, setting fastcgi_read_timeout 300s does not fix the problem. It just delays the error. Find and fix the slow query or API call.
Not checking the OOM killer. PHP-FPM can crash silently when the OOM killer terminates it. Always run dmesg | grep -i oom as part of your diagnostic routine.
Wrong socket permissions. If PHP-FPM runs as www-data but Nginx runs as nginx, the Unix socket needs proper group permissions. Set listen.owner, listen.group, and listen.mode in the pool config.
Ignoring slow database queries. A single unindexed query holding a PHP-FPM worker for 30 seconds means that worker is unavailable for 30 seconds. With a small pool, a few slow queries can exhaust all workers and trigger 502s for everyone.
Not tuning pm.max_children. The default value (typically 5) is a placeholder, not a production configuration. Calculate it based on your server's available RAM and per-worker memory consumption.
Restarting Nginx instead of PHP-FPM. When you see a 502, the instinct is to restart Nginx. But Nginx is not the problem. Restarting Nginx briefly drops all active connections and solves nothing if PHP-FPM is the culprit. Restart (or reload) PHP-FPM instead.

Frequently Asked Questions

What is the difference between 502 and 504?

A 502 Bad Gateway means the upstream backend sent an invalid response or refused the connection entirely. PHP-FPM is down, crashed, or returned something Nginx could not parse. A 504 Gateway Timeout means the upstream is alive but did not respond within the timeout period. In both cases the root cause is on the backend, but 502 usually indicates a harder failure (process down), while 504 indicates a performance problem (process alive but slow).

Can Cloudflare cause 502 errors?

Yes. Cloudflare shows its own branded "502 Bad Gateway" page when it cannot reach your origin server. If you see a Cloudflare-branded error page, the problem is between Cloudflare and your server (origin is down, firewall blocking Cloudflare IPs, or SSL mismatch). If you see a plain white "502 Bad Gateway" page, the problem is between Nginx and your backend (PHP-FPM). To test, bypass Cloudflare and hit your origin directly: curl -sk -H "Host: example.com" https://YOUR_SERVER_IP/. Check your security headers and SSL configuration to rule out TLS issues between Cloudflare and your origin.

How do I calculate the right pm.max_children value?

Measure the actual memory usage of your PHP-FPM workers under load (not at idle). Use ps --no-headers -o rss -C php-fpm to get the RSS of each worker in KB. Average them. Then: max_children = (Total RAM - RAM used by OS, DB, cache, etc.) / Average worker RSS. Leave at least 10-15% headroom for spikes. For example, on an 8 GB server with 3 GB reserved for other services and 100 MB average worker size: (8192 - 3072) / 100 = 51 workers. Round down to 45-50 for safety. Monitor and adjust from there.

Why do I only get 502 errors under load?

This almost always means pm.max_children is too low. When traffic is light, the available workers handle every request. Under load, all workers are busy, new requests queue up, and once the queue overflows, Nginx gets "connection refused." The fix is to increase max_children (if you have RAM) or optimize your application to handle requests faster (so workers free up sooner). Enable the pm.status_path as described above to confirm: if "max children reached" is incrementing, that is your bottleneck.

How do I do a zero-downtime restart of PHP-FPM?

Use reload, not restart. The reload signal (USR2) tells PHP-FPM to gracefully finish serving active requests, then start new worker processes with the updated configuration. Active connections are not dropped.

# Graceful reload (zero downtime):
sudo systemctl reload php8.2-fpm

# Or send the signal directly:
sudo kill -USR2 $(cat /run/php/php8.2-fpm.pid)

# AVOID this in production (drops all active connections):
sudo systemctl restart php8.2-fpm

After reloading, verify with systemctl status php8.2-fpm and check your error log to confirm no new 502 errors appear.

Secure Your Server While You Are At It

Debugging a 502 often means you are deep in server configuration. Take 30 seconds to scan your domain for exposed files, weak headers, and SSL issues before you close the terminal.

Run a Free Security Scan

The Bottom Line

Nginx 502 Bad Gateway is never an Nginx problem. It is a backend problem that Nginx is reporting honestly. Start with the error log, identify the failure pattern, and work backward to the root cause. In production, the fix is almost always one of these: restart PHP-FPM, fix a socket mismatch, increase pm.max_children, or address memory exhaustion.

Build monitoring and alerting around PHP-FPM metrics so you catch the next one before your users do. And while you are in the server, make sure the rest of your configuration is solid too.

Written by Usman Khan

DevOps Engineer | MSc Cybersecurity | CEH | AWS Solutions Architect

Usman has 10+ years of experience securing enterprise infrastructure, managing high-traffic PHP platforms, and debugging production incidents at scale. He writes from real-world experience running Magento, WordPress, and Laravel on Nginx in both bare-metal and Kubernetes environments. Read more about the author.