Render Tutorials
When deploys go wrong

Health checks and crashes

⏱ 9 min

You’ve fixed the port binding. The service starts, Render detects the port, and then… the deploy still fails. This step covers the two big families of “almost-live” failures:

  • Health check failures - Render found your port but the configured healthCheckPath doesn’t return 200.
  • Crash loops - the process keeps starting, doing something, and exiting non-zero before Render can mark it live.

Both surface as “Deploy failed” after Detected service running on port, and the diagnosis lives in the runtime logs.

The boot → health-check → live timeline

flowchart LR
  start["startCommand runs"]
  port["Port detected"]
  hc["Health check<br/>HTTP GET healthCheckPath"]
  live["LIVE"]

  start -->|"binds port"| port
  port -->|"if configured"| hc
  port -->|"if no health check"| live
  hc -->|"200 OK"| live
  hc -->|"non-2xx, timeout, or never reachable"| fail["Deploy failed"]
  start -->|"process exits"| crash["Crash before binding"]

Two failure surfaces appear here:

  1. Crash before binding - process exited before Render saw the port. Read the stack trace.
  2. Health check fails - port found, but GET /healthz returns 500 or times out.

Crashes on boot

The clearest failure shape: your process starts, prints a stack trace, and exits non-zero.

What you’ll see

Runtime log: missing env var
==> Running 'npm start'
> app@1.0.0 start
> node server.js
Error: DATABASE_URL is not defined
at validateConfig (/opt/render/project/src/config.js:12:9)
at Object.<anonymous> (/opt/render/project/src/server.js:3:1)
==> Application exited with code 1
==> Running 'npm start'
> app@1.0.0 start
> node server.js
Error: DATABASE_URL is not defined
... (same error, repeats indefinitely until deploy timeout) ...
==> Deploy failed
Runtime log: Python ImportError on boot
==> Running 'gunicorn myapp:app --bind 0.0.0.0:$PORT'
[2026-05-21 18:42:11 +0000] [1] [INFO] Starting gunicorn 21.2.0
[2026-05-21 18:42:11 +0000] [1] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/opt/render/project/src/myapp.py", line 5, in <module>
from redis import Redis
ModuleNotFoundError: No module named 'redis'
==> Application exited with code 3

What it means

The first error line tells you exactly why the process couldn’t start. The same error repeating means Render’s restart loop is reproducing the failure deterministically.

Common crash-on-boot causes

PatternCauseFix
Error: DATABASE_URL is not defined (or SECRET_KEY, etc.)Required env var isn’t set in the Render DashboardSet it; or, if it should come from a database, use fromDatabase in your Blueprint
ECONNREFUSED 127.0.0.1:5432App is trying to connect to a local database that doesn’t exist on RenderSet DATABASE_URL to the Render-provided internal connection string
ModuleNotFoundError: No module named 'X' at runtimeA dependency works in dev but is missing from production depsAdd it to dependencies (not devDependencies), or requirements.txt (not requirements-dev.txt)
EACCES: permission denied writing to diskApp writes to the filesystem but you don’t have a persistent diskAdd a disk, or write to /tmp (ephemeral but allowed)
bind: address already in useTwo listeners trying to grab the same portUsually a duplicate app.listen call from a hot-reload library still active in production

The pattern is always the same: read the first stack trace, identify what the app needs that isn’t there, provide it.

Health check failures

Once the port is detected, Render starts polling your healthCheckPath (if configured). If it doesn’t return 200 within the grace window, the deploy fails.

What you’ll see

Runtime log: health check failing
==> Detected service running on port 10000
==> Health check at '/healthz' returned 503 (5 attempts)
==> Deploy failed: Health check returned non-2xx status
Runtime log: health check path 404
==> Detected service running on port 10000
==> Health check at '/health' returned 404 (5 attempts)
==> Deploy failed: Health check returned non-2xx status
Runtime log: health check timeout
==> Detected service running on port 10000
==> Health check at '/healthz' timed out (5 attempts)
==> Deploy failed

What each variant means

StatusLikely cause
404The path doesn’t exist on your service. Either you’ve typo’d the path in Render’s config or the route was removed
503 / 500The endpoint exists but the service isn’t healthy - usually a dependency check (DB, Redis) is failing inside the handler
TimeoutThe endpoint is taking too long. App is busy starting up, or the handler does heavy work synchronously
405 Method Not AllowedYour /healthz only accepts POST. Render uses GET

The fix

A health check endpoint should be:

Good health check (Node / Express)
app.get("/healthz", (req, res) => {
res.status(200).json({ status: "ok" });
});
Good health check (FastAPI)
@app.get("/healthz")
def healthz():
return {"status": "ok"}

Things to avoid doing in the health check handler:

  • Connecting to the database on every call (slow under load, false negatives on transient DB issues).
  • Checking external services (your /healthz shouldn’t fail because Stripe is having an outage).
  • Doing CPU-heavy work.

A health check should answer “is this process alive enough to receive traffic?”, not “is everything in the world working?”

Tuning the health check

In the Render Dashboard’s Settings → Health Check section (or healthCheckPath in your Blueprint), you can also adjust how many failures Render tolerates before failing the deploy. For services with heavy startup work (loading models, warming caches), bump the start-period in your code rather than the tolerance - Render’s grace window is generous enough for normal startup, but if you’re loading a 4GB language model into RAM, you need to set the health check to a different endpoint that only goes green once the model is ready.

Two-phase health check
let ready = false;
app.get("/healthz", (req, res) => {
// Live check: process is up
res.status(200).json({ status: "ok", ready });
});
app.get("/ready", (req, res) => {
// Ready check: model loaded, queues warmed
res.status(ready ? 200 : 503).json({ ready });
});
(async () => {
await loadHeavyModel();
ready = true;
console.log("Service ready to receive traffic");
})();

Point Render’s healthCheckPath at /ready, and the deploy will only flip live once the heavy startup completes.

OOM kills

The classic “I built it, it worked for a minute, then died with no warning.”

What you’ll see

Runtime log: OOM kill
[2026-05-21 18:45:12] Loading 8 GB embedding model...
==> Out of memory! Process was killed
==> Application exited with code 137
==> Restarting...

Or - and this is the tricky one - no application log at all before the exit:

Runtime log: OOM with no warning
==> Detected service running on port 10000
==> Your service is live 🎉
(... 30 minutes pass ...)
==> Application exited with code 137
==> Restarting...

What it means

Exit code 137 is the universal signature of an OOM kill (it’s 128 + SIGKILL(9)). The kernel killed your process because the instance ran out of memory.

Diagnosis

Two questions:

  1. Was it instant or gradual? Instant OOM during startup means your normal RSS exceeds the instance limit - wrong instance size for the workload. Gradual OOM (live for a while, then dies) means a memory leak.
  2. Look at the Metrics tab. Memory should plateau in healthy services. A steadily-climbing graph that crashes at the plan limit is a leak.

The fix

  • Wrong instance size → upgrade to a larger plan. The scaling docs list the memory ceilings per plan.
  • Memory leak → use a heap profiler locally (Node: --inspect + Chrome DevTools; Python: tracemalloc; Go: pprof). Render itself doesn’t profile your process, but it gives you the time-of-death and the metrics graph.
  • One-time spike (large request, batch job) → move the heavy work to a background worker, or tune the queue worker count down. If you have one process with 8 worker threads each doing 1 GB of work, you’re at 8 GB - instances much smaller than that will OOM.
render.yaml: bump the instance size
services:
- type: web
name: api
plan: standard # was 'starter' - 4 GB instead of 512 MB

SIGTERM and graceful shutdown

A different failure mode that often shows up as “deploys started failing after I added zero-downtime deploys.”

What you’ll see

Runtime log: SIGTERM not handled
==> SIGTERM received, shutting down gracefully
==> Application did not exit within 30 seconds, sending SIGKILL
==> Application exited with code 137
==> New instance is starting...
Runtime log: workers killed mid-request
[CRITICAL] WORKER TIMEOUT (pid:23)
[ERROR] Worker (pid:23) was sent SIGKILL! Perhaps out of memory?

What it means

When Render deploys a new version or scales down, it sends SIGTERM to the old process and waits up to 30 seconds for it to exit cleanly. If the process ignores the signal, Render sends SIGKILL - which is what shows in logs above.

The two flavors:

  • Process doesn’t handle SIGTERM at all → server keeps running, drops connections at SIGKILL time. Ugly logs but recoverable.
  • Long-running requests block shutdown → a worker stuck on a slow request can’t drain in time. Often shows up as WORKER TIMEOUT in gunicorn or uvicorn.

The fix

Wire a SIGTERM handler that stops accepting new connections and drains the in-flight ones:

Node / Express graceful shutdown
const server = app.listen(port, "0.0.0.0");
process.on("SIGTERM", () => {
console.log("SIGTERM received, draining...");
server.close(() => {
console.log("All connections drained, exiting");
process.exit(0);
});
// Force exit after 25s to stay under Render's 30s window
setTimeout(() => process.exit(0), 25000).unref();
});
gunicorn graceful timeout
# In startCommand or gunicorn.conf.py
gunicorn myapp:app --bind 0.0.0.0:$PORT \
--workers 4 \
--graceful-timeout 25 \
--timeout 30

For background workers (no HTTP listener), the same pattern: SIGTERM → finish current job → exit.

Python worker
import signal
shutdown = False
def handle_sigterm(*_):
global shutdown
shutdown = True
signal.signal(signal.SIGTERM, handle_sigterm)
while not shutdown:
job = queue.get(timeout=5)
process(job)

Crash loop vs flapping service

Two patterns that look identical from a distance:

PatternSymptomLikely cause
Crash loopService exits on boot, restarts, exits, restarts, indefinitelyDeterministic boot error (missing env var, bad config)
FlappingService goes live, stays up for minutes/hours, dies, restartsNon-deterministic (memory leak, slow OOM, external dep timing out)

The Events feed disambiguates them: crash loops show many restart events in a few minutes; flapping services have minutes or hours between restarts. The diagnosis paths are completely different.

Your service exits with code 137 about 20 minutes after going live. The Metrics tab shows memory climbing steadily from 200 MB at boot to 1.5 GB at exit. The instance is a 1 GB Starter plan. Which is the right next step in the method?

What you learned

  • Crashes on boot have a stack trace as the first error - read it, identify what's missing (env var, dep, network), provide it
  • Health check failures break down by status: 404 (wrong path), 503/500 (handler is failing internally), timeout (handler too slow). Fix the handler, not the check tolerance
  • Use a two-phase health check (`/healthz` cheap, `/ready` heavy) when your service has slow startup like model loading
  • Exit code 137 is OOM. Check the Metrics tab for a memory graph - plateaus mean wrong instance size, climbs mean leaks
  • Set `NODE_OPTIONS=--max-old-space-size=...` to let V8 use more of the available RAM
  • Trap SIGTERM and drain in under 25 seconds. Without a handler, Render sends SIGKILL at 30 seconds and you drop in-flight requests