Health checks and crashes — When deploys go wrong

You’ve fixed the port binding. The service starts, Render detects the port, and then… the deploy still fails. This step covers the two big families of “almost-live” failures:

Health check failures - Render found your port but the configured healthCheckPath doesn’t return 200.
Crash loops - the process keeps starting, doing something, and exiting non-zero before Render can mark it live.

Both surface as “Deploy failed” after Detected service running on port, and the diagnosis lives in the runtime logs.

The boot → health-check → live timeline

flowchart LR
  start["startCommand runs"]
  port["Port detected"]
  hc["Health check<br/>HTTP GET healthCheckPath"]
  live["LIVE"]

  start -->|"binds port"| port
  port -->|"if configured"| hc
  port -->|"if no health check"| live
  hc -->|"200 OK"| live
  hc -->|"non-2xx, timeout, or never reachable"| fail["Deploy failed"]
  start -->|"process exits"| crash["Crash before binding"]

Two failure surfaces appear here:

Crash before binding - process exited before Render saw the port. Read the stack trace.
Health check fails - port found, but GET /healthz returns 500 or times out.

Crashes on boot

The clearest failure shape: your process starts, prints a stack trace, and exits non-zero.

What you’ll see

==> Running 'npm start'
> app@1.0.0 start
> node server.js
Error: DATABASE_URL is not defined
    at validateConfig (/opt/render/project/src/config.js:12:9)
    at Object.<anonymous> (/opt/render/project/src/server.js:3:1)
==> Application exited with code 1
==> Running 'npm start'
> app@1.0.0 start
> node server.js
Error: DATABASE_URL is not defined
... (same error, repeats indefinitely until deploy timeout) ...
==> Deploy failed

==> Running 'gunicorn myapp:app --bind 0.0.0.0:$PORT'
[2026-05-21 18:42:11 +0000] [1] [INFO] Starting gunicorn 21.2.0
[2026-05-21 18:42:11 +0000] [1] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/opt/render/project/src/myapp.py", line 5, in <module>
    from redis import Redis
ModuleNotFoundError: No module named 'redis'
==> Application exited with code 3

What it means

The first error line tells you exactly why the process couldn’t start. The same error repeating means Render’s restart loop is reproducing the failure deterministically.

Common crash-on-boot causes

Pattern	Cause	Fix
`Error: DATABASE_URL is not defined` (or `SECRET_KEY`, etc.)	Required env var isn’t set in the Render Dashboard	Set it; or, if it should come from a database, use `fromDatabase` in your Blueprint
`ECONNREFUSED 127.0.0.1:5432`	App is trying to connect to a local database that doesn’t exist on Render	Set `DATABASE_URL` to the Render-provided internal connection string
`ModuleNotFoundError: No module named 'X'` at runtime	A dependency works in dev but is missing from production deps	Add it to `dependencies` (not `devDependencies`), or `requirements.txt` (not `requirements-dev.txt`)
`EACCES: permission denied` writing to disk	App writes to the filesystem but you don’t have a persistent disk	Add a disk, or write to `/tmp` (ephemeral but allowed)
`bind: address already in use`	Two listeners trying to grab the same port	Usually a duplicate `app.listen` call from a hot-reload library still active in production

The pattern is always the same: read the first stack trace, identify what the app needs that isn’t there, provide it.

Health check failures

Once the port is detected, Render starts polling your healthCheckPath (if configured). If it doesn’t return 200 within the grace window, the deploy fails.

What you’ll see

==> Detected service running on port 10000
==> Health check at '/healthz' returned 503 (5 attempts)
==> Deploy failed: Health check returned non-2xx status

==> Detected service running on port 10000
==> Health check at '/health' returned 404 (5 attempts)
==> Deploy failed: Health check returned non-2xx status

==> Detected service running on port 10000
==> Health check at '/healthz' timed out (5 attempts)
==> Deploy failed

What each variant means

Status	Likely cause
`404`	The path doesn’t exist on your service. Either you’ve typo’d the path in Render’s config or the route was removed
`503` / `500`	The endpoint exists but the service isn’t healthy - usually a dependency check (DB, Redis) is failing inside the handler
Timeout	The endpoint is taking too long. App is busy starting up, or the handler does heavy work synchronously
`405 Method Not Allowed`	Your `/healthz` only accepts POST. Render uses GET

The fix

A health check endpoint should be:

app.get("/healthz", (req, res) => {
  res.status(200).json({ status: "ok" });
});

@app.get("/healthz")
def healthz():
    return {"status": "ok"}

Things to avoid doing in the health check handler:

Connecting to the database on every call (slow under load, false negatives on transient DB issues).
Checking external services (your /healthz shouldn’t fail because Stripe is having an outage).
Doing CPU-heavy work.

A health check should answer “is this process alive enough to receive traffic?”, not “is everything in the world working?”

Tuning the health check

In the Render Dashboard’s Settings → Health Check section (or healthCheckPath in your Blueprint), you can also adjust how many failures Render tolerates before failing the deploy. For services with heavy startup work (loading models, warming caches), bump the start-period in your code rather than the tolerance - Render’s grace window is generous enough for normal startup, but if you’re loading a 4GB language model into RAM, you need to set the health check to a different endpoint that only goes green once the model is ready.

let ready = false;

app.get("/healthz", (req, res) => {
  // Live check: process is up
  res.status(200).json({ status: "ok", ready });
});

app.get("/ready", (req, res) => {
  // Ready check: model loaded, queues warmed
  res.status(ready ? 200 : 503).json({ ready });
});

(async () => {
  await loadHeavyModel();
  ready = true;
  console.log("Service ready to receive traffic");
})();

Point Render’s healthCheckPath at /ready, and the deploy will only flip live once the heavy startup completes.

OOM kills

The classic “I built it, it worked for a minute, then died with no warning.”

What you’ll see

[2026-05-21 18:45:12] Loading 8 GB embedding model...
==> Out of memory! Process was killed
==> Application exited with code 137
==> Restarting...

Or - and this is the tricky one - no application log at all before the exit:

==> Detected service running on port 10000
==> Your service is live 🎉
(... 30 minutes pass ...)
==> Application exited with code 137
==> Restarting...

What it means

Exit code 137 is the universal signature of an OOM kill (it’s 128 + SIGKILL(9)). The kernel killed your process because the instance ran out of memory.

Diagnosis

Two questions:

Was it instant or gradual? Instant OOM during startup means your normal RSS exceeds the instance limit - wrong instance size for the workload. Gradual OOM (live for a while, then dies) means a memory leak.
Look at the Metrics tab. Memory should plateau in healthy services. A steadily-climbing graph that crashes at the plan limit is a leak.

The fix

Wrong instance size → upgrade to a larger plan. The scaling docs list the memory ceilings per plan.
Memory leak → use a heap profiler locally (Node: --inspect + Chrome DevTools; Python: tracemalloc; Go: pprof). Render itself doesn’t profile your process, but it gives you the time-of-death and the metrics graph.
One-time spike (large request, batch job) → move the heavy work to a background worker, or tune the queue worker count down. If you have one process with 8 worker threads each doing 1 GB of work, you’re at 8 GB - instances much smaller than that will OOM.

services:
  - type: web
    name: api
    plan: standard  # was 'starter' - 4 GB instead of 512 MB

SIGTERM and graceful shutdown

A different failure mode that often shows up as “deploys started failing after I added zero-downtime deploys.”

What you’ll see

==> SIGTERM received, shutting down gracefully
==> Application did not exit within 30 seconds, sending SIGKILL
==> Application exited with code 137
==> New instance is starting...

[CRITICAL] WORKER TIMEOUT (pid:23)
[ERROR] Worker (pid:23) was sent SIGKILL! Perhaps out of memory?

What it means

When Render deploys a new version or scales down, it sends SIGTERM to the old process and waits up to 30 seconds for it to exit cleanly. If the process ignores the signal, Render sends SIGKILL - which is what shows in logs above.

The two flavors:

Process doesn’t handle SIGTERM at all → server keeps running, drops connections at SIGKILL time. Ugly logs but recoverable.
Long-running requests block shutdown → a worker stuck on a slow request can’t drain in time. Often shows up as WORKER TIMEOUT in gunicorn or uvicorn.

The fix

Wire a SIGTERM handler that stops accepting new connections and drains the in-flight ones:

const server = app.listen(port, "0.0.0.0");

process.on("SIGTERM", () => {
  console.log("SIGTERM received, draining...");
  server.close(() => {
    console.log("All connections drained, exiting");
    process.exit(0);
  });
  // Force exit after 25s to stay under Render's 30s window
  setTimeout(() => process.exit(0), 25000).unref();
});

# In startCommand or gunicorn.conf.py
gunicorn myapp:app --bind 0.0.0.0:$PORT \
  --workers 4 \
  --graceful-timeout 25 \
  --timeout 30

For background workers (no HTTP listener), the same pattern: SIGTERM → finish current job → exit.

import signal
shutdown = False

def handle_sigterm(*_):
    global shutdown
    shutdown = True

signal.signal(signal.SIGTERM, handle_sigterm)

while not shutdown:
    job = queue.get(timeout=5)
    process(job)

Crash loop vs flapping service

Two patterns that look identical from a distance:

Pattern	Symptom	Likely cause
Crash loop	Service exits on boot, restarts, exits, restarts, indefinitely	Deterministic boot error (missing env var, bad config)
Flapping	Service goes live, stays up for minutes/hours, dies, restarts	Non-deterministic (memory leak, slow OOM, external dep timing out)

The Events feed disambiguates them: crash loops show many restart events in a few minutes; flapping services have minutes or hours between restarts. The diagnosis paths are completely different.

Your service exits with code 137 about 20 minutes after going live. The Metrics tab shows memory climbing steadily from 200 MB at boot to 1.5 GB at exit. The instance is a 1 GB Starter plan. Which is the right next step in the method?

Upgrade to Standard (4 GB) and call it doneAdd a SIGTERM handler so shutdown is gracefulForm a hypothesis ('memory leak') and isolate it locally with a heap profiler; only then decide whether to fix the leak or scale upDisable the health check to stop the restarts

What you learned

Crashes on boot have a stack trace as the first error - read it, identify what's missing (env var, dep, network), provide it
Health check failures break down by status: 404 (wrong path), 503/500 (handler is failing internally), timeout (handler too slow). Fix the handler, not the check tolerance
Use a two-phase health check (`/healthz` cheap, `/ready` heavy) when your service has slow startup like model loading
Exit code 137 is OOM. Check the Metrics tab for a memory graph - plateaus mean wrong instance size, climbs mean leaks
Set `NODE_OPTIONS=--max-old-space-size=...` to let V8 use more of the available RAM
Trap SIGTERM and drain in under 25 seconds. Without a handler, Render sends SIGKILL at 30 seconds and you drop in-flight requests