You’ve fixed the port binding. The service starts, Render detects the port, and then… the deploy still fails. This step covers the two big families of “almost-live” failures:
- Health check failures - Render found your port but the configured
healthCheckPathdoesn’t return 200. - Crash loops - the process keeps starting, doing something, and exiting non-zero before Render can mark it live.
Both surface as “Deploy failed” after Detected service running on port, and the diagnosis lives in the runtime logs.
The boot → health-check → live timeline
flowchart LR start["startCommand runs"] port["Port detected"] hc["Health check<br/>HTTP GET healthCheckPath"] live["LIVE"] start -->|"binds port"| port port -->|"if configured"| hc port -->|"if no health check"| live hc -->|"200 OK"| live hc -->|"non-2xx, timeout, or never reachable"| fail["Deploy failed"] start -->|"process exits"| crash["Crash before binding"]
Two failure surfaces appear here:
- Crash before binding - process exited before Render saw the port. Read the stack trace.
- Health check fails - port found, but
GET /healthzreturns 500 or times out.
Crashes on boot
The clearest failure shape: your process starts, prints a stack trace, and exits non-zero.
What you’ll see
==> Running 'npm start'> app@1.0.0 start> node server.jsError: DATABASE_URL is not defined at validateConfig (/opt/render/project/src/config.js:12:9) at Object.<anonymous> (/opt/render/project/src/server.js:3:1)==> Application exited with code 1==> Running 'npm start'> app@1.0.0 start> node server.jsError: DATABASE_URL is not defined... (same error, repeats indefinitely until deploy timeout) ...==> Deploy failed==> Running 'gunicorn myapp:app --bind 0.0.0.0:$PORT'[2026-05-21 18:42:11 +0000] [1] [INFO] Starting gunicorn 21.2.0[2026-05-21 18:42:11 +0000] [1] [ERROR] Exception in worker processTraceback (most recent call last): File "/opt/render/project/src/myapp.py", line 5, in <module> from redis import RedisModuleNotFoundError: No module named 'redis'==> Application exited with code 3What it means
The first error line tells you exactly why the process couldn’t start. The same error repeating means Render’s restart loop is reproducing the failure deterministically.
Common crash-on-boot causes
| Pattern | Cause | Fix |
|---|---|---|
Error: DATABASE_URL is not defined (or SECRET_KEY, etc.) | Required env var isn’t set in the Render Dashboard | Set it; or, if it should come from a database, use fromDatabase in your Blueprint |
ECONNREFUSED 127.0.0.1:5432 | App is trying to connect to a local database that doesn’t exist on Render | Set DATABASE_URL to the Render-provided internal connection string |
ModuleNotFoundError: No module named 'X' at runtime | A dependency works in dev but is missing from production deps | Add it to dependencies (not devDependencies), or requirements.txt (not requirements-dev.txt) |
EACCES: permission denied writing to disk | App writes to the filesystem but you don’t have a persistent disk | Add a disk, or write to /tmp (ephemeral but allowed) |
bind: address already in use | Two listeners trying to grab the same port | Usually a duplicate app.listen call from a hot-reload library still active in production |
The pattern is always the same: read the first stack trace, identify what the app needs that isn’t there, provide it.
Health check failures
Once the port is detected, Render starts polling your healthCheckPath (if configured). If it doesn’t return 200 within the grace window, the deploy fails.
What you’ll see
==> Detected service running on port 10000==> Health check at '/healthz' returned 503 (5 attempts)==> Deploy failed: Health check returned non-2xx status==> Detected service running on port 10000==> Health check at '/health' returned 404 (5 attempts)==> Deploy failed: Health check returned non-2xx status==> Detected service running on port 10000==> Health check at '/healthz' timed out (5 attempts)==> Deploy failedWhat each variant means
| Status | Likely cause |
|---|---|
404 | The path doesn’t exist on your service. Either you’ve typo’d the path in Render’s config or the route was removed |
503 / 500 | The endpoint exists but the service isn’t healthy - usually a dependency check (DB, Redis) is failing inside the handler |
| Timeout | The endpoint is taking too long. App is busy starting up, or the handler does heavy work synchronously |
405 Method Not Allowed | Your /healthz only accepts POST. Render uses GET |
The fix
A health check endpoint should be:
app.get("/healthz", (req, res) => { res.status(200).json({ status: "ok" });});@app.get("/healthz")def healthz(): return {"status": "ok"}Things to avoid doing in the health check handler:
- Connecting to the database on every call (slow under load, false negatives on transient DB issues).
- Checking external services (your
/healthzshouldn’t fail because Stripe is having an outage). - Doing CPU-heavy work.
A health check should answer “is this process alive enough to receive traffic?”, not “is everything in the world working?”
Tuning the health check
In the Render Dashboard’s Settings → Health Check section (or healthCheckPath in your Blueprint), you can also adjust how many failures Render tolerates before failing the deploy. For services with heavy startup work (loading models, warming caches), bump the start-period in your code rather than the tolerance - Render’s grace window is generous enough for normal startup, but if you’re loading a 4GB language model into RAM, you need to set the health check to a different endpoint that only goes green once the model is ready.
let ready = false;
app.get("/healthz", (req, res) => { // Live check: process is up res.status(200).json({ status: "ok", ready });});
app.get("/ready", (req, res) => { // Ready check: model loaded, queues warmed res.status(ready ? 200 : 503).json({ ready });});
(async () => { await loadHeavyModel(); ready = true; console.log("Service ready to receive traffic");})();Point Render’s healthCheckPath at /ready, and the deploy will only flip live once the heavy startup completes.
OOM kills
The classic “I built it, it worked for a minute, then died with no warning.”
What you’ll see
[2026-05-21 18:45:12] Loading 8 GB embedding model...==> Out of memory! Process was killed==> Application exited with code 137==> Restarting...Or - and this is the tricky one - no application log at all before the exit:
==> Detected service running on port 10000==> Your service is live 🎉(... 30 minutes pass ...)==> Application exited with code 137==> Restarting...What it means
Exit code 137 is the universal signature of an OOM kill (it’s 128 + SIGKILL(9)). The kernel killed your process because the instance ran out of memory.
Diagnosis
Two questions:
- Was it instant or gradual? Instant OOM during startup means your normal RSS exceeds the instance limit - wrong instance size for the workload. Gradual OOM (live for a while, then dies) means a memory leak.
- Look at the Metrics tab. Memory should plateau in healthy services. A steadily-climbing graph that crashes at the plan limit is a leak.
The fix
- Wrong instance size → upgrade to a larger plan. The scaling docs list the memory ceilings per plan.
- Memory leak → use a heap profiler locally (Node:
--inspect+ Chrome DevTools; Python:tracemalloc; Go:pprof). Render itself doesn’t profile your process, but it gives you the time-of-death and the metrics graph. - One-time spike (large request, batch job) → move the heavy work to a background worker, or tune the queue worker count down. If you have one process with 8 worker threads each doing 1 GB of work, you’re at 8 GB - instances much smaller than that will OOM.
services: - type: web name: api plan: standard # was 'starter' - 4 GB instead of 512 MBSIGTERM and graceful shutdown
A different failure mode that often shows up as “deploys started failing after I added zero-downtime deploys.”
What you’ll see
==> SIGTERM received, shutting down gracefully==> Application did not exit within 30 seconds, sending SIGKILL==> Application exited with code 137==> New instance is starting...[CRITICAL] WORKER TIMEOUT (pid:23)[ERROR] Worker (pid:23) was sent SIGKILL! Perhaps out of memory?What it means
When Render deploys a new version or scales down, it sends SIGTERM to the old process and waits up to 30 seconds for it to exit cleanly. If the process ignores the signal, Render sends SIGKILL - which is what shows in logs above.
The two flavors:
- Process doesn’t handle SIGTERM at all → server keeps running, drops connections at SIGKILL time. Ugly logs but recoverable.
- Long-running requests block shutdown → a worker stuck on a slow request can’t drain in time. Often shows up as
WORKER TIMEOUTin gunicorn or uvicorn.
The fix
Wire a SIGTERM handler that stops accepting new connections and drains the in-flight ones:
const server = app.listen(port, "0.0.0.0");
process.on("SIGTERM", () => { console.log("SIGTERM received, draining..."); server.close(() => { console.log("All connections drained, exiting"); process.exit(0); }); // Force exit after 25s to stay under Render's 30s window setTimeout(() => process.exit(0), 25000).unref();});# In startCommand or gunicorn.conf.pygunicorn myapp:app --bind 0.0.0.0:$PORT \ --workers 4 \ --graceful-timeout 25 \ --timeout 30For background workers (no HTTP listener), the same pattern: SIGTERM → finish current job → exit.
import signalshutdown = False
def handle_sigterm(*_): global shutdown shutdown = True
signal.signal(signal.SIGTERM, handle_sigterm)
while not shutdown: job = queue.get(timeout=5) process(job)Crash loop vs flapping service
Two patterns that look identical from a distance:
| Pattern | Symptom | Likely cause |
|---|---|---|
| Crash loop | Service exits on boot, restarts, exits, restarts, indefinitely | Deterministic boot error (missing env var, bad config) |
| Flapping | Service goes live, stays up for minutes/hours, dies, restarts | Non-deterministic (memory leak, slow OOM, external dep timing out) |
The Events feed disambiguates them: crash loops show many restart events in a few minutes; flapping services have minutes or hours between restarts. The diagnosis paths are completely different.
What you learned
- Crashes on boot have a stack trace as the first error - read it, identify what's missing (env var, dep, network), provide it
- Health check failures break down by status: 404 (wrong path), 503/500 (handler is failing internally), timeout (handler too slow). Fix the handler, not the check tolerance
- Use a two-phase health check (`/healthz` cheap, `/ready` heavy) when your service has slow startup like model loading
- Exit code 137 is OOM. Check the Metrics tab for a memory graph - plateaus mean wrong instance size, climbs mean leaks
- Set `NODE_OPTIONS=--max-old-space-size=...` to let V8 use more of the available RAM
- Trap SIGTERM and drain in under 25 seconds. Without a handler, Render sends SIGKILL at 30 seconds and you drop in-flight requests