The playbook — When deploys go wrong

You’ve walked the method (step 02), the logs surface (step 03), and the four failure families (steps 04-07). This last step is the stitching: a single decision tree that takes you from “deploy failed” to “I know exactly what’s wrong” in a bounded number of clicks, plus a runbook entry you can drop into your team’s docs.

The master flowchart

Save this somewhere. Print it. Tattoo it on a forearm if you must.

flowchart TB
  start["A deploy or service is broken"]
  events["Open the Events feed"]
  status{"What's the status<br/>of the most recent event?"}

  start --> events --> status

  status -->|"'Build failed'"| build["Open deploy log<br/>→ search for 'error'<br/>→ first match is the cause"]
  status -->|"'Pre-deploy failed'"| build
  status -->|"'Deploy failed'<br/>(after build/predeploy)"| boot["Open runtime logs<br/>→ look for stack trace OR<br/>'No open ports detected'"]
  status -->|"'Live' but errors<br/>in the wild"| runtime["Reproduce with curl<br/>→ note the HTTP code<br/>→ check runtime logs"]

  build --> b1["Step 04:<br/>Build-failure library"]
  boot --> bo1{"What does the<br/>runtime log say?"}
  bo1 -->|"'No open ports detected'"| port["Step 05:<br/>Port binding"]
  bo1 -->|"Stack trace<br/>before exit"| crash["Step 06:<br/>Crash on boot"]
  bo1 -->|"Health check<br/>failed"| hc["Step 06:<br/>Health checks"]

  runtime --> r1{"HTTP code?"}
  r1 -->|"4xx"| app["Step 07:<br/>App-level errors"]
  r1 -->|"5xx (not 502)"| app
  r1 -->|"502"| edge["Step 07:<br/>Edge couldn't reach app"]

Three branches off the top, each with a known target step. No branch is deeper than three levels. If you find yourself drilling further, you’re probably trying to solve two problems at once - split them, fix one, then go back to the other.

The 60-second triage script

When a deploy or service breaks, run this script in your head before anything else:

Open the Events feed for the service It’s the timeline. The most recent red event is the failure; the most recent change before that is the suspect.
Note the status of the failing event ‘Build failed’ / ‘Pre-deploy failed’ → deploy log. ‘Deploy failed’ → runtime logs. Live with errors → runtime logs scoped by time.
Open the right log and search for `error` Jump to the first match. Read 5 lines above and 10 below.
Match the first error against the failure library Steps 04-07 catalogue the common ones. If yours isn’t there, the shape is usually similar to one that is.
Write the hypothesis in one sentence before changing anything ”X is missing/wrong because Y.” If you can’t write Y, keep reading the logs.
Apply the smallest fix that disproves the hypothesis One commit, one env var, one Blueprint change. Not three at once.

If you’ve done this twice and you’re still stuck, escalate (step 02, section 6). The script is good for ~80% of failures; the rest need more context, and the longer you swing without a fresh perspective, the worse the swing gets.

A printable cheat sheet

Pin this next to your monitor:

Symptom in Events feed	Where to look	Most common causes
Build failed	Deploy log → build section	Module not found, language version, lockfile drift
Pre-deploy failed	Deploy log → pre-deploy section	Failed migration, missing env var, broken seed script
Deploy failed (after build OK)	Runtime logs of new instance	Port binding (0.0.0.0/PORT), missing CMD, crash on boot
Deploy failed (port detected)	Runtime logs of new instance	Health check failing, slow startup, missing dependency
Live but 502s	Runtime logs at request time	Node keep-alive, gunicorn worker timeout, intermittent network
Live but 500s	Runtime logs at request time	Uncaught exception, DB SSL, pool exhaustion
Live but 404s on SPA routes	Static site config	Missing rewrite rule for `/*` → `/index.html`
Live but 400s	Runtime logs at request time	Django ALLOWED_HOSTS, custom domain not in app’s allowlist
Service repeatedly restarting	Runtime logs + Metrics tab	OOM (exit 137), unhandled exception in non-request code, missing env var

The four-category drift checklist

When the failure doesn’t match a known pattern, walk the four categories of local-vs-Render drift from step 01. For each, ask one question:

Versions “What language and dependency versions am I using locally? What is the deploy log saying Render is using? Are they the same major and minor?"
Environment variables "What env vars are set locally (check .env, your shell, your IDE config)? Are all of them set on Render? In particular: DB URL, secret keys, API keys, NODE_ENV/RUST_LOG."
Filesystem "Do any of my imports differ in case from the actual file? Am I writing to disk anywhere without a persistent disk attached?"
Network "Am I using the right DB URL (internal vs external)? Is sslmode set correctly? Am I assuming localhost reachability that doesn’t exist between services?”

If you’ve answered all four and the gap is still hiding, you’ve eliminated the 99th percentile. The remainder is platform issues or genuinely novel application bugs - both worth a support ticket or a deeper read of the Render docs.

A runbook entry to copy

Drop this into your team’s runbook. It’s the externalised version of the method, sized for someone on call at 3am who can’t think clearly:

RUNBOOK.md

# Render: a deploy or service is broken

## 1. Where is the failure?

- Open https://dashboard.render.com → the affected service → **Events** tab.
- The most recent red event tells you which phase failed:
  - **Build failed** → click the deploy → read the build log top-to-bottom; search for "error".
  - **Pre-deploy failed** → same deploy log, scroll to the pre-deploy section.
  - **Deploy failed** with green build → click the service's **Logs** tab and filter to the deploy's time window.
  - **Live but errors** → reproduce with curl, note the HTTP code, open Logs.

## 2. Read the FIRST error

- The first error is the cause. Everything after it is consequence.
- Scroll UP, not down. Skip the "process exited" line at the bottom.
- Render's `==>` lines tell you what the *platform* was doing; everything else is your app.

## 3. Match against known patterns

- See: <link to this tutorial>
- Build phase: tutorial step 04
- Boot phase: tutorial step 05
- Crash / health check: tutorial step 06
- Runtime (HTTP errors): tutorial step 07

## 4. Apply ONE fix

- Smallest change that disproves the hypothesis.
- Wait for the redeploy. Don't push three commits to "try things".

## 5. Still stuck?

- Status page: https://status.render.com - check for incidents.
- Escalation channel: <your support channel / Discord / ticket queue>.
- Include: service ID, deploy ID, UTC timestamp, first error line, what you tried.

## Quick links

- Render docs troubleshooting: https://render.com/docs/troubleshooting-deploys
- Render logging docs: https://render.com/docs/logging
- This tutorial: <link>

Customise the links to your team’s tools and you’re done.

Practising on a calm day

The method is muscle memory. Practising it when nothing is broken is the difference between using it well during an incident and forgetting it exists.

A 15-minute drill you can run any time:

Pick a recent successful deploy Open its log. Read the ==> markers and identify each phase. Verify you can name what each section does without scrolling.
Pick a recent failed deploy (if you have one) Walk the method out loud. Don’t peek at how it was actually fixed until you’ve written down your hypothesis.
Break a test service deliberately Set NODE_VERSION=14 on a Node service that needs 20. Redeploy. Watch the failure, walk through this tutorial’s relevant step, and fix it. Repeat with a wrong PORT, a missing env var, etc.
Time yourself From the moment the deploy fails to the moment you’ve written a hypothesis. Aim for under 90 seconds.

The fastest engineers aren’t faster typists - they recognise patterns faster. The library you’ve built across steps 04-07 is the pattern recognition.

When the docs are the source of truth

This tutorial captures the most common failures as of writing. Render’s platform evolves: new runtime defaults, new instance types, new platform features. If something here ever disagrees with the Render docs, the docs win:

Troubleshooting your deploy - the canonical list of common errors.
Logging - current log retention, streaming, and the log explorer.
Native runtimes - what’s pre-installed and what isn’t.
Health checks - the configuration surface and behavior.

Bookmark them all. The fastest answer to “is this a Render quirk or my bug?” is often a 30-second skim of one of these pages.

Where to go next

You’ve now got a complete troubleshooting toolkit - method, logs literacy, a failure library across all four phases, a flowchart, and a runbook. Natural next stops:

Render CLI for power users - turns the Render Dashboard-based diagnosis loop into a scriptable, CI-integrated one. The render logs and render deploys commands compress a lot of clicks.
Postgres on Render: a deep dive - connection pooling, SSL, and the four-error playbook for the database-layer issues that often appear as 500s in this tutorial.
Advanced Blueprint patterns - the right Blueprint shape removes whole categories of failure (especially env-var drift between services) before they happen.

It's 2am. A teammate pages you: 'Production is on fire. The site is throwing 502s.' You're not awake enough to think. Which of these does the playbook tell you to do FIRST?

Roll back to the previous deploy immediatelyOpen https://status.render.com to check for a platform incidentOpen the service's Events feed and runtime Logs tab, then identify the surface (this is a runtime/edge issue, not a deploy issue)SSH into the running service and start debugging

What you learned

The master flowchart starts at the Events feed, splits on the most recent event's status, and lands in steps 04–07 within at most two hops
The 60-second triage script: open events, identify the surface, read the FIRST error, match against the library, write one hypothesis, apply the smallest fix
Cheat sheet: build failed → deploy log; deploy failed (build green) → runtime log; live with errors → runtime log scoped by time
Four-category drift checklist: versions, env vars, filesystem, network. Walk all four when the failure isn't matching a known pattern
The runbook entry is meant to be readable at 3am - keep it short, link out to the deep-dive (this tutorial), and update it when you hit new patterns
Practise on a calm day. Pattern recognition is the difference between a 5-minute fix and a 90-minute fire
When in doubt, the [Render docs on troubleshooting](https://render.com/docs/troubleshooting-deploys) are the source of truth