Scale and survive failures — Localhost Part 1: Deploy an AI code-review agent on Render

The worker deploy unlocked two things in-process review couldn’t do: scale the agent independently, and keep work alive when the web tier restarts. Before you prove that in the Dashboard, read the one part that makes the worker pattern real: message acknowledgement.

1. Trace the ack semantics

Open the queue helper and find the entry-processing function. In the current workshop repo, the function is implemented so you can inspect the contract directly:

Queue helper: packages/queue-agents/src/kv.ts
Function: processEntry

Queue helper: packages/queue_agents/src/queue_agents/kv.py
Function: process_entry

Parse the stream entry into a job.
Run the handler.
On success, call XACK so the consumer group does not redeliver the message.
On failure, log and return without acking. The message stays pending and can be retried.

Open the queue helper In your fork, open the queue helper for your track.
Find the entry processor Read the try block: it parses the entry, runs the handler, then calls xack only after success.
Check the failure path Read the error path: it logs and returns without acknowledging the stream entry, so the message stays pending for retry.

This is the queue ownership the rest of the workshop contrasts with Workflows. Acknowledging too early can lose work. Letting errors escape can kill the consumer loop. Forgetting to ack success means Redis keeps the job pending forever.

2. Verify the queue behavior

Start Redis or Valkey locally if it is not already running:

redis-server

docker run --rm -p 6379:6379 redis

In another terminal, run the focused worker test:

VALKEY_URL=redis://127.0.0.1:6379 npm run test:worker

VALKEY_URL=redis://127.0.0.1:6379 uv run pytest tests/integration/test_queue_kv.py

All three focused tests should pass: the success ack, the failed-handler pending state, and a stale pending entry reclaimed for another attempt. If the run reports 0 tests or skipped, the suite never reached a live Redis or Valkey, see Troubleshooting. The repo ships this implementation so you can inspect the contract, verify it locally, and connect that contract to the deployed behavior.

3. Scale out

Submit several reviews quickly against the <your-username>-queue-agents-web URL. Use the LlamaIndex baseline PR, the OpenAI Agents trace PR, or any other public PR from the dashboard picker. With one worker, jobs queue up and drain one at a time. Use the dashboard’s Status and Run time (s) columns as your quick read on how long each run waits and works.

Now add workers:

Open the worker In the Render Dashboard, open <your-username>-queue-agents-worker.
Raise the instance count Go to Scaling and set instances to 3. This is the numInstances field from the Blueprint.
Resubmit several reviews Submit several reviews again and compare the new rows’ status changes and run times.

Jobs move through the Key Value stream and spread across the worker instances. Throughput went up with no code change. The agent did not learn to be faster; you gave it more places to run.

flowchart LR
  web["queue-agents-web<br/>producer"]
  kv[("Key Value<br/>stream + consumer group")]
  w1["worker 1"]
  w2["worker 2"]
  w3["worker 3"]

  web -->|enqueue| kv
  kv --> w1
  kv --> w2
  kv --> w3

4. Survive a restart

The other payoff is durability. The work no longer lives in a request, so killing the request can’t kill the work.

Start a review Submit a public PR and confirm a worker picks it up in the logs.
Restart the web service While the review is in flight, redeploy or restart <your-username>-queue-agents-web from the Dashboard.
Check the result The worker finishes independently. When the web service comes back, the completed review row is already in Postgres and visible in the dashboard.

In Pattern 1, that restart would have lost the run. Here the job was already on the queue and the worker was already running it, both outside the web service’s lifecycle.

5. Count what you now own

Scale and durability were not free. Open the queue helper again and count the coordination layer:

Use packages/queue-agents/src/kv.ts.

Use packages/queue_agents/src/queue_agents/kv.py.

The stream and the consumer group.
Blocking reads that wait for the next job.
Acks, so a job isn’t lost or double-run.
Retry-on-failure for un-acked messages.
The pub/sub channel that reports progress while the dashboard reads durable review state.

It is not a huge file. But every line is coordination code you now own, debug, and keep correct. There is still no built-in trace of which agent ran where or how long each step took.

6. Next up: Workflows

You’ll compare these two stacks with the Workflows version in the next tutorial. The shared workshop workspace is torn down after the session.

Why does a failed worker handler leave the stream entry unacknowledged?

So Postgres can delete the failed review rowSo the consumer group can keep the message pending and retry it laterSo the web service can resend the original HTTP requestSo the agent can skip the same PR next time

Troubleshooting

Find the symptom that matches what you’re seeing, then apply the fix.

The test reports green but ran zero tests. Both worker suites skip themselves when VALKEY_URL is unset, and the exit code is still 0. Confirm the run actually executed: you want 3 passing tests (acks on success, leaves un-acked on failure, redelivers a pending entry). A green run with 0 tests or skipped means VALKEY_URL didn’t reach the process. Always pass it inline, as shown in the command.

command not found: redis-server or Cannot connect to the Docker daemon. Install Redis (brew install redis && redis-server) or run docker run --rm -p 6379:6379 redis with Docker Desktop running. Verify with redis-cli ping returning PONG before running the test, and keep that terminal open.

Address already in use on port 6379. Something is already on the default port. You don’t need a second instance: confirm it with redis-cli ping and point the test at it. To clear a stray one: docker rm -f <id> or brew services stop redis.

The test passes but connection-refused in real use. The test only skips on a missing env var, not a missing server. ECONNREFUSED 127.0.0.1:6379 means no Redis/Valkey is actually running on that port.

Scaling to 3 workers shows no throughput change. With the mock model each review finishes in milliseconds, so one worker never visibly bottlenecks. Submit a burst (5+ reviews at once) so the queue backs up, wait for all three instances to go live, then watch the Status and Run time (s) columns. The contrast is clearest with a real model, where each review takes seconds.

The restart-durability demo finishes before you can restart. A mock review is near-instant, so use a real model or a larger PR. Confirm a worker picked up the job in its logs first, then restart <your-username>-queue-agents-web (the web service, not the worker). The completed row lands in Postgres independent of the web lifecycle.

A failing job retries on a ~30s delay, not instantly. On failure the entry stays pending and is redelivered only after a 30-second idle reclaim window. The focused test forces immediate retry, so it looks instant there but waits 30s in the deployed worker. A worker log looping on the same entry id every ~30s is a failing handler being reclaimed, not a hang.

command not found: pytest. pytest lives in the workspace .venv, not on your PATH. Keep the uv run prefix shown in the command above; bare pytest won’t resolve.

What you learned

Traced the entry processor: successful jobs are acked and failed jobs stay pending for retry
Ran the focused worker queue test against local Redis or Valkey to prove the ack boundary
Scaled the worker with `numInstances` and watched throughput rise across several review submissions
Restarted the web service mid-review and confirmed the work survived in Postgres
Set up the contrast for Pattern 3, where Render Workflows replaces that coordination code