Go further with your review task — Localhost Part 2: Run AI agents as Render Workflows

You have a working, deployed task. These challenges are optional and take-home: each adds a real agentic capability on top of your authored review task, and each reinforces the same lesson. The substrate makes hard things declarative. Pick one, or work through all three.

Production hardening map

The deployed reviewer is a working slice, not the end state for a production agent. Four next layers plug into this codebase without changing the agent itself:

Evals. Score the shared review runner over labeled PRs so quality regressions fail before they ship. The mock model keeps CI runs reproducible, and the decision parser gives you a structured verdict to compare.
Guardrails. Validate input and output: strip prompt-injection from a diff before an agent sees it, and re-ask when the judge’s JSON is malformed. The filter-diff step and decision parser are the hooks.
Circuit breakers. Cap tokens and wall-clock per run, and fail fast when a dependency degrades instead of spending the budget on retries. The shared Budget type and per-task retry config are the starting points.
Observability. Track p95 latency, token spend, and per-agent failure rate. Every agent already runs as a traced task keyed by runId; the next step is aggregation and export.

1. Add a reflection loop to the judge

Instead of the judge emitting a verdict in one shot, have it critique its own draft and revise before returning. Reflection is one of the highest-leverage agent patterns: a second pass catches the judge’s own over- and under-reactions.

A reflection loop is just calling the same judge task again with its previous output fed back in. On the Pattern 2 queue you’d hand-roll the re-enqueue. As a task it’s a loop.

import { task } from "@renderinc/sdk/workflows";
import { REVIEWERS, judge, parseDecision, type Patch } from "@workshop/agent";
import { storeTracer } from "@workshop/db";

const judgeTask = task(
  { name: "judge", timeoutSeconds: 120 },
  async (input: Record<string, unknown>) => judge.run(input, { tracer: storeTracer() }),
);

// Fan out the reviewers, then pair each result with its agent name.
const results = await Promise.all(
  REVIEWERS.map((agent) =>
    task(
      { name: agent.name },
      async (input: { patches: Patch[] }) => agent.run(input, { tracer: storeTracer() }),
    )({ patches: filtered.patches }),
  ),
);
const findings = REVIEWERS.map((agent, i) => ({ agent: agent.name, note: results[i].text }));

// Draft, then reflect: the judge sees its own prior verdict each pass.
let decision = parseDecision((await judgeTask({ findings })).text);
for (let pass = 0; pass < 2; pass++) {
  const reflected = await judgeTask({
    findings,
    previousVerdict: decision,
    instruction: "Critique your previous verdict and revise if needed. Return the same JSON shape.",
  });
  const next = parseDecision(reflected.text);
  if (next.verdict === decision.verdict && next.reason === decision.reason) break;
  decision = next;
}

import asyncio

from render_sdk import Workflows
from workshop_agent import REVIEWERS, judge, parse_decision
from workshop_agent.types import RunContext
from workshop_db import store_tracer

app = Workflows()

@app.task(name="judge", timeout_seconds=120)
async def judge_task(input: dict) -> dict:
    ctx = RunContext(tracer=store_tracer())
    result = await judge.run(input, ctx)
    return {
        "text": result.text,
        "usage": {
            "input_tokens": result.usage.input_tokens,
            "output_tokens": result.usage.output_tokens,
        },
    }

# Fan out the reviewers, then pair each result with its agent name.
ctx = RunContext(tracer=store_tracer())
results = await asyncio.gather(*[
    agent.run({"patches": patches}, ctx) for agent in REVIEWERS
])
findings = [{"agent": agent.name, "note": results[i].text} for i, agent in enumerate(REVIEWERS)]

# Draft, then reflect: the judge sees its own prior verdict each pass.
decision = parse_decision((await judge_task({"findings": findings}))["text"])
for _ in range(2):
    reflected = await judge_task({
        "findings": findings,
        "previousVerdict": {"verdict": decision.verdict, "reason": decision.reason},
        "instruction": "Critique your previous verdict and revise if needed. Return the same JSON shape.",
    })
    next_decision = parse_decision(reflected["text"])
    if next_decision.verdict == decision.verdict and next_decision.reason == decision.reason:
        break
    decision = next_decision

The judge reads its input as JSON, so the extra previous-verdict and instruction keys just show up in its context. No prompt surgery needed. Watch the trace: you’ll see the judge task invoked multiple times under one authored review run, each its own isolated, retried, traced instance. The loop is your control flow. Durability is still the platform’s.

2. Wire in an MCP tool

Give a reviewer a tool backed by an external MCP server, the same way you’d plug in a real capability (web fetch, a vuln database, your own internal service). Tools and MCP sources come from the shared registry, so wiring one makes it available to all three patterns at once.

Add the optional dependency, then drop a source file in the shared tools directory:

npm install @modelcontextprotocol/sdk --workspace @workshop/agent

import { defineMcpSource } from "./tool.js";

// Stdio transport: the registry spawns this command and connects over stdio.
// Tools are auto-namespaced as `docs__<toolName>`.
export default defineMcpSource({
  id: "docs",
  command: "npx",
  args: ["-y", "@modelcontextprotocol/server-filesystem", process.cwd()],
});

uv add mcp --package workshop-agent

from workshop_agent.tools.tool import McpSourceSpec, define_mcp_source

# Stdio transport: the registry spawns this command and connects over stdio.
# Tools are auto-namespaced as `docs__<toolName>`.
source = define_mcp_source(McpSourceSpec(
    id="docs",
    command="npx",
    args=["-y", "@modelcontextprotocol/server-filesystem", "."],
))

MCP tool ids are namespaced as source id plus tool name, for example docs__read_file. Reference them by that name in an agent’s tools array. Add docs__read_file to the security reviewer’s tools, then run a reviewer task. The trace shows a tool span for the MCP call nested under the agent’s LLM turns. The shared tool resolver connects the source on agent.run() and tears the connection down afterward. You wrote none of that lifecycle.

3. Add a human-in-the-loop gate

Before a request-changes verdict (or any block finding) becomes actioned, require a human to approve or override it. This is the pattern behind “the agent proposes, a person disposes.”

The hard part of a human gate is the wait: work has to survive an arbitrarily long pause. In Pattern 1 that pause dies with the request. In Pattern 2 you’d own the parked-job bookkeeping. With tasks, you split the workflow at the decision boundary and let the platform hold state between the halves.

// Phase 1 - propose. A deterministic gate decides if a human is needed.
const decision = parseDecision((await judgeTask({ findings })).text);
const needsHuman =
  decision.verdict === "request-changes" ||
  decision.findings.some((f) => f.severity === "block");

if (needsHuman) {
  await savePendingApproval(input.url, decision); // park it durably
  return { status: "awaiting-approval", verdict: decision.verdict };
}
return { status: "auto-approved", verdict: decision.verdict };

// Phase 2 - a separate task (or webhook) the human triggers to resolve it:
//   resolveApproval(url, "approve" | "reject") -> finalize and act.

# Phase 1 - propose. A deterministic gate decides if a human is needed.
decision = parse_decision((await judge_task({"findings": findings}))["text"])
needs_human = (
    decision.verdict == "request-changes"
    or any(f.get("severity") == "block" for f in decision.findings)
)

if needs_human:
    await save_pending_approval(url, decision)  # park it durably
    return {"status": "awaiting-approval", "verdict": decision.verdict}
return {"status": "auto-approved", "verdict": decision.verdict}

# Phase 2 - a separate task (or webhook) the human triggers to resolve it:
#   resolve_approval(url, "approve" | "reject") -> finalize and act.

The two helpers above are yours to write. Here’s the minimal version that completes the challenge; swap the in-memory store for Postgres (through the shared database package) or any key-value store when you want it to survive a restart:

// Minimal store. Swap the Map for a pending_approvals table for real durability.
const pendingApprovals = new Map<string, unknown>();

async function savePendingApproval(url: string, decision: unknown): Promise<void> {
  pendingApprovals.set(url, decision);
}

async function resolveApproval(url: string, action: "approve" | "reject"): Promise<unknown> {
  const decision = pendingApprovals.get(url);
  pendingApprovals.delete(url);
  return { url, action, decision };
}

# Minimal store. Swap the dict for a pending_approvals table for real durability.
_pending_approvals: dict[str, dict] = {}

async def save_pending_approval(url: str, decision: dict) -> None:
    _pending_approvals[url] = decision

async def resolve_approval(url: str, action: str) -> dict:
    decision = _pending_approvals.pop(url, None)
    return {"url": url, "action": action, "decision": decision}

Trigger a PR that earns request-changes and the run returns awaiting-approval instead of finalizing. Resolve it from the second task and watch both halves show up as separate runs in the trace, linked by the PR URL.

Where this leaves you

You added durable, retried, isolated, traced execution by writing plain functions and config objects. In the worker pattern, that same set of guarantees took a queue, a consumer group, acks, retries, and a pub/sub bus, all code you had to own and debug. That is the whole arc of the workshop: the agent never changed. The substrate did the work.

A human-in-the-loop gate parks a `request-changes` verdict and waits for a person to approve it. Why is that wait cheap on Workflows but expensive on the Pattern 2 queue?

Workflows poll the database faster than the worker canDurable execution holds the run's state across an arbitrarily long pause, so you split the workflow at the decision boundary instead of owning parked-job bookkeepingThe queue can't store a verdict, so approvals are impossible in Pattern 2Workflows skip the judge step whenever a human is involved

Troubleshooting

Find the symptom that matches what you’re seeing, then apply the fix. These are optional take-home challenges, and the snippets are excerpts, not paste-and-run files.

savePendingApproval is not defined (or save_pending_approval). These helpers are not shipped in the repo. The human-gate challenge expects you to write them, backed by Postgres through the shared database package or any key-value store. A minimal version is a pending_approvals table plus a function that upserts the verdict by PR URL, and a second function to resolve it.

The snippets won’t compile when pasted at the top of the file. They reference patches, filtered, findings, and input.url, which only exist inside the existing task body, and they repeat imports the file already has. Splice the logic inside the existing your-review task body and merge the new import names into the existing import line rather than re-declaring task / app.

The gate never fires; every run is auto-approved. The mock judge always returns approve and never emits severity: block, so needsHuman is always false. To exercise the gate, set a real key and review a PR with real problems, or temporarily force the verdict. The top-of-page “exercises every path” note holds for the happy path, not the request-changes branch.

f.severity is always undefined. Findings are untyped passthrough from the model; parseDecision doesn’t enforce a schema, so severity is present only if the model emitted it. For a robust gate, key on decision.verdict (always normalized) rather than per-finding severity.

The MCP tool hangs or fails with npx: not found. The filesystem MCP server is a Node package launched via npx, even on the Python track, and npx -y downloads it from npm on first use. You need Node installed and network access; on a flaky venue network this is a silent multi-minute hang. Warm the cache during setup (npx -y @modelcontextprotocol/server-filesystem --help), or skip this challenge offline. If the tool returns “access denied,” pass an explicit absolute directory as the server’s root instead of process.cwd() / ..

The fan-out only runs security and performance, never UX. REVIEWERS is just [security, performance]; the UX reviewer is added conditionally. Use selectReviewers(patches) if you want the frontend reviewer included.

What you learned

Added a judge reflection loop: the same judge task called again with its prior verdict fed back in
Wired an MCP tool through the shared registry so all three patterns get it at once
Gated a `request-changes` verdict behind a human, using durable execution to hold the wait
Confirmed the arc: the agent never changed, the substrate carried the new guarantees