Chaos drill (break a shard, prove recovery) — ETL on Workflows, Part 2: Productionize and scale it

In this step you’ll deliberately fail shard 3 on its first attempt, trigger a run, and confirm two things from the output: every customer made it through exactly once, and the Render Dashboard shows the retry that healed it.

Inject the failure

Before

  @app.task(retry=Retry(max_retries=3, wait_duration_ms=2000))
- def process_shard(shard_id: int) -> dict:
-   log(event="shard_start", shard_id=shard_id)
 
 
 
 
    profiles = build_profiles_for_shard(shard_id)
-   log(event="shard_end", shard_id=shard_id, n=len(profiles))
 
    return {"shard_id": shard_id, "profiles": profiles, "count": len(profiles)}

After

  @app.task(retry=Retry(max_retries=3, wait_duration_ms=2000))
 
 
+ def process_shard(shard_id: int, attempt: int = 1) -> dict:
+   log(event="shard_start", shard_id=shard_id, attempt=attempt)
+   if os.environ.get("CHAOS_FAIL_SHARD") == str(shard_id) and attempt == 1:
+       raise RuntimeError(f"chaos: failing shard {shard_id} on first attempt")
    profiles = build_profiles_for_shard(shard_id)
 
+   log(event="shard_end", shard_id=shard_id, attempt=attempt, n=len(profiles))
    return {"shard_id": shard_id, "profiles": profiles, "count": len(profiles)}

Before

  const processShard = task(
  { name: "process_shard", retry: { maxRetries: 3, waitDurationMs: 2000 } },
  function processShard(shardId: number, attempt = 1) {
    log({ event: "shard_start", shard_id: shardId, attempt });
 
 
 
    const result = buildProfilesForShard(shardId);
    log({ event: "shard_end", shard_id: shardId, attempt, n: result.count });
    return result;
  }
  );

After

  const processShard = task(
  { name: "process_shard", retry: { maxRetries: 3, waitDurationMs: 2000 } },
  function processShard(shardId: number, attempt = 1) {
    log({ event: "shard_start", shard_id: shardId, attempt });
+   if (process.env.CHAOS_FAIL_SHARD === String(shardId) && attempt === 1) {
+     throw new Error(`chaos: failing shard ${shardId} on first attempt`);
+   }
    const result = buildProfilesForShard(shardId);
    log({ event: "shard_end", shard_id: shardId, attempt, n: result.count });
    return result;
  }
  );

The attempt value is the only moving part. Use the retry-attempt field exposed by your installed SDK. If your SDK exposes it through task context instead of an argument, read it there and keep the same condition: fail shard 3 on attempt 1, then let the retry continue normally. Do not key this only on CHAOS_FAIL_SHARD, or every retry will fail too.

Trigger the drill

Terminal

$git add -A && git commit -m 'add chaos gate' && git push
$# wait for the deploy to finish, then:
$CHAOS_FAIL_SHARD=3 python trigger.py
Run started: <run-id>

Set CHAOS_FAIL_SHARD=3 on the Workflow service, not just in your local shell. In the Render Dashboard, open the Workflow service, go to Environment, add the env var, save, and redeploy. The task runs on Render, so a local-only env var will not reach the remote process_shard subtasks.

Watch the retry in the Render Dashboard

Open Runs, find the run id from your terminal, and expand the parent task. Shard 3 should show one failed attempt followed by a successful retry. The other process_shard rows should finish green on their first attempt. Click into shard 3 and read the JSON logs from step 5: you should see shard_start for attempt 1, the chaos error, then shard_start and shard_end for the retry.

Show hint

Read the retry timeline left to right. The failed row shows attempt 1. The next row for the same shard shows attempt 2. The gap between them should roughly match the retry wait plus scheduling time.

Verify correctness

The whole point of the drill is to prove the retry didn’t break the output. Two checks:

Terminal

$# 1. Row count matches input:
$wc -l merged_output.csv sample_data/crm.csv
  1001 merged_output.csv
  1001 sample_data/crm.csv
$# 2. No duplicate customer_id in the output:
$tail -n +2 merged_output.csv | cut -d, -f1 | sort | uniq -d

The two wc -l values should match: 1 header row plus 1,000 customers in both files. The duplicate check should print nothing. If it prints any customer_id, the retry produced a duplicate output and the idempotency key or aggregator de-dupe from step 5 is not doing its job.

What you learned

Gating chaos on an env var keeps the failure controllable and reversible
A first-attempt-only failure plus a retry policy is the simplest reproducible drill
The Runs tab shows the retry timeline; per-shard JSON logs explain what each attempt did
Row count + duplicate check is enough to prove the pipeline is exactly-once at the output layer