What you'll build — ETL on Workflows, Part 2: Productionize and scale it

This is ETL on Workflows, Part 2. If you haven’t built a sharded ETL on Workflows yet, do Part 1 first.

Part 1 teaches the fan-out design: hash routing, shard workers, and an orchestrator. This tutorial assumes that shape already exists. Here you add the production layer: retries, idempotency, structured logs, failure drills, and measured scale-up.

By the end of this tutorial you’ll have a Render Workflow that merges customer data from four CSV sources into enriched profiles, recovers gracefully when one shard fails, and scales to 1M+ records on a single SDK call.

You’ll start from render-examples/data-processor-workflow, a working customer-data merge pipeline shipped in both Python and TypeScript (the same pipeline Part 1 walks you through building). The tutorial’s job is to take that working code and harden it for production.

Most ETL pipelines start as a single script that loops through a dataset. They grow from there. That works until the dataset doesn’t fit in one process, one transient failure kills a multi-hour run, or you can’t tell which records actually made it through. Render Workflows fixes the structural problems: each task runs in its own instance, fan-out is one SDK call, retries are built in. You still have to make the pipeline safe to re-run. That’s what “production” means here: sharding for parallelism, idempotency for correctness, retries for resilience, and per-shard observability so you can prove all three.

Before you start

You’ll need:

A Render account with Workflows enabled.
The Render CLI 2.11.0+ for render workflows dev and the deploy in step 4.
Python 3.11+ or Node.js 20+, depending on the language you pick.
A GitHub account for the fork-and-deploy step.
Part 1, or a working sharded pipeline you can harden. The chaos drill and benchmark assume you already understand the fan-out shape.

Skim Workflows limits before the scale-up in step 7 so the per-task and per-run ceilings don’t surprise you.

What “production” means here

You’ll add five things on top of the reference repo’s demo:

Retries with exponential backoff on the shard task, so a flaky upstream or a transient network blip recovers on its own.
Idempotent shard merges. Re-running a shard produces byte-identical output, every time.
Structured per-shard logs with timing and record counts. Readable in the Render Dashboard, greppable in your terminal.
A chaos drill where you deliberately fail one shard, watch the retry, and verify the final output has no duplicates and no missing records.
A benchmarked scale-up. Same code, more shards or a bigger instance plan, with your own before/after numbers.

The system at a glance

Every run starts from the SDK client, a small Python or TypeScript script you’ll write in step 3. The client calls merge_customer_data, the orchestrator task. The orchestrator spawns N process_shard subtasks that run in parallel. Each subtask loads the four CSV sources, hashes each customer_id to decide which records belong to its shard, merges its slice across all four sources, and enriches the profiles. The orchestrator aggregates the results and returns them to the caller. The demo repo also ships a frontend and an API service as alternative triggers. You’ll ignore both. The SDK client is enough.

Why shard at all?

A single-process loop over 1M customers across four source files takes roughly 30+ seconds of wall-clock, and only if nothing fails halfway through. The same workload split across 10 shards runs in 3 to 5 seconds because each shard executes on its own instance, in parallel, with its own memory and CPU. The cost stays the same (you pay per task-second, not per shard). The failure radius shrinks too: if one shard hits a bad row, only that shard retries. The other nine keep their results. Fan-out gives you throughput and isolation at once.

You're running the 10-shard merge against 1M customers. Shard 3 hits a bad row and the task errors. What happens to the other nine shards' results?

The whole workflow aborts and no merged output is returnedThe other nine keep their results; only shard 3 retriesAll ten shards retry together to keep the output consistentThe workflow returns partial output and marks the run as failed

Roadmap

What you’ll build. This page.
Tour the repo and run one shard locally. Clone the reference repo, generate a small sample dataset, and watch the tasks register on the local dev server.
Understand the fan-out pattern. Read the orchestrator and shard tasks, then trigger your first end-to-end run from a tiny SDK client script.
Deploy the workflow to Render. Push your fork, create the Workflow service in the Render Dashboard, and trigger it remotely from the same script.
Harden the tasks. Add retry policies, make each shard idempotent, and emit structured per-shard timing logs.
Chaos drill. Deliberately fail one shard, watch the retry timeline in the Render Dashboard, and verify the output is exactly-once.
Scale up and benchmark. Regenerate at 1M rows, bump shard count and instance plan, record your own before/after numbers.

What you learned

You'll harden an existing Render Workflows ETL, not build one from scratch
Hash-based sharding keeps each customer's records on a single shard across all source files
The production layer is retries, idempotency, structured logs, a chaos drill, and a benchmarked scale-up
Triggers come from a small SDK client script. No frontend or API service in this tutorial.