This is ETL on Workflows, Part 1. The series takes you from a blank render workflows init to a hardened, benchmarked ETL pipeline on Render.
Part 1 builds the sharded pipeline: hash routing, fan-out, shard workers, orchestration, aggregation, and a deployed run. Part 2 takes that same pipeline and makes it production-ready with retries, idempotency, structured logs, a chaos drill, and a scale benchmark.
By the end of this tutorial you’ll have a Render Workflow that loads four customer-data CSVs (CRM, Billing, Product, Support), shards records across N parallel subtasks, merges and enriches each customer’s profile, and runs end-to-end against a deployed service from a small SDK client script.
The finished thing is render-examples/data-processor-workflow. You’ll arrive at roughly the same code shape on your own, learning the design decisions as you go.
Before you start
You’ll need:
- A Render account with Workflows enabled (for the deploy step at the end).
- The Render CLI 2.11.0+ for
render workflows initandrender workflows dev. - Python 3.11+ or Node.js 20+, depending on the language tabs you pick.
- Completed the Render Workflows quickstart, or comfort with the SDK basics. The tutorial assumes you know what a task is and how
run_taskworks. - Comfort in a terminal and with a Python or Node project.
The Workflows limits page is worth skimming if you plan to push this to bigger datasets later.
What “sharded” means here
A naive ETL loops through 1M customers in one process. That works until the dataset doesn’t fit in memory, one transient failure kills the whole run, or the loop takes longer than your job runner allows. The fan-out pattern fixes all three: hash each customer_id into one of N shards, run N tasks in parallel on their own instances, aggregate the results. The reader does the work of N shards in the time of one.
The design choices that make this work are the lessons of this tutorial:
- Hash routing on a stable key so every record for the same customer lands on the same shard, even when the records live in different source files.
- Subtasks own their own I/O so the orchestrator doesn’t have to pass large payloads. Workflow task arguments are JSON, capped at 4 MB. A pre-built shard slice for 1M customers would blow past that.
- The orchestrator owns coordination and aggregation, not data movement. It spawns subtasks and combines their summarized results.
- JSON-serializable inputs and outputs at every task boundary so subtasks can run in their own processes with no shared memory.
The system at a glance
The orchestrator never touches a CSV. It hands each process_shard subtask just a shard_id (a single integer). Each subtask reads the four CSVs from disk, filters to its shard’s customers (using hash(customer_id) % N == shard_id), merges across sources, and returns the enriched profiles for that shard. The orchestrator aggregates the returns into a stats summary.
Roadmap
- What you’ll build. This page.
- Scaffold with
render workflows initand strip the examples. Get a clean project that registers zero tasks, ready to receive your own. - Drop in sample data. Pull the data generator from the reference repo and generate four CSVs of 1K rows each.
- Write the sharding helper. A
shardingmodule with one function: deterministic hash routing. - Write the shard worker.
process_shardtakes ashard_id, loads all four CSVs, filters to its shard’s customers, merges across sources, enriches, and returns the profiles. - Write the orchestrator.
merge_customer_dataspawns Nprocess_shardsubtasks in parallel, awaits all of them, and aggregates the results into a stats summary. - Trigger and verify locally. A small SDK client script. Run end-to-end against 1K rows. Inspect the aggregated output.
- Deploy to Render. Push to GitHub, create the Workflow service, trigger the deployed pipeline from the same script.
When you finish, Part 2 picks up the same code and changes the question from “How do I build fan-out ETL?” to “How do I make it safe to run in production?”
What you learned
- You'll design and build a sharded customer-data merge pipeline from scratch on Render Workflows
- Hash routing on `customer_id` is what makes the per-shard merge correct across four source files
- Subtasks own data loading and per-shard compute. The orchestrator owns coordination and aggregation
- Part 2 of the series productionizes what you build here: retries, idempotency, chaos drill, scale benchmark