What you'll build — ETL on Workflows, Part 1: Build a sharded pipeline

This is ETL on Workflows, Part 1. The series takes you from a blank render workflows init to a hardened, benchmarked ETL pipeline on Render.

Part 1 builds the sharded pipeline: hash routing, fan-out, shard workers, orchestration, aggregation, and a deployed run. Part 2 takes that same pipeline and makes it production-ready with retries, idempotency, structured logs, a chaos drill, and a scale benchmark.

By the end of this tutorial you’ll have a Render Workflow that loads four customer-data CSVs (CRM, Billing, Product, Support), shards records across N parallel subtasks, merges and enriches each customer’s profile, and runs end-to-end against a deployed service from a small SDK client script.

The finished thing is render-examples/data-processor-workflow. You’ll arrive at roughly the same code shape on your own, learning the design decisions as you go.

Before you start

You’ll need:

A Render account with Workflows enabled (for the deploy step at the end).
The Render CLI 2.11.0+ for render workflows init and render workflows dev.
Python 3.11+ or Node.js 20+, depending on the language tabs you pick.
Completed the Render Workflows quickstart, or comfort with the SDK basics. The tutorial assumes you know what a task is and how run_task works.
Comfort in a terminal and with a Python or Node project.

The Workflows limits page is worth skimming if you plan to push this to bigger datasets later.

What “sharded” means here

A naive ETL loops through 1M customers in one process. That works until the dataset doesn’t fit in memory, one transient failure kills the whole run, or the loop takes longer than your job runner allows. The fan-out pattern fixes all three: hash each customer_id into one of N shards, run N tasks in parallel on their own instances, aggregate the results. The reader does the work of N shards in the time of one.

The design choices that make this work are the lessons of this tutorial:

Hash routing on a stable key so every record for the same customer lands on the same shard, even when the records live in different source files.
Subtasks own their own I/O so the orchestrator doesn’t have to pass large payloads. Workflow task arguments are JSON, capped at 4 MB. A pre-built shard slice for 1M customers would blow past that.
The orchestrator owns coordination and aggregation, not data movement. It spawns subtasks and combines their summarized results.
JSON-serializable inputs and outputs at every task boundary so subtasks can run in their own processes with no shared memory.

The system at a glance

The orchestrator never touches a CSV. It hands each process_shard subtask just a shard_id (a single integer). Each subtask reads the four CSVs from disk, filters to its shard’s customers (using hash(customer_id) % N == shard_id), merges across sources, and returns the enriched profiles for that shard. The orchestrator aggregates the returns into a stats summary.

You're sharding 1M customers across 10 workers. Why hash by `customer_id` rather than slice the CSV by row index?

Hashing is faster than slicing for large CSVsRow-index slicing requires sorting first, which is expensive for 1M rowsHashing guarantees that every record for the same customer lands on the same shard, even across different source filesRender Workflows requires hash-based routing for any task with more than 5 subtasks

Roadmap

What you’ll build. This page.
Scaffold with render workflows init and strip the examples. Get a clean project that registers zero tasks, ready to receive your own.
Drop in sample data. Pull the data generator from the reference repo and generate four CSVs of 1K rows each.
Write the sharding helper. A sharding module with one function: deterministic hash routing.
Write the shard worker. process_shard takes a shard_id, loads all four CSVs, filters to its shard’s customers, merges across sources, enriches, and returns the profiles.
Write the orchestrator. merge_customer_data spawns N process_shard subtasks in parallel, awaits all of them, and aggregates the results into a stats summary.
Trigger and verify locally. A small SDK client script. Run end-to-end against 1K rows. Inspect the aggregated output.
Deploy to Render. Push to GitHub, create the Workflow service, trigger the deployed pipeline from the same script.

When you finish, Part 2 picks up the same code and changes the question from “How do I build fan-out ETL?” to “How do I make it safe to run in production?”

What you learned

You'll design and build a sharded customer-data merge pipeline from scratch on Render Workflows
Hash routing on `customer_id` is what makes the per-shard merge correct across four source files
Subtasks own data loading and per-shard compute. The orchestrator owns coordination and aggregation
Part 2 of the series productionizes what you build here: retries, idempotency, chaos drill, scale benchmark