intermediate ⏱ 90 min 8 steps

ETL on Workflows, Part 1: Build a sharded pipeline

Design and build a sharded customer-data pipeline from scratch with hash routing, fan-out, and aggregation. Part 2 productionizes and scales the same pipeline.

Start tutorial

#workflows #python #typescript #etl #etl-series

Prerequisites

Completed the Render Workflows quickstart, or equivalent SDK familiarity
Render CLI 2.11.0+
Python 3.11+ or Node 20+
Comfortable in a terminal and with a Python or Node project
A Render account with Workflows enabled (for the deploy step at the end)

Steps

01 What you'll build Part 1 of ETL on Workflows. Design the sharded customer-data merge pipeline you'll later harden and scale. 5 min
02 Scaffold with `render workflows init` and strip the examples Get a clean project that registers zero tasks, ready to receive your own. 10 min
03 Drop in sample data Pull the data generator from the reference repo, generate 1K-row CSVs for the four sources, and look at what the merge will have to handle. 10 min
04 Write the sharding helper A small module with one function that hashes a customer_id into a stable shard index. Deterministic, source-agnostic, no external deps. 8 min
05 Write the shard worker (`process_shard`) process_shard takes a shard_id, loads all four CSVs, filters to its shard's customers, merges across sources, enriches each profile, and returns them. 15 min
06 Write the orchestrator (`merge_customer_data`) merge_customer_data spawns N process_shard subtasks in parallel, awaits all of them, and aggregates the returns into a stats summary. 12 min
07 Trigger and verify locally Write a small SDK client script, run the pipeline end-to-end against 1K rows, and confirm the aggregated output. 8 min
08 Deploy to Render Push your code to GitHub, create the Workflow service in the Render Dashboard, and run the same trigger script against the deployed slug. 15 min