Render Tutorials
ETL on Workflows, Part 1: Build a sharded pipeline

Write the sharding helper

⏱ 8 min

In this step you’ll write the routing function every process_shard subtask uses to decide which customers belong to it. The whole module is one function and a constant, but the design decisions matter: this is what keeps every source file’s records for the same customer together.

The contract

get_shard_id(customer_id) must:

  • Return the same shard index for the same customer_id on every call, in every process, forever.
  • Distribute customer IDs roughly evenly across shards (no hot shard).
  • Have no external dependencies, no network calls, no random seeds.

The simplest implementation that satisfies all three is hash(customer_id) % NUM_SHARDS with a stable hash function.

Write the module

workflows/sharding.py
import hashlib
NUM_SHARDS = 10
def get_shard_id(customer_id: str) -> int:
hash_bytes = hashlib.md5(customer_id.encode()).digest()
return int.from_bytes(hash_bytes[:4], "big") % NUM_SHARDS
workflows/src/sharding.ts
import { createHash } from "node:crypto";
export const NUM_SHARDS = 10;
export function getShardId(customerId: string): number {
const hashBytes = createHash("md5").update(customerId).digest();
return hashBytes.readUInt32BE(0) % NUM_SHARDS;
}

Confirm it’s deterministic

A throwaway script. Run it twice and confirm both runs produce identical output:

Terminal
$python -c "from workflows.sharding import get_shard_id; print([get_shard_id(f'cust_{i:08d}') for i in range(10)])"
[3, 8, 1, 5, 9, 0, 4, 7, 2, 6]
$# Re-run. The output is byte-identical:
$python -c "from workflows.sharding import get_shard_id; print([get_shard_id(f'cust_{i:08d}') for i in range(10)])"
[3, 8, 1, 5, 9, 0, 4, 7, 2, 6]
Terminal
$npx tsx -e "import { getShardId } from './workflows/src/sharding'; console.log([...Array(10)].map((_, i) => getShardId('cust_' + String(i).padStart(8, '0'))))"
[3, 8, 1, 5, 9, 0, 4, 7, 2, 6]
$# Re-run. The output is byte-identical:
$# (same command again)
[3, 8, 1, 5, 9, 0, 4, 7, 2, 6]

If the two runs disagree, the hash function isn’t stable. That’s the single most common bug in shard routing.

Show hint

With 10 shards and 1M customers, you expect roughly 100K customers per shard. MD5 has enough entropy that the distribution is uniform to within a few percent (test it yourself with Counter(get_shard_id(...) for ...) on a sample). The same holds at 100 shards. The reason to cap shard count isn’t collision risk, it’s coordination overhead: every shard is its own Workflow instance, and each one needs to boot, run, and shut down.

You ship the pipeline with Python's built-in `hash()` instead of `hashlib.md5`. The local dev run looks fine. What breaks on Render?

What you learned

  • Sharding has one contract: same input always returns the same output, in any process, forever
  • Cryptographic-style hashes (MD5, SHA-256) are the safe default. Python's built-in `hash()` is randomized and unsafe
  • Distribution across shards is uniform enough with MD5 that you don't need a separate balancer
  • The whole helper is one function. The lesson is the contract, not the code