Write the sharding helper — ETL on Workflows, Part 1: Build a sharded pipeline

In this step you’ll write the routing function every process_shard subtask uses to decide which customers belong to it. The whole module is one function and a constant, but the design decisions matter: this is what keeps every source file’s records for the same customer together.

The contract

get_shard_id(customer_id) must:

Return the same shard index for the same customer_id on every call, in every process, forever.
Distribute customer IDs roughly evenly across shards (no hot shard).
Have no external dependencies, no network calls, no random seeds.

The simplest implementation that satisfies all three is hash(customer_id) % NUM_SHARDS with a stable hash function.

Write the module

import hashlib

NUM_SHARDS = 10


def get_shard_id(customer_id: str) -> int:
    hash_bytes = hashlib.md5(customer_id.encode()).digest()
    return int.from_bytes(hash_bytes[:4], "big") % NUM_SHARDS

import { createHash } from "node:crypto";

export const NUM_SHARDS = 10;

export function getShardId(customerId: string): number {
  const hashBytes = createHash("md5").update(customerId).digest();
  return hashBytes.readUInt32BE(0) % NUM_SHARDS;
}

Confirm it’s deterministic

A throwaway script. Run it twice and confirm both runs produce identical output:

Terminal

$python -c "from workflows.sharding import get_shard_id; print([get_shard_id(f'cust_{i:08d}') for i in range(10)])"
[3, 8, 1, 5, 9, 0, 4, 7, 2, 6]
$# Re-run. The output is byte-identical:
$python -c "from workflows.sharding import get_shard_id; print([get_shard_id(f'cust_{i:08d}') for i in range(10)])"
[3, 8, 1, 5, 9, 0, 4, 7, 2, 6]

Terminal

$npx tsx -e "import { getShardId } from './workflows/src/sharding'; console.log([...Array(10)].map((_, i) => getShardId('cust_' + String(i).padStart(8, '0'))))"
[3, 8, 1, 5, 9, 0, 4, 7, 2, 6]
$# Re-run. The output is byte-identical:
$# (same command again)
[3, 8, 1, 5, 9, 0, 4, 7, 2, 6]

If the two runs disagree, the hash function isn’t stable. That’s the single most common bug in shard routing.

Show hint

With 10 shards and 1M customers, you expect roughly 100K customers per shard. MD5 has enough entropy that the distribution is uniform to within a few percent (test it yourself with Counter(get_shard_id(...) for ...) on a sample). The same holds at 100 shards. The reason to cap shard count isn’t collision risk, it’s coordination overhead: every shard is its own Workflow instance, and each one needs to boot, run, and shut down.

You ship the pipeline with Python's built-in `hash()` instead of `hashlib.md5`. The local dev run looks fine. What breaks on Render?

Nothing. Python's `hash()` is stable enough for sharding.Each fresh process gets a different hash seed, so the same `customer_id` lands on a different shard in different subtasks. The merge silently produces duplicates and gaps.`hash()` raises an exception for strings longer than 64 characters.Render Workflows blocks `hash()` calls in tasks for security reasons.

What you learned

Sharding has one contract: same input always returns the same output, in any process, forever
Cryptographic-style hashes (MD5, SHA-256) are the safe default. Python's built-in `hash()` is randomized and unsafe
Distribution across shards is uniform enough with MD5 that you don't need a separate balancer
The whole helper is one function. The lesson is the contract, not the code