Drop in sample data — ETL on Workflows, Part 1: Build a sharded pipeline

In this step you’ll grab the sample-data generator from render-examples/data-processor-workflow and produce four CSV files: crm.csv, billing.csv, product.csv, and support.csv. Each one represents one source system in a real customer-data stack.

The generator isn’t what this tutorial is about, so you’ll copy it rather than write it. The merge logic in steps 4 to 6 is what matters.

Pull the generator

Terminal

$mkdir -p scripts && cd scripts
$curl -fsSL https://raw.githubusercontent.com/render-examples/data-processor-workflow/main/scripts/generate_data.py -o generate_data.py

The generator uses only the Python standard library (random, csv, datetime). No extra installs. If you chose the TypeScript path for everything else in this tutorial, you still need Python on your machine for this one script. The CSV files it produces are language-agnostic.

Generate 1K rows per source

The generator hardcodes its output path to sample_data/ at the repo root (one level up from scripts/). Run it from scripts/:

Terminal

$python generate_data.py --rows 1000
Generating 1,000 rows per CSV file...

Generating CRM data...
  Created sample_data/crm.csv (1,000 rows)
Generating billing data...
  Created sample_data/billing.csv (1,000 rows)
Generating product data...
  Created sample_data/product.csv (1,000 rows)
Generating support data...
  Created sample_data/support.csv (1,000 rows)

Done! Generated 4,000 total records in customer-merge/sample_data

What’s in each source

Source	Columns	Owns
`crm.csv`	`customer_id, email, company_name, industry, employee_count, deal_stage, deal_value, sales_owner, last_contact`	Sales-facing facts
`billing.csv`	`customer_id, email, plan, mrr, payment_status, subscription_start, last_payment`	Revenue and contract state
`product.csv`	`customer_id, email, signup_date, last_active, total_sessions, features_used, usage_pct, account_status`	Product engagement
`support.csv`	`customer_id, email, total_tickets, open_tickets, avg_resolution_hrs, last_ticket_date, nps_score, csat_score`	Service quality

Spot-check one customer

Pick any customer_id from crm.csv and look it up across the other three files to see what the merge has to combine. From the repo root (one level up from scripts/):

Terminal

$cd ..
$CUST=$(head -2 sample_data/crm.csv | tail -1 | cut -d, -f1)
$echo "Looking up $CUST across all four sources:"
Looking up cust_00000001 across all four sources:
$for f in sample_data/*.csv; do echo "--- $f ---"; grep "^$CUST," $f; done
--- sample_data/billing.csv ---
cust_00000001,billing_1@example.com,Business,247,Active,...
--- sample_data/crm.csv ---
cust_00000001,jordan.smith@globalcorp.com,Global Corp,Technology,250,...
--- sample_data/product.csv ---
cust_00000001,product_1@example.com,2024-08-12,...
--- sample_data/support.csv ---
cust_00000001,support_1@example.com,3,1,12.4,2026-02-14,8,4.2

The generator produces one row per customer per source, so every customer appears in all four files. Real pipelines aren’t that clean: some customers are CRM-only (new leads with no billing yet), some are billing + support but no product activity (churn risk), some are missing from CRM entirely (legacy accounts). The merge in step 5 uses dict.update() style merging so a missing source just leaves those fields out, instead of failing.

What you learned

Four source files (CRM, Billing, Product, Support) share `customer_id` as the join key
The merge in step 5 produces one enriched profile per customer by combining all four sources
1K rows keeps iteration fast; you'll switch to 1M in the productionize tutorial
Real pipelines have customers missing from some sources; the merge has to handle that