In this step you’ll grab the sample-data generator from render-examples/data-processor-workflow and produce four CSV files: crm.csv, billing.csv, product.csv, and support.csv. Each one represents one source system in a real customer-data stack.
The generator isn’t what this tutorial is about, so you’ll copy it rather than write it. The merge logic in steps 4 to 6 is what matters.
Pull the generator
$mkdir -p scripts && cd scripts$curl -fsSL https://raw.githubusercontent.com/render-examples/data-processor-workflow/main/scripts/generate_data.py -o generate_data.py
The generator uses only the Python standard library (random, csv, datetime). No extra installs. If you chose the TypeScript path for everything else in this tutorial, you still need Python on your machine for this one script. The CSV files it produces are language-agnostic.
Generate 1K rows per source
The generator hardcodes its output path to sample_data/ at the repo root (one level up from scripts/). Run it from scripts/:
$python generate_data.py --rows 1000Generating 1,000 rows per CSV file... Generating CRM data... Created sample_data/crm.csv (1,000 rows) Generating billing data... Created sample_data/billing.csv (1,000 rows) Generating product data... Created sample_data/product.csv (1,000 rows) Generating support data... Created sample_data/support.csv (1,000 rows) Done! Generated 4,000 total records in customer-merge/sample_data
What’s in each source
| Source | Columns | Owns |
|---|---|---|
crm.csv | customer_id, email, company_name, industry, employee_count, deal_stage, deal_value, sales_owner, last_contact | Sales-facing facts |
billing.csv | customer_id, email, plan, mrr, payment_status, subscription_start, last_payment | Revenue and contract state |
product.csv | customer_id, email, signup_date, last_active, total_sessions, features_used, usage_pct, account_status | Product engagement |
support.csv | customer_id, email, total_tickets, open_tickets, avg_resolution_hrs, last_ticket_date, nps_score, csat_score | Service quality |
Spot-check one customer
Pick any customer_id from crm.csv and look it up across the other three files to see what the merge has to combine. From the repo root (one level up from scripts/):
$cd ..$CUST=$(head -2 sample_data/crm.csv | tail -1 | cut -d, -f1)$echo "Looking up $CUST across all four sources:"Looking up cust_00000001 across all four sources:$for f in sample_data/*.csv; do echo "--- $f ---"; grep "^$CUST," $f; done--- sample_data/billing.csv --- cust_00000001,billing_1@example.com,Business,247,Active,... --- sample_data/crm.csv --- cust_00000001,jordan.smith@globalcorp.com,Global Corp,Technology,250,... --- sample_data/product.csv --- cust_00000001,product_1@example.com,2024-08-12,... --- sample_data/support.csv --- cust_00000001,support_1@example.com,3,1,12.4,2026-02-14,8,4.2
The generator produces one row per customer per source, so every customer appears in all four files. Real pipelines aren’t that clean: some customers are CRM-only (new leads with no billing yet), some are billing + support but no product activity (churn risk), some are missing from CRM entirely (legacy accounts). The merge in step 5 uses dict.update() style merging so a missing source just leaves those fields out, instead of failing.
What you learned
- Four source files (CRM, Billing, Product, Support) share `customer_id` as the join key
- The merge in step 5 produces one enriched profile per customer by combining all four sources
- 1K rows keeps iteration fast; you'll switch to 1M in the productionize tutorial
- Real pipelines have customers missing from some sources; the merge has to handle that