Trigger and verify locally — ETL on Workflows, Part 1: Build a sharded pipeline

In this step you’ll write the trigger script you’ll use for the rest of the series, run the pipeline against your 1K-row sample, and confirm merge_customer_data returns the expected stats summary.

Write the trigger script

Put this file in the repo root (one level up from workflows/):

import os
from render_sdk import Render

render = Render()

slug = os.getenv("WORKFLOW_SLUG", "local/merge_customer_data")
result = render.workflows.run_task(slug, [])
data = result.results

print(f"Generated {data['profiles_generated']} profiles across {data['shards_processed']} shards")
print(f"Avg health score: {data['statistics']['avg_health_score']}")
print(f"Churn distribution: {data['statistics']['churn_distribution']}")
print(f"Sample profile keys: {sorted((data['sample_profile'] or {}).keys())}")

assert data["profiles_generated"] == 1000, f"expected 1000 profiles, got {data['profiles_generated']}"
print("OK")

import { Render } from "@renderinc/sdk";

const render = new Render();

const slug = process.env.WORKFLOW_SLUG ?? "local/merge_customer_data";
const started = await render.workflows.startTask(slug, []);
const finished = await started.get();
const data = finished.results as {
  profiles_generated: number;
  shards_processed: number;
  sample_profile: Record<string, unknown> | null;
  statistics: { avg_health_score: number; churn_distribution: Record<string, number> };
};

console.log(`Generated ${data.profiles_generated} profiles across ${data.shards_processed} shards`);
console.log(`Avg health score: ${data.statistics.avg_health_score}`);
console.log(`Churn distribution: ${JSON.stringify(data.statistics.churn_distribution)}`);
console.log(`Sample profile keys: ${Object.keys(data.sample_profile ?? {}).sort().join(", ")}`);

if (data.profiles_generated !== 1000) {
  throw new Error(`expected 1000 profiles, got ${data.profiles_generated}`);
}
console.log("OK");

The script defaults to the local-dev slug (local/merge_customer_data). On Render it switches to the deployed slug via WORKFLOW_SLUG. The same file works locally and remotely. You’ll change two env vars in step 8 and run it again.

Run it end to end

Make sure the dev server from step 6 is still running. In a second terminal:

Terminal

$RENDER_USE_LOCAL_DEV=true python trigger.py
Generated 1000 profiles across 10 shards
Avg health score: 52.7
Churn distribution: {'LOW': 412, 'MEDIUM': 487, 'HIGH': 101}
Sample profile keys: ['account_status', 'avg_resolution_hrs', 'churn_risk', 'company_name', 'csat_score', 'customer_id', 'deal_stage', 'deal_value', 'email', 'employee_count', 'expansion_potential', 'features_used', 'health_score', 'industry', 'last_active', 'last_contact', 'last_payment', 'last_ticket_date', 'mrr', 'nps_score', 'open_tickets', 'payment_status', 'plan', 'sales_owner', 'signup_date', 'subscription_start', 'total_sessions', 'total_tickets', 'usage_pct']
OK

Terminal

$RENDER_USE_LOCAL_DEV=true npx tsx trigger.ts
Generated 1000 profiles across 10 shards
Avg health score: 52.7
Churn distribution: {"LOW":412,"MEDIUM":487,"HIGH":101}
Sample profile keys: account_status, avg_resolution_hrs, churn_risk, ...
OK

Three signals confirm the pipeline works:

profiles_generated == 1000 matches the input row count. No customers were dropped, none doubled.
shards_processed == 10 means all ten subtasks completed.
The sample profile has fields from all four sources (industry from CRM, mrr from Billing, total_sessions from Product, nps_score from Support, plus the three enrichment fields). The merge worked.

The aggregated output deliberately does not return the full profile list. That would risk the 4 MB return-payload limit on the orchestrator and isn’t useful for verification at this size. In a real pipeline the orchestrator would write profiles to S3, Postgres, or another sink before returning the stats summary.

Show hint

Most first-run failures fall into three buckets:

DATA_DIR points at the wrong path. The default is ../sample_data relative to workflows/. If you ran generate_data.py from a different location, set DATA_DIR explicitly or move the CSVs.
RENDER_USE_LOCAL_DEV is not set. Without it, the SDK tries to call Render’s API and fails with an auth error.
The dev server isn’t running, or it crashed on a missing import. Restart it in the foreground (no &) so you can see the traceback.

What you learned

The SDK client is one file. The same script works locally and against the deployed Workflow
`RENDER_USE_LOCAL_DEV=true` targets the local dev server. Without it, the SDK talks to Render
Three signals prove correctness: total profile count, shard count, sample-profile field shape
The aggregated output is intentionally a stats summary, not the raw profiles, to stay inside the 4 MB return limit