Render Tutorials
ETL on Workflows, Part 1: Build a sharded pipeline

Trigger and verify locally

⏱ 8 min

In this step you’ll write the trigger script you’ll use for the rest of the series, run the pipeline against your 1K-row sample, and confirm merge_customer_data returns the expected stats summary.

Write the trigger script

Put this file in the repo root (one level up from workflows/):

trigger.py
import os
from render_sdk import Render
render = Render()
slug = os.getenv("WORKFLOW_SLUG", "local/merge_customer_data")
result = render.workflows.run_task(slug, [])
data = result.results
print(f"Generated {data['profiles_generated']} profiles across {data['shards_processed']} shards")
print(f"Avg health score: {data['statistics']['avg_health_score']}")
print(f"Churn distribution: {data['statistics']['churn_distribution']}")
print(f"Sample profile keys: {sorted((data['sample_profile'] or {}).keys())}")
assert data["profiles_generated"] == 1000, f"expected 1000 profiles, got {data['profiles_generated']}"
print("OK")
trigger.ts
import { Render } from "@renderinc/sdk";
const render = new Render();
const slug = process.env.WORKFLOW_SLUG ?? "local/merge_customer_data";
const started = await render.workflows.startTask(slug, []);
const finished = await started.get();
const data = finished.results as {
profiles_generated: number;
shards_processed: number;
sample_profile: Record<string, unknown> | null;
statistics: { avg_health_score: number; churn_distribution: Record<string, number> };
};
console.log(`Generated ${data.profiles_generated} profiles across ${data.shards_processed} shards`);
console.log(`Avg health score: ${data.statistics.avg_health_score}`);
console.log(`Churn distribution: ${JSON.stringify(data.statistics.churn_distribution)}`);
console.log(`Sample profile keys: ${Object.keys(data.sample_profile ?? {}).sort().join(", ")}`);
if (data.profiles_generated !== 1000) {
throw new Error(`expected 1000 profiles, got ${data.profiles_generated}`);
}
console.log("OK");

The script defaults to the local-dev slug (local/merge_customer_data). On Render it switches to the deployed slug via WORKFLOW_SLUG. The same file works locally and remotely. You’ll change two env vars in step 8 and run it again.

Run it end to end

Make sure the dev server from step 6 is still running. In a second terminal:

Terminal
$RENDER_USE_LOCAL_DEV=true python trigger.py
Generated 1000 profiles across 10 shards Avg health score: 52.7 Churn distribution: {'LOW': 412, 'MEDIUM': 487, 'HIGH': 101} Sample profile keys: ['account_status', 'avg_resolution_hrs', 'churn_risk', 'company_name', 'csat_score', 'customer_id', 'deal_stage', 'deal_value', 'email', 'employee_count', 'expansion_potential', 'features_used', 'health_score', 'industry', 'last_active', 'last_contact', 'last_payment', 'last_ticket_date', 'mrr', 'nps_score', 'open_tickets', 'payment_status', 'plan', 'sales_owner', 'signup_date', 'subscription_start', 'total_sessions', 'total_tickets', 'usage_pct'] OK
Terminal
$RENDER_USE_LOCAL_DEV=true npx tsx trigger.ts
Generated 1000 profiles across 10 shards Avg health score: 52.7 Churn distribution: {"LOW":412,"MEDIUM":487,"HIGH":101} Sample profile keys: account_status, avg_resolution_hrs, churn_risk, ... OK

Three signals confirm the pipeline works:

  • profiles_generated == 1000 matches the input row count. No customers were dropped, none doubled.
  • shards_processed == 10 means all ten subtasks completed.
  • The sample profile has fields from all four sources (industry from CRM, mrr from Billing, total_sessions from Product, nps_score from Support, plus the three enrichment fields). The merge worked.

The aggregated output deliberately does not return the full profile list. That would risk the 4 MB return-payload limit on the orchestrator and isn’t useful for verification at this size. In a real pipeline the orchestrator would write profiles to S3, Postgres, or another sink before returning the stats summary.

Show hint

Most first-run failures fall into three buckets:

  • DATA_DIR points at the wrong path. The default is ../sample_data relative to workflows/. If you ran generate_data.py from a different location, set DATA_DIR explicitly or move the CSVs.
  • RENDER_USE_LOCAL_DEV is not set. Without it, the SDK tries to call Render’s API and fails with an auth error.
  • The dev server isn’t running, or it crashed on a missing import. Restart it in the foreground (no &) so you can see the traceback.

What you learned

  • The SDK client is one file. The same script works locally and against the deployed Workflow
  • `RENDER_USE_LOCAL_DEV=true` targets the local dev server. Without it, the SDK talks to Render
  • Three signals prove correctness: total profile count, shard count, sample-profile field shape
  • The aggregated output is intentionally a stats summary, not the raw profiles, to stay inside the 4 MB return limit