How Render Services Stayed Up During the AWS October Outage

Kevin Paulisse

December 03, 2025

Kevin Paulisse

October 20, 2025: A cloud outage ripples across the web. Lasting nearly 15 hours, the disruption to AWS's us-east-1 region impacted about 30% of the 7,000+ cloud providers tracked by Parametrix's real-time monitoring analysis. AWS published a detailed post-incident summary explaining the root causes and their remediation efforts. The Render team is grateful for their transparency, and for the proactive partnership of their technical account management team.

The outage's impact varied from provider to provider: some went completely dark, while others merely experienced slowdowns. In Render's case, builds and deploys in our Virginia region became degraded with elevated latency, slower build times, and delays in resource provisioning.

Despite these degradations, active Render services and workloads remained up and running throughout the incident. None of our customers experienced complete downtime.

So how did Render come through (comparatively) unscathed? This post explores why our infrastructure responded the way it did to this specific failure mode. We'll walk through the architectural choices we made (in some cases years prior), explain how those decisions shaped our actions during the incident, and share what we're improving for the future.

The AWS incident began on October 20 at 07:11 UTC (just after midnight Pacific time). The Render platform initially remained completely healthy (more on this later), but after about 90 minutes we were unable to provision new EC2 instances in us-east-1. This meant we couldn't add nodes to our Virginia cluster to handle increasing demand for service builds and deploys. Build times slowed as existing capacity became saturated.

This particular bottleneck resolved when AWS re-enabled EC2 provisioning later in the morning. However, a new set of issues followed:

Newly provisioned nodes came up in a degraded state, unable to initialize network configuration or reliably run containers.
Some workloads became stuck trying to start or stop.
Database creation experienced delays and some failures due to impaired EBS volume provisioning. Stateful workloads struggled to attach EBS volumes due to inconsistencies in AWS's storage subsystem.
Proxy services, which route HTTP traffic to customer web services and static sites, experienced elevated latency due to degradation of our Network Load Balancer:

Slack communication regarding Virginia proxy performance during the incident

Critically, throughout all of this, existing workloads kept running. Data remained persistent, and no customers experienced complete downtime.

The incident resolved gradually as AWS restored EC2 launch capacity and network stability throughout the day. By approximately 22:53 UTC (3:53 PM PDT), AWS confirmed full recovery. We validated that all workloads were scheduled, node pools were stable, and customer-facing latency had returned to baseline before declaring the incident resolved on our end.

Render's comparative resilience during this outage was the result of numerous architectural decisions made over the course of 6+ years. In the face of this particular failure mode, those decisions worked meaningfully in our favor (many thanks to diligence, and a few thanks to simple luck).

Render's AWS-hosted infrastructure uses foundational services like EC2, S3, and ELB. By avoiding higher-level managed services like DynamoDB, RDS, and Lambda, we retain more direct control over our infrastructure and reduce our dependency on AWS's managed service layers.

This choice made a pivotal difference during the October 20 outage. As described in AWS's writeup, the root cause originated in DynamoDB's DNS management system, cascading from there to higher-level AWS services that depend on it. Because Render doesn't use these services, the initial failures didn't affect us. Our issues manifested only later in the cascade, after EC2 provisioning had been throttled and basic network load balancers became degraded. And even then, the scope of the impact was limited to capacity constraints, as opposed to complete service failures.

This choice is deliberate and one we regularly re-evaluate to avoid "not invented here" becoming our default posture. We readily adopt third-party services when they aren’t core to our product, but compute management and orchestration are fundamental to what Render does, so we keep those components under our direct control. The tradeoff is clear: operating on lower-level primitives means taking on more responsibility for orchestration, scaling, monitoring, and operations. All of this requires greater engineering investment, but that investment paid off during this incident: we could see precisely what was failing and respond with targeted action.

When throttling began for EC2 provisioning, Render engineers quickly identified the capacity constraint and received prompt confirmation from our AWS Technical Account Manager. To preserve as much compute capacity as possible, we set the minimum capacity on our active Auto Scaling groups to their current desired values. This prevented AWS's autoscaling from attempting to scale down healthy nodes during the outage.

Engineers continuously monitored EC2 launch attempts throughout the day, noting exactly which API calls were failing and when capacity began to return. This visibility into lower-level primitives enabled us to make informed decisions about when it was safe to resume normal scaling operations, along with which workloads to prioritize with our limited capacity.

Render operates its own Kubernetes clusters on AWS instead of using EKS (Elastic Kubernetes Service). In addition to providing complete control over scheduling, scaling, and node management, this enabled us to maintain operational control throughout the incident.

Importantly, our Kubernetes control plane runs independently of the clusters that customer workloads run in (and it doesn't run in us-east-1). This isolation of failure domains paid massive dividends here, enabling us to maintain full visibility and control throughout the incident.

The cost of this approach is complexity and operational overhead. Running multi-region infrastructure introduces a slew of networking and coordination challenges: we're responsible for Kubernetes upgrades, security patches, control plane reliability, and deep operational knowledge. But during a crisis like this one, that operational depth translates into a significant advantage.

Separately, this incident exposed a tradeoff in Render's approach to managing Kubernetes versions. To prioritize platform reliability for our customers, we intentionally keep our production clusters a couple of versions behind the bleeding-edge release. This meant that we were unable to upgrade to a newer EBS CSI driver that would have avoided some of the previously mentioned volume attachment issues. This reflects the inherent tradeoff in self-managing infrastructure: full control comes with full responsibility to strike the right balance between recency and stability.

The infrastructure team executed multiple tactical interventions that were only possible because of this direct control. We reallocated proxy nodes from our free tier to paid customers, temporarily prioritizing production workloads while capacity was constrained. We also adjusted minimum replica counts for critical services and made real-time decisions about resource allocation as the situation evolved.

When stateful workloads encountered EBS volume attachment issues due to AWS storage subsystem inconsistencies, engineers manually cleaned up stuck persistent volume claims, restoring functionality for critical databases and logging infrastructure.

From its inception, Render has intentionally avoided hosting critical platform infrastructure in the us-east-1 region. As AWS’s oldest and largest region, us-east-1 has historically seen a higher frequency of customer-facing incidents than other regions, due to its scale and central role in AWS’s infrastructure. According to StatusGator's analysis of AWS reliability data, us-east-1 experienced 10 outages totaling nearly 34 hours in 2025. For US-hosted infrastructure, Render has built a significantly larger customer footprint in us-west-2 (Oregon) and us-east-2 (Ohio).

During the October 20 outage, this strategic choice limited our exposure. Most of our customers don't run applications in us-east-1, which inherently limited the scope of the impact. The large majority of Render's user base continued operating normally without any interruption whatsoever.

We're immensely proud of the performance of both our team and platform on October 20, and we are also first to acknowledge the role that luck played in our outcome. Had this exact same failure occurred in us-west-2 (where a much larger percentage of our customers run their apps), the operational impact would have been far greater.

This incident reinforced that regional distribution is an essential pillar of a platform's resilience strategy, both for technical reasons and for limiting the blast radius when failures do occur.

No incident response is perfect, and this one exposed several areas where we can do better:

Operationalization. Some of our most effective interventions (such as adjusting Auto Scaling Group minimums and reallocating proxy nodes between tiers) were largely improvised, requiring manual action and real-time problem solving under pressure. Ideally, solutions like these are documented, tested procedures that any on-call engineer can execute confidently (or, ideally, that automated systems can perform safely without human intervention).

Tooling fallbacks. Communication proved challenging during parts of the incident. Several systems that Render engineers use to coordinate incident response and communicate with our customers were degraded or unavailable due to the very same incident we were addressing. Engineers adapted by posting manual updates on Twitter and coordinating through alternate channels, but we lacked clear procedures for these fallback scenarios. We're establishing redundant communication paths and documenting the credentials, access methods, and procedures for manually publishing status updates when primary tools fail.

Proactive failure detection. We're implementing test frameworks to help us detect provider-level issues sooner. With simple, periodic tests that attempt to provision a small EC2 instance or perform other basic operations, we can more quickly distinguish between our own system hiccups and upstream provider failures. This kind of lightweight signal could have helped us confirm the AWS provisioning issues more quickly.

These improvements reflect a learning mindset. We did many things right during this incident, but we also identified important gaps. The work ahead is about transforming the ad-hoc solutions that worked into repeatable, reliable processes.

We want to be clear about what this incident does and doesn't tell us. Our infrastructure remained operational during this specific failure mode, where AWS's problems cascaded from DynamoDB through EC2 provisioning and network load balancers. But a different type of failure could expose different limitations in our architecture. We're not claiming universal resilience, but rather explaining how our particular set of tradeoffs played out during this particular incident.

This outage reinforces some important lessons about infrastructure resilience:

Architectural choices made years before a crisis can fundamentally shape how you experience that crisis. The decision to use lower-level AWS primitives, self-manage Kubernetes, and distribute across regions didn't prevent us from being affected, but it changed what "being affected" meant.
There are limits to what any infrastructure team can control. When fundamental cloud provider services fail, you're working within constraints set by someone else's recovery timeline.
The tradeoffs are real. Our architectural choices require more operational investment, more expertise, and more ongoing maintenance than using managed services. During normal operations, that cost is constant. During a crisis, it can become an advantage.

Ultimately, resilience isn't about preventing all failures. It's about understanding your architecture well enough to know how it will degrade, and having the tools and knowledge to manage that degradation effectively. This incident tested our architecture in ways we hadn't fully anticipated, and we learned from it. That's how infrastructure gets stronger.

How Render Services Stayed Up During the AWS October Outage

How Render was affected

Why our infrastructure responded this way

Low-level AWS primitives

How this affected our response

Self-managed Kubernetes

How this affected our response

Non-reliance on us-east-1

How this affected the incident

The role of circumstance

What we're improving

The bigger picture