Root Cause Analysis for Render's Extended Service Disruption on 3/26/24

The Render Team

March 29, 2024

The Render Team

Summary

On March 26, 2024, at 16:07 UTC, services on Render's platform were disrupted after an unintended restart of all customer and system workloads. No data was lost, and static sites and services not dependent on PostgreSQL, Redis, or Disks recovered within 20 minutes. However, recovery of managed data stores and other services with attached disks took significantly longer due to the scale of the event and underlying rate limits associated with moving disks between machines. Full functionality across all regions was restored at 20:00 UTC.

Most PostgreSQL databases, Redis instances, and services with attached disks saw much longer recovery times due to system-level throttling and rate limits that weren't designed for an event of this nature and scale. Centralized logging and metrics services were also slow to recover due to these limits. We increased the underlying rate limits after discovering the root cause of the delay, and took other actions to improve recovery times. However, even with these mitigations, the scale of the event delayed complete recovery of paid services until 18:45 UTC and free services until 20:00 UTC.

We are incredibly sorry for the extended disruption we caused for many customers. Platform reliability has always been our top priority as a company, and we let you down.

We have implemented measures to prevent an incident of this scale and nature going forward. We are prioritizing further prevention and mitigation measures to improve platform resilience.

Timeline

All times in UTC on 2024-03-26.

Time	Event
16:07	An unintended change causes a restart of all customer and system workloads.
16:07	Render engineers are paged.
16:09	We open an internal critical incident to investigate.
16:19	We identify the source of the restart and disable it to prevent further restarts.
16:19	We open a public incident on https://status.render.com and continue to investigate and mitigate.
16:21	All static sites, stateless services in all regions, and all services hosted on GCP are restored.
17:48	All paid services are restored in Singapore.
17:53	All paid services are restored in Ohio.
18:34	All paid services are restored in Oregon.
18:45	All paid services are restored in Frankfurt.
19:40	Logs are restored in all regions.
20:00	All free services are restored in all regions.

What happened

On March 26, 2024, at 16:07 UTC, a faulty code change caused a restart of all workloads on our platform. This change was put behind a feature flag and tested manually and automatically in Render's development and staging environments, but a combination of issues ultimately led to the bug making it to production:

The testing infrastructure for the code change was inconsistent across production and non-production environments.
The change was feature-flagged, but a subtle bug in the feature-flagging code prevented the faulty code from running in non-production environments and surfacing sooner.

Our systems paged our engineers as soon as the incident started, and we opened an internal incident to investigate. We declared a public incident 12 minutes after the initial report. We quickly identified the faulty code and disabled it to prevent additional workload restarts.

While many services without attached disks recovered within minutes, components responsible for restarting services with attached disks (PostgreSQL, Redis, and services with explicitly attached disks) were severely overloaded due to the unprecedented scale of the event, leading to significantly increased recovery times for many of these services.

When services restart, they are often transparently moved to a different host. When services with an attached disk (including managed PostgreSQL and Redis instances) are restarted and moved to a different host, our systems detach the disk from the source host and attach it to the target host. In isolation, a single detach-attach operation takes, at most, a few minutes. However, hundreds of thousands of services with disks were restarted simultaneously during the incident, overloading the systems responsible for moving disks between machines and significantly slowing down our ability to restore service

As we worked to expedite the recovery process, we discovered and quickly increased default rate limits for the detach-attach process. We also noticed throttling in an underlying infrastructure provider and worked with the provider to increase rate limits across all impacted regions. We increased these limits to the maximum values possible without creating further instability in our systems. We also prioritized paid service recovery by temporarily suspending free PostgreSQL instances during the incident. These changes enabled considerably faster recovery of impacted services; however, full recovery took longer due to the overwhelming volume of the restarts.

Some monitoring and logging systems that rely on attached disks were also unavailable during the incident, leading to gaps in some service metrics and logs.

Mitigations

This incident was the most severe and widespread outage in Render's history, and it surfaced multiple opportunities for us to improve platform reliability further and minimize time to recovery. They are listed below and are being implemented with the highest priority.

Increased disk management rate limits

As discussed above, we increased multiple rate limits in the systems responsible for moving disks between machines. While we are confident in our ability to prevent similar incidents in the future, we are also now equipped to recover much faster than before.

Ensure consistency between production and non-production systems

Our investigation found subtle differences in testing infrastructure between our production and non-production environments. We are working to standardize and improve our testing processes to prevent similar incidents going forward.

Improve our disk management infrastructure

In addition to increased rate limits, we are also making code changes to the components that manage disks. Specifically, we will rely more on batched operations, increasing disk management throughput by an order of magnitude.

Restrict permissions for control plane components

The control plane code that triggered the incident only needed to interact with a small subset of system resources and should not have had permission to restart existing customer services and data stores. We are adding system-level restrictions so that only necessary control plane components and services can interact with customer services.

Improve incident communication

Our investigation uncovered multiple gaps in incident communications. It took 12 minutes after opening an internal incident to update Render's public status; while our engineers worked to collect enough information to provide a meaningful update, we should have opened a public incident sooner.

In our initial update, we incorrectly used the 'Degraded Performance' status instead of 'Partial Outage' or 'Full Outage'. As a result, individual component statuses did not reflect the severity of the incident until our next update 22 minutes later.

We understand the critical importance of timely and accurate updates during incidents; we are working on automation and improving our incident response processes to ensure that our status page and other communication channels are updated as soon as relevant information becomes available.