Summary
On March 26, 2024, at 16:07 UTC, services on Render's platform were disrupted after an unintended restart of all customer and system workloads. No data was lost, and static sites and services not dependent on PostgreSQL, Redis, or Disks recovered within 20 minutes. However, recovery of managed data stores and other services with attached disks took significantly longer due to the scale of the event and underlying rate limits associated with moving disks between machines. Full functionality across all regions was restored at 20:00 UTC. Most PostgreSQL databases, Redis instances, and services with attached disks saw much longer recovery times due to system-level throttling and rate limits that weren't designed for an event of this nature and scale. Centralized logging and metrics services were also slow to recover due to these limits. We increased the underlying rate limits after discovering the root cause of the delay, and took other actions to improve recovery times. However, even with these mitigations, the scale of the event delayed complete recovery of paid services until 18:45 UTC and free services until 20:00 UTC. We are incredibly sorry for the extended disruption we caused for many customers. Platform reliability has always been our top priority as a company, and we let you down. We have implemented measures to prevent an incident of this scale and nature going forward. We are prioritizing further prevention and mitigation measures to improve platform resilience.Timeline
All times in UTC on 2024-03-26.16:07 | An unintended change causes a restart of all customer and system workloads. |
---|---|
16:07 | Render engineers are paged. |
16:09 | We open an internal critical incident to investigate. |
16:19 | We identify the source of the restart and disable it to prevent further restarts. |
16:19 | We open a public incident on https://status.render.com and continue to investigate and mitigate. |
16:21 | All static sites, stateless services in all regions, and all services hosted on GCP are restored. |
17:48 | All paid services are restored in Singapore. |
17:53 | All paid services are restored in Ohio. |
18:34 | All paid services are restored in Oregon. |
18:45 | All paid services are restored in Frankfurt. |
19:40 | Logs are restored in all regions. |
20:00 | All free services are restored in all regions. |
What happened
On March 26, 2024, at 16:07 UTC, a faulty code change caused a restart of all workloads on our platform. This change was put behind a feature flag and tested manually and automatically in Render's development and staging environments, but a combination of issues ultimately led to the bug making it to production:- The testing infrastructure for the code change was inconsistent across production and non-production environments.
- The change was feature-flagged, but a subtle bug in the feature-flagging code prevented the faulty code from running in non-production environments and surfacing sooner.