Engineering
December 17, 2024

Breaking down OpenAI's outage: How to avoid a hidden DNS dependency in Kubernetes

David Mauskop
OpenAI recently experienced an hours-long, platform-wide outage after a newly-deployed telemetry service overloaded their Kubernetes (K8s) control planes. At Render, we have years of experience running self-managed Kubernetes clusters in production. In reading OpenAI's postmortem, we recognized a story we've heard and lived before. If OpenAI's setup is like ours, there may be a key detail they missed in their otherwise excellent writeup. With a couple lines of K8s configuration, they could have significantly reduced the severity of this incident.

The control plane and data plane in K8s

As OpenAI highlighted in their postmortem, Kubernetes clusters are split conceptually into a control plane and a data plane. The control plane allows people or systems to make changes to a cluster. The data plane allows services running in the cluster to continue running with their current configuration.
The control plane versus data plane nodes. Image sourced from the official Kubernetes docs.
The control plane versus data plane nodes. Image sourced from the official Kubernetes docs.
If the control plane goes down, the cluster becomes partially degraded. You can't deploy new code or run a new job, and some of the observability tools you normally rely on won't be available. Important Kubernetes features, like autoscaling and automatic recovery from machine failure, stop working. Thus, the Kubernetes community has put in a lot of effort to ensure Kubernetes can be run with a highly-available control plane. The cluster being partially degraded is still better than it being fully down. Even though the data plane may be less resilient and less observable when detached from the control plane, it will continue to run.

A design principle for the data plane

It's useful to keep in mind the distinction between the control plane and data plane when designing any distributed system, not just Kubernetes. The data plane typically has stricter uptime requirements than the control plane, which means you want to remove any dependencies the data plane has on the control plane. OpenAI's postmortem says that such a dependency existed in their system, because data plane services needed the Kubernetes API server (a key part of the control plane) for DNS resolution:
The Kubernetes data plane (responsible for handling user requests) is designed to operate independently of the control plane. However, the Kubernetes API server is required for DNS resolution, which is a critical dependency for many of our services.
This explanation isn't quite right, because the Kubernetes API server is not directly required to resolve DNS. DNS depends on the API server in a more subtle way, as we discovered firsthand in our own incident.

An easy to miss DNS dependency

In January 2022, an etcd memory spike overloaded the Kubernetes control plane in one of Render's Frankfurt clusters. As in the OpenAI incident, our incident became much more severe when DNS resolution started failing. Initially, only operations like new deploys and cron job runs were failing: very bad, but not catastrophic. After DNS went down, requests to already-running services started failing. Catastrophe. The main issue was that our DNS servers were running on control plane nodes, and we rebooted those nodes during the incident. And as we found out, the Kubernetes API server is not required to resolve DNS, but it is required during DNS initialization. Since the API server was down along with the rest of our control plane, our DNS servers couldn't restart. If our DNS servers had continued running throughout the incident, they would have been able to fulfill DNS requests from their in-memory caches1, thus limiting the impact of the incident. The core issue was that our data plane wasn't sufficiently isolated from our control plane. Even if we hadn't made the mistake of rebooting the control plane nodes, the etcd memory spike alone could have forced a restart of our DNS servers.

A problematic default in K8s

OpenAI's postmortem suggests they may have hit this same issue we did, which stems from running DNS servers on the control plane. The postmortem mentions that they attempted to mitigate by scaling up their API servers. It's possible that scaling up their API servers required restarting control plane nodes, including DNS servers. It's also possible that the initial surge in API server load forced the co-located DNS servers to restart. I suspect that many Kubernetes administrators run DNS servers—specifically CoreDNS servers—on control plane nodes, not because they intentionally chose to, but because it's the default for clusters configured with kubeadm. It makes sense that kubeadm runs CoreDNS on control plane nodes during cluster bootstrapping, before data plane nodes (more commonly known as worker nodes) have joined the cluster. In a production cluster, though, we want stronger control plane-data plane separation. So, after bootstrapping a new cluster, it's a good idea to take additional steps. Here's what we did.

How we redesigned our control plane

In response to the January 2022 incident, we made two key changes to our control plane:
  1. We started running CoreDNS on data plane nodes instead of control plane nodes.
  2. We split out etcd to run on dedicated nodes. We isolated etcd because it doesn't perform well under resource contention, especially when that leads to increases in disk write latency.
Before the redesign: our Kubernetes cluster configuration
Before the redesign: our Kubernetes cluster configuration
After the redesign: our Kubernetes cluster configuration
After the redesign: our Kubernetes cluster configuration

Implementing the CoreDNS change

To move CoreDNS to worker nodes, we updated the tolerations in the CoreDNS pod spec. Originally, the kubeadm-generated spec included this snippet:
tolerations:
- key: CriticalAddonsOnly
  operator: Exists
- key: node-role.kubernetes.io/control-plane
  effect: NoSchedule
We updated it to look something like this:
tolerations:
- key: node-type
  operator: Equal
  value: data-plane-stable   # value changed for clarity
  effect: NoSchedule
Like many Kubernetes admins, we use taints to create different kinds of worker nodes. Because all our worker nodes are tainted, this small change makes it so CoreDNS pods can only be scheduled on nodes that are tainted with node-type=data-plane-stable:NoSchedule. Overall, these changes have proven effective. In May of this year, we had an incident in which several of our Kubernetes control planes failed simultaneously. We were able to contain the impact because CoreDNS was properly isolated from the control plane. Our data plane kept on data planing.

Telemetry—a common cause

As it happens, our May incident was caused by a telemetry service overwhelming our control plane. So, this section from OpenAI's root cause analysis sounded familiar to us:
The issue stemmed from a new telemetry service deployment […] this new services configuration unintentionally caused every node in each cluster to execute resource-intensive Kubernetes API operations whose cost scaled with the size of the cluster. With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed
Why are telemetry services so problematic for Kubernetes? They tend to run as daemonsets, meaning there's one instance for every node in the cluster. On initialization, each instance of a telemetry service may need to perform an expensive operation against the Kubernetes API server, like listing all pods on the node it’s observing. Furthermore, when crafting these API requests, it's easy for a developer to inadvertently bypass the API server cache and do an expensive consistent read against etcd. Kubernetes resources are indexed in etcd by namespace, not by node, so listing all pods on a node means listing all pods in the cluster and then filtering the result.2 As a follow up to our incident in May, we started monitoring for expensive list requests from unexpected sources, alongside other mitigations.

Hard lessons, the easy way

On Render's infrastructure team we love a good incident story. We are connoisseurs of the form. For a recent on-site we toured the Dandelion Chocolate Factory near our office in San Francisco. During the Q&A, we only wanted to hear stories about the worst incidents our tour guide had encountered on the factory floor. Sometimes an incident story teaches you new lessons—like when you spill gallons of molten chocolate, you need to scoop it all up before it cools and hardens—and sometimes it reinforces familiar ones. We share some lessons here not to gloat, but because we have been bitten by similar issues that bit OpenAI. We hope you can learn these lessons the easy way, not the hard way. By sharing knowledge, we can collectively make our infrastructure more reliable.

Footnotes

  1. The kubernetes plugin that ships with CoreDNS manages this in-memory cache.
  2. To indicate that results can be served from the cache, add resourceVersion="0" as a query parameter on API calls. As of Kubernetes version 1.31, there's also a beta feature to do consistent reads from the cache.