How Render Scaled Knative to 100k+ Web Apps

Hieu Nguyen

October 03, 2023

Hieu Nguyen

In November 2021, Render introduced a free tier for hobbyist developers and teams who want to kick the tires. Adoption grew at a steady, predictable rate—until Heroku announced the end of their free offering ten months later:

Comparison of Render free-tier apps <b>created</b> each week, late 2022

Render's free-tier adoption rate doubled immediately and grew from there (awesome), causing our infrastructure to creak under the load (less awesome). In the span of a month, we experienced four incidents related to this surge. We knew that if Free usage continued to grow (and it very much has—as of this writing, tens of thousands of free-tier apps are created each week), we needed to make it much more scalable. This post describes the first step we took along that path.

How we initially built Free

Some background: unlike other services on Render, free-tier web services "scale to zero" (as in, they stop running) if they go 15 minutes without receiving traffic. They start up again whenever they next receive an incoming request. This hibernation behavior helps us provide a no-cost offering without breaking the bank. However, this desired behavior presented an immediate development challenge. Render uses Kubernetes (K8s) behind the scenes, and K8s didn't natively support scale-to-zero (it still doesn't, as of September 2023). In looking for a solution that did, we found and settled on Knative (kay-NAY-tiv). Knative extended Kubernetes with serverless support—a natural fit for services that would regularly spin up and down. In the interest of shipping quickly, we deployed Knative with its default configuration. And, until our growth spurt nearly a year later, those defaults worked without issue.

Where we hit a wall

With the free-tier surge, the total number of apps on Render effectively quadrupled. This put significant strain on the networking layer of each of our Kubernetes clusters. To understand the nature of that strain, let's look at how this layer operates. Two networking components run on every node in every cluster: Calico and kube-proxy. Calico mainly takes care of IP address management, or IPAM: assigning IP addresses to Pods and Services (we're using capital-S Service to refer to a Kubernetes Service, to distinguish from the services that customers create on Render.). It also enforces Network Policies by managing iptables rules on the node. kube-proxy configures a different set of routing rules on the node to ensure traffic destined for a Service is load-balanced across all backing Pods. Both of these components do their jobs by listening for creates, updates, and deletes to all Pods and Services in the cluster. As you can imagine, having more Pods and Services that changed more frequently resulted in more work:

More work meant more CPU consumption. Remember, both Calico and kube-proxy run on every node. The more CPU these components used, the less we had left to run our customers' apps.
More work meant higher update latency. As the work queue grew, each networking change took longer to propagate due to increased time spent waiting in the queue. This delay is defined as the network programming latency, or NPL (read more about NPL here). When there was high NPL, traffic could be routed using stale rules that led nowhere (the Pod had already been destroyed), causing connections to fail intermittently.

To mitigate these issues, we needed to reduce the overhead each free-tier app added to our networking machinery.

"Serviceless" Knative

As mentioned, we'd deployed out-of-the-box Knative to handle free-tier resource provisioning. We took a closer look at exactly what K8s primitives were being provisioned for each free-tier app:

One Pod (for running the application code). Expected.
2N + 1 Services, where N is the number of times the app was deployed. This is because Knative manages changes with Revisions, and retained resources belonging to historical Revisions. Unexpected.

We figured the Pod needed to stay, but did we really need all those Kubernetes Services? What if we could get away with fewer—or even zero? We dove deeper into how those resources interacted in a cluster:

And learned what each of the Knative-provisioned Services (in purple above) was for:

The Placeholder Service was a dummy service that existed to prevent naming collisions among resources for Knative-managed apps. There was one for every free-tier app.
The Public Service routed incoming traffic to the app from the public internet.
The Private Service routed incoming cluster-local traffic based on whether the app was scaled up.
- If scaled up, traffic was routed to the Pod.
- If scaled down, traffic was routed to the cluster's Knative proxy (called the activator), which handled scaling up the app by creating a Pod.

Armed with this newfound knowledge, we devised a path to remove all of these Services.

Step by step

We started simple with the dummy Placeholder Service, which did literally nothing. There was no risk of naming collisions among our Knative-managed resources, so we updated the Knative Route controller to stop creating the Placeholder Service. ❌ Next! While the Public Service (for public internet routing) is needed for plenty of Knative use cases out there, in Render-land, all requests from the public Internet must pass through our load-balancing layer. This means requests are guaranteed to be cluster-local by the time they reach Pods, so the Public Service also had nothing to do! We patched Knative to stop reconciling it and its related Endpoint resources. ❌ Finally, the Private Service (for cluster-local routing). We put together the concepts that Services are used to balance load across backing Pods, and that a free-tier app can have at most only one Pod receiving traffic at a time, making load balancing slightly unnecessary. There were two changes we needed to make:

Streamline traffic to flow exclusively through the activator, as we no longer had a Service to split traffic to when the app is scaled up. With a little experimentation, we discovered that the activator could both wake Pods and reverse-proxy to a woke Pod, even though that behavior wasn't documented! We just needed to set the right headers.
Patch the activator to listen for changes to Pod readiness states, and route directly to Pod IP addresses (thanks, Calico!). By default, the activator listens for changes to EndpointSlices, but those are tied to the Services we were hoping to delete. See the code additions in the diff:

+ Click to show diff

diff

diff --git a/pkg/activator/net/helpers.go b/pkg/activator/net/helpers.go
index 3a8e5ef49..0e7d5c3d0 100644
--- a/pkg/activator/net/helpers.go
+++ b/pkg/activator/net/helpers.go
@@ -92,6 +92,47 @@ func endpointsToDests(endpoints *corev1.Endpoints, portName string) (ready, notR
 	return ready, notReady
 }
+// podsToDests returns the pod's IP as the ready set if the pod is Ready,
+// and as the notReady set if not.
+func podsToDests(pods []*corev1.Pod) (_ready, _notReady sets.String) {
+	ready := sets.String{}
+	notReady := sets.String{}
+	// 8012 is the port of the queue-proxy container that runs alongside
+	// the user-server container.
+	port := "8012"
+	for _, pod := range pods {
+		if podIsReady(pod) {
+			for _, ip := range pod.Status.PodIPs {
+				ready.Insert(net.JoinHostPort(ip.IP, port))
+			}
+		} else {
+			for _, ip := range pod.Status.PodIPs {
+				notReady.Insert(net.JoinHostPort(ip.IP, port))
+			}
+		}
+
+	}
+	return ready, notReady
+}
+
+func podIsReady(pod *corev1.Pod) bool {
+	// This is the same logic that the K8s endpoints controller uses to list
+	// the Pod's IP under the Endpoint's `Addresses` (as opposed to `NotReadyAddresses`).
+	//
+	// https://github.com/kubernetes/kubernetes/blob/8a3f72074e6390f8f14a89a7b78cef4379737216/pkg/controller/endpoint/endpoints_controller.go#L624
+	//
+	// If we "mistime" it by interpreting the pod as ready before it is indeed
+	// ready, then the request that triggers scale-from-zero will always return
+	// some connection error, as the activator will try to reach the user-server
+	// container before it can serve requests.
+	for _, cond := range pod.Status.Conditions {
+		if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
+			return true
+		}
+	}
+	return false
+}
+
 // getServicePort takes a service and a protocol and returns the port number of
 // the port named for that protocol. If the port is not found then ok is false.
 func getServicePort(protocol networking.ProtocolType, svc *corev1.Service) (int, bool) {
diff --git a/pkg/activator/net/revision_backends.go b/pkg/activator/net/revision_backends.go
index 3d1524925..94758eae1 100644
--- a/pkg/activator/net/revision_backends.go
+++ b/pkg/activator/net/revision_backends.go
@@ -31,6 +31,8 @@ import (
 	"go.uber.org/zap"
 	"go.uber.org/zap/zapcore"
 	"golang.org/x/sync/errgroup"
+	"k8s.io/apimachinery/pkg/labels"
+	podsinformer "knative.dev/pkg/client/injection/kube/informers/core/v1/pod"
 	corev1 "k8s.io/api/core/v1"
 	"k8s.io/apimachinery/pkg/types"
@@ -47,6 +49,7 @@ import (
 	"knative.dev/pkg/logging"
 	"knative.dev/pkg/logging/logkey"
 	"knative.dev/pkg/reconciler"
+
 	"knative.dev/serving/pkg/apis/serving"
 	revisioninformer "knative.dev/serving/pkg/client/injection/informers/serving/v1/revision"
 	servinglisters "knative.dev/serving/pkg/client/listers/serving/v1"
@@ -441,6 +444,7 @@ type revisionBackendsManager struct {
 	ctx            context.Context
 	revisionLister servinglisters.RevisionLister
 	serviceLister  corev1listers.ServiceLister
+	podLister      corev1listers.PodLister
 	revisionWatchers    map[types.NamespacedName]*revisionWatcher
 	revisionWatchersMux sync.RWMutex
@@ -466,6 +470,7 @@ func newRevisionBackendsManagerWithProbeFrequency(ctx context.Context, tr http.R
 		ctx:              ctx,
 		revisionLister:   revisioninformer.Get(ctx).Lister(),
 		serviceLister:    serviceinformer.Get(ctx).Lister(),
+		podLister:        podsinformer.Get(ctx).Lister(),
 		revisionWatchers: make(map[types.NamespacedName]*revisionWatcher),
 		updateCh:         make(chan revisionDestsUpdate),
 		transport:        tr,
@@ -478,6 +483,7 @@ func newRevisionBackendsManagerWithProbeFrequency(ctx context.Context, tr http.R
 	endpointsInformer.Informer().AddEventHandler(cache.FilteringResourceEventHandler{
 		FilterFunc: reconciler.ChainFilterFuncs(
 			reconciler.LabelExistsFilterFunc(serving.RevisionUID),
+			reconciler.Not(reconciler.LabelExistsFilterFunc(serving.RenderServicelessKey)),
 			// We are only interested in the private services, since that is
 			// what is populated by the actual revision backends.
 			reconciler.LabelFilterFunc(networking.ServiceTypeKey, string(networking.ServiceTypePrivate), false),
@@ -488,6 +494,18 @@ func newRevisionBackendsManagerWithProbeFrequency(ctx context.Context, tr http.R
 			DeleteFunc: rbm.endpointsDeleted,
 		},
 	})
+	podsInformer := podsinformer.Get(ctx)
+	podsInformer.Informer().AddEventHandler(cache.FilteringResourceEventHandler{
+		FilterFunc: reconciler.ChainFilterFuncs(
+			reconciler.LabelExistsFilterFunc(serving.RevisionUID),
+			reconciler.LabelExistsFilterFunc(serving.RenderServicelessKey),
+		),
+		Handler: cache.ResourceEventHandlerFuncs{
+			AddFunc:    rbm.podUpdated,
+			UpdateFunc: controller.PassNew(rbm.podUpdated),
+			DeleteFunc: rbm.podDeleted,
+		},
+	})
 	go func() {
 		// updateCh can only be closed after revisionWatchers are done running
@@ -566,6 +584,40 @@ func (rbm *revisionBackendsManager) endpointsUpdated(newObj interface{}) {
 	}
 }
+// podUpdated is a handler function to be used by the Endpoints informer.
+// It updates the endpoints in the RevisionBackendsManager if the hosts changed
+func (rbm *revisionBackendsManager) podUpdated(newObj interface{}) {
+	// Ignore the updates when we've terminated.
+	select {
+	case <-rbm.ctx.Done():
+		return
+	default:
+	}
+	pod := newObj.(*corev1.Pod)
+	revID := types.NamespacedName{Namespace: pod.Namespace, Name: pod.Labels[serving.RevisionLabelKey]}
+
+	rw, err := rbm.getOrCreateRevisionWatcher(revID)
+	if err != nil {
+		rbm.logger.Errorw("Failed to get revision watcher", zap.Error(err), zap.String(logkey.Key, revID.String()))
+		return
+	}
+	selector, _ := labels.Parse(fmt.Sprintf("%s=%s", serving.RevisionLabelKey, pod.Labels[serving.RevisionLabelKey]))
+	allPods, err := rbm.podLister.Pods(pod.Namespace).List(selector)
+	if err != nil {
+		rbm.logger.Errorw("Failed to list pods belonging to revision", zap.Error(err), zap.String(logkey.Key, revID.String()))
+		return
+	}
+	ready, notReady := podsToDests(allPods)
+	select {
+	case <-rbm.ctx.Done():
+		return
+	case rw.destsCh <- dests{ready: ready, notReady: notReady}:
+	}
+}
+
 // deleteRevisionWatcher deletes the revision watcher for rev if it exists. It expects
 // a write lock is held on revisionWatchersMux when calling.
 func (rbm *revisionBackendsManager) deleteRevisionWatcher(rev types.NamespacedName) {
@@ -590,3 +642,7 @@ func (rbm *revisionBackendsManager) endpointsDeleted(obj interface{}) {
 	defer rbm.revisionWatchersMux.Unlock()
 	rbm.deleteRevisionWatcher(revID)
 }
+
+func (rbm *revisionBackendsManager) podDeleted(obj interface{}) {
+	rbm.podUpdated(obj)
+}
diff --git a/pkg/apis/serving/register.go b/pkg/apis/serving/register.go
index 61a2e8b4a..c067f962b 100644
--- a/pkg/apis/serving/register.go
+++ b/pkg/apis/serving/register.go
@@ -25,6 +25,8 @@ const (
 	// GroupNamePrefix is the prefix for label key and annotation key
 	GroupNamePrefix = GroupName + "/"
+	RenderServicelessKey = "render.com/knative-serviceless"
+
 	// ConfigurationLabelKey is the label key attached to a Revision indicating by
 	// which Configuration it is created.
 	ConfigurationLabelKey = GroupName + "/configuration"

And just like that, the Private Service was no more. ❌

Want to go deeper under the hood? Check out an abridged version of the design doc for removing the Private Service.

At the end of this entire optimization pass, the networking architecture for a free-tier app had been simplified to the following:

Free-tier architecture after Knative Service removal

Zero Kubernetes Services per free-tier app! Predictably, K8s Service counts plummeted across our clusters:

Change in K8s Service count in each Render cluster, November 2022

With these improvements, Calico and kube-proxy's combined usage fell by hundreds of CPU seconds in our largest cluster.

CPU usage for Calico and kube-proxy in one Render cluster, November 2022

With compute resources freed up, free-tier network latency and stability improved dramatically. But even so, we knew we had more work to do.

A moving target

Our Knative tweaks bought us some much-needed breathing room, but ultimately, free-tier usage began to put a strain even on this optimized architecture. The time was quickly approaching for us to rip out Knative entirely, in favor of a home-grown solution that was tailor-made for Render's needs. But that's a story for another post! --

Design doc for removing the Private Service (abridged)
Knative documentation
Careers at Render 😉