Debug your Render services in Claude Code and Cursor.

Engineering

October 30, 2025

Engineering

How We Found 7 TiB of Memory Just Sitting Around

Brian Stack

October 30, 2025

Brian Stack

“

Debugging infrastructure at scale is rarely about one big aha moment. It’s often the result of many small questions, small changes, and small wins stacked up until something clicks.

Inside the hypercube of bad vibes: the namespace dimension

Getting ready to dissect what I like to call: the Kubernetes hypercube of bad vibes.

Credits: Hyperkube from gregegan.net, diagram (modified) from Kubernetes community repo

Plenty of teams run Kubernetes clusters bigger than ours. More nodes, more pods, more ingresses, you name it. In most dimensions, someone out there has us beat.

There's one dimension where I suspect we might be near the very top: namespaces. I say that because we keep running into odd behavior in any process that has to keep track of them. In particular, anything that listwatches them ends up using a surprising amount of memory and puts real pressure on the apiserver. This has become one of those scaling quirks you only really notice once you hit a certain threshold. As this memory overhead adds up, efficiency decreases: each byte we have to use for management is a byte we can't put towards user services.

The problem gets significantly worse when a daemonset needs to listwatch namespaces or network policies (netpols, which we define per namespace). Since daemonsets run a pod on every node, each of those pods independently performs a listwatch on the same resources. As a result, memory usage increases with the number of nodes.

Even worse, these listwatch calls can put significant load on the apiserver. If many daemonset pods restart at once, such as during a rollout, they can overwhelm the server with requests and cause real disruption.

Following the memory trail

A few months ago, if you looked at our nodes, the largest memory consumers were often daemonsets. In particular, Calico and Vector which handle configuring networking and log collection respectively.

We had already done some work to reduce Calico’s memory usage, working closely with the project’s maintainers to make it scale more efficiently. That optimization effort was a big win for us, and it gave us useful insight into how memory behaves when namespaces scale up.

Time-series graph of memory usage per pod for calico-node instances

To support that work, we set up a staging cluster with several hundred thousand namespaces. We knew that per-namespace network policies (netpols) were the scaling factor that stressed Calico, so we reproduced those conditions to validate our changes.

While running those tests, we noticed something strange. Vector, another daemonset, also started consuming large amounts of memory.

Memory usage per pod graph showing Vector pods

The pattern looked familiar, and we knew we had another problem to dig into. Vector obviously wasn’t looking at netpols but after poking around a bit we found it was listwatching namespaces from every node in order to allow referencing namespace labels per-pod in the kubernetes logs source.

Do we really need these labels?

That gave us an idea: what if Vector didn’t need to use namespaces at all? Was that even possible?

As it turns out, yes, they were in use in our configuration, but only to check whether a pod belonged to a user namespace.

yaml

type: filter
    inputs: [add_log_levels]
    condition:
      type: vrl
      source: .kubernetes.namespace_labels.userNS == "true"

Conveniently, we realized we could hackily describe that condition in another way, and the memory savings were absolutely worth it.

.kubernetes.namespace_name != null && 
  (
    starts_with(string!(.kubernetes.namespace_name), "user-")
  )

Building the fix (and breaking the logs)

At that point, we were feeling a bit too lucky. We reached out to the Vector maintainers to ask whether disabling this behavior would actually work, and whether they would be open to accepting a contribution if we made it happen.

From there, all that was left was to try it. The code change was straightforward. We added a new config option and threaded it through the relevant parts of the codebase.

insert_namespace_fields: {
        description: """
            Specifies whether or not to enrich logs with namespace fields.
            Setting to `false` prevents Vector from pulling in namespaces and thus namespace label fields will not
            be available. This helps reduce load on the `kube-apiserver` and lowers daemonset memory usage in clusters
            with many namespaces.
            """
        required: false
        type: bool: default: true
    }

After a few hours of flailing at rustc, a Docker image finally built and we were ready to test the theory. The container ran cleanly with no errors in the logs, which seemed promising.

But then we hit a snag. Nothing was being emitted. No logs at all. I couldn’t figure out why.

Thankfully, our pal Claude came to the rescue:

Claude chatbot answer: Looking at the code, I can see the issue. When add_namespace_fields is set to false, the namespace watcher/reflector is not created (lines 722–741). However, there’s still a dependency on the namespace state in the K8sPathsProvider (line 768) and NamespaceMetadataAnnotator (line 774)

I rebuilt it (which took like 73 hours because Rust), generated a new image, updating staging, and watched nervously. This time, logs were flowing like normal and…

The numbers don’t add up

The change saved 50 percent of memory. A huge win. We were ready to wrap it up and ship to production.

But then Hieu, one of our teammates, asked a very good question.

Slack conversation: Hieu: that sounds good. still concerning that it uses 1Gi in staging without namespace data, though. Me: we can profile it more if we want. now that I can build images, I think we can do the Rust profiling the same way we get from Go for free.Hieu: cool! yeah, I think it’d be worth understanding where the RAM is going

He was right, something didn’t add up.

A few hours later, after repeatedly running my head into a wall, I still hadn’t found anything. There was still a full gibibyte of memory unaccounted for. My whole theory about how this worked was starting to fall apart.

I even dropped into the channel to see if anyone had Valgrind experience:

Me (later in channel): anybody got a background in valgrind? seems pretty straightforward to get working so far but it won’t end up interfacing with pyroscope. we’ll have to exec in and gdb manually.
The answer was no.

In a last-ditch effort to profile it again, I finally saw the answer. It had been staring me in the face the whole time.

We actually had two kubernetes_logs sources on user nodes. I had only set the flag on one of them. Once I applied it to both, memory usage dropped to the level we had seen in staging before the extra namespaces were added.

yaml

sources: 
 kubernetes_system_logs:
    type: kubernetes_logs
    glob_minimum_cooldown_ms: 1000
...
  kubernetes_logs:
    type: kubernetes_logs
    glob_minimum_cooldown_ms: 1000
...

Shipping it

I put together a full pull request, and after waiting a little while, it shipped!

Changelog noting new insert_namespace_fields. Click to see it in the Vector docs.

Around the same time, our colleague Mark happened to be on-call. He did his usual magic — pulled everything together, tested the rollout in staging, and got it shipped to production.

I’ll let the results speak for themselves.

Memory usage per pod dropped from nearly 4 GiB down to just a few tens of MiB not to mention the reduction in CPU and network IO

The total memory usage across one of our larger clusters dropped by 1TiB.

Our largest cluster saw a 1 TiB memory drop, with savings across our other clusters adding up to a total of just over 7 TiB.

7 TiB later

Debugging infrastructure at scale is rarely about one big “aha” moment. It’s often the result of many small questions, small changes, and small wins stacked up until something clicks.

In this case, it started with a memory chart that didn’t look quite right, a teammate asking the right question at the right time, and a bit of persistence. When applied to our whole infrastructure, that simple fix freed up 7 TiB of memory, reduced risk during rollouts, and made the system easier to reason about.

Huge thanks to Hieu for pushing the investigation forward, Mark for shipping it smoothly, and the Vector maintainers for being responsive and open to the change.

If you’re running daemonsets at scale and seeing unexplained memory pressure, it might be worth asking:

Do you really need those namespace labels?