can-engineers-save-the-planet-talk-part-3-hero-image

Increasing the Container PID limit on AWS EKS Bottlerocket

TL;DR

  • Problem: Workload expected to run on EKS Bottlerocket needed higher PID limit for dynamic process creation.

  • Cause: Container PID limit set by systemd’s DefaultTasksMax (15% of system-wide limit).

  • Fix: Increased pod PID limit (podPidsLimit) and set DefaultTasksMax=infinity via bootstrap container.

  • Key Insight: Pod and container PID limits must align; adjust container limits for specific workloads.

  • Result: Workload enabled successfully on Bottlerocket.

Workload Requirements

In our case the workload we needed to support on EKS Bottlerocket dynamically creates process after the container start. We know that this isn't ideal. To change this, a major rearchitecture is required which makes it a constraint we have to accommodate for.

The workload checks at startup the file /sys/fs/cgroup/pids.max and expects the value max to verify a high PID limit. Without the check a problem arises when the PID limit is reached and the workload needs to create more processes, which it can't in that case leading to service disruption.

To support running this workload on EKS Bottlerocket we needed to increase the PID limit on container level i.e. set /sys/fs/cgroup/pids.max to max.

We tried to increase the PID limit on pod level via the Kubelet flag podPidsLimit ([ref]) which didn't have the desired effect as /sys/fs/cgroup/pids.max in the container wasn't set to max.

Where does the the container PID limit come from if not from the pod limit?

It is 15% of the lower value of the system wide PID and thread limit:

    cat /proc/sys/kernel/threads-max
505969
cat /proc/sys/kernel/pid_max
4194304
lower(505969, 4194304) * 0.15 = 75895
  

This corresponds to the systemd setting DefaultTasksMax which was discovered by inspecting the cgroups and their settings.

The value of /sys/fs/cgroup/pids.max in the container therefore is 75895. Not enough in our scenario as the workload requires to create many more processes.

Technologies Involved

From the PodSpec to the container cgroup, which component manages what?

On Bottlerocket the kubelet as node agent takes a PodSpec and instructs containerd as container runtime to start all containers listed in the PodSpec. Containerd instructs its cgroupDriver systemd to create systemd units for the containers. Systemd creates cgroups for each of the units and starts each container process.

Control groups are a Linux Kernel feature to hirarchically control resources and are essential to container technology. The relevance to PID limits is that the PID limit on pod level limits the PID limit on containers as they are child cgroups of the pod.

On the other hand, even if the pod PID limit is high enough, but the container PID limit isn't, the container can't create more processes than the limit on its own cgroup allows. The podPidsLimit flag on the Kubelet applies to the pod cgroup.

A pod cgroup is the parent of a container cgroup:

    ...
├─kubepods-besteffort-pod21069422_dc7e_4a7f_b27e_379eee8396c8.slice
│ ├─cri-containerd-8b7905f34fae7450ed3e8ed0445826144e740ab2575ceb57de1896c373e049fe.scope
│ │└─1761 /pause
...
  

Above and below are obviously further cgroups in the hierarchy that are left out. The important insight is the hierarchical structure of the cgroups.

Setting DefaultTasksMax to infinity

Now that we know from the previous section that the systemd property DefaultTasksMax is the value set as PID limit. This is the property to tweak.

Systemd interprets configuration files under /run/systemd/system.conf.d/\*.conf. We can create a configuration file with the desired value for DefaultTasksMax to explicitly set it.

    #/run/systemd/system.conf.d/custom.conf
[Manager]
DefaultTasksMax=infinity
  

In combination with the increased pod PID limit, this leads to the maximum available limit in the container.

Shipping The Fix

Bottlerocket bootstrap containers ([ref]) are intended to set up the host with containers during boot. All we need is a container that creates the config file and reference it as bootstrap container.

Systemd will start the container during boot and the configuration will be setup before the node is registered in the Kubernetes cluster.

All nodes that reference the bootstrap container will be available with the increased PID limit on container level for all workloads scheduled on them.

Conclusion

Increasing the container PID limit by changing the DefaultTasksMax fulfilled the requirements and we successfully managed to accommodate the workload on EKS Bottlerocket.

PID limit on Pod Level vs. Container Level

When to use on pod level and when to use on container level?

The pod level limit can be seen as protection against any container in that pod exhausting all PIDs (like fork bombs), potentially bringing down the node. On container level the PID limit increase is only required in rare scenarios if a workload creates new processes during runtime like a webserver creating new processes for each connection.

Note that to increase the container PID limit, the pod PID limit also needs to be increased due to the hierarchical nature of cgroups.

Jonas Bührle
Jonas Bührle
Platform Engineer

Jonas joined Celonis as a Platform Engineer through the FutureNaut program in September 2022. He has a dual degree in Computer Science from the Technical University Ulm and the Rose-Hulman Institute of Technology in Indiana.

Before joining Celonis, Jonas worked as a Full-Stack Engineer as well as a Systems Engineer. At Celonis, he is part of the team that builds the Celonis Cloud Platform which powers the Celonis Software.

Outside of work, Jonas likes team and long distance sports, especially football where he is a player as well as a referee.

More blog posts on Celonis Engineering Blog
Next article

GitTricks: How to keep your history clean

Async request processing in Spring MVC
Dear visitor, you're using an outdated browser. Parts of this website will not work correctly. For a better experience, update or change your browser.