Problem: Workload expected to run on EKS Bottlerocket needed higher PID limit for dynamic process creation.
Cause: Container PID limit set by systemd’s DefaultTasksMax (15% of system-wide limit).
Fix: Increased pod PID limit (podPidsLimit) and set DefaultTasksMax=infinity via bootstrap container.
Key Insight: Pod and container PID limits must align; adjust container limits for specific workloads.
Result: Workload enabled successfully on Bottlerocket.
In our case the workload we needed to support on EKS Bottlerocket dynamically creates process after the container start. We know that this isn't ideal. To change this, a major rearchitecture is required which makes it a constraint we have to accommodate for.
The workload checks at startup the file /sys/fs/cgroup/pids.max and expects the value max to verify a high PID limit. Without the check a problem arises when the PID limit is reached and the workload needs to create more processes, which it can't in that case leading to service disruption.
To support running this workload on EKS Bottlerocket we needed to increase the PID limit on container level i.e. set /sys/fs/cgroup/pids.max to max.
We tried to increase the PID limit on pod level via the Kubelet flag podPidsLimit ([ref]) which didn't have the desired effect as /sys/fs/cgroup/pids.max in the container wasn't set to max.
Where does the the container PID limit come from if not from the pod limit?
It is 15% of the lower value of the system wide PID and thread limit:
cat /proc/sys/kernel/threads-max
505969
cat /proc/sys/kernel/pid_max
4194304
lower(505969, 4194304) * 0.15 = 75895
This corresponds to the systemd setting DefaultTasksMax which was discovered by inspecting the cgroups and their settings.
The value of /sys/fs/cgroup/pids.max in the container therefore is 75895. Not enough in our scenario as the workload requires to create many more processes.
From the PodSpec to the container cgroup, which component manages what?
On Bottlerocket the kubelet as node agent takes a PodSpec and instructs containerd as container runtime to start all containers listed in the PodSpec. Containerd instructs its cgroupDriver systemd to create systemd units for the containers. Systemd creates cgroups for each of the units and starts each container process.
Control groups are a Linux Kernel feature to hirarchically control resources and are essential to container technology. The relevance to PID limits is that the PID limit on pod level limits the PID limit on containers as they are child cgroups of the pod.
On the other hand, even if the pod PID limit is high enough, but the container PID limit isn't, the container can't create more processes than the limit on its own cgroup allows. The podPidsLimit flag on the Kubelet applies to the pod cgroup.
A pod cgroup is the parent of a container cgroup:
...
├─kubepods-besteffort-pod21069422_dc7e_4a7f_b27e_379eee8396c8.slice
│ ├─cri-containerd-8b7905f34fae7450ed3e8ed0445826144e740ab2575ceb57de1896c373e049fe.scope
│ │└─1761 /pause
...
Above and below are obviously further cgroups in the hierarchy that are left out. The important insight is the hierarchical structure of the cgroups.
Now that we know from the previous section that the systemd property DefaultTasksMax is the value set as PID limit. This is the property to tweak.
Systemd interprets configuration files under /run/systemd/system.conf.d/\*.conf. We can create a configuration file with the desired value for DefaultTasksMax to explicitly set it.
#/run/systemd/system.conf.d/custom.conf
[Manager]
DefaultTasksMax=infinity
In combination with the increased pod PID limit, this leads to the maximum available limit in the container.
Bottlerocket bootstrap containers ([ref]) are intended to set up the host with containers during boot. All we need is a container that creates the config file and reference it as bootstrap container.
Systemd will start the container during boot and the configuration will be setup before the node is registered in the Kubernetes cluster.
All nodes that reference the bootstrap container will be available with the increased PID limit on container level for all workloads scheduled on them.
Increasing the container PID limit by changing the DefaultTasksMax fulfilled the requirements and we successfully managed to accommodate the workload on EKS Bottlerocket.
When to use on pod level and when to use on container level?
The pod level limit can be seen as protection against any container in that pod exhausting all PIDs (like fork bombs), potentially bringing down the node. On container level the PID limit increase is only required in rare scenarios if a workload creates new processes during runtime like a webserver creating new processes for each connection.
Note that to increase the container PID limit, the pod PID limit also needs to be increased due to the hierarchical nature of cgroups.