Kubernetes Blog The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community.
-
Kubernetes v1.33: User Namespaces enabled by default!
on April 25, 2025 at 6:30 pm
In Kubernetes v1.33 support for user namespaces is enabled by default. This means that, when the stack requirements are met, pods can opt-in to use user namespaces. To use the feature there is no need to enable any Kubernetes feature flag anymore! In this blog post we answer some common questions about user namespaces. But, before we dive into that, let’s recap what user namespaces are and why they are important. What is a user namespace? Note: Linux user namespaces are a different concept from Kubernetes namespaces. The former is a Linux kernel feature; the latter is a Kubernetes feature. Linux provides different namespaces to isolate processes from each other. For example, a typical Kubernetes pod runs within a network namespace to isolate the network identity and a PID namespace to isolate the processes. One Linux namespace that was left behind is the user namespace. It isolates the UIDs and GIDs of the containers from the ones on the host. The identifiers in a container can be mapped to identifiers on the host in a way where host and container(s) never end up in overlapping UID/GIDs. Furthermore, the identifiers can be mapped to unprivileged, non-overlapping UIDs and GIDs on the host. This brings three key benefits: Prevention of lateral movement: As the UIDs and GIDs for different containers are mapped to different UIDs and GIDs on the host, containers have a harder time attacking each other, even if they escape the container boundaries. For example, suppose container A runs with different UIDs and GIDs on the host than container B. In that case, the operations it can do on container B’s files and processes are limited: only read/write what a file allows to others, as it will never have permission owner or group permission (the UIDs/GIDs on the host are guaranteed to be different for different containers). Increased host isolation: As the UIDs and GIDs are mapped to unprivileged users on the host, if a container escapes the container boundaries, even if it runs as root inside the container, it has no privileges on the host. This greatly protects what host files it can read/write, which process it can send signals to, etc. Furthermore, capabilities granted are only valid inside the user namespace and not on the host, limiting the impact a container escape can have. Enablement of new use cases: User namespaces allow containers to gain certain capabilities inside their own user namespace without affecting the host. This unlocks new possibilities, such as running applications that require privileged operations without granting full root access on the host. This is particularly useful for running nested containers. User namespace IDs allocation If a pod running as the root user without a user namespace manages to breakout, it has root privileges on the node. If some capabilities were granted to the container, the capabilities are valid on the host too. None of this is true when using user namespaces (modulo bugs, of course 🙂). Demos Rodrigo created demos to understand how some CVEs are mitigated when user namespaces are used. We showed them here before (see here and here), but take a look if you haven’t: Mitigation of CVE 2024-21626 with user namespaces: Mitigation of CVE 2022-0492 with user namespaces: Everything you wanted to know about user namespaces in Kubernetes Here we try to answer some of the questions we have been asked about user namespaces support in Kubernetes. 1. What are the requirements to use it? The requirements are documented here. But we will elaborate a bit more, in the following questions. Note this is a Linux-only feature. 2. How do I configure a pod to opt-in? A complete step-by-step guide is available here. But the short version is you need to set the hostUsers: false field in the pod spec. For example like this: apiVersion: v1 kind: Pod metadata: name: userns spec: hostUsers: false containers: – name: shell command: [“sleep”, “infinity”] image: debian Yes, it is that simple. Applications will run just fine, without any other changes needed (unless your application needs the privileges). User namespaces allows you to run as root inside the container, but not have privileges in the host. However, if your application needs the privileges on the host, for example an app that needs to load a kernel module, then you can’t use user namespaces. 3. What are idmap mounts and why the file-systems used need to support it? Idmap mounts are a Linux kernel feature that uses a mapping of UIDs/GIDs when accessing a mount. When combined with user namespaces, it greatly simplifies the support for volumes, as you can forget about the host UIDs/GIDs the user namespace is using. In particular, thanks to idmap mounts we can: Run each pod with different UIDs/GIDs on the host. This is key for the lateral movement prevention we mentioned earlier. Share volumes with pods that don’t use user namespaces. Enable/disable user namespaces without needing to chown the pod’s volumes. Support for idmap mounts in the kernel is per file-system and different kernel releases added support for idmap mounts on different file-systems. To find which kernel version added support for each file-system, you can check out the mount_setattr man page, or the online version of it here. Most popular file-systems are supported, the notable absence that isn’t supported yet is NFS. 4. Can you clarify exactly which file-systems need to support idmap mounts? The file-systems that need to support idmap mounts are all the file-systems used by a pod in the pod.spec.volumes field. This means: for PV/PVC volumes, the file-system used in the PV needs to support idmap mounts; for hostPath volumes, the file-system used in the hostPath needs to support idmap mounts. What does this mean for secrets/configmaps/projected/downwardAPI volumes? For these volumes, the kubelet creates a tmpfs file-system. So, you will need a 6.3 kernel to use these volumes (note that if you use them as env variables it is fine). And what about emptyDir volumes? Those volumes are created by the kubelet by default in /var/lib/kubelet/pods/. You can also use a custom directory for this. But what needs to support idmap mounts is the file-system used in that directory. The kubelet creates some more files for the container, like /etc/hostname, /etc/resolv.conf, /dev/termination-log, /etc/hosts, etc. These files are also created in /var/lib/kubelet/pods/ by default, so it’s important for the file-system used in that directory to support idmap mounts. Also, some container runtimes may put some of these ephemeral volumes inside a tmpfs file-system, in which case you will need support for idmap mounts in tmpfs. 5. Can I use a kernel older than 6.3? Yes, but you will need to make sure you are not using a tmpfs file-system. If you avoid that, you can easily use 5.19 (if all the other file-systems you use support idmap mounts in that kernel). It can be tricky to avoid using tmpfs, though, as we just described above. Besides having to avoid those volume types, you will also have to avoid mounting the service account token. Every pod has it mounted by default, and it uses a projected volume that, as we mentioned, uses a tmpfs file-system. You could even go lower than 5.19, all the way to 5.12. However, your container rootfs probably uses an overlayfs file-system, and support for overlayfs was added in 5.19. We wouldn’t recommend to use a kernel older than 5.19, as not being able to use idmap mounts for the rootfs is a big limitation. If you absolutely need to, you can check this blog post Rodrigo wrote some years ago, about tricks to use user namespaces when you can’t support idmap mounts on the rootfs. 6. If my stack supports user namespaces, do I need to configure anything else? No, if your stack supports it and you are using Kubernetes v1.33, there is nothing you need to configure. You should be able to follow the task: Use a user namespace with a pod. However, in case you have specific requirements, you may configure various options. You can find more information here. You can also enable a feature gate to relax the PSS rules. 7. The demos are nice, but are there more CVEs that this mitigates? Yes, quite a lot, actually! Besides the ones in the demo, the KEP has more CVEs you can check. That list is not exhaustive, there are many more. 8. Can you sum up why user namespaces is important? Think about running a process as root, maybe even an untrusted process. Do you think that is secure? What if we limit it by adding seccomp and apparmor, mask some files in /proc (so it can’t crash the node, etc.) and some more tweaks? Wouldn’t it be better if we don’t give it privileges in the first place, instead of trying to play whack-a-mole with all the possible ways root can escape? This is what user namespaces does, plus some other goodies: Run as an unprivileged user on the host without making changes to your application. Greg and Vinayak did a great talk on the pains you can face when trying to run unprivileged without user namespaces. The pains part starts in this minute. All pods run with different UIDs/GIDs, we significantly improve the lateral movement. This is guaranteed with user namespaces (the kubelet chooses it for you). In the same talk, Greg and Vinayak show that to achieve the same without user namespaces, they went through a quite complex custom solution. This part starts in this minute. The capabilities granted are only granted inside the user namespace. That means that if a pod breaks out of the container, they are not valid on the host. We can’t provide that without user namespaces. It enables new use-cases in a secure way. You can run docker in docker, unprivileged container builds, Kubernetes inside Kubernetes, etc all in a secure way. Most of the previous solutions to do this required privileged containers or putting the node at a high risk of compromise. 9. Is there container runtime documentation for user namespaces? Yes, we have containerd documentation. This explains different limitations of containerd 1.7 and how to use user namespaces in containerd without Kubernetes pods (using ctr). Note that if you use containerd, you need containerd 2.0 or higher to use user namespaces with Kubernetes. CRI-O doesn’t have special documentation for user namespaces, it works out of the box. 10. What about the other container runtimes? No other container runtime that we are aware of supports user namespaces with Kubernetes. That sadly includes cri-dockerd too. 11. I’d like to learn more about it, what would you recommend? Rodrigo did an introduction to user namespaces at KubeCon 2022: Run As “Root”, Not Root: User Namespaces In K8s- Marga Manterola, Isovalent & Rodrigo Campos Catelin Also, this aforementioned presentation at KubeCon 2023 can be useful as a motivation for user namespaces: Least Privilege Containers: Keeping a Bad Day from Getting Worse – Greg Castle & Vinayak Goyal Bear in mind the presentation are some years old, some things have changed since then. Use the Kubernetes documentation as the source of truth. If you would like to learn more about the low-level details of user namespaces, you can check man 7 user_namespaces and man 1 unshare. You can easily create namespaces and experiment with how they behave. Be aware that the unshare tool has a lot of flexibility, and with that options to create incomplete setups. If you would like to know more about idmap mounts, you can check its Linux kernel documentation. Conclusions Running pods as root is not ideal and running them as non-root is also hard with containers, as it can require a lot of changes to the applications. User namespaces are a unique feature to let you have the best of both worlds: run as non-root, without any changes to your application. This post covered: what are user namespaces, why they are important, some real world examples of CVEs mitigated by user-namespaces, and some common questions. Hopefully, this post helped you to eliminate the last doubts you had and you will now try user-namespaces (if you didn’t already!). How do I get involved? You can reach SIG Node by several means: Slack: #sig-node Mailing list Open Community Issues/PRs You can also contact us directly: GitHub: @rata @giuseppe @saschagrunert Slack: @rata @giuseppe @sascha
-
Continuing the transition from Endpoints to EndpointSlices
on April 24, 2025 at 6:30 pm
Since the addition of EndpointSlices (KEP-752) as alpha in v1.15 and later GA in v1.21, the Endpoints API in Kubernetes has been gathering dust. New Service features like dual-stack networking and traffic distribution are only supported via the EndpointSlice API, so all service proxies, Gateway API implementations, and similar controllers have had to be ported from using Endpoints to using EndpointSlices. At this point, the Endpoints API is really only there to avoid breaking end user workloads and scripts that still make use of it. As of Kubernetes 1.33, the Endpoints API is now officially deprecated, and the API server will return warnings to users who read or write Endpoints resources rather than using EndpointSlices. Eventually, the plan (as documented in KEP-4974) is to change the Kubernetes Conformance criteria to no longer require that clusters run the Endpoints controller (which generates Endpoints objects based on Services and Pods), to avoid doing work that is unneeded in most modern-day clusters. Thus, while the Kubernetes deprecation policy means that the Endpoints type itself will probably never completely go away, users who still have workloads or scripts that use the Endpoints API should start migrating them to EndpointSlices. Notes on migrating from Endpoints to EndpointSlices Consuming EndpointSlices rather than Endpoints For end users, the biggest change between the Endpoints API and the EndpointSlice API is that while every Service with a selector has exactly 1 Endpoints object (with the same name as the Service), a Service may have any number of EndpointSlices associated with it: $ kubectl get endpoints myservice Warning: v1 Endpoints is deprecated in v1.33+; use discovery.k8s.io/v1 EndpointSlice NAME ENDPOINTS AGE myservice 10.180.3.17:443 1h $ kubectl get endpointslice -l kubernetes.io/service-name=myservice NAME ADDRESSTYPE PORTS ENDPOINTS AGE myservice-7vzhx IPv4 443 10.180.3.17 21s myservice-jcv8s IPv6 443 2001:db8:0123::5 21s In this case, because the service is dual stack, it has 2 EndpointSlices: 1 for IPv4 addresses and 1 for IPv6 addresses. (The Endpoints API does not support dual stack, so the Endpoints object shows only the addresses in the cluster’s primary address family.) Although any Service with multiple endpoints can have multiple EndpointSlices, there are three main cases where you will see this: An EndpointSlice can only represent endpoints of a single IP family, so dual-stack Services will have separate EndpointSlices for IPv4 and IPv6. All of the endpoints in an EndpointSlice must target the same ports. So, for example, if you have a set of endpoint Pods listening on port 80, and roll out an update to make them listen on port 8080 instead, then while the rollout is in progress, the Service will need 2 EndpointSlices: 1 for the endpoints listening on port 80, and 1 for the endpoints listening on port 8080. When a Service has more than 100 endpoints, the EndpointSlice controller will split the endpoints into multiple EndpointSlices rather than aggregating them into a single excessively-large object like the Endpoints controller does. Because there is not a predictable 1-to-1 mapping between Services and EndpointSlices, there is no way to know what the actual name of the EndpointSlice resource(s) for a Service will be ahead of time; thus, instead of fetching the EndpointSlice(s) by name, you instead ask for all EndpointSlices with a “kubernetes.io/service-name” label pointing to the Service: $ kubectl get endpointslice -l kubernetes.io/service-name=myservice A similar change is needed in Go code. With Endpoints, you would do something like: // Get the Endpoints named `name` in `namespace`. endpoint, err := client.CoreV1().Endpoints(namespace).Get(ctx, name, metav1.GetOptions{}) if err != nil { if apierrors.IsNotFound(err) { // No Endpoints exists for the Service (yet?) … } // handle other errors … } // process `endpoint` … With EndpointSlices, this becomes: // Get all EndpointSlices for Service `name` in `namespace`. slices, err := client.DiscoveryV1().EndpointSlices(namespace).List(ctx, metav1.ListOptions{LabelSelector: discoveryv1.LabelServiceName + “=” + name}) if err != nil { // handle errors … } else if len(slices.Items) == 0 { // No EndpointSlices exist for the Service (yet?) … } // process `slices.Items` … Generating EndpointSlices rather than Endpoints For people (or controllers) generating Endpoints, migrating to EndpointSlices is slightly easier, because in most cases you won’t have to worry about multiple slices. You just need to update your YAML or Go code to use the new type (which organizes the information in a slightly different way than Endpoints did). For example, this Endpoints object: apiVersion: v1 kind: Endpoints metadata: name: myservice subsets: – addresses: – ip: 10.180.3.17 nodeName: node-4 – ip: 10.180.5.22 nodeName: node-9 – ip: 10.180.18.2 nodeName: node-7 notReadyAddresses: – ip: 10.180.6.6 nodeName: node-8 ports: – name: https protocol: TCP port: 443 would become something like: apiVersion: discovery.k8s.io/v1 kind: EndpointSlice metadata: name: myservice labels: kubernetes.io/service-name: myservice addressType: IPv4 endpoints: – addresses: – 10.180.3.17 nodeName: node-4 – addresses: – 10.180.5.22 nodeName: node-9 – addresses: – 10.180.18.12 nodeName: node-7 – addresses: – 10.180.6.6 nodeName: node-8 conditions: ready: false ports: – name: https protocol: TCP port: 443 Some points to note: This example uses an explicit name, but you could also use generateName and let the API server append a unique suffix. The name itself does not matter: what matters is the “kubernetes.io/service-name” label pointing back to the Service. You have to explicitly indicate addressType: IPv4 (or IPv6). An EndpointSlice is similar to a single element of the “subsets” array in Endpoints. An Endpoints object with multiple subsets will normally need to be expressed as multiple EndpointSlices, each with different “ports”. The endpoints and addresses fields are both arrays, but by convention, each addresses array only contains a single element. If your Service has multiple endpoints, then you need to have multiple elements in the endpoints array, each with a single element in its addresses array. The Endpoints API lists “ready” and “not-ready” endpoints separately, while the EndpointSlice API allows each endpoint to have conditions (such as “ready: false”) associated with it. And of course, once you have ported to EndpointSlice, you can make use of EndpointSlice-specific features, such as topology hints and terminating endpoints. Consult the EndpointSlice API documentation for more information.
-
Kubernetes v1.33: Octarine
on April 23, 2025 at 6:30 pm
Editors: Agustina Barbetta, Aakanksha Bhende, Udi Hofesh, Ryota Sawada, Sneha Yadav Similar to previous releases, the release of Kubernetes v1.33 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community. This release consists of 64 enhancements. Of those enhancements, 18 have graduated to Stable, 20 are entering Beta, 24 have entered Alpha, and 2 are deprecated or withdrawn. There are also several notable deprecations and removals in this release; make sure to read about those if you already run an older version of Kubernetes. Release theme and logo The theme for Kubernetes v1.33 is Octarine: The Color of Magic1, inspired by Terry Pratchett’s Discworld series. This release highlights the open source magic2 that Kubernetes enables across the ecosystem. If you’re familiar with the world of Discworld, you might recognize a small swamp dragon perched atop the tower of the Unseen University, gazing up at the Kubernetes moon above the city of Ankh-Morpork with 64 stars3 in the background. As Kubernetes moves into its second decade, we celebrate both the wizardry of its maintainers, the curiosity of new contributors, and the collaborative spirit that fuels the project. The v1.33 release is a reminder that, as Pratchett wrote, “It’s still magic even if you know how it’s done.” Even if you know the ins and outs of the Kubernetes code base, stepping back at the end of the release cycle, you’ll realize that Kubernetes remains magical. Kubernetes v1.33 is a testament to the enduring power of open source innovation, where hundreds of contributors4 from around the world work together to create something truly extraordinary. Behind every new feature, the Kubernetes community works to maintain and improve the project, ensuring it remains secure, reliable, and released on time. Each release builds upon the other, creating something greater than we could achieve alone. 1. Octarine is the mythical eighth color, visible only to those attuned to the arcane—wizards, witches, and, of course, cats. And occasionally, someone who’s stared at IPtable rules for too long. 2. Any sufficiently advanced technology is indistinguishable from magic…? 3. It’s not a coincidence 64 KEPs (Kubernetes Enhancement Proposals) are also included in v1.33. 4. See the Project Velocity section for v1.33 🚀 Spotlight on key updates Kubernetes v1.33 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight! Stable: Sidecar containers The sidecar pattern involves deploying separate auxiliary container(s) to handle extra capabilities in areas such as networking, logging, and metrics gathering. Sidecar containers graduate to stable in v1.33. Kubernetes implements sidecars as a special class of init containers with restartPolicy: Always, ensuring that sidecars start before application containers, remain running throughout the pod’s lifecycle, and terminate automatically after the main containers exit. Additionally, sidecars can utilize probes (startup, readiness, liveness) to signal their operational state, and their Out-Of-Memory (OOM) score adjustments are aligned with primary containers to prevent premature termination under memory pressure. To learn more, read Sidecar Containers. This work was done as part of KEP-753: Sidecar Containers led by SIG Node. Beta: In-place resource resize for vertical scaling of Pods Workloads can be defined using APIs like Deployment, StatefulSet, etc. These describe the template for the Pods that should run, including memory and CPU resources, as well as the replica count of the number of Pods that should run. Workloads can be scaled horizontally by updating the Pod replica count, or vertically by updating the resources required in the Pods container(s). Before this enhancement, container resources defined in a Pod’s spec were immutable, and updating any of these details within a Pod template would trigger Pod replacement. But what if you could dynamically update the resource configuration for your existing Pods without restarting them? The KEP-1287 is precisely to allow such in-place Pod updates. It was released as alpha in v1.27, and has graduated to beta in v1.33. This opens up various possibilities for vertical scale-up of stateful processes without any downtime, seamless scale-down when the traffic is low, and even allocating larger resources during startup, which can then be reduced once the initial setup is complete. This work was done as part of KEP-1287: In-Place Update of Pod Resources led by SIG Node and SIG Autoscaling. Alpha: New configuration option for kubectl with .kuberc for user preferences In v1.33, kubectl introduces a new alpha feature with opt-in configuration file .kuberc for user preferences. This file can contain kubectl aliases and overrides (e.g. defaulting to use server-side apply), while leaving cluster credentials and host information in kubeconfig. This separation allows sharing the same user preferences for kubectl interaction, regardless of target cluster and kubeconfig used. To enable this alpha feature, users can set the environment variable of KUBECTL_KUBERC=true and create a .kuberc configuration file. By default, kubectl looks for this file in ~/.kube/kuberc. You can also specify an alternative location using the –kuberc flag, for example: kubectl –kuberc /var/kube/rc. This work was done as part of KEP-3104: Separate kubectl user preferences from cluster configs led by SIG CLI. Features graduating to Stable This is a selection of some of the improvements that are now stable following the v1.33 release. Backoff limits per index for indexed Jobs This release graduates a feature that allows setting backoff limits on a per-index basis for Indexed Jobs. Traditionally, the backoffLimit parameter in Kubernetes Jobs specifies the number of retries before considering the entire Job as failed. This enhancement allows each index within an Indexed Job to have its own backoff limit, providing more granular control over retry behavior for individual tasks. This ensures that the failure of specific indices does not prematurely terminate the entire Job, allowing the other indices to continue processing independently. This work was done as part of KEP-3850: Backoff Limit Per Index For Indexed Jobs led by SIG Apps. Job success policy Using .spec.successPolicy, users can specify which pod indexes must succeed (succeededIndexes), how many pods must succeed (succeededCount), or a combination of both. This feature benefits various workloads, including simulations where partial completion is sufficient, and leader-worker patterns where only the leader’s success determines the Job’s overall outcome. This work was done as part of KEP-3998: Job success/completion policy led by SIG Apps. Bound ServiceAccount token security improvements This enhancement introduced features such as including a unique token identifier (i.e. JWT ID Claim, also known as JTI) and node information within the tokens, enabling more precise validation and auditing. Additionally, it supports node-specific restrictions, ensuring that tokens are only usable on designated nodes, thereby reducing the risk of token misuse and potential security breaches. These improvements, now generally available, aim to enhance the overall security posture of service account tokens within Kubernetes clusters. This work was done as part of KEP-4193: Bound service account token improvements led by SIG Auth. Subresource support in kubectl The –subresource argument is now generally available for kubectl subcommands such as get, patch, edit, apply and replace, allowing users to fetch and update subresources for all resources that support them. To learn more about the subresources supported, visit the kubectl reference. This work was done as part of KEP-2590: Add subresource support to kubectl led by SIG CLI. Multiple Service CIDRs This enhancement introduced a new implementation of allocation logic for Service IPs. Across the whole cluster, every Service of type: ClusterIP must have a unique IP address assigned to it. Trying to create a Service with a specific cluster IP that has already been allocated will return an error. The updated IP address allocator logic uses two newly stable API objects: ServiceCIDR and IPAddress. Now generally available, these APIs allow cluster administrators to dynamically increase the number of IP addresses available for type: ClusterIP Services (by creating new ServiceCIDR objects). This work was done as part of KEP-1880: Multiple Service CIDRs led by SIG Network. nftables backend for kube-proxy The nftables backend for kube-proxy is now stable, adding a new implementation that significantly improves performance and scalability for Services implementation within Kubernetes clusters. For compatibility reasons, iptables remains the default on Linux nodes. Check the migration guide if you want to try it out. This work was done as part of KEP-3866: nftables kube-proxy backend led by SIG Network. Topology aware routing with trafficDistribution: PreferClose This release graduates topology-aware routing and traffic distribution to GA, which would allow us to optimize service traffic in multi-zone clusters. The topology-aware hints in EndpointSlices would enable components like kube-proxy to prioritize routing traffic to endpoints within the same zone, thereby reducing latency and cross-zone data transfer costs. Building upon this, trafficDistribution field is added to the Service specification, with the PreferClose option directing traffic to the nearest available endpoints based on network topology. This configuration enhances performance and cost-efficiency by minimizing inter-zone communication. This work was done as part of KEP-4444: Traffic Distribution for Services and KEP-2433: Topology Aware Routing led by SIG Network. Options to reject non SMT-aligned workload This feature added policy options to the CPU Manager, enabling it to reject workloads that do not align with Simultaneous Multithreading (SMT) configurations. This enhancement, now generally available, ensures that when a pod requests exclusive use of CPU cores, the CPU Manager can enforce allocation of entire core pairs (comprising primary and sibling threads) on SMT-enabled systems, thereby preventing scenarios where workloads share CPU resources in unintended ways. This work was done as part of KEP-2625: node: cpumanager: add options to reject non SMT-aligned workload led by SIG Node. Defining Pod affinity or anti-affinity using matchLabelKeys and mismatchLabelKeys The matchLabelKeys and mismatchLabelKeys fields are available in Pod affinity terms, enabling users to finely control the scope where Pods are expected to co-exist (Affinity) or not (AntiAffinity). These newly stable options complement the existing labelSelector mechanism. The affinity fields facilitate enhanced scheduling for versatile rolling updates, as well as isolation of services managed by tools or controllers based on global configurations. This work was done as part of KEP-3633: Introduce MatchLabelKeys to Pod Affinity and Pod Anti Affinity led by SIG Scheduling. Considering taints and tolerations when calculating Pod topology spread skew This enhanced PodTopologySpread by introducing two fields: nodeAffinityPolicy and nodeTaintsPolicy. These fields allow users to specify whether node affinity rules and node taints should be considered when calculating pod distribution across nodes. By default, nodeAffinityPolicy is set to Honor, meaning only nodes matching the pod’s node affinity or selector are included in the distribution calculation. The nodeTaintsPolicy defaults to Ignore, indicating that node taints are not considered unless specified. This enhancement provides finer control over pod placement, ensuring that pods are scheduled on nodes that meet both affinity and taint toleration requirements, thereby preventing scenarios where pods remain pending due to unsatisfied constraints. This work was done as part of KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew led by SIG Scheduling. Volume populators After being released as beta in v1.24, volume populators have graduated to GA in v1.33. This newly stable feature provides a way to allow users to pre-populate volumes with data from various sources, and not just from PersistentVolumeClaim (PVC) clones or volume snapshots. The mechanism relies on the dataSourceRef field within a PersistentVolumeClaim. This field offers more flexibility than the existing dataSource field, and allows for custom resources to be used as data sources. A special controller, volume-data-source-validator, validates these data source references, alongside a newly stable CustomResourceDefinition (CRD) for an API kind named VolumePopulator. The VolumePopulator API allows volume populator controllers to register the types of data sources they support. You need to set up your cluster with the appropriate CRD in order to use volume populators. This work was done as part of KEP-1495: Generic data populators led by SIG Storage. Always honor PersistentVolume reclaim policy This enhancement addressed an issue where the Persistent Volume (PV) reclaim policy is not consistently honored, leading to potential storage resource leaks. Specifically, if a PV is deleted before its associated Persistent Volume Claim (PVC), the “Delete” reclaim policy may not be executed, leaving the underlying storage assets intact. To mitigate this, Kubernetes now sets finalizers on relevant PVs, ensuring that the reclaim policy is enforced regardless of the deletion sequence. This enhancement prevents unintended retention of storage resources and maintains consistency in PV lifecycle management. This work was done as part of KEP-2644: Always Honor PersistentVolume Reclaim Policy led by SIG Storage. New features in Beta This is a selection of some of the improvements that are now beta following the v1.33 release. Support for Direct Service Return (DSR) in Windows kube-proxy DSR provides performance optimizations by allowing the return traffic routed through load balancers to bypass the load balancer and respond directly to the client; reducing load on the load balancer and also reducing overall latency. For information on DSR on Windows, read Direct Server Return (DSR) in a nutshell. Initially introduced in v1.14, support for DSR has been promoted to beta by SIG Windows as part of KEP-5100: Support for Direct Service Return (DSR) and overlay networking in Windows kube-proxy. Structured parameter support While structured parameter support continues as a beta feature in Kubernetes v1.33, this core part of Dynamic Resource Allocation (DRA) has seen significant improvements. A new v1beta2 version simplifies the resource.k8s.io API, and regular users with the namespaced cluster edit role can now use DRA. The kubelet now includes seamless upgrade support, enabling drivers deployed as DaemonSets to use a rolling update mechanism. For DRA implementations, this prevents the deletion and re-creation of ResourceSlices, allowing them to remain unchanged during upgrades. Additionally, a 30-second grace period has been introduced before the kubelet cleans up after unregistering a driver, providing better support for drivers that do not use rolling updates. This work was done as part of KEP-4381: DRA: structured parameters by WG Device Management, a cross-functional team including SIG Node, SIG Scheduling, and SIG Autoscaling. Dynamic Resource Allocation (DRA) for network interfaces The standardized reporting of network interface data via DRA, introduced in v1.32, has graduated to beta in v1.33. This enables more native Kubernetes network integrations, simplifying the development and management of networking devices. This was covered previously in the v1.32 release announcement blog. This work was done as part of KEP-4817: DRA: Resource Claim Status with possible standardized network interface data led by SIG Network, SIG Node, and WG Device Management. Handle unscheduled pods early when scheduler does not have any pod on activeQ This feature improves queue scheduling behavior. Behind the scenes, the scheduler achieves this by popping pods from the backoffQ, which are not backed off due to errors, when the activeQ is empty. Previously, the scheduler would become idle even when the activeQ was empty; this enhancement improves scheduling efficiency by preventing that. This work was done as part of KEP-5142: Pop pod from backoffQ when activeQ is empty led by SIG Scheduling. Asynchronous preemption in the Kubernetes Scheduler Preemption ensures higher-priority pods get the resources they need by evicting lower-priority ones. Asynchronous Preemption, introduced in v1.32 as alpha, has graduated to beta in v1.33. With this enhancement, heavy operations such as API calls to delete pods are processed in parallel, allowing the scheduler to continue scheduling other pods without delays. This improvement is particularly beneficial in clusters with high Pod churn or frequent scheduling failures, ensuring a more efficient and resilient scheduling process. This work was done as part of KEP-4832: Asynchronous preemption in the scheduler led by SIG Scheduling. ClusterTrustBundles ClusterTrustBundle, a cluster-scoped resource designed for holding X.509 trust anchors (root certificates), has graduated to beta in v1.33. This API makes it easier for in-cluster certificate signers to publish and communicate X.509 trust anchors to cluster workloads. This work was done as part of KEP-3257: ClusterTrustBundles (previously Trust Anchor Sets) led by SIG Auth. Fine-grained SupplementalGroups control Introduced in v1.31, this feature graduates to beta in v1.33 and is now enabled by default. Provided that your cluster has the SupplementalGroupsPolicy feature gate enabled, the supplementalGroupsPolicy field within a Pod’s securityContext supports two policies: the default Merge policy maintains backward compatibility by combining specified groups with those from the container image’s /etc/group file, whereas the new Strict policy applies only to explicitly defined groups. This enhancement helps to address security concerns where implicit group memberships from container images could lead to unintended file access permissions and bypass policy controls. This work was done as part of KEP-3619: Fine-grained SupplementalGroups control led by SIG Node. Support for mounting images as volumes Support for using Open Container Initiative (OCI) images as volumes in Pods, introduced in v1.31, has graduated to beta. This feature allows users to specify an image reference as a volume in a Pod while reusing it as a volume mount within containers. It opens up the possibility of packaging the volume data separately, and sharing them among containers in a Pod without including them in the main image, thereby reducing vulnerabilities and simplifying image creation. This work was done as part of KEP-4639: VolumeSource: OCI Artifact and/or Image led by SIG Node and SIG Storage. Support for user namespaces within Linux Pods One of the oldest open KEPs as of writing is KEP-127, Pod security improvement by using Linux User namespaces for Pods. This KEP was first opened in late 2016, and after multiple iterations, had its alpha release in v1.25, initial beta in v1.30 (where it was disabled by default), and has moved to on-by-default beta as part of v1.33. This support will not impact existing Pods unless you manually specify pod.spec.hostUsers to opt in. As highlighted in the v1.30 sneak peek blog, this is an important milestone for mitigating vulnerabilities. This work was done as part of KEP-127: Support User Namespaces in pods led by SIG Node. Pod procMount option The procMount option, introduced as alpha in v1.12, and off-by-default beta in v1.31, has moved to an on-by-default beta in v1.33. This enhancement improves Pod isolation by allowing users to fine-tune access to the /proc filesystem. Specifically, it adds a field to the Pod securityContext that lets you override the default behavior of masking and marking certain /proc paths as read-only. This is particularly useful for scenarios where users want to run unprivileged containers inside the Kubernetes Pod using user namespaces. Normally, the container runtime (via the CRI implementation) starts the outer container with strict /proc mount settings. However, to successfully run nested containers with an unprivileged Pod, users need a mechanism to relax those defaults, and this feature provides exactly that. This work was done as part of KEP-4265: add ProcMount option led by SIG Node. CPUManager policy to distribute CPUs across NUMA nodes This feature adds a new policy option for the CPU Manager to distribute CPUs across Non-Uniform Memory Access (NUMA) nodes, rather than concentrating them on a single node. It optimizes CPU resource allocation by balancing workloads across multiple NUMA nodes, thereby improving performance and resource utilization in multi-NUMA systems. This work was done as part of KEP-2902: Add CPUManager policy option to distribute CPUs across NUMA nodes instead of packing them led by SIG Node. Zero-second sleeps for container PreStop hooks Kubernetes 1.29 introduced a Sleep action for the preStop lifecycle hook in Pods, allowing containers to pause for a specified duration before termination. This provides a straightforward method to delay container shutdown, facilitating tasks such as connection draining or cleanup operations. The Sleep action in a preStop hook can now accept a zero-second duration as a beta feature. This allows defining a no-op preStop hook, which is useful when a preStop hook is required but no delay is desired. This work was done as part of KEP-3960: Introducing Sleep Action for PreStop Hook and KEP-4818: Allow zero value for Sleep Action of PreStop Hook led by SIG Node. Internal tooling for declarative validation of Kubernetes-native types Behind the scenes, the internals of Kubernetes are starting to use a new mechanism for validating objects and changes to objects. Kubernetes v1.33 introduces validation-gen, an internal tool that Kubernetes contributors use to generate declarative validation rules. The overall goal is to improve the robustness and maintainability of API validations by enabling developers to specify validation constraints declaratively, reducing manual coding errors and ensuring consistency across the codebase. This work was done as part of KEP-5073: Declarative Validation Of Kubernetes Native Types With validation-gen led by SIG API Machinery. New features in Alpha This is a selection of some of the improvements that are now alpha following the v1.33 release. Configurable tolerance for HorizontalPodAutoscalers This feature introduces configurable tolerance for HorizontalPodAutoscalers, which dampens scaling reactions to small metric variations. This work was done as part of KEP-4951: Configurable tolerance for Horizontal Pod Autoscalers led by SIG Autoscaling. Configurable container restart delay Introduced as alpha1 in v1.32, this feature provides a set of kubelet-level configurations to fine-tune how CrashLoopBackOff is handled. This work was done as part of KEP-4603: Tune CrashLoopBackOff led by SIG Node. Custom container stop signals Before Kubernetes v1.33, stop signals could only be set in container image definitions (for example, via the StopSignal configuration field in the image metadata). If you wanted to modify termination behavior, you needed to build a custom container image. By enabling the (alpha) ContainerStopSignals feature gate in Kubernetes v1.33, you can now define custom stop signals directly within Pod specifications. This is defined in the container’s lifecycle.stopSignal field and requires the Pod’s spec.os.name field to be present. If unspecified, containers fall back to the image-defined stop signal (if present), or the container runtime default (typically SIGTERM for Linux). This work was done as part of KEP-4960: Container Stop Signals led by SIG Node. DRA enhancements galore! Kubernetes v1.33 continues to develop Dynamic Resource Allocation (DRA) with features designed for today’s complex infrastructures. DRA is an API for requesting and sharing resources between pods and containers inside a pod. Typically those resources are devices such as GPUs, FPGAs, and network adapters. The following are all the alpha DRA feature gates introduced in v1.33: Similar to Node taints, by enabling the DRADeviceTaints feature gate, devices support taints and tolerations. An admin or a control plane component can taint devices to limit their usage. Scheduling of pods which depend on those devices can be paused while a taint exists and/or pods using a tainted device can be evicted. By enabling the feature gate DRAPrioritizedList, DeviceRequests get a new field named firstAvailable. This field is an ordered list that allows the user to specify that a request may be satisfied in different ways, including allocating nothing at all if some specific hardware is not available. With feature gate DRAAdminAccess enabled, only users authorized to create ResourceClaim or ResourceClaimTemplate objects in namespaces labeled with resource.k8s.io/admin-access: “true” can use the adminAccess field. This ensures that non-admin users cannot misuse the adminAccess feature. While it has been possible to consume device partitions since v1.31, vendors had to pre-partition devices and advertise them accordingly. By enabling the DRAPartitionableDevices feature gate in v1.33, device vendors can advertise multiple partitions, including overlapping ones. The Kubernetes scheduler will choose the partition based on workload requests, and prevent the allocation of conflicting partitions simultaneously. This feature gives vendors the ability to dynamically create partitions at allocation time. The allocation and dynamic partitioning are automatic and transparent to users, enabling improved resource utilization. These feature gates have no effect unless you also enable the DynamicResourceAllocation feature gate. This work was done as part of KEP-5055: DRA: device taints and tolerations, KEP-4816: DRA: Prioritized Alternatives in Device Requests, KEP-5018: DRA: AdminAccess for ResourceClaims and ResourceClaimTemplates, and KEP-4815: DRA: Add support for partitionable devices, led by SIG Node, SIG Scheduling and SIG Auth. Robust image pull policy to authenticate images for IfNotPresent and Never This feature allows users to ensure that kubelet requires an image pull authentication check for each new set of credentials, regardless of whether the image is already present on the node. This work was done as part of KEP-2535: Ensure secret pulled images led by SIG Auth. Node topology labels are available via downward API This feature enables Node topology labels to be exposed via the downward API. Prior to Kubernetes v1.33, a workaround involved using an init container to query the Kubernetes API for the underlying node; this alpha feature simplifies how workloads can access Node topology information. This work was done as part of KEP-4742: Expose Node labels via downward API led by SIG Node. Better pod status with generation and observed generation Prior to this change, the metadata.generation field was unused in pods. Along with extending to support metadata.generation, this feature will introduce status.observedGeneration to provide clearer pod status. This work was done as part of KEP-5067: Pod Generation led by SIG Node. Support for split level 3 cache architecture with kubelet’s CPU Manager The previous kubelet’s CPU Manager was unaware of split L3 cache architecture (also known as Last Level Cache, or LLC), and can potentially distribute CPU assignments without considering the split L3 cache, causing a noisy neighbor problem. This alpha feature improves the CPU Manager to better assign CPU cores for better performance. This work was done as part of KEP-5109: Split L3 Cache Topology Awareness in CPU Manager led by SIG Node. PSI (Pressure Stall Information) metrics for scheduling improvements This feature adds support on Linux nodes for providing PSI stats and metrics using cgroupv2. It can detect resource shortages and provide nodes with more granular control for pod scheduling. This work was done as part of KEP-4205: Support PSI based on cgroupv2 led by SIG Node. Secret-less image pulls with kubelet The kubelet’s on-disk credential provider now supports optional Kubernetes ServiceAccount (SA) token fetching. This simplifies authentication with image registries by allowing cloud providers to better integrate with OIDC compatible identity solutions. This work was done as part of KEP-4412: Projected service account tokens for Kubelet image credential providers led by SIG Auth. Graduations, deprecations, and removals in v1.33 Graduations to stable This lists all the features that have graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes. This release includes a total of 18 enhancements promoted to stable: Take taints/tolerations into consideration when calculating PodTopologySpread skew Introduce MatchLabelKeys to Pod Affinity and Pod Anti Affinity Bound service account token improvements Generic data populators Multiple Service CIDRs Topology Aware Routing Portworx file in-tree to CSI driver migration Always Honor PersistentVolume Reclaim Policy nftables kube-proxy backend Deprecate status.nodeInfo.kubeProxyVersion field Add subresource support to kubectl Backoff Limit Per Index For Indexed Jobs Job success/completion policy Sidecar Containers CRD Validation Ratcheting node: cpumanager: add options to reject non SMT-aligned workload Traffic Distribution for Services Recursive Read-only (RRO) mounts Deprecations and removals As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project’s overall health. See the Kubernetes deprecation and removal policy for more details on this process. Many of these deprecations and removals were announced in the Deprecations and Removals blog post. Deprecation of the stable Endpoints API The EndpointSlices API has been stable since v1.21, which effectively replaced the original Endpoints API. While the original Endpoints API was simple and straightforward, it also posed some challenges when scaling to large numbers of network endpoints. The EndpointSlices API has introduced new features such as dual-stack networking, making the original Endpoints API ready for deprecation. This deprecation affects only those who use the Endpoints API directly from workloads or scripts; these users should migrate to use EndpointSlices instead. There will be a dedicated blog post with more details on the deprecation implications and migration plans. You can find more in KEP-4974: Deprecate v1.Endpoints. Removal of kube-proxy version information in node status Following its deprecation in v1.31, as highlighted in the v1.31 release announcement, the .status.nodeInfo.kubeProxyVersion field for Nodes was removed in v1.33. This field was set by kubelet, but its value was not consistently accurate. As it has been disabled by default since v1.31, this field has been removed entirely in v1.33. You can find more in KEP-4004: Deprecate status.nodeInfo.kubeProxyVersion field. Removal of in-tree gitRepo volume driver The gitRepo volume type has been deprecated since v1.11, nearly 7 years ago. Since its deprecation, there have been security concerns, including how gitRepo volume types can be exploited to gain remote code execution as root on the nodes. In v1.33, the in-tree driver code is removed. There are alternatives such as git-sync and initContainers. gitVolumes in the Kubernetes API is not removed, and thus pods with gitRepo volumes will be admitted by kube-apiserver, but kubelets with the feature-gate GitRepoVolumeDriver set to false will not run them and return an appropriate error to the user. This allows users to opt-in to re-enabling the driver for 3 versions to give them enough time to fix workloads. The feature gate in kubelet and in-tree plugin code is planned to be removed in the v1.39 release. You can find more in KEP-5040: Remove gitRepo volume driver. Removal of host network support for Windows pods Windows Pod networking aimed to achieve feature parity with Linux and provide better cluster density by allowing containers to use the Node’s networking namespace. The original implementation landed as alpha with v1.26, but because it faced unexpected containerd behaviours and alternative solutions were available, the Kubernetes project has decided to withdraw the associated KEP. Support was fully removed in v1.33. Please note that this does not affect HostProcess containers, which provides host network as well as host level access. The KEP withdrawn in v1.33 was about providing the host network only, which was never stable due to technical limitations with Windows networking logic. You can find more in KEP-3503: Host network support for Windows pods. Release notes Check out the full details of the Kubernetes v1.33 release in our release notes. Availability Kubernetes v1.33 is available for download on GitHub or on the Kubernetes download page. To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.33 using kubeadm. Release Team Kubernetes is only possible with the support, commitment, and hard work of its community. Release Team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management. We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.33 release to our community. The Release Team’s membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. There was a new team structure adopted in this release cycle, which was to combine Release Notes and Docs subteams into a unified subteam of Docs. Thanks to the meticulous effort in organizing the relevant information and resources from the new Docs team, both Release Notes and Docs tracking have seen a smooth and successful transition. Finally, a very special thanks goes out to our release lead, Nina Polshakova, for her support throughout a successful release cycle, her advocacy, her efforts to ensure that everyone could contribute effectively, and her challenges to improve the release process. Project velocity The CNCF K8s DevStats project aggregates several interesting data points related to the velocity of Kubernetes and various subprojects. This includes everything from individual contributions, to the number of companies contributing, and illustrates the depth and breadth of effort that goes into evolving this ecosystem. During the v1.33 release cycle, which spanned 15 weeks from January 13 to April 23, 2025, Kubernetes received contributions from as many as 121 different companies and 570 individuals (as of writing, a few weeks before the release date). In the wider cloud native ecosystem, the figure goes up to 435 companies counting 2400 total contributors. You can find the data source in this dashboard. Compared to the velocity data from previous release, v1.32, we see a similar level of contribution from companies and individuals, indicating strong community interest and engagement. Note that, “contribution” counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs. If you are interested in contributing, visit Getting Started on our contributor website. Check out DevStats to learn more about the overall velocity of the Kubernetes project and community. Event update Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community! May 2025 KCD – Kubernetes Community Days: Costa Rica: May 3, 2025 | Heredia, Costa Rica KCD – Kubernetes Community Days: Helsinki: May 6, 2025 | Helsinki, Finland KCD – Kubernetes Community Days: Texas Austin: May 15, 2025 | Austin, USA KCD – Kubernetes Community Days: Seoul: May 22, 2025 | Seoul, South Korea KCD – Kubernetes Community Days: Istanbul, Turkey: May 23, 2025 | Istanbul, Turkey KCD – Kubernetes Community Days: San Francisco Bay Area: May 28, 2025 | San Francisco, USA June 2025 KCD – Kubernetes Community Days: New York: June 4, 2025 | New York, USA KCD – Kubernetes Community Days: Czech & Slovak: June 5, 2025 | Bratislava, Slovakia KCD – Kubernetes Community Days: Bengaluru: June 6, 2025 | Bangalore, India KubeCon + CloudNativeCon China 2025: June 10-11, 2025 | Hong Kong KCD – Kubernetes Community Days: Antigua Guatemala: June 14, 2025 | Antigua Guatemala, Guatemala KubeCon + CloudNativeCon Japan 2025: June 16-17, 2025 | Tokyo, Japan KCD – Kubernetes Community Days: Nigeria, Africa: June 19, 2025 | Nigeria, Africa July 2025 KCD – Kubernetes Community Days: Utrecht: July 4, 2025 | Utrecht, Netherlands KCD – Kubernetes Community Days: Taipei: July 5, 2025 | Taipei, Taiwan KCD – Kubernetes Community Days: Lima, Peru: July 19, 2025 | Lima, Peru August 2025 KubeCon + CloudNativeCon India 2025: August 6-7, 2025 | Hyderabad, India KCD – Kubernetes Community Days: Colombia: August 29, 2025 | Bogotá, Colombia You can find the latest KCD details here. Upcoming release webinar Join members of the Kubernetes v1.33 Release Team on Friday, May 16th 2025 at 4:00 PM (UTC), to learn about the release highlights of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site. Get involved The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support. Follow us on Bluesky @kubernetes.io for the latest updates Join the community discussion on Discuss Join the community on Slack Post questions (or answer questions) on Server Fault or Stack Overflow Share your Kubernetes story Read more about what’s happening with Kubernetes on the blog Learn more about the Kubernetes Release Team
-
Kubernetes Multicontainer Pods: An Overview
on April 22, 2025 at 12:00 am
As cloud-native architectures continue to evolve, Kubernetes has become the go-to platform for deploying complex, distributed systems. One of the most powerful yet nuanced design patterns in this ecosystem is the sidecar pattern—a technique that allows developers to extend application functionality without diving deep into source code. The origins of the sidecar pattern Think of a sidecar like a trusty companion motorcycle attachment. Historically, IT infrastructures have always used auxiliary services to handle critical tasks. Before containers, we relied on background processes and helper daemons to manage logging, monitoring, and networking. The microservices revolution transformed this approach, making sidecars a structured and intentional architectural choice. With the rise of microservices, the sidecar pattern became more clearly defined, allowing developers to offload specific responsibilities from the main service without altering its code. Service meshes like Istio and Linkerd have popularized sidecar proxies, demonstrating how these companion containers can elegantly handle observability, security, and traffic management in distributed systems. Kubernetes implementation In Kubernetes, sidecar containers operate within the same Pod as the main application, enabling communication and resource sharing. Does this sound just like defining multiple containers along each other inside the Pod? It actually does, and this is how sidecar containers had to be implemented before Kubernetes v1.29.0, which introduced native support for sidecars. Sidecar containers can now be defined within a Pod manifest using the spec.initContainers field. What makes it a sidecar container is that you specify it with restartPolicy: Always. You can see an example of this below, which is a partial snippet of the full Kubernetes manifest: initContainers: – name: logshipper image: alpine:latest restartPolicy: Always command: [‘sh’, ‘-c’, ‘tail -F /opt/logs.txt’] volumeMounts: – name: data mountPath: /opt That field name, spec.initContainers may sound confusing. How come when you want to define a sidecar container, you have to put an entry in the spec.initContainers array? spec.initContainers are run to completion just before main application starts, so they’re one-off, whereas sidecars often run in parallel to the main app container. It’s the spec.initContainers with restartPolicy:Always which differs classic init containers from Kubernetes-native sidecar containers and ensures they are always up. When to embrace (or avoid) sidecars While the sidecar pattern can be useful in many cases, it is generally not the preferred approach unless the use case justifies it. Adding a sidecar increases complexity, resource consumption, and potential network latency. Instead, simpler alternatives such as built-in libraries or shared infrastructure should be considered first. Deploy a sidecar when: You need to extend application functionality without touching the original code Implementing cross-cutting concerns like logging, monitoring or security Working with legacy applications requiring modern networking capabilities Designing microservices that demand independent scaling and updates Proceed with caution if: Resource efficiency is your primary concern Minimal network latency is critical Simpler alternatives exist You want to minimize troubleshooting complexity Four essential multi-container patterns Init container pattern The Init container pattern is used to execute (often critical) setup tasks before the main application container starts. Unlike regular containers, init containers run to completion and then terminate, ensuring that preconditions for the main application are met. Ideal for: Preparing configurations Loading secrets Verifying dependency availability Running database migrations The init container ensures your application starts in a predictable, controlled environment without code modifications. Ambassador pattern An ambassador container provides Pod-local helper services that expose a simple way to access a network service. Commonly, ambassador containers send network requests on behalf of a an application container and take care of challenges such as service discovery, peer identity verification, or encryption in transit. Perfect when you need to: Offload client connectivity concerns Implement language-agnostic networking features Add security layers like TLS Create robust circuit breakers and retry mechanisms Configuration helper A configuration helper sidecar provides configuration updates to an application dynamically, ensuring it always has access to the latest settings without disrupting the service. Often the helper needs to provide an initial configuration before the application would be able to start successfully. Use cases: Fetching environment variables and secrets Polling configuration changes Decoupling configuration management from application logic Adapter pattern An adapter (or sometimes façade) container enables interoperability between the main application container and external services. It does this by translating data formats, protocols, or APIs. Strengths: Transforming legacy data formats Bridging communication protocols Facilitating integration between mismatched services Wrap-up While sidecar patterns offer tremendous flexibility, they’re not a silver bullet. Each added sidecar introduces complexity, consumes resources, and potentially increases operational overhead. Always evaluate simpler alternatives first. The key is strategic implementation: use sidecars as precision tools to solve specific architectural challenges, not as a default approach. When used correctly, they can improve security, networking, and configuration management in containerized environments. Choose wisely, implement carefully, and let your sidecars elevate your container ecosystem.
-
Introducing kube-scheduler-simulator
on April 7, 2025 at 12:00 am
The Kubernetes Scheduler is a crucial control plane component that determines which node a Pod will run on. Thus, anyone utilizing Kubernetes relies on a scheduler. kube-scheduler-simulator is a simulator for the Kubernetes scheduler, that started as a Google Summer of Code 2021 project developed by me (Kensei Nakada) and later received a lot of contributions. This tool allows users to closely examine the scheduler’s behavior and decisions. It is useful for casual users who employ scheduling constraints (for example, inter-Pod affinity) and experts who extend the scheduler with custom plugins. Motivation The scheduler often appears as a black box, composed of many plugins that each contribute to the scheduling decision-making process from their unique perspectives. Understanding its behavior can be challenging due to the multitude of factors it considers. Even if a Pod appears to be scheduled correctly in a simple test cluster, it might have been scheduled based on different calculations than expected. This discrepancy could lead to unexpected scheduling outcomes when deployed in a large production environment. Also, testing a scheduler is a complex challenge. There are countless patterns of operations executed within a real cluster, making it unfeasible to anticipate every scenario with a finite number of tests. More often than not, bugs are discovered only when the scheduler is deployed in an actual cluster. Actually, many bugs are found by users after shipping the release, even in the upstream kube-scheduler. Having a development or sandbox environment for testing the scheduler — or, indeed, any Kubernetes controllers — is a common practice. However, this approach falls short of capturing all the potential scenarios that might arise in a production cluster because a development cluster is often much smaller with notable differences in workload sizes and scaling dynamics. It never sees the exact same use or exhibits the same behavior as its production counterpart. The kube-scheduler-simulator aims to solve those problems. It enables users to test their scheduling constraints, scheduler configurations, and custom plugins while checking every detailed part of scheduling decisions. It also allows users to create a simulated cluster environment, where they can test their scheduler with the same resources as their production cluster without affecting actual workloads. Features of the kube-scheduler-simulator The kube-scheduler-simulator’s core feature is its ability to expose the scheduler’s internal decisions. The scheduler operates based on the scheduling framework, using various plugins at different extension points, filter nodes (Filter phase), score nodes (Score phase), and ultimately determine the best node for the Pod. The simulator allows users to create Kubernetes resources and observe how each plugin influences the scheduling decisions for Pods. This visibility helps users understand the scheduler’s workings and define appropriate scheduling constraints. The simulator web frontend Inside the simulator, a debuggable scheduler runs instead of the vanilla scheduler. This debuggable scheduler outputs the results of each scheduler plugin at every extension point to the Pod’s annotations like the following manifest shows and the web front end formats/visualizes the scheduling results based on these annotations. kind: Pod apiVersion: v1 metadata: # The JSONs within these annotations are manually formatted for clarity in the blog post. annotations: kube-scheduler-simulator.sigs.k8s.io/bind-result: ‘{“DefaultBinder”:”success”}’ kube-scheduler-simulator.sigs.k8s.io/filter-result: >- { “node-jjfg5”:{ “NodeName”:”passed”, “NodeResourcesFit”:”passed”, “NodeUnschedulable”:”passed”, “TaintToleration”:”passed” }, “node-mtb5x”:{ “NodeName”:”passed”, “NodeResourcesFit”:”passed”, “NodeUnschedulable”:”passed”, “TaintToleration”:”passed” } } kube-scheduler-simulator.sigs.k8s.io/finalscore-result: >- { “node-jjfg5”:{ “ImageLocality”:”0″, “NodeAffinity”:”0″, “NodeResourcesBalancedAllocation”:”52″, “NodeResourcesFit”:”47″, “TaintToleration”:”300″, “VolumeBinding”:”0″ }, “node-mtb5x”:{ “ImageLocality”:”0″, “NodeAffinity”:”0″, “NodeResourcesBalancedAllocation”:”76″, “NodeResourcesFit”:”73″, “TaintToleration”:”300″, “VolumeBinding”:”0″ } } kube-scheduler-simulator.sigs.k8s.io/permit-result: ‘{}’ kube-scheduler-simulator.sigs.k8s.io/permit-result-timeout: ‘{}’ kube-scheduler-simulator.sigs.k8s.io/postfilter-result: ‘{}’ kube-scheduler-simulator.sigs.k8s.io/prebind-result: ‘{“VolumeBinding”:”success”}’ kube-scheduler-simulator.sigs.k8s.io/prefilter-result: ‘{}’ kube-scheduler-simulator.sigs.k8s.io/prefilter-result-status: >- { “AzureDiskLimits”:””, “EBSLimits”:””, “GCEPDLimits”:””, “InterPodAffinity”:””, “NodeAffinity”:””, “NodePorts”:””, “NodeResourcesFit”:”success”, “NodeVolumeLimits”:””, “PodTopologySpread”:””, “VolumeBinding”:””, “VolumeRestrictions”:””, “VolumeZone”:”” } kube-scheduler-simulator.sigs.k8s.io/prescore-result: >- { “InterPodAffinity”:””, “NodeAffinity”:”success”, “NodeResourcesBalancedAllocation”:”success”, “NodeResourcesFit”:”success”, “PodTopologySpread”:””, “TaintToleration”:”success” } kube-scheduler-simulator.sigs.k8s.io/reserve-result: ‘{“VolumeBinding”:”success”}’ kube-scheduler-simulator.sigs.k8s.io/result-history: >- [ { “kube-scheduler-simulator.sigs.k8s.io/bind-result”:”{\”DefaultBinder\”:\”success\”}”, “kube-scheduler-simulator.sigs.k8s.io/filter-result”:”{\”node-jjfg5\”:{\”NodeName\”:\”passed\”,\”NodeResourcesFit\”:\”passed\”,\”NodeUnschedulable\”:\”passed\”,\”TaintToleration\”:\”passed\”},\”node-mtb5x\”:{\”NodeName\”:\”passed\”,\”NodeResourcesFit\”:\”passed\”,\”NodeUnschedulable\”:\”passed\”,\”TaintToleration\”:\”passed\”}}”, “kube-scheduler-simulator.sigs.k8s.io/finalscore-result”:”{\”node-jjfg5\”:{\”ImageLocality\”:\”0\”,\”NodeAffinity\”:\”0\”,\”NodeResourcesBalancedAllocation\”:\”52\”,\”NodeResourcesFit\”:\”47\”,\”TaintToleration\”:\”300\”,\”VolumeBinding\”:\”0\”},\”node-mtb5x\”:{\”ImageLocality\”:\”0\”,\”NodeAffinity\”:\”0\”,\”NodeResourcesBalancedAllocation\”:\”76\”,\”NodeResourcesFit\”:\”73\”,\”TaintToleration\”:\”300\”,\”VolumeBinding\”:\”0\”}}”, “kube-scheduler-simulator.sigs.k8s.io/permit-result”:”{}”, “kube-scheduler-simulator.sigs.k8s.io/permit-result-timeout”:”{}”, “kube-scheduler-simulator.sigs.k8s.io/postfilter-result”:”{}”, “kube-scheduler-simulator.sigs.k8s.io/prebind-result”:”{\”VolumeBinding\”:\”success\”}”, “kube-scheduler-simulator.sigs.k8s.io/prefilter-result”:”{}”, “kube-scheduler-simulator.sigs.k8s.io/prefilter-result-status”:”{\”AzureDiskLimits\”:\”\”,\”EBSLimits\”:\”\”,\”GCEPDLimits\”:\”\”,\”InterPodAffinity\”:\”\”,\”NodeAffinity\”:\”\”,\”NodePorts\”:\”\”,\”NodeResourcesFit\”:\”success\”,\”NodeVolumeLimits\”:\”\”,\”PodTopologySpread\”:\”\”,\”VolumeBinding\”:\”\”,\”VolumeRestrictions\”:\”\”,\”VolumeZone\”:\”\”}”, “kube-scheduler-simulator.sigs.k8s.io/prescore-result”:”{\”InterPodAffinity\”:\”\”,\”NodeAffinity\”:\”success\”,\”NodeResourcesBalancedAllocation\”:\”success\”,\”NodeResourcesFit\”:\”success\”,\”PodTopologySpread\”:\”\”,\”TaintToleration\”:\”success\”}”, “kube-scheduler-simulator.sigs.k8s.io/reserve-result”:”{\”VolumeBinding\”:\”success\”}”, “kube-scheduler-simulator.sigs.k8s.io/score-result”:”{\”node-jjfg5\”:{\”ImageLocality\”:\”0\”,\”NodeAffinity\”:\”0\”,\”NodeResourcesBalancedAllocation\”:\”52\”,\”NodeResourcesFit\”:\”47\”,\”TaintToleration\”:\”0\”,\”VolumeBinding\”:\”0\”},\”node-mtb5x\”:{\”ImageLocality\”:\”0\”,\”NodeAffinity\”:\”0\”,\”NodeResourcesBalancedAllocation\”:\”76\”,\”NodeResourcesFit\”:\”73\”,\”TaintToleration\”:\”0\”,\”VolumeBinding\”:\”0\”}}”, “kube-scheduler-simulator.sigs.k8s.io/selected-node”:”node-mtb5x” } ] kube-scheduler-simulator.sigs.k8s.io/score-result: >- { “node-jjfg5”:{ “ImageLocality”:”0″, “NodeAffinity”:”0″, “NodeResourcesBalancedAllocation”:”52″, “NodeResourcesFit”:”47″, “TaintToleration”:”0″, “VolumeBinding”:”0″ }, “node-mtb5x”:{ “ImageLocality”:”0″, “NodeAffinity”:”0″, “NodeResourcesBalancedAllocation”:”76″, “NodeResourcesFit”:”73″, “TaintToleration”:”0″, “VolumeBinding”:”0″ } } kube-scheduler-simulator.sigs.k8s.io/selected-node: node-mtb5x Users can also integrate their custom plugins or extenders, into the debuggable scheduler and visualize their results. This debuggable scheduler can also run standalone, for example, on any Kubernetes cluster or in integration tests. This would be useful to custom plugin developers who want to test their plugins or examine their custom scheduler in a real cluster with better debuggability. The simulator as a better dev cluster As mentioned earlier, with a limited set of tests, it is impossible to predict every possible scenario in a real-world cluster. Typically, users will test the scheduler in a small, development cluster before deploying it to production, hoping that no issues arise. The simulator’s importing feature provides a solution by allowing users to simulate deploying a new scheduler version in a production-like environment without impacting their live workloads. By continuously syncing between a production cluster and the simulator, users can safely test a new scheduler version with the same resources their production cluster handles. Once confident in its performance, they can proceed with the production deployment, reducing the risk of unexpected issues. What are the use cases? Cluster users: Examine if scheduling constraints (for example, PodAffinity, PodTopologySpread) work as intended. Cluster admins: Assess how a cluster would behave with changes to the scheduler configuration. Scheduler plugin developers: Test a custom scheduler plugins or extenders, use the debuggable scheduler in integration tests or development clusters, or use the syncing feature for testing within a production-like environment. Getting started The simulator only requires Docker to be installed on a machine; a Kubernetes cluster is not necessary. git clone git@github.com:kubernetes-sigs/kube-scheduler-simulator.git cd kube-scheduler-simulator make docker_up You can then access the simulator’s web UI at http://localhost:3000. Visit the kube-scheduler-simulator repository for more details! Getting involved The scheduler simulator is developed by Kubernetes SIG Scheduling. Your feedback and contributions are welcome! Open issues or PRs at the kube-scheduler-simulator repository. Join the conversation on the #sig-scheduling slack channel. Acknowledgments The simulator has been maintained by dedicated volunteer engineers, overcoming many challenges to reach its current form. A big shout out to all the awesome contributors!