Kubernetes Blog The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community.

  • Kubernetes v1.33 sneak peek
    on March 26, 2025 at 6:30 pm

    As the release of Kubernetes v1.33 approaches, the Kubernetes project continues to evolve. Features may be deprecated, removed, or replaced to improve the overall health of the project. This blog post outlines some planned changes for the v1.33 release, which the release team believes you should be aware of to ensure the continued smooth operation of your Kubernetes environment and to keep you up-to-date with the latest developments. The information below is based on the current status of the v1.33 release and is subject to change before the final release date. The Kubernetes API removal and deprecation process The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that same API is available and that APIs have a minimum lifetime for each stability level. A deprecated API has been marked for removal in a future Kubernetes release. It will continue to function until removal (at least one year from the deprecation), but usage will result in a warning being displayed. Removed APIs are no longer available in the current version, at which point you must migrate to using the replacement. Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes. Beta or pre-release API versions must be supported for 3 releases after the deprecation. Alpha or experimental API versions may be removed in any release without prior deprecation notice; this process can become a withdrawal in cases where a different implementation for the same feature is already in place. Whether an API is removed as a result of a feature graduating from beta to stable, or because that API simply did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the deprecation guide. Deprecations and removals for Kubernetes v1.33 Deprecation of the stable Endpoints API The EndpointSlices API has been stable since v1.21, which effectively replaced the original Endpoints API. While the original Endpoints API was simple and straightforward, it also posed some challenges when scaling to large numbers of network endpoints. The EndpointSlices API has introduced new features such as dual-stack networking, making the original Endpoints API ready for deprecation. This deprecation only impacts those who use the Endpoints API directly from workloads or scripts; these users should migrate to use EndpointSlices instead. There will be a dedicated blog post with more details on the deprecation implications and migration plans in the coming weeks. You can find more in KEP-4974: Deprecate v1.Endpoints. Removal of kube-proxy version information in node status Following its deprecation in v1.31, as highlighted in the release announcement, the status.nodeInfo.kubeProxyVersion field will be removed in v1.33. This field was set by kubelet, but its value was not consistently accurate. As it has been disabled by default since v1.31, the v1.33 release will remove this field entirely. You can find more in KEP-4004: Deprecate status.nodeInfo.kubeProxyVersion field. Removal of host network support for Windows pods Windows Pod networking aimed to achieve feature parity with Linux and provide better cluster density by allowing containers to use the Node’s networking namespace. The original implementation landed as alpha with v1.26, but as it faced unexpected containerd behaviours, and alternative solutions were available, the Kubernetes project has decided to withdraw the associated KEP. We’re expecting to see support fully removed in v1.33. You can find more in KEP-3503: Host network support for Windows pods. Featured improvement of Kubernetes v1.33 As authors of this article, we picked one improvement as the most significant change to call out! Support for user namespaces within Linux Pods One of the oldest open KEPs today is KEP-127, Pod security improvement by using Linux User namespaces for Pods. This KEP was first opened in late 2016, and after multiple iterations, had its alpha release in v1.25, initial beta in v1.30 (where it was disabled by default), and now is set to be a part of v1.33, where the feature is available by default. This support will not impact existing Pods unless you manually specify pod.spec.hostUsers to opt in. As highlighted in the v1.30 sneak peek blog, this is an important milestone for mitigating vulnerabilities. You can find more in KEP-127: Support User Namespaces in pods. Selected other Kubernetes v1.33 improvements The following list of enhancements is likely to be included in the upcoming v1.33 release. This is not a commitment and the release content is subject to change. In-place resource resize for vertical scaling of Pods When provisioning a Pod, you can use various resources such as Deployment, StatefulSet, etc. Scalability requirements may need horizontal scaling by updating the Pod replica count, or vertical scaling by updating resources allocated to Pod’s container(s). Before this enhancement, container resources defined in a Pod’s spec were immutable, and updating any of these details within a Pod template would trigger Pod replacement. But what if you could dynamically update the resource configuration for your existing Pods without restarting them? The KEP-1287 is precisely to allow such in-place Pod updates. It opens up various possibilities of vertical scale-up for stateful processes without any downtime, seamless scale-down when the traffic is low, and even allocating larger resources during startup that is eventually reduced once the initial setup is complete. This was released as alpha in v1.27, and is expected to land as beta in v1.33. You can find more in KEP-1287: In-Place Update of Pod Resources. DRA’s ResourceClaim Device Status graduates to beta The devices field in ResourceClaim status, originally introduced in the v1.32 release, is likely to graduate to beta in v1.33. This field allows drivers to report device status data, improving both observability and troubleshooting capabilities. For example, reporting the interface name, MAC address, and IP addresses of network interfaces in the status of a ResourceClaim can significantly help in configuring and managing network services, as well as in debugging network related issues. You can read more about ResourceClaim Device Status in Dynamic Resource Allocation: ResourceClaim Device Status document. Also, you can find more about the planned enhancement in KEP-4817: DRA: Resource Claim Status with possible standardized network interface data. Ordered namespace deletion This KEP introduces a more structured deletion process for Kubernetes namespaces to ensure secure and deterministic resource removal. The current semi-random deletion order can create security gaps or unintended behaviour, such as Pods persisting after their associated NetworkPolicies are deleted. By enforcing a structured deletion sequence that respects logical and security dependencies, this approach ensures Pods are removed before other resources. The design improves Kubernetes’s security and reliability by mitigating risks associated with non-deterministic deletions. You can find more in KEP-5080: Ordered namespace deletion. Enhancements for indexed job management These two KEPs are both set to graduate to GA to provide better reliability for job handling, specifically for indexed jobs. KEP-3850 provides per-index backoff limits for indexed jobs, which allows each index to be fully independent of other indexes. Also, KEP-3998 extends Job API to define conditions for making an indexed job as successfully completed when not all indexes are succeeded. You can find more in KEP-3850: Backoff Limit Per Index For Indexed Jobs and KEP-3998: Job success/completion policy. Want to know more? New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what’s new in Kubernetes v1.33 as part of the CHANGELOG for that release. Kubernetes v1.33 release is planned for Wednesday, 23rd April, 2025. Stay tuned for updates! You can also see the announcements of changes in the release notes for: Kubernetes v1.32 Kubernetes v1.31 Kubernetes v1.30 Get involved The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support. Follow us on Bluesky @kubernetes.io for the latest updates Join the community discussion on Discuss Join the community on Slack Post questions (or answer questions) on Server Fault or Stack Overflow Share your Kubernetes story Read more about what’s happening with Kubernetes on the blog Learn more about the Kubernetes Release Team

  • Fresh Swap Features for Linux Users in Kubernetes 1.32
    on March 25, 2025 at 6:00 pm

    Swap is a fundamental and an invaluable Linux feature. It offers numerous benefits, such as effectively increasing a node’s memory by swapping out unused data, shielding nodes from system-level memory spikes, preventing Pods from crashing when they hit their memory limits, and much more. As a result, the node special interest group within the Kubernetes project has invested significant effort into supporting swap on Linux nodes. The 1.22 release introduced Alpha support for configuring swap memory usage for Kubernetes workloads running on Linux on a per-node basis. Later, in release 1.28, support for swap on Linux nodes has graduated to Beta, along with many new improvements. In the following Kubernetes releases more improvements were made, paving the way to GA in the near future. Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems. This was due to the inherent difficulty in guaranteeing and accounting for pod memory utilization when swap memory was involved. As a result, swap support was deemed out of scope in the initial design of Kubernetes, and the default behavior of a kubelet was to fail to start if swap memory was detected on a node. In version 1.22, the swap feature for Linux was initially introduced in its Alpha stage. This provided Linux users the opportunity to experiment with the swap feature for the first time. However, as an Alpha version, it was not fully developed and only partially worked on limited environments. In version 1.28 swap support on Linux nodes was promoted to Beta. The Beta version was a drastic leap forward. Not only did it fix a large amount of bugs and made swap work in a stable way, but it also brought cgroup v2 support, introduced a wide variety of tests which include complex scenarios such as node-level pressure, and more. It also brought many exciting new capabilities such as the LimitedSwap behavior which sets an auto-calculated swap limit to containers, OpenMetrics instrumentation support (through the /metrics/resource endpoint) and Summary API for VerticalPodAutoscalers (through the /stats/summary endpoint), and more. Today we are working on more improvements, paving the way for GA. Currently, the focus is especially towards ensuring node stability, enhanced debug abilities, addressing user feedback, polishing the feature and making it stable. For example, in order to increase stability, containers in high-priority pods cannot access swap which ensures the memory they need is ready to use. In addition, the UnlimitedSwap behavior was removed since it might compromise the node’s health. Secret content protection against swapping has also been introduced (see relevant security-risk section for more info). To conclude, compared to previous releases, the kubelet’s support for running with swap enabled is more stable and robust, more user-friendly, and addresses many known shortcomings. That said, the NodeSwap feature introduces basic swap support, and this is just the beginning. In the near future, additional features are planned to enhance swap functionality in various ways, such as improving evictions, extending the API, increasing customizability, and more! How do I use it? In order for the kubelet to initialize on a swap-enabled node, the failSwapOn field must be set to false on kubelet’s configuration setting, or the deprecated –fail-swap-on command line flag must be deactivated. It is possible to configure the memorySwap.swapBehavior option to define the manner in which a node utilizes swap memory. For instance, # this fragment goes into the kubelet’s configuration file memorySwap: swapBehavior: LimitedSwap The currently available configuration options for swapBehavior are: NoSwap (default): Kubernetes workloads cannot use swap. However, processes outside of Kubernetes’ scope, like system daemons (such as kubelet itself!) can utilize swap. This behavior is beneficial for protecting the node from system-level memory spikes, but it does not safeguard the workloads themselves from such spikes. LimitedSwap: Kubernetes workloads can utilize swap memory, but with certain limitations. The amount of swap available to a Pod is determined automatically, based on the proportion of the memory requested relative to the node’s total memory. Only non-high-priority Pods under the Burstable Quality of Service (QoS) tier are permitted to use swap. For more details, see the section below. If configuration for memorySwap is not specified, by default the kubelet will apply the same behaviour as the NoSwap setting. On Linux nodes, Kubernetes only supports running with swap enabled for hosts that use cgroup v2. On cgroup v1 systems, all Kubernetes workloads are not allowed to use swap memory. Install a swap-enabled cluster with kubeadm Before you begin It is required for this demo that the kubeadm tool be installed, following the steps outlined in the kubeadm installation guide. If swap is already enabled on the node, cluster creation may proceed. If swap is not enabled, please refer to the provided instructions for enabling swap. Create a swap file and turn swap on I’ll demonstrate creating 4GiB of swap, both in the encrypted and unencrypted case. Setting up unencrypted swap An unencrypted swap file can be set up as follows. # Allocate storage and restrict access fallocate –length 4GiB /swapfile chmod 600 /swapfile # Format the swap space mkswap /swapfile # Activate the swap space for paging swapon /swapfile Setting up encrypted swap An encrypted swap file can be set up as follows. Bear in mind that this example uses the cryptsetup binary (which is available on most Linux distributions). # Allocate storage and restrict access fallocate –length 4GiB /swapfile chmod 600 /swapfile # Create an encrypted device backed by the allocated storage cryptsetup –type plain –cipher aes-xts-plain64 –key-size 256 -d /dev/urandom open /swapfile cryptswap # Format the swap space mkswap /dev/mapper/cryptswap # Activate the swap space for paging swapon /dev/mapper/cryptswap Verify that swap is enabled Swap can be verified to be enabled with both swapon -s command or the free command > swapon -s Filename Type Size Used Priority /dev/dm-0 partition 4194300 0 -2 > free -h total used free shared buff/cache available Mem: 3.8Gi 1.3Gi 249Mi 25Mi 2.5Gi 2.5Gi Swap: 4.0Gi 0B 4.0Gi Enable swap on boot After setting up swap, to start the swap file at boot time, you either set up a systemd unit to activate (encrypted) swap, or you add a line similar to /swapfile swap swap defaults 0 0 into /etc/fstab. Set up a Kubernetes cluster that uses swap-enabled nodes To make things clearer, here is an example kubeadm configuration file kubeadm-config.yaml for the swap enabled cluster. — apiVersion: “kubeadm.k8s.io/v1beta3” kind: InitConfiguration — apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration failSwapOn: false memorySwap: swapBehavior: LimitedSwap Then create a single-node cluster using kubeadm init –config kubeadm-config.yaml. During init, there is a warning that swap is enabled on the node and in case the kubelet failSwapOn is set to true. We plan to remove this warning in a future release. How is the swap limit being determined with LimitedSwap? The configuration of swap memory, including its limitations, presents a significant challenge. Not only is it prone to misconfiguration, but as a system-level property, any misconfiguration could potentially compromise the entire node rather than just a specific workload. To mitigate this risk and ensure the health of the node, we have implemented Swap with automatic configuration of limitations. With LimitedSwap, Pods that do not fall under the Burstable QoS classification (i.e. BestEffort/Guaranteed QoS Pods) are prohibited from utilizing swap memory. BestEffort QoS Pods exhibit unpredictable memory consumption patterns and lack information regarding their memory usage, making it difficult to determine a safe allocation of swap memory. Conversely, Guaranteed QoS Pods are typically employed for applications that rely on the precise allocation of resources specified by the workload, with memory being immediately available. To maintain the aforementioned security and node health guarantees, these Pods are not permitted to use swap memory when LimitedSwap is in effect. In addition, high-priority pods are not permitted to use swap in order to ensure the memory they consume always residents on disk, hence ready to use. Prior to detailing the calculation of the swap limit, it is necessary to define the following terms: nodeTotalMemory: The total amount of physical memory available on the node. totalPodsSwapAvailable: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use). containerMemoryRequest: The container’s memory request. Swap limitation is configured as: (containerMemoryRequest / nodeTotalMemory) × totalPodsSwapAvailable In other words, the amount of swap that a container is able to use is proportionate to its memory request, the node’s total physical memory and the total amount of swap memory on the node that is available for use by Pods. It is important to note that, for containers within Burstable QoS Pods, it is possible to opt-out of swap usage by specifying memory requests that are equal to memory limits. Containers configured in this manner will not have access to swap memory. How does it work? There are a number of possible ways that one could envision swap use on a node. When swap is already provisioned and available on a node, the kubelet is able to be configured so that: It can start with swap on. It will direct the Container Runtime Interface to allocate zero swap memory to Kubernetes workloads by default. Swap configuration on a node is exposed to a cluster admin via the memorySwap in the KubeletConfiguration. As a cluster administrator, you can specify the node’s behaviour in the presence of swap memory by setting memorySwap.swapBehavior. The kubelet employs the CRI (container runtime interface) API, and directs the container runtime to configure specific cgroup v2 parameters (such as memory.swap.max) in a manner that will enable the desired swap configuration for a container. For runtimes that use control groups, the container runtime is then responsible for writing these settings to the container-level cgroup. How can I monitor swap? Node and container level metric statistics Kubelet now collects node and container level metric statistics, which can be accessed at the /metrics/resource (which is used mainly by monitoring tools like Prometheus) and /stats/summary (which is used mainly by Autoscalers) kubelet HTTP endpoints. This allows clients who can directly interrogate the kubelet to monitor swap usage and remaining swap memory when using LimitedSwap. Additionally, a machine_swap_bytes metric has been added to cadvisor to show the total physical swap capacity of the machine. See this page for more info. Node Feature Discovery (NFD) Node Feature Discovery is a Kubernetes addon for detecting hardware features and configuration. It can be utilized to discover which nodes are provisioned with swap. As an example, to figure out which nodes are provisioned with swap, use the following command: kubectl get nodes -o jsonpath='{range .items[?(@.metadata.labels.feature\.node\.kubernetes\.io/memory-swap)]}{.metadata.name}{“\t”}{.metadata.labels.feature\.node\.kubernetes\.io/memory-swap}{“\n”}{end}’ This will result in an output similar to: k8s-worker1: true k8s-worker2: true k8s-worker3: false In this example, swap is provisioned on nodes k8s-worker1 and k8s-worker2, but not on k8s-worker3. Caveats Having swap available on a system reduces predictability. While swap can enhance performance by making more RAM available, swapping data back to memory is a heavy operation, sometimes slower by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system’s behaviour under memory pressure. Enabling swap increases the risk of noisy neighbors, where Pods that frequently use their RAM may cause other Pods to swap. In addition, since swap allows for greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, and due to unexpected packing configurations, the scheduler currently does not account for swap memory usage. This heightens the risk of noisy neighbors. The performance of a node with swap memory enabled depends on the underlying physical storage. When swap memory is in use, performance will be significantly worse in an I/O operations per second (IOPS) constrained environment, such as a cloud VM with I/O throttling, when compared to faster storage mediums like solid-state drives or NVMe. As swap might cause IO pressure, it is recommended to give a higher IO latency priority to system critical daemons. See the relevant section in the recommended practices section below. Memory-backed volumes On Linux nodes, memory-backed volumes (such as secret volume mounts, or emptyDir with medium: Memory) are implemented with a tmpfs filesystem. The contents of such volumes should remain in memory at all times, hence should not be swapped to disk. To ensure the contents of such volumes remain in memory, the noswap tmpfs option is being used. The Linux kernel officially supports the noswap option from version 6.3 (more info can be found in Linux Kernel Version Requirements). However, the different distributions often choose to backport this mount option to older Linux versions as well. In order to verify whether the node supports the noswap option, the kubelet will do the following: If the kernel’s version is above 6.3 then the noswap option will be assumed to be supported. Otherwise, kubelet would try to mount a dummy tmpfs with the noswap option at startup. If kubelet fails with an error indicating of an unknown option, noswap will be assumed to not be supported, hence will not be used. A kubelet log entry will be emitted to warn the user about memory-backed volumes might swap to disk. If kubelet succeeds, the dummy tmpfs will be deleted and the noswap option will be used. If the noswap option is not supported, kubelet will emit a warning log entry, then continue its execution. It is deeply encouraged to encrypt the swap space. See the section above with an example for setting unencrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. It is the administrator’s responsibility to provision encrypted swap to mitigate this risk. Good practice for using swap in a Kubernetes cluster Disable swap for system-critical daemons During the testing phase and based on user feedback, it was observed that the performance of system-critical daemons and services might degrade. This implies that system daemons, including the kubelet, could operate slower than usual. If this issue is encountered, it is advisable to configure the cgroup of the system slice to prevent swapping (i.e., set memory.swap.max=0). Protect system-critical daemons for I/O latency Swap can increase the I/O load on a node. When memory pressure causes the kernel to rapidly swap pages in and out, system-critical daemons and services that rely on I/O operations may experience performance degradation. To mitigate this, it is recommended for systemd users to prioritize the system slice in terms of I/O latency. For non-systemd users, setting up a dedicated cgroup for system daemons and processes and prioritizing I/O latency in the same way is advised. This can be achieved by setting io.latency for the system slice, thereby granting it higher I/O priority. See cgroup’s documentation for more info. Swap and control plane nodes The Kubernetes project recommends running control plane nodes without any swap space configured. The control plane primarily hosts Guaranteed QoS Pods, so swap can generally be disabled. The main concern is that swapping critical services on the control plane could negatively impact performance. Use of a dedicated disk for swap It is recommended to use a separate, encrypted disk for the swap partition. If swap resides on a partition or the root filesystem, workloads may interfere with system processes that need to write to disk. When they share the same disk, processes can overwhelm swap, disrupting the I/O of kubelet, container runtime, and systemd, which would impact other workloads. Since swap space is located on a disk, it is crucial to ensure the disk is fast enough for the intended use cases. Alternatively, one can configure I/O priorities between different mapped areas of a single backing device. Looking ahead As you can see, the swap feature was dramatically improved lately, paving the way for a feature GA. However, this is just the beginning. It’s a foundational implementation marking the beginning of enhanced swap functionality. In the near future, additional features are planned to further improve swap capabilities, including better eviction mechanisms, extended API support, increased customizability, better debug abilities and more! How can I learn more? You can review the current documentation for using swap with Kubernetes. For more information, please see KEP-2400 and its design proposal. How do I get involved? Your feedback is always welcome! SIG Node meets regularly and can be reached via Slack (channel #sig-node), or the SIG’s mailing list. A Slack channel dedicated to swap is also available at #sig-node-swap. Feel free to reach out to me, Itamar Holder (@iholder101 on Slack and GitHub) if you’d like to help or ask further questions.

  • Ingress-nginx CVE-2025-1974: What You Need to Know
    on March 24, 2025 at 8:00 pm

    Today, the ingress-nginx maintainers have released patches for a batch of critical vulnerabilities that could make it easy for attackers to take over your Kubernetes cluster. If you are among the over 40% of Kubernetes administrators using ingress-nginx, you should take action immediately to protect your users and data. Background Ingress is the traditional Kubernetes feature for exposing your workload Pods to the world so that they can be useful. In an implementation-agnostic way, Kubernetes users can define how their applications should be made available on the network. Then, an ingress controller uses that definition to set up local or cloud resources as required for the user’s particular situation and needs. Many different ingress controllers are available, to suit users of different cloud providers or brands of load balancers. Ingress-nginx is a software-only ingress controller provided by the Kubernetes project. Because of its versatility and ease of use, ingress-nginx is quite popular: it is deployed in over 40% of Kubernetes clusters! Ingress-nginx translates the requirements from Ingress objects into configuration for nginx, a powerful open source webserver daemon. Then, nginx uses that configuration to accept and route requests to the various applications running within a Kubernetes cluster. Proper handling of these nginx configuration parameters is crucial, because ingress-nginx needs to allow users significant flexibility while preventing them from accidentally or intentionally tricking nginx into doing things it shouldn’t. Vulnerabilities Patched Today Four of today’s ingress-nginx vulnerabilities are improvements to how ingress-nginx handles particular bits of nginx config. Without these fixes, a specially-crafted Ingress object can cause nginx to misbehave in various ways, including revealing the values of Secrets that are accessible to ingress-nginx. By default, ingress-nginx has access to all Secrets cluster-wide, so this can often lead to complete cluster takeover by any user or entity that has permission to create an Ingress. The most serious of today’s vulnerabilities, CVE-2025-1974, rated 9.8 CVSS, allows anything on the Pod network to exploit configuration injection vulnerabilities via the Validating Admission Controller feature of ingress-nginx. This makes such vulnerabilities far more dangerous: ordinarily one would need to be able to create an Ingress object in the cluster, which is a fairly privileged action. When combined with today’s other vulnerabilities, CVE-2025-1974 means that anything on the Pod network has a good chance of taking over your Kubernetes cluster, with no credentials or administrative access required. In many common scenarios, the Pod network is accessible to all workloads in your cloud VPC, or even anyone connected to your corporate network! This is a very serious situation. Today, we have released ingress-nginx v1.12.1 and v1.11.5, which have fixes for all five of these vulnerabilities. Your next steps First, determine if your clusters are using ingress-nginx. In most cases, you can check this by running kubectl get pods –all-namespaces –selector app.kubernetes.io/name=ingress-nginx with cluster administrator permissions. If you are using ingress-nginx, make a plan to remediate these vulnerabilities immediately. The best and easiest remedy is to upgrade to the new patch release of ingress-nginx. All five of today’s vulnerabilities are fixed by installing today’s patches. If you can’t upgrade right away, you can significantly reduce your risk by turning off the Validating Admission Controller feature of ingress-nginx. If you have installed ingress-nginx using Helm Reinstall, setting the Helm value controller.admissionWebhooks.enabled=false If you have installed ingress-nginx manually delete the ValidatingWebhookconfiguration called ingress-nginx-admission edit the ingress-nginx-controller Deployment or Daemonset, removing –validating-webhook from the controller container’s argument list If you turn off the Validating Admission Controller feature as a mitigation for CVE-2025-1974, remember to turn it back on after you upgrade. This feature provides important quality of life improvements for your users, warning them about incorrect Ingress configurations before they can take effect. Conclusion, thanks, and further reading The ingress-nginx vulnerabilities announced today, including CVE-2025-1974, present a serious risk to many Kubernetes users and their data. If you use ingress-nginx, you should take action immediately to keep yourself safe. Thanks go out to Nir Ohfeld, Sagi Tzadik, Ronen Shustin, and Hillai Ben-Sasson from Wiz for responsibly disclosing these vulnerabilities, and for working with the Kubernetes SRC members and ingress-nginx maintainers (Marco Ebert and James Strong) to ensure we fixed them effectively. For further information about the maintenance and future of ingress-nginx, please see this GitHub issue and/or attend James and Marco’s KubeCon/CloudNativeCon EU 2025 presentation. For further information about the specific vulnerabilities discussed in this article, please see the appropriate GitHub issue: CVE-2025-24513, CVE-2025-24514, CVE-2025-1097, CVE-2025-1098, or CVE-2025-1974

  • Introducing JobSet
    on March 23, 2025 at 12:00 am

    Authors: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat) In this article, we introduce JobSet, an open source API for representing distributed jobs. The goal of JobSet is to provide a unified API for distributed ML training and HPC workloads on Kubernetes. Why JobSet? The Kubernetes community’s recent enhancements to the batch ecosystem on Kubernetes has attracted ML engineers who have found it to be a natural fit for the requirements of running distributed training workloads. Large ML models (particularly LLMs) which cannot fit into the memory of the GPU or TPU chips on a single host are often distributed across tens of thousands of accelerator chips, which in turn may span thousands of hosts. As such, the model training code is often containerized and executed simultaneously on all these hosts, performing distributed computations which often shard both the model parameters and/or the training dataset across the target accelerator chips, using communication collective primitives like all-gather and all-reduce to perform distributed computations and synchronize gradients between hosts. These workload characteristics make Kubernetes a great fit for this type of workload, as efficiently scheduling and managing the lifecycle of containerized applications across a cluster of compute resources is an area where it shines. It is also very extensible, allowing developers to define their own Kubernetes APIs, objects, and controllers which manage the behavior and life cycle of these objects, allowing engineers to develop custom distributed training orchestration solutions to fit their needs. However, as distributed ML training techniques continue to evolve, existing Kubernetes primitives do not adequately model them alone anymore. Furthermore, the landscape of Kubernetes distributed training orchestration APIs has become fragmented, and each of the existing solutions in this fragmented landscape has certain limitations that make it non-optimal for distributed ML training. For example, the KubeFlow training operator defines custom APIs for different frameworks (e.g. PyTorchJob, TFJob, MPIJob, etc.); however, each of these job types are in fact a solution fit specifically to the target framework, each with different semantics and behavior. On the other hand, the Job API fixed many gaps for running batch workloads, including Indexed completion mode, higher scalability, Pod failure policies and Pod backoff policy to mention a few of the most recent enhancements. However, running ML training and HPC workloads using the upstream Job API requires extra orchestration to fill the following gaps: Multi-template Pods : Most HPC or ML training jobs include more than one type of Pods. The different Pods are part of the same workload, but they need to run a different container, request different resources or have different failure policies. A common example is the driver-worker pattern. Job groups : Large scale training workloads span multiple network topologies, running across multiple racks for example. Such workloads are network latency sensitive, and aim to localize communication and minimize traffic crossing the higher-latency network links. To facilitate this, the workload needs to be split into groups of Pods each assigned to a network topology. Inter-Pod communication : Create and manage the resources (e.g. headless Services) necessary to establish communication between the Pods of a job. Startup sequencing : Some jobs require a specific start sequence of pods; sometimes the driver is expected to start first (like Ray or Spark), in other cases the workers are expected to be ready before starting the driver (like MPI). JobSet aims to address those gaps using the Job API as a building block to build a richer API for large-scale distributed HPC and ML use cases. How JobSet Works JobSet models a distributed batch workload as a group of Kubernetes Jobs. This allows a user to easily specify different pod templates for different distinct groups of pods (e.g. a leader, workers, parameter servers, etc.). It uses the abstraction of a ReplicatedJob to manage child Jobs, where a ReplicatedJob is essentially a Job Template with some desired number of Job replicas specified. This provides a declarative way to easily create identical child-jobs to run on different islands of accelerators, without resorting to scripting or Helm charts to generate many versions of the same job but with different names. Some other key JobSet features which address the problems described above include: Replicated Jobs : In modern data centers, hardware accelerators like GPUs and TPUs allocated in islands of homogenous accelerators connected via a specialized, high bandwidth network links. For example, a user might provision nodes containing a group of hosts co-located on a rack, each with H100 GPUs, where GPU chips within each host are connected via NVLink, with a NVLink Switch connecting the multiple NVLinks. TPU Pods are another example of this: TPU ViperLitePods consist of 64 hosts, each with 4 TPU v5e chips attached, all connected via ICI mesh. When running a distributed training job across multiple of these islands, we often want to partition the workload into a group of smaller identical jobs, 1 per island, where each pod primarily communicates with the pods within the same island to do segments of distributed computation, and keeping the gradient synchronization over DCN (data center network, which is lower bandwidth than ICI) to a bare minimum. Automatic headless service creation, configuration, and lifecycle management : Pod-to-pod communication via pod hostname is enabled by default, with automatic configuration and lifecycle management of the headless service enabling this. Configurable success policies : JobSet has configurable success policies which target specific ReplicatedJobs, with operators to target “Any” or “All” of their child jobs. For example, you can configure the JobSet to be marked complete if and only if all pods that are part of the “worker” ReplicatedJob are completed. Configurable failure policies : JobSet has configurable failure policies which allow the user to specify a maximum number of times the JobSet should be restarted in the event of a failure. If any job is marked failed, the entire JobSet will be recreated, allowing the workload to resume from the last checkpoint. When no failure policy is specified, if any job fails, the JobSet simply fails. Exclusive placement per topology domain : JobSet allows users to express that child jobs have 1:1 exclusive assignment to a topology domain, typically an accelerator island like a rack. For example, if the JobSet creates two child jobs, then this feature will enforce that the pods of each child job will be co-located on the same island, and that only one child job is allowed to schedule per island. This is useful for scenarios where we want to use a distributed data parallel (DDP) training strategy to train a model using multiple islands of compute resources (GPU racks or TPU slices), running 1 model replica in each accelerator island, ensuring the forward and backward passes themselves occur within a single model replica occurs over the high bandwidth interconnect linking the accelerators chips within the island, and only the gradient synchronization between model replicas occurs across accelerator islands over the lower bandwidth data center network. Integration with Kueue : Users can submit JobSets via Kueue to oversubscribe their clusters, queue workloads to run as capacity becomes available, prevent partial scheduling and deadlocks, enable multi-tenancy, and more. Example use case Distributed ML training on multiple TPU slices with Jax The following example is a JobSet spec for running a TPU Multislice workload on 4 TPU v5e slices. To learn more about TPU concepts and terminology, please refer to these docs. This example uses Jax, an ML framework with native support for Just-In-Time (JIT) compilation targeting TPU chips via OpenXLA. However, you can also use PyTorch/XLA to do ML training on TPUs. This example makes use of several JobSet features (both explicitly and implicitly) to support the unique scheduling requirements of TPU multislice training out-of-the-box with very little configuration required by the user. # Run a simple Jax workload on apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: multislice annotations: # Give each child Job exclusive usage of a TPU slice alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool spec: failurePolicy: maxRestarts: 3 replicatedJobs: – name: workers replicas: 4 # Set to number of TPU slices template: spec: parallelism: 2 # Set to number of VMs per TPU slice completions: 2 # Set to number of VMs per TPU slice backoffLimit: 0 template: spec: hostNetwork: true dnsPolicy: ClusterFirstWithHostNet nodeSelector: cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice cloud.google.com/gke-tpu-topology: 2×4 containers: – name: jax-tpu image: python:3.8 ports: – containerPort: 8471 – containerPort: 8080 securityContext: privileged: true command: – bash – -c – | pip install “jax[tpu]” -f https://storage.googleapis.com/jax-releases/libtpu_releases.html python -c ‘import jax; print(“Global device count:”, jax.device_count())’ sleep 60 resources: limits: google.com/tpu: 4 Future work and getting involved We have a number of features on the JobSet roadmap planned for development this year, which can be found in the JobSet roadmap. Please feel free to reach out with feedback of any kind. We’re also open to additional contributors, whether it is to fix or report bugs, or help add new features or write documentation. You can get in touch with us via our repo, mailing list or on Slack. Last but not least, thanks to all our contributors who made this project possible!

  • Spotlight on SIG Apps
    on March 12, 2025 at 12:00 am

    In our ongoing SIG Spotlight series, we dive into the heart of the Kubernetes project by talking to the leaders of its various Special Interest Groups (SIGs). This time, we focus on SIG Apps, the group responsible for everything related to developing, deploying, and operating applications on Kubernetes. Sandipan Panda (DevZero) had the opportunity to interview Maciej Szulik (Defense Unicorns) and Janet Kuo (Google), the chairs and tech leads of SIG Apps. They shared their experiences, challenges, and visions for the future of application management within the Kubernetes ecosystem. Introductions Sandipan: Hello, could you start by telling us a bit about yourself, your role, and your journey within the Kubernetes community that led to your current roles in SIG Apps? Maciej: Hey, my name is Maciej, and I’m one of the leads for SIG Apps. Aside from this role, you can also find me helping SIG CLI and also being one of the Steering Committee members. I’ve been contributing to Kubernetes since late 2014 in various areas, including controllers, apiserver, and kubectl. Janet: Certainly! I’m Janet, a Staff Software Engineer at Google, and I’ve been deeply involved with the Kubernetes project since its early days, even before the 1.0 launch in 2015. It’s been an amazing journey! My current role within the Kubernetes community is one of the chairs and tech leads of SIG Apps. My journey with SIG Apps started organically. I started with building the Deployment API and adding rolling update functionalities. I naturally gravitated towards SIG Apps and became increasingly involved. Over time, I took on more responsibilities, culminating in my current leadership roles. About SIG Apps All following answers were jointly provided by Maciej and Janet. Sandipan: For those unfamiliar, could you provide an overview of SIG Apps’ mission and objectives? What key problems does it aim to solve within the Kubernetes ecosystem? As described in our charter, we cover a broad area related to developing, deploying, and operating applications on Kubernetes. That, in short, means we’re open to each and everyone showing up at our bi-weekly meetings and discussing the ups and downs of writing and deploying various applications on Kubernetes. Sandipan: What are some of the most significant projects or initiatives currently being undertaken by SIG Apps? At this point in time, the main factors driving the development of our controllers are the challenges coming from running various AI-related workloads. It’s worth giving credit here to two working groups we’ve sponsored over the past years: The Batch Working Group, which is looking at running HPC, AI/ML, and data analytics jobs on top of Kubernetes. The Serving Working Group, which is focusing on hardware-accelerated AI/ML inference. Best practices and challenges Sandipan: SIG Apps plays a crucial role in developing application management best practices for Kubernetes. Can you share some of these best practices and how they help improve application lifecycle management? Implementing health checks and readiness probes ensures that your applications are healthy and ready to serve traffic, leading to improved reliability and uptime. The above, combined with comprehensive logging, monitoring, and tracing solutions, will provide insights into your application’s behavior, enabling you to identify and resolve issues quickly. Auto-scale your application based on resource utilization or custom metrics, optimizing resource usage and ensuring your application can handle varying loads. Use Deployment for stateless applications, StatefulSet for stateful applications, Job and CronJob for batch workloads, and DaemonSet for running a daemon on each node. Use Operators and CRDs to extend the Kubernetes API to automate the deployment, management, and lifecycle of complex applications, making them easier to operate and reducing manual intervention. Sandipan: What are some of the common challenges SIG Apps faces, and how do you address them? The biggest challenge we’re facing all the time is the need to reject a lot of features, ideas, and improvements. This requires a lot of discipline and patience to be able to explain the reasons behind those decisions. Sandipan: How has the evolution of Kubernetes influenced the work of SIG Apps? Are there any recent changes or upcoming features in Kubernetes that you find particularly relevant or beneficial for SIG Apps? The main benefit for both us and the whole community around SIG Apps is the ability to extend kubernetes with Custom Resource Definitions and the fact that users can build their own custom controllers leveraging the built-in ones to achieve whatever sophisticated use cases they might have and we, as the core maintainers, haven’t considered or weren’t able to efficiently resolve inside Kubernetes. Contributing to SIG Apps Sandipan: What opportunities are available for new contributors who want to get involved with SIG Apps, and what advice would you give them? We get the question, “What good first issue might you recommend we start with?” a lot 🙂 But unfortunately, there’s no easy answer to it. We always tell everyone that the best option to start contributing to core controllers is to find one you are willing to spend some time with. Read through the code, then try running unit tests and integration tests focusing on that controller. Once you grasp the general idea, try breaking it and the tests again to verify your breakage. Once you start feeling confident you understand that particular controller, you may want to search through open issues affecting that controller and either provide suggestions, explaining the problem users have, or maybe attempt your first fix. Like we said, there are no shortcuts on that road; you need to spend the time with the codebase to understand all the edge cases we’ve slowly built up to get to the point where we are. Once you’re successful with one controller, you’ll need to repeat that same process with others all over again. Sandipan: How does SIG Apps gather feedback from the community, and how is this feedback integrated into your work? We always encourage everyone to show up and present their problems and solutions during our bi-weekly meetings. As long as you’re solving an interesting problem on top of Kubernetes and you can provide valuable feedback about any of the core controllers, we’re always happy to hear from everyone. Looking ahead Sandipan: Looking ahead, what are the key focus areas or upcoming trends in application management within Kubernetes that SIG Apps is excited about? How is the SIG adapting to these trends? Definitely the current AI hype is the major driving factor; as mentioned above, we have two working groups, each covering a different aspect of it. Sandipan: What are some of your favorite things about this SIG? Without a doubt, the people that participate in our meetings and on Slack, who tirelessly help triage issues, pull requests and invest a lot of their time (very frequently their private time) into making kubernetes great! SIG Apps is an essential part of the Kubernetes community, helping to shape how applications are deployed and managed at scale. From its work on improving Kubernetes’ workload APIs to driving innovation in AI/ML application management, SIG Apps is continually adapting to meet the needs of modern application developers and operators. Whether you’re a new contributor or an experienced developer, there’s always an opportunity to get involved and make an impact. If you’re interested in learning more or contributing to SIG Apps, be sure to check out their SIG README and join their bi-weekly meetings. SIG Apps Mailing List SIG Apps on Slack

Scroll to Top