Kubernetes Blog The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community.
-
Kubernetes v1.34: Pods Report DRA Resource Health
on September 17, 2025 at 6:30 pm
The rise of AI/ML and other high-performance workloads has made specialized hardware like GPUs, TPUs, and FPGAs a critical component of many Kubernetes clusters. However, as discussed in a previous blog post about navigating failures in Pods with devices, when this hardware fails, it can be difficult to diagnose, leading to significant downtime. With the release of Kubernetes v1.34, we are excited to announce a new alpha feature that brings much-needed visibility into the health of these devices. This work extends the functionality of KEP-4680, which first introduced a mechanism for reporting the health of devices managed by Device Plugins. Now, this capability is being extended to Dynamic Resource Allocation (DRA). Controlled by the ResourceHealthStatus feature gate, this enhancement allows DRA drivers to report device health directly into a Pod’s .status field, providing crucial insights for operators and developers. Why expose device health in Pod status? For stateful applications or long-running jobs, a device failure can be disruptive and costly. By exposing device health in the .status field for a Pod, Kubernetes provides a standardized way for users and automation tools to quickly diagnose issues. If a Pod is failing, you can now check its status to see if an unhealthy device is the root cause, saving valuable time that might otherwise be spent debugging application code. How it works This feature introduces a new, optional communication channel between the Kubelet and DRA drivers, built on three core components. A new gRPC health service A new gRPC service, DRAResourceHealth, is defined in the dra-health/v1alpha1 API group. DRA drivers can implement this service to stream device health updates to the Kubelet. The service includes a NodeWatchResources server-streaming RPC that sends the health status (Healthy, Unhealthy, or Unknown) for the devices it manages. Kubelet integration The Kubelet’s DRAPluginManager discovers which drivers implement the health service. For each compatible driver, it starts a long-lived NodeWatchResources stream to receive health updates. The DRA Manager then consumes these updates and stores them in a persistent healthInfoCache that can survive Kubelet restarts. Populating the Pod status When a device’s health changes, the DRA manager identifies all Pods affected by the change and triggers a Pod status update. A new field, allocatedResourcesStatus, is now part of the v1.ContainerStatus API object. The Kubelet populates this field with the current health of each device allocated to the container. A practical example If a Pod is in a CrashLoopBackOff state, you can use kubectl describe pod <pod-name> to inspect its status. If an allocated device has failed, the output will now include the allocatedResourcesStatus field, clearly indicating the problem: status: containerStatuses: – name: my-gpu-intensive-container # … other container statuses allocatedResourcesStatus: – name: “claim:my-gpu-claim” resources: – resourceID: “example.com/gpu-a1b2-c3d4” health: “Unhealthy” This explicit status makes it clear that the issue is with the underlying hardware, not the application. Now you can improve the failure detection logic to react on the unhealthy devices associated with the Pod by de-scheduling a Pod. How to use this feature As this is an alpha feature in Kubernetes v1.34, you must take the following steps to use it: Enable the ResourceHealthStatus feature gate on your kube-apiserver and kubelets. Ensure you are using a DRA driver that implements the v1alpha1 DRAResourceHealth gRPC service. DRA drivers If you are developing a DRA driver, make sure to think about device failure detection strategy and ensure that your driver is integrated with this feature. This way, your driver will improve the user experience and simplify debuggability of hardware issues. What’s next? This is the first step in a broader effort to improve how Kubernetes handles device failures. As we gather feedback on this alpha feature, the community is planning several key enhancements before graduating to Beta: Detailed health messages: To improve the troubleshooting experience, we plan to add a human-readable message field to the gRPC API. This will allow DRA drivers to provide specific context for a health status, such as “GPU temperature exceeds threshold” or “NVLink connection lost”. Configurable health timeouts: The timeout for marking a device’s health as “Unknown” is currently hardcoded. We plan to make this configurable, likely on a per-driver basis, to better accommodate the different health-reporting characteristics of various hardware. Improved post-mortem troubleshooting: We will address a known limitation where health updates may not be applied to pods that have already terminated. This fix will ensure that the health status of a device at the time of failure is preserved, which is crucial for troubleshooting batch jobs and other “run-to-completion” workloads. This feature was developed as part of KEP-4680, and community feedback is crucial as we work toward graduating it to Beta. We have more improvements of device failure handling in k8s and encourage you to try it out and share your experiences with the SIG Node community!
-
Kubernetes v1.34: Moving Volume Group Snapshots to v1beta2
on September 16, 2025 at 6:30 pm
Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release and moved to Beta in the Kubernetes 1.32 release. The recent release of Kubernetes v1.34 moved that support to a second beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point. This new feature is only supported for CSI volume drivers. What’s new in Beta 2? While testing the beta version, we encountered an issue where the restoreSize field is not set for individual VolumeSnapshotContents and VolumeSnapshots if CSI driver does not implement the ListSnapshots RPC call. We evaluated various options here and decided to make this change releasing a new beta for the API. Specifically, a VolumeSnapshotInfo struct is added in v1beta2, it contains information for an individual volume snapshot that is a member of a volume group snapshot. VolumeSnapshotInfoList, a list of VolumeSnapshotInfo, is added to VolumeGroupSnapshotContentStatus, replacing VolumeSnapshotHandlePairList. VolumeSnapshotInfoList is a list of snapshot information returned by the CSI driver to identify snapshots on the storage system. VolumeSnapshotInfoList is populated by the csi-snapshotter sidecar based on the CSI CreateVolumeGroupSnapshotResponse returned by the CSI driver’s CreateVolumeGroupSnapshot call. The existing v1beta1 API objects will be converted to the new v1beta2 API objects by a conversion webhook. What’s next? Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release. How can I learn more? The design spec for the volume group snapshot feature. The code repository for volume group snapshot APIs and controller. CSI documentation on the group snapshot feature. How do I get involved? This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta: Ben Swartzlander (bswartz) Hemant Kumar (gnufied) Jan Šafránek (jsafrane) Madhu Rajanna (Madhu-1) Michelle Au (msau42) Niels de Vos (nixpanic) Leonardo Cecchi (leonardoce) Saad Ali (saad-ali) Xing Yang (xing-yang) Yati Padia (yati1998) For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors. We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.
-
Kubernetes v1.34: Decoupled Taint Manager Is Now Stable
on September 15, 2025 at 6:30 pm
This enhancement separates the responsibility of managing node lifecycle and pod eviction into two distinct components. Previously, the node lifecycle controller handled both marking nodes as unhealthy with NoExecute taints and evicting pods from them. Now, a dedicated taint eviction controller manages the eviction process, while the node lifecycle controller focuses solely on applying taints. This separation not only improves code organization but also makes it easier to improve taint eviction controller or build custom implementations of the taint based eviction. What’s new? The feature gate SeparateTaintEvictionController has been promoted to GA in this release. Users can optionally disable taint-based eviction by setting –controllers=-taint-eviction-controller in kube-controller-manager. How can I learn more? For more details, refer to the KEP and to the beta announcement article: Kubernetes 1.29: Decoupling taint manager from node lifecycle controller. How to get involved? We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature and helped move it from beta to stable: Ed Bartosh (@bart0sh) Yuan Chen (@yuanchen8911) Aldo Culquicondor (@alculquicondor) Baofa Fan (@carlory) Sergey Kanzhelev (@SergeyKanzhelev) Tim Bannister (@lmktfy) Maciej Skoczeń (@macsko) Maciej Szulik (@soltysh) Wojciech Tyczynski (@wojtek-t)
-
Kubernetes v1.34: Autoconfiguration for Node Cgroup Driver Goes GA
on September 12, 2025 at 6:30 pm
Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs and systemd. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would misbehave without any explicit error message. This was a source of headaches for many cluster admins. Now, we’ve (almost) arrived at the end of that headache. Automated cgroup driver detection In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. You can read more here. After many releases of waiting for each CRI implementation to have major versions released and packaged in major operating systems, this feature has gone GA as of Kubernetes 1.34.0. In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough: containerd: Support was added in v2.0.0 CRI-O: Support was added in v1.28.0 Announcement: Kubernetes is deprecating containerd v1.y support While CRI-O releases versions that match Kubernetes versions, and thus CRI-O versions without this behavior are no longer supported, containerd maintains its own release cycle. containerd support for this feature is only in v2.0 and later, but Kubernetes 1.34 still supports containerd 1.7 and other LTS releases of containerd. The Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.y. The last Kubernetes release to offer this support will be the last released version of v1.35, and support will be dropped in v1.36.0. To assist administrators in managing this future transition, a new detection mechanism is available. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be outdated. The presence of this metric with a version label of 1.36.0 will indicate that the node’s containerd runtime is not new enough for the upcoming requirements. Consequently, an administrator will need to upgrade containerd to v2.0 or a later version before, or at the same time as, upgrading the kubelet to v1.36.0.
-
Kubernetes v1.34: Mutable CSI Node Allocatable Graduates to Beta
on September 11, 2025 at 6:30 pm
The functionality for CSI drivers to update information about attachable volume count on the nodes, first introduced as Alpha in Kubernetes v1.33, has graduated to Beta in the Kubernetes v1.34 release! This marks a significant milestone in enhancing the accuracy of stateful pod scheduling by reducing failures due to outdated attachable volume capacity information. Background Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node’s lifecycle for various reasons, such as: Manual or external operations attaching/detaching volumes outside of Kubernetes control. Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots. Multi-driver scenarios, where one CSI driver’s operations affect available capacity reported by another. Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don’t, leading to pods stuck in a ContainerCreating state. Dynamically adapting CSI volume limits With this new feature, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler, as well as other components relying on this information, have the most accurate, up-to-date view of node capacity. How it works Kubernetes supports two mechanisms for updating the reported node volume limits: Periodic Updates: CSI drivers specify an interval to periodically refresh the node’s allocatable capacity. Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (ResourceExhausted error). Enabling the feature To use this beta feature, the MutableCSINodeAllocatableCount feature gate must be enabled in these components: kube-apiserver kubelet Example CSI driver configuration Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds: apiVersion: storage.k8s.io/v1 kind: CSIDriver metadata: name: example.csi.k8s.io spec: nodeAllocatableUpdatePeriodSeconds: 60 This configuration directs kubelet to periodically call the CSI driver’s NodeGetInfo method every 60 seconds, updating the node’s allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage. Immediate updates on attachment failures When a volume attachment operation fails due to a ResourceExhausted error (gRPC code 8), Kubernetes immediately updates the allocatable count instead of waiting for the next periodic update. The Kubelet then marks the affected pods as Failed, enabling their controllers to recreate them. This prevents pods from getting permanently stuck in the ContainerCreating state. Getting started To enable this feature in your Kubernetes v1.34 cluster: Enable the feature gate MutableCSINodeAllocatableCount on the kube-apiserver and kubelet components. Update your CSI driver configuration by setting nodeAllocatableUpdatePeriodSeconds. Monitor and observe improvements in scheduling accuracy and pod placement reliability. Next steps This feature is currently in beta and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution to GA stability. Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.