Kubernetes Blog The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community.
-
Tuning Linux Swap for Kubernetes: A Deep Dive
on August 19, 2025 at 6:30 pm
The Kubernetes NodeSwap feature, likely to graduate to stable in the upcoming Kubernetes v1.34 release, allows swap usage: a significant shift from the conventional practice of disabling swap for performance predictability. This article focuses exclusively on tuning swap on Linux nodes, where this feature is available. By allowing Linux nodes to use secondary storage for additional virtual memory when physical RAM is exhausted, node swap support aims to improve resource utilization and reduce out-of-memory (OOM) kills. However, enabling swap is not a “turn-key” solution. The performance and stability of your nodes under memory pressure are critically dependent on a set of Linux kernel parameters. Misconfiguration can lead to performance degradation and interfere with Kubelet’s eviction logic. In this blogpost, I’ll dive into critical Linux kernel parameters that govern swap behavior. I will explore how these parameters influence Kubernetes workload performance, swap utilization, and crucial eviction mechanisms. I will present various test results showcasing the impact of different configurations, and share my findings on achieving optimal settings for stable and high-performing Kubernetes clusters. Introduction to Linux swap At a high level, the Linux kernel manages memory through pages, typically 4KiB in size. When physical memory becomes constrained, the kernel’s page replacement algorithm decides which pages to move to swap space. While the exact logic is a sophisticated optimization, this decision-making process is influenced by certain key factors: Page access patterns (how recently pages are accessed) Page dirtyness (whether pages have been modified) Memory pressure (how urgently the system needs free memory) Anonymous vs File-backed memory It is important to understand that not all memory pages are the same. The kernel distinguishes between anonymous and file-backed memory. Anonymous memory: This is memory that is not backed by a specific file on the disk, such as a program’s heap and stack. From the application’s perspective this is private memory, and when the kernel needs to reclaim these pages, it must write them to a dedicated swap device. File-backed memory: This memory is backed by a file on a filesystem. This includes a program’s executable code, shared libraries, and filesystem caches. When the kernel needs to reclaim these pages, it can simply discard them if they have not been modified (“clean”). If a page has been modified (“dirty”), the kernel must first write the changes back to the file before it can be discarded. While a system without swap can still reclaim clean file-backed pages memory under pressure by dropping them, it has no way to offload anonymous memory. Enabling swap provides this capability, allowing the kernel to move less-frequently accessed memory pages to disk to conserve memory to avoid system OOM kills. Key kernel parameters for swap tuning To effectively tune swap behavior, Linux provides several kernel parameters that can be managed via sysctl. vm.swappiness: This is the most well-known parameter. It is a value from 0 to 200 (100 in older kernels) that controls the kernel’s preference for swapping anonymous memory pages versus reclaiming file-backed memory pages (page cache). High value (eg: 90+): The kernel will be aggressive in swapping out less-used anonymous memory to make room for file-cache. Low value (eg: < 10): The kernel will strongly prefer dropping file cache pages over swapping anonymous memory. vm.min_free_kbytes: This parameter tells the kernel to keep a minimum amount of memory free as a buffer. When the amount of free memory drops below the this safety buffer, the kernel starts more aggressively reclaiming pages (swapping, and eventually handling OOM kills). Function: It acts as a safety lever to ensure the kernel has enough memory for critical allocation requests that cannot be deferred. Impact on swap: Setting a higher min_free_kbytes effectively raises the floor for for free memory, causing the kernel to initiate swap earlier under memory pressure. vm.watermark_scale_factor: This setting controls the gap between different watermarks: min, low and high, which are calculated based on min_free_kbytes. Watermarks explained: low: When free memory is below this mark, the kswapd kernel process wakes up to reclaim pages in the background. This is when a swapping cycle begins. min: When free memory hits this minimum level, then aggressive page reclamation will block process allocation. Failing to reclaim pages will cause OOM kills. high: Memory reclamation stops once the free memory reaches this level. Impact: A higher watermark_scale_factor careates a larger buffer between the low and min watermarks. This gives kswapd more time to reclaim memory gradually before the system hits a critical state. In a typical server workload, you might have a long-running process with some memory that becomes ‘cold’. A higher swappiness value can free up RAM by swapping out the cold memory, for other active processes that can benefit from keeping their file-cache. Tuning the min_free_kbytes and watermark_scale_factor parameters to move the swapping window early will give more room for kswapd to offload memory to disk and prevent OOM kills during sudden memory spikes. Swap tests and results To understand the real-impact of these parameters, I designed a series of stress tests. Test setup Environment: GKE on Google Cloud Kubernetes version: 1.33.2 Node configuration: n2-standard-2 (8GiB RAM, 50GB swap on a pd-balanced disk, without encryption), Ubuntu 22.04 Workload: A custom Go application designed to allocate memory at a configurable rate, generate file-cache pressure, and simulate different memory access patterns (random vs sequential). Monitoring: A sidecar container capturing system metrics every second. Protection: Critical system components (kubelet, container runtime, sshd) were prevented from swapping by setting memory.swap.max=0 in their respective cgroups. Test methodology I ran a stress-test pod on nodes with different swappiness settings (0, 60, and 90) and varied the min_free_kbytes and watermark_scale_factor parameters to observe the outcomes under heavy memory allocation and I/O pressure. Visualizing swap in action The graph below, from a 100MBps stress test, shows swap in action. As free memory (in the “Memory Usage” plot) decreases, swap usage (Swap Used (GiB)) and swap-out activity (Swap Out (MiB/s)) increase. Critically, as the system relies more on swap, the I/O activity and corresponding wait time (IO Wait % in the “CPU Usage” plot) also rises, indicating CPU stress. Findings My initial tests with default kernel parameters (swappiness=60, min_free_kbytes=68MB, watermark_scale_factor=10) quickly led to OOM kills and even unexpected node restarts under high memory pressure. With selecting appropriate kernel parameters a good balance in node stability and performance can be achieved. The impact of swappiness The swappiness parameter directly influences the kernel’s choice between reclaiming anonymous memory (swapping) and dropping page cache. To observe this, I ran a test where one pod generated and held file-cache pressure, followed by a second pod allocating anonymous memory at 100MB/s, to observe the kernel preference on reclaim: My findings reveal a clear trade-off: swappiness=90: The kernel proactively swapped out the inactive anonymous memory to keep the file cache. This resulted in high and sustained swap usage and significant I/O activity (“Blocks Out”), which in turn caused spikes in I/O wait on the CPU. swappiness=0: The kernel favored dropping file-cache pages delaying swap consumption. However, it’s critical to understand that this does not disable swapping. When memory pressure was high, the kernel still swapped anonymous memory to disk. The choice is workload-dependent. For workloads sensitive to I/O latency, a lower swappiness is preferable. For workloads that rely on a large and frequently accessed file cache, a higher swappiness may be beneficial, provided the underlying disk is fast enough to handle the load. Tuning watermarks to prevent eviction and OOM kills The most critical challenge I encountered was the interaction between rapid memory allocation and Kubelet’s eviction mechanism. When my test pod, which was deliberately configured to overcommit memory, allocated it at a high rate (e.g., 300-500 MBps), the system quickly ran out of free memory. With default watermarks, the buffer for reclamation was too small. Before kswapd could free up enough memory by swapping, the node would hit a critical state, leading to two potential outcomes: Kubelet eviction If kubelet’s eviction manager detected memory.available was below its threshold, it would evict the pod. OOM killer In some high-rate scenarios, the OOM Killer would activate before eviction could complete, sometimes killing higher priority pods that were not the source of the pressure. To mitigate this I tuned the watermarks: Increased min_free_kbytes to 512MiB: This forces the kernel to start reclaiming memory much earlier, providing a larger safety buffer. Increased watermark_scale_factor to 2000: This widened the gap between the low and high watermarks (from ≈337MB to ≈591MB in my test node’s /proc/zoneinfo), effectively increasing the swapping window. This combination gave kswapd a larger operational zone and more time to swap pages to disk during memory spikes, successfully preventing both premature evictions and OOM kills in my test runs. Table compares watermark levels from /proc/zoneinfo (Non-NUMA node): min_free_kbytes=67584KiB and watermark_scale_factor=10 min_free_kbytes=524288KiB and watermark_scale_factor=2000 Node 0, zone Normal pages free 583273 boost 0 min 10504 low 13130 high 15756 spanned 1310720 present 1310720 managed 1265603 Node 0, zone Normal pages free 470539 min 82109 low 337017 high 591925 spanned 1310720 present 1310720 managed 1274542 The graph below reveals that the kernel buffer size and scaling factor play a crucial role in determining how the system responds to memory load. With the right combination of these parameters, the system can effectively use swap space to avoid eviction and maintain stability. Risks and recommendations Enabling swap in Kubernetes is a powerful tool, but it comes with risks that must be managed through careful tuning. Risk of performance degradation Swapping is orders of magnitude slower than accessing RAM. If an application’s active working set is swapped out, its performance will suffer dramatically due to high I/O wait times (thrashing). Swap could preferably be provisioned with a SSD backed storage to improve performance. Risk of masking memory leaks Swap can hide memory leaks in applications, which might otherwise lead to a quick OOM kill. With swap, a leaky application might slowly degrade node performance over time, making the root cause harder to diagnose. Risk of disabling evictions Kubelet proactively monitors the node for memory-pressure and terminates pods to reclaim the resources. Improper tuning can lead to OOM kills before kubelet has a chance to evict pods gracefully. A properly configured min_free_kbytes is essential to ensure kubelet’s eviction mechanism remains effective. Kubernetes context Together, the kernel watermarks and kubelet eviction threshold create a series of memory pressure zones on a node. The eviction-threshold parameters need to be adjusted to configure Kubernetes managed evictions occur before the OOM kills. As the diagram shows, an ideal configuration will be to create a large enough ‘swapping zone’ (between high and min watermarks) so that the kernel can handle memory pressure by swapping before available memory drops into the Eviction/Direct Reclaim zone. Recommended starting point Based on these findings, I recommend the following as a starting point for Linux nodes with swap enabled. You should benchmark this with your own workloads. vm.swappiness=60: Linux default is a good starting point for general-purpose workloads. However, the ideal value is workload-dependent, and swap-sensitive applications may need more careful tuning. vm.min_free_kbytes=500000 (500MB): Set this to a reasonably high value (e.g., 2-3% of total node memory) to give the node a reasonable safety buffer. vm.watermark_scale_factor=2000: Create a larger window for kswapd to work with, preventing OOM kills during sudden memory allocation spikes. I encourage running benchmark tests with your own workloads in test-environments, when setting up swap for the first time in your Kubernetes cluster. Swap performance can be sensitive to different environment differences such as CPU load, disk type (SSD vs HDD) and I/O patterns.
-
Introducing Headlamp AI Assistant
on August 7, 2025 at 7:00 pm
This announcement originally appeared on the Headlamp blog. To simplify Kubernetes management and troubleshooting, we’re thrilled to introduce Headlamp AI Assistant: a powerful new plugin for Headlamp that helps you understand and operate your Kubernetes clusters and applications with greater clarity and ease. Whether you’re a seasoned engineer or just getting started, the AI Assistant offers: Fast time to value: Ask questions like “Is my application healthy?” or “How can I fix this?” without needing deep Kubernetes knowledge. Deep insights: Start with high-level queries and dig deeper with prompts like “List all the problematic pods” or “How can I fix this pod?” Focused & relevant: Ask questions in the context of what you’re viewing in the UI, such as “What’s wrong here?” Action-oriented: Let the AI take action for you, like “Restart that deployment”, with your permission. Here is a demo of the AI Assistant in action as it helps troubleshoot an application running with issues in a Kubernetes cluster: Hopping on the AI train Large Language Models (LLMs) have transformed not just how we access data but also how we interact with it. The rise of tools like ChatGPT opened a world of possibilities, inspiring a wave of new applications. Asking questions or giving commands in natural language is intuitive, especially for users who aren’t deeply technical. Now everyone can quickly ask how to do X or Y, without feeling awkward or having to traverse pages and pages of documentation like before. Therefore, Headlamp AI Assistant brings a conversational UI to Headlamp, powered by LLMs that Headlamp users can configure with their own API keys. It is available as a Headlamp plugin, making it easy to integrate into your existing setup. Users can enable it by installing the plugin and configuring it with their own LLM API keys, giving them control over which model powers the assistant. Once enabled, the assistant becomes part of the Headlamp UI, ready to respond to contextual queries and perform actions directly from the interface. Context is everything As expected, the AI Assistant is focused on helping users with Kubernetes concepts. Yet, while there is a lot of value in responding to Kubernetes related questions from Headlamp’s UI, we believe that the great benefit of such an integration is when it can use the context of what the user is experiencing in an application. So, the Headlamp AI Assistant knows what you’re currently viewing in Headlamp, and this makes the interaction feel more like working with a human assistant. For example, if a pod is failing, users can simply ask “What’s wrong here?” and the AI Assistant will respond with the root cause, like a missing environment variable or a typo in the image name. Follow-up prompts like “How can I fix this?” allow the AI Assistant to suggest a fix, streamlining what used to take multiple steps into a quick, conversational flow. Sharing the context from Headlamp is not a trivial task though, so it’s something we will keep working on perfecting. Tools Context from the UI is helpful, but sometimes additional capabilities are needed. If the user is viewing the pod list and wants to identify problematic deployments, switching views should not be necessary. To address this, the AI Assistant includes support for a Kubernetes tool. This allows asking questions like “Get me all deployments with problems” prompting the assistant to fetch and display relevant data from the current cluster. Likewise, if the user requests an action like “Restart that deployment” after the AI points out what deployment needs restarting, it can also do that. In case of “write” operations, the AI Assistant does check with the user for permission to run them. AI Plugins Although the initial version of the AI Assistant is already useful for Kubernetes users, future iterations will expand its capabilities. Currently, the assistant supports only the Kubernetes tool, but further integration with Headlamp plugins is underway. Similarly, we could get richer insights for GitOps via the Flux plugin, monitoring through Prometheus, package management with Helm, and more. And of course, as the popularity of MCP grows, we are looking into how to integrate it as well, for a more plug-and-play fashion. Try it out! We hope this first version of the AI Assistant helps users manage Kubernetes clusters more effectively and assist newcomers in navigating the learning curve. We invite you to try out this early version and give us your feedback. The AI Assistant plugin can be installed from Headlamp’s Plugin Catalog in the desktop version, or by using the container image when deploying Headlamp. Stay tuned for the future versions of the Headlamp AI Assistant!
-
Kubernetes v1.34 Sneak Peek
on July 28, 2025 at 12:00 am
Kubernetes v1.34 is coming at the end of August 2025. This release will not include any removal or deprecation, but it is packed with an impressive number of enhancements. Here are some of the features we are most excited about in this cycle! Please note that this information reflects the current state of v1.34 development and may change before release. Featured enhancements of Kubernetes v1.34 The following list highlights some of the notable enhancements likely to be included in the v1.34 release, but is not an exhaustive list of all planned changes. This is not a commitment and the release content is subject to change. The core of DRA targets stable Dynamic Resource Allocation (DRA) provides a flexible way to categorize, request, and use devices like GPUs or custom hardware in your Kubernetes cluster. Since the v1.30 release, DRA has been based around claiming devices using structured parameters that are opaque to the core of Kubernetes. The relevant enhancement proposal, KEP-4381, took inspiration from dynamic provisioning for storage volumes. DRA with structured parameters relies on a set of supporting API kinds: ResourceClaim, DeviceClass, ResourceClaimTemplate, and ResourceSlice API types under resource.k8s.io, while extending the .spec for Pods with a new resourceClaims field. The core of DRA is targeting graduation to stable in Kubernetes v1.34. With DRA, device drivers and cluster admins define device classes that are available for use. Workloads can claim devices from a device class within device requests. Kubernetes allocates matching devices to specific claims and places the corresponding Pods on nodes that can access the allocated devices. This framework provides flexible device filtering using CEL, centralized device categorization, and simplified Pod requests, among other benefits. Once this feature has graduated, the resource.k8s.io/v1 APIs will be available by default. ServiceAccount tokens for image pull authentication The ServiceAccount token integration for kubelet credential providers is likely to reach beta and be enabled by default in Kubernetes v1.34. This allows the kubelet to use these tokens when pulling container images from registries that require authentication. That support already exists as alpha, and is tracked as part of KEP-4412. The existing alpha integration allows the kubelet to use short-lived, automatically rotated ServiceAccount tokens (that follow OIDC-compliant semantics) to authenticate to a container image registry. Each token is scoped to one associated Pod; the overall mechanism replaces the need for long-lived image pull Secrets. Adopting this new approach reduces security risks, supports workload-level identity, and helps cut operational overhead. It brings image pull authentication closer to modern, identity-aware good practice. Pod replacement policy for Deployments After a change to a Deployment, terminating pods may stay up for a considerable amount of time and may consume additional resources. As part of KEP-3973, the .spec.podReplacementPolicy field will be introduced (as alpha) for Deployments. If your cluster has the feature enabled, you’ll be able to select one of two policies: TerminationStarted Creates new pods as soon as old ones start terminating, resulting in faster rollouts at the cost of potentially higher resource consumption. TerminationComplete Waits until old pods fully terminate before creating new ones, resulting in slower rollouts but ensuring controlled resource consumption. This feature makes Deployment behavior more predictable by letting you choose when new pods should be created during updates or scaling. It’s beneficial when working in clusters with tight resource constraints or with workloads with long termination periods. It’s expected to be available as an alpha feature and can be enabled using the DeploymentPodReplacementPolicy and DeploymentReplicaSetTerminatingReplicas feature gates in the API server and kube-controller-manager. Production-ready tracing for kubelet and API Server To address the longstanding challenge of debugging node-level issues by correlating disconnected logs, KEP-2831 provides deep, contextual insights into the kubelet. This feature instruments critical kubelet operations, particularly its gRPC calls to the Container Runtime Interface (CRI), using the vendor-agnostic OpenTelemetry standard. It allows operators to visualize the entire lifecycle of events (for example: a Pod startup) to pinpoint sources of latency and errors. Its most powerful aspect is the propagation of trace context; the kubelet passes a trace ID with its requests to the container runtime, enabling runtimes to link their own spans. This effort is complemented by a parallel enhancement, KEP-647, which brings the same tracing capabilities to the Kubernetes API server. Together, these enhancements provide a more unified, end-to-end view of events, simplifying the process of pinpointing latency and errors from the control plane down to the node. These features have matured through the official Kubernetes release process. KEP-2831 was introduced as an alpha feature in v1.25, while KEP-647 debuted as alpha in v1.22. Both enhancements were promoted to beta together in the v1.27 release. Looking forward, Kubelet Tracing (KEP-2831) and API Server Tracing (KEP-647) are now targeting graduation to stable in the upcoming v1.34 release. PreferSameZone and PreferSameNode traffic distribution for Services The spec.trafficDistribution field within a Kubernetes Service allows users to express preferences for how traffic should be routed to Service endpoints. KEP-3015 deprecates PreferClose and introduces two additional values: PreferSameZone and PreferSameNode. PreferSameZone is equivalent to the current PreferClose. PreferSameNode prioritizes sending traffic to endpoints on the same node as the client. This feature was introduced in v1.33 behind the PreferSameTrafficDistribution feature gate. It is targeting graduation to beta in v1.34 with its feature gate enabled by default. Support for KYAML: a Kubernetes dialect of YAML KYAML aims to be a safer and less ambiguous YAML subset, and was designed specifically for Kubernetes. Whatever version of Kubernetes you use, you’ll be able use KYAML for writing manifests and/or Helm charts. You can write KYAML and pass it as an input to any version of kubectl, because all KYAML files are also valid as YAML. With kubectl v1.34, we expect you’ll also be able to request KYAML output from kubectl (as in kubectl get -o kyaml …). If you prefer, you can still request the output in JSON or YAML format. KYAML addresses specific challenges with both YAML and JSON. YAML’s significant whitespace requires careful attention to indentation and nesting, while its optional string-quoting can lead to unexpected type coercion (for example: “The Norway Bug”). Meanwhile, JSON lacks comment support and has strict requirements for trailing commas and quoted keys. KEP-5295 introduces KYAML, which tries to address the most significant problems by: Always double-quoting value strings Leaving keys unquoted unless they are potentially ambiguous Always using {} for mappings (associative arrays) Always using [] for lists This might sound a lot like JSON, because it is! But unlike JSON, KYAML supports comments, allows trailing commas, and doesn’t require quoted keys. We’re hoping to see KYAML introduced as a new output format for kubectl v1.34. As with all these features, none of these changes are 100% confirmed; watch this space! As a format, KYAML is and will remain a strict subset of YAML, ensuring that any compliant YAML parser can parse KYAML documents. Kubernetes does not require you to provide input specifically formatted as KYAML, and we have no plans to change that. Fine-grained autoscaling control with HPA configurable tolerance KEP-4951 introduces a new feature that allows users to configure autoscaling tolerance on a per-HPA basis, overriding the default cluster-wide 10% tolerance setting that often proves too coarse-grained for diverse workloads. The enhancement adds an optional tolerance field to the HPA’s spec.behavior.scaleUp and spec.behavior.scaleDown sections, enabling different tolerance values for scale-up and scale-down operations, which is particularly valuable since scale-up responsiveness is typically more critical than scale-down speed for handling traffic surges. Released as alpha in Kubernetes v1.33 behind the HPAConfigurableTolerance feature gate, this feature is expected to graduate to beta in v1.34. This improvement helps to address scaling challenges with large deployments, where for scaling in, a 10% tolerance might mean leaving hundreds of unnecessary Pods running. Using the new, more flexible approach would enable workload-specific optimization for both responsive and conservative scaling behaviors. Want to know more? New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what’s new in Kubernetes v1.34 as part of the CHANGELOG for that release. The Kubernetes v1.34 release is planned for Wednesday 27th August 2025. Stay tuned for updates! Get involved The simplest way to get involved with Kubernetes is to join one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support. Follow us on Bluesky @kubernetes.io for the latest updates Join the community discussion on Discuss Join the community on Slack Post questions (or answer questions) on Server Fault or Stack Overflow Share your Kubernetes story Read more about what’s happening with Kubernetes on the blog Learn more about the Kubernetes Release Team
-
Post-Quantum Cryptography in Kubernetes
on July 18, 2025 at 12:00 am
The world of cryptography is on the cusp of a major shift with the advent of quantum computing. While powerful quantum computers are still largely theoretical for many applications, their potential to break current cryptographic standards is a serious concern, especially for long-lived systems. This is where Post-Quantum Cryptography (PQC) comes in. In this article, I’ll dive into what PQC means for TLS and, more specifically, for the Kubernetes ecosystem. I’ll explain what the (suprising) state of PQC in Kubernetes is and what the implications are for current and future clusters. What is Post-Quantum Cryptography Post-Quantum Cryptography refers to cryptographic algorithms that are thought to be secure against attacks by both classical and quantum computers. The primary concern is that quantum computers, using algorithms like Shor’s Algorithm, could efficiently break widely used public-key cryptosystems such as RSA and Elliptic Curve Cryptography (ECC), which underpin much of today’s secure communication, including TLS. The industry is actively working on standardizing and adopting PQC algorithms. One of the first to be standardized by NIST is the Module-Lattice Key Encapsulation Mechanism (ML-KEM), formerly known as Kyber, and now standardized as FIPS-203 (PDF download). It is difficult to predict when quantum computers will be able to break classical algorithms. However, it is clear that we need to start migrating to PQC algorithms now, as the next section shows. To get a feeling for the predicted timeline we can look at a NIST report covering the transition to post-quantum cryptography standards. It declares that system with classical crypto should be deprecated after 2030 and disallowed after 2035. Key exchange vs. digital signatures: different needs, different timelines In TLS, there are two main cryptographic operations we need to secure: Key Exchange: This is how the client and server agree on a shared secret to encrypt their communication. If an attacker records encrypted traffic today, they could decrypt it in the future, if they gain access to a quantum computer capable of breaking the key exchange. This makes migrating KEMs to PQC an immediate priority. Digital Signatures: These are primarily used to authenticate the server (and sometimes the client) via certificates. The authenticity of a server is verified at the time of connection. While important, the risk of an attack today is much lower, because the decision of trusting a server cannot be abused after the fact. Additionally, current PQC signature schemes often come with significant computational overhead and larger key/signature sizes compared to their classical counterparts. Another significant hurdle in the migration to PQ certificates is the upgrade of root certificates. These certificates have long validity periods and are installed in many devices and operating systems as trust anchors. Given these differences, the focus for immediate PQC adoption in TLS has been on hybrid key exchange mechanisms. These combine a classical algorithm (such as Elliptic Curve Diffie-Hellman Ephemeral (ECDHE)) with a PQC algorithm (such as ML-KEM). The resulting shared secret is secure as long as at least one of the component algorithms remains unbroken. The X25519MLKEM768 hybrid scheme is the most widely supported one. State of PQC key exchange mechanisms (KEMs) today Support for PQC KEMs is rapidly improving across the ecosystem. Go: The Go standard library’s crypto/tls package introduced support for X25519MLKEM768 in version 1.24 (released February 2025). Crucially, it’s enabled by default when there is no explicit configuration, i.e., Config.CurvePreferences is nil. Browsers & OpenSSL: Major browsers like Chrome (version 131, November 2024) and Firefox (version 135, February 2025), as well as OpenSSL (version 3.5.0, April 2025), have also added support for the ML-KEM based hybrid scheme. Apple is also rolling out support for X25519MLKEM768 in version 26 of their operating systems. Given the proliferation of Apple devices, this will have a significant impact on the global PQC adoption. For a more detailed overview of the state of PQC in the wider industry, see this blog post by Cloudflare. Post-quantum KEMs in Kubernetes: an unexpected arrival So, what does this mean for Kubernetes? Kubernetes components, including the API server and kubelet, are built with Go. As of Kubernetes v1.33, released in April 2025, the project uses Go 1.24. A quick check of the Kubernetes codebase reveals that Config.CurvePreferences is not explicitly set. This leads to a fascinating conclusion: Kubernetes v1.33, by virtue of using Go 1.24, supports hybrid post-quantum X25519MLKEM768 for TLS connections by default! You can test this yourself. If you set up a Minikube cluster running Kubernetes v1.33.0, you can connect to the API server using a recent OpenSSL client: $ minikube start –kubernetes-version=v1.33.0 $ kubectl cluster-info Kubernetes control plane is running at https://127.0.0.1:<PORT> $ kubectl config view –minify –raw -o jsonpath=\'{.clusters[0].cluster.certificate-authority-data}\’ | base64 -d > ca.crt $ openssl version OpenSSL 3.5.0 8 Apr 2025 (Library: OpenSSL 3.5.0 8 Apr 2025) $ echo -n “Q” | openssl s_client -connect 127.0.0.1:<PORT> -CAfile ca.crt […] Negotiated TLS1.3 group: X25519MLKEM768 […] DONE Lo and behold, the negotiated group is X25519MLKEM768! This is a significant step towards making Kubernetes quantum-safe, seemingly without a major announcement or dedicated KEP (Kubernetes Enhancement Proposal). The Go version mismatch pitfall An interesting wrinkle emerged with Go versions 1.23 and 1.24. Go 1.23 included experimental support for a draft version of ML-KEM, identified as X25519Kyber768Draft00. This was also enabled by default if Config.CurvePreferences was nil. Kubernetes v1.32 used Go 1.23. However, Go 1.24 removed the draft support and replaced it with the standardized version X25519MLKEM768. What happens if a client and server are using mismatched Go versions (one on 1.23, the other on 1.24)? They won’t have a common PQC KEM to negotiate, and the handshake will fall back to classical ECC curves (e.g., X25519). How could this happen in practice? Consider a scenario: A Kubernetes cluster is running v1.32 (using Go 1.23 and thus X25519Kyber768Draft00). A developer upgrades their kubectl to v1.33, compiled with Go 1.24, only supporting X25519MLKEM768. Now, when kubectl communicates with the v1.32 API server, they no longer share a common PQC algorithm. The connection will downgrade to classical cryptography, silently losing the PQC protection that has been in place. This highlights the importance of understanding the implications of Go version upgrades, and the details of the TLS stack. Limitations: packet size One practical consideration with ML-KEM is the size of its public keys with encoded key sizes of around 1.2 kilobytes for ML-KEM-768. This can cause the initial TLS ClientHello message not to fit inside a single TCP/IP packet, given the typical networking constraints (most commonly, the standard Ethernet frame size limit of 1500 bytes). Some TLS libraries or network appliances might not handle this gracefully, assuming the Client Hello always fits in one packet. This issue has been observed in some Kubernetes-related projects and networking components, potentially leading to connection failures when PQC KEMs are used. More details can be found at tldr.fail. State of Post-Quantum Signatures While KEMs are seeing broader adoption, PQC digital signatures are further behind in terms of widespread integration into standard toolchains. NIST has published standards for PQC signatures, such as ML-DSA (FIPS-204) and SLH-DSA (FIPS-205). However, implementing these in a way that’s broadly usable (e.g., for PQC Certificate Authorities) presents challenges: Larger Keys and Signatures: PQC signature schemes often have significantly larger public keys and signature sizes compared to classical algorithms like Ed25519 or RSA. For instance, Dilithium2 keys can be 30 times larger than Ed25519 keys, and certificates can be 12 times larger. Performance: Signing and verification operations can be substantially slower. While some algorithms are on par with classical algorithms, others may have a much higher overhead, sometimes on the order of 10x to 1000x worse performance. To improve this situation, NIST is running a second round of standardization for PQC signatures. Toolchain Support: Mainstream TLS libraries and CA software do not yet have mature, built-in support for these new signature algorithms. The Go team, for example, has indicated that ML-DSA support is a high priority, but the soonest it might appear in the standard library is Go 1.26 (as of May 2025). Cloudflare’s CIRCL (Cloudflare Interoperable Reusable Cryptographic Library) library implements some PQC signature schemes like variants of Dilithium, and they maintain a fork of Go (cfgo) that integrates CIRCL. Using cfgo, it’s possible to experiment with generating certificates signed with PQC algorithms like Ed25519-Dilithium2. However, this requires using a custom Go toolchain and is not yet part of the mainstream Kubernetes or Go distributions. Conclusion The journey to a post-quantum secure Kubernetes is underway, and perhaps further along than many realize, thanks to the proactive adoption of ML-KEM in Go. With Kubernetes v1.33, users are already benefiting from hybrid post-quantum key exchange in many TLS connections by default. However, awareness of potential pitfalls, such as Go version mismatches leading to downgrades and issues with Client Hello packet sizes, is crucial. While PQC for KEMs is becoming a reality, PQC for digital signatures and certificate hierarchies is still in earlier stages of development and adoption for mainstream use. As Kubernetes maintainers and contributors, staying informed about these developments will be key to ensuring the long-term security of the platform.
-
Navigating Failures in Pods With Devices
on July 3, 2025 at 12:00 am
Kubernetes is the de facto standard for container orchestration, but when it comes to handling specialized hardware like GPUs and other accelerators, things get a bit complicated. This blog post dives into the challenges of managing failure modes when operating pods with devices in Kubernetes, based on insights from Sergey Kanzhelev and Mrunal Patel’s talk at KubeCon NA 2024. You can follow the links to slides and recording. The AI/ML boom and its impact on Kubernetes The rise of AI/ML workloads has brought new challenges to Kubernetes. These workloads often rely heavily on specialized hardware, and any device failure can significantly impact performance and lead to frustrating interruptions. As highlighted in the 2024 Llama paper, hardware issues, particularly GPU failures, are a major cause of disruption in AI/ML training. You can also learn how much effort NVIDIA spends on handling devices failures and maintenance in the KubeCon talk by Ryan Hallisey and Piotr Prokop All-Your-GPUs-Are-Belong-to-Us: An Inside Look at NVIDIA’s Self-Healing GeForce NOW Infrastructure (recording) as they see 19 remediation requests per 1000 nodes a day! We also see data centers offering spot consumption models and overcommit on power, making device failures commonplace and a part of the business model. However, Kubernetes’s view on resources is still very static. The resource is either there or not. And if it is there, the assumption is that it will stay there fully functional – Kubernetes lacks good support for handling full or partial hardware failures. These long-existing assumptions combined with the overall complexity of a setup lead to a variety of failure modes, which we discuss here. Understanding AI/ML workloads Generally, all AI/ML workloads require specialized hardware, have challenging scheduling requirements, and are expensive when idle. AI/ML workloads typically fall into two categories – training and inference. Here is an oversimplified view of those categories’ characteristics, which are different from traditional workloads like web services: Training These workloads are resource-intensive, often consuming entire machines and running as gangs of pods. Training jobs are usually “run to completion” – but that could be days, weeks or even months. Any failure in a single pod can necessitate restarting the entire step across all the pods. Inference These workloads are usually long-running or run indefinitely, and can be small enough to consume a subset of a Node’s devices or large enough to span multiple nodes. They often require downloading huge files with the model weights. These workload types specifically break many past assumptions: Workload assumptions before and now Before Now Can get a better CPU and the app will work faster. Require a specific device (or class of devices) to run. When something doesn’t work, just recreate it. Allocation or reallocation is expensive. Any node will work. No need to coordinate between Pods. Scheduled in a special way – devices often connected in a cross-node topology. Each Pod can be plug-and-play replaced if failed. Pods are a part of a larger task. Lifecycle of an entire task depends on each Pod. Container images are slim and easily available. Container images may be so big that they require special handling. Long initialization can be offset by slow rollout. Initialization may be long and should be optimized, sometimes across many Pods together. Compute nodes are commoditized and relatively inexpensive, so some idle time is acceptable. Nodes with specialized hardware can be an order of magnitude more expensive than those without, so idle time is very wasteful. The existing failure model was relying on old assumptions. It may still work for the new workload types, but it has limited knowledge about devices and is very expensive for them. In some cases, even prohibitively expensive. You will see more examples later in this article. Why Kubernetes still reigns supreme This article is not going deeper into the question: why not start fresh for AI/ML workloads since they are so different from the traditional Kubernetes workloads. Despite many challenges, Kubernetes remains the platform of choice for AI/ML workloads. Its maturity, security, and rich ecosystem of tools make it a compelling option. While alternatives exist, they often lack the years of development and refinement that Kubernetes offers. And the Kubernetes developers are actively addressing the gaps identified in this article and beyond. The current state of device failure handling This section outlines different failure modes and the best practices and DIY (Do-It-Yourself) solutions used today. The next session will describe a roadmap of improving things for those failure modes. Failure modes: K8s infrastructure In order to understand the failures related to the Kubernetes infrastructure, you need to understand how many moving parts are involved in scheduling a Pod on the node. The sequence of events when the Pod is scheduled in the Node is as follows: Device plugin is scheduled on the Node Device plugin is registered with the kubelet via local gRPC Kubelet uses device plugin to watch for devices and updates capacity of the node Scheduler places a user Pod on a Node based on the updated capacity Kubelet asks Device plugin to Allocate devices for a User Pod Kubelet creates a User Pod with the allocated devices attached to it This diagram shows some of those actors involved: As there are so many actors interconnected, every one of them and every connection may experience interruptions. This leads to many exceptional situations that are often considered failures, and may cause serious workload interruptions: Pods failing admission at various stages of its lifecycle Pods unable to run on perfectly fine hardware Scheduling taking unexpectedly long time The goal for Kubernetes is to make the interruption between these components as reliable as possible. Kubelet already implements retries, grace periods, and other techniques to improve it. The roadmap section goes into details on other edge cases that the Kubernetes project tracks. However, all these improvements only work when these best practices are followed: Configure and restart kubelet and the container runtime (such as containerd or CRI-O) as early as possible to not interrupt the workload. Monitor device plugin health and carefully plan for upgrades. Do not overload the node with less-important workloads to prevent interruption of device plugin and other components. Configure user pods tolerations to handle node readiness flakes. Configure and code graceful termination logic carefully to not block devices for too long. Another class of Kubernetes infra-related issues is driver-related. With traditional resources like CPU and memory, no compatibility checks between the application and hardware were needed. With special devices like hardware accelerators, there are new failure modes. Device drivers installed on the node: Must match the hardware Be compatible with an app Must work with other drivers (like nccl, etc.) Best practices for handling driver versions: Monitor driver installer health Plan upgrades of infrastructure and Pods to match the version Have canary deployments whenever possible Following the best practices in this section and using device plugins and device driver installers from trusted and reliable sources generally eliminate this class of failures. Kubernetes is tracking work to make this space even better. Failure modes: device failed There is very little handling of device failure in Kubernetes today. Device plugins report the device failure only by changing the count of allocatable devices. And Kubernetes relies on standard mechanisms like liveness probes or container failures to allow Pods to communicate the failure condition to the kubelet. However, Kubernetes does not correlate device failures with container crashes and does not offer any mitigation beyond restarting the container while being attached to the same device. This is why many plugins and DIY solutions exist to handle device failures based on various signals. Health controller In many cases a failed device will result in unrecoverable and very expensive nodes doing nothing. A simple DIY solution is a node health controller. The controller could compare the device allocatable count with the capacity and if the capacity is greater, it starts a timer. Once the timer reaches a threshold, the health controller kills and recreates a node. There are problems with the health controller approach: Root cause of the device failure is typically not known The controller is not workload aware Failed device might not be in use and you want to keep other devices running The detection may be too slow as it is very generic The node may be part of a bigger set of nodes and simply cannot be deleted in isolation without other nodes There are variations of the health controller solving some of the problems above. The overall theme here though is that to best handle failed devices, you need customized handling for the specific workload. Kubernetes doesn’t yet offer enough abstraction to express how critical the device is for a node, for the cluster, and for the Pod it is assigned to. Pod failure policy Another DIY approach for device failure handling is a per-pod reaction on a failed device. This approach is applicable for training workloads that are implemented as Jobs. Pod can define special error codes for device failures. For example, whenever unexpected device behavior is encountered, Pod exits with a special exit code. Then the Pod failure policy can handle the device failure in a special way. Read more on Handling retriable and non-retriable pod failures with Pod failure policy There are some problems with the Pod failure policy approach for Jobs: There is no well-known device failed condition, so this approach does not work for the generic Pod case Error codes must be coded carefully and in some cases are hard to guarantee. Only works with Jobs with restartPolicy: Never, due to the limitation of a pod failure policy feature. So, this solution has limited applicability. Custom pod watcher A little more generic approach is to implement the Pod watcher as a DIY solution or use some third party tools offering this functionality. The pod watcher is most often used to handle device failures for inference workloads. Since Kubernetes just keeps a pod assigned to a device, even if the device is reportedly unhealthy, the idea is to detect this situation with the pod watcher and apply some remediation. It often involves obtaining device health status and its mapping to the Pod using Pod Resources API on the node. If a device fails, it can then delete the attached Pod as a remediation. The replica set will handle the Pod recreation on a healthy device. The other reasons to implement this watcher: Without it, the Pod will keep being assigned to the failed device forever. There is no descheduling for a pod with restartPolicy=Always. There are no built-in controllers that delete Pods in CrashLoopBackoff. Problems with the custom pod watcher: The signal for the pod watcher is expensive to get, and involves some privileged actions. It is a custom solution and it assumes the importance of a device for a Pod. The pod watcher relies on external controllers to reschedule a Pod. There are more variations of DIY solutions for handling device failures or upcoming maintenance. Overall, Kubernetes has enough extension points to implement these solutions. However, some extension points require higher privilege than users may be comfortable with or are too disruptive. The roadmap section goes into more details on specific improvements in handling the device failures. Failure modes: container code failed When the container code fails or something bad happens with it, like out of memory conditions, Kubernetes knows how to handle those cases. There is either the restart of a container, or a crash of a Pod if it has restartPolicy: Never and scheduling it on another node. Kubernetes has limited expressiveness on what is a failure (for example, non-zero exit code or liveness probe failure) and how to react on such a failure (mostly either Always restart or immediately fail the Pod). This level of expressiveness is often not enough for the complicated AI/ML workloads. AI/ML pods are better rescheduled locally or even in-place as that would save on image pulling time and device allocation. AI/ML pods are often interconnected and need to be restarted together. This adds another level of complexity and optimizing it often brings major savings in running AI/ML workloads. There are various DIY solutions to handle Pod failures orchestration. The most typical one is to wrap a main executable in a container by some orchestrator. And this orchestrator will be able to restart the main executable whenever the job needs to be restarted because some other pod has failed. Solutions like this are very fragile and elaborate. They are often worth the money saved comparing to a regular JobSet delete/recreate cycle when used in large training jobs. Making these solutions less fragile and more streamlined by developing new hooks and extension points in Kubernetes will make it easy to apply to smaller jobs, benefiting everybody. Failure modes: device degradation Not all device failures are terminal for the overall workload or batch job. As the hardware stack gets more and more complex, misconfiguration on one of the hardware stack layers, or driver failures, may result in devices that are functional, but lagging on performance. One device that is lagging behind can slow down the whole training job. We see reports of such cases more and more often. Kubernetes has no way to express this type of failures today and since it is the newest type of failure mode, there is not much of a best practice offered by hardware vendors for detection and third party tooling for remediation of these situations. Typically, these failures are detected based on observed workload characteristics. For example, the expected speed of AI/ML training steps on particular hardware. Remediation for those issues is highly depend on a workload needs. Roadmap As outlined in a section above, Kubernetes offers a lot of extension points which are used to implement various DIY solutions. The space of AI/ML is developing very fast, with changing requirements and usage patterns. SIG Node is taking a measured approach of enabling more extension points to implement the workload-specific scenarios over introduction of new semantics to support specific scenarios. This means prioritizing making information about failures readily available over implementing automatic remediations for those failures that might only be suitable for a subset of workloads. This approach ensures there are no drastic changes for workload handling which may break existing, well-oiled DIY solutions or experiences with the existing more traditional workloads. Many error handling techniques used today work for AI/ML, but are very expensive. SIG Node will invest in extension points to make those cheaper, with the understanding that the price cutting for AI/ML is critical. The following is the set of specific investments we envision for various failure modes. Roadmap for failure modes: K8s infrastructure The area of Kubernetes infrastructure is the easiest to understand and very important to make right for the upcoming transition from Device Plugins to DRA. SIG Node is tracking many work items in this area, most notably the following: integrate kubelet with the systemd watchdog · Issue #127460 DRA: detect stale DRA plugin sockets · Issue #128696 Support takeover for devicemanager/device-plugin · Issue #127803 Kubelet plugin registration reliability · Issue #127457 Recreate the Device Manager gRPC server if failed · Issue #128167 Retry pod admission on device plugin grpc failures · Issue #128043 Basically, every interaction of Kubernetes components must be reliable via either the kubelet improvements or the best practices in plugins development and deployment. Roadmap for failure modes: device failed For the device failures some patterns are already emerging in common scenarios that Kubernetes can support. However, the very first step is to make information about failed devices available easier. The very first step here is the work in KEP 4680 (Add Resource Health Status to the Pod Status for Device Plugin and DRA). Longer term ideas include to be tested: Integrate device failures into Pod Failure Policy. Node-local retry policies, enabling pod failure policies for Pods with restartPolicy=OnFailure and possibly beyond that. Ability to deschedule pod, including with the restartPolicy: Always, so it can get a new device allocated. Add device health to the ResourceSlice used to represent devices in DRA, rather than simply withdrawing an unhealthy device from the ResourceSlice. Roadmap for failure modes: container code failed The main improvements to handle container code failures for AI/ML workloads are all targeting cheaper error handling and recovery. The cheapness is mostly coming from reuse of pre-allocated resources as much as possible. From reusing the Pods by restarting containers in-place, to node local restart of containers instead of rescheduling whenever possible, to snapshotting support, and re-scheduling prioritizing the same node to save on image pulls. Consider this scenario: A big training job needs 512 Pods to run. And one of the pods failed. It means that all Pods need to be interrupted and synced up to restart the failed step. The most efficient way to achieve this generally is to reuse as many Pods as possible by restarting them in-place, while replacing the failed pod to clear up the error from it. Like demonstrated in this picture: It is possible to implement this scenario, but all solutions implementing it are fragile due to lack of certain extension points in Kubernetes. Adding these extension points to implement this scenario is on the Kubernetes roadmap. Roadmap for failure modes: device degradation There is very little done in this area – there is no clear detection signal, very limited troubleshooting tooling, and no built-in semantics to express the “degraded” device on Kubernetes. There has been discussion of adding data on device performance or degradation in the ResourceSlice used by DRA to represent devices, but it is not yet clearly defined. There are also projects like node-healthcheck-operator that can be used for some scenarios. We expect developments in this area from hardware vendors and cloud providers, and we expect to see mostly DIY solutions in the near future. As more users get exposed to AI/ML workloads, this is a space needing feedback on patterns used here. Join the conversation The Kubernetes community encourages feedback and participation in shaping the future of device failure handling. Join SIG Node and contribute to the ongoing discussions! This blog post provides a high-level overview of the challenges and future directions for device failure management in Kubernetes. By addressing these issues, Kubernetes can solidify its position as the leading platform for AI/ML workloads, ensuring resilience and reliability for applications that depend on specialized hardware.