Kubernetes Blog The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community.
-
Kubernetes v1.35: Introducing Workload Aware Scheduling
on December 29, 2025 at 6:30 pm
Scheduling large workloads is a much more complex and fragile operation than scheduling a single Pod, as it often requires considering all Pods together instead of scheduling each one independently. For example, when scheduling a machine learning batch job, you often need to place each worker strategically, such as on the same rack, to make the entire process as efficient as possible. At the same time, the Pods that are part of such a workload are very often identical from the scheduling perspective, which fundamentally changes how this process should look. There are many custom schedulers adapted to perform workload scheduling efficiently, but considering how common and important workload scheduling is to Kubernetes users, especially in the AI era with the growing number of use cases, it is high time to make workloads a first-class citizen for kube-scheduler and support them natively. Workload aware scheduling The recent 1.35 release of Kubernetes delivered the first tranche of workload aware scheduling improvements. These are part of a wider effort that is aiming to improve scheduling and management of workloads. The effort will span over many SIGs and releases, and is supposed to gradually expand capabilities of the system toward reaching the north star goal, which is seamless workload scheduling and management in Kubernetes including, but not limited to, preemption and autoscaling. Kubernetes v1.35 introduces the Workload API that you can use to describe the desired shape as well as scheduling-oriented requirements of the workload. It comes with an initial implementation of gang scheduling that instructs the kube-scheduler to schedule gang Pods in the all-or-nothing fashion. Finally, we improved scheduling of identical Pods (that typically make a gang) to speed up the process thanks to the opportunistic batching feature. Workload API The new Workload API resource is part of the scheduling.k8s.io/v1alpha1 API group. This resource acts as a structured, machine-readable definition of the scheduling requirements of a multi-Pod application. While user-facing workloads like Jobs define what to run, the Workload resource determines how a group of Pods should be scheduled and how its placement should be managed throughout its lifecycle. A Workload allows you to define a group of Pods and apply a scheduling policy to them. Here is what a gang scheduling configuration looks like. You can define a podGroup named workers and apply the gang policy with a minCount of 4. apiVersion: scheduling.k8s.io/v1alpha1 kind: Workload metadata: name: training-job-workload namespace: some-ns spec: podGroups: – name: workers policy: gang: # The gang is schedulable only if 4 pods can run at once minCount: 4 When you create your Pods, you link them to this Workload using the new workloadRef field: apiVersion: v1 kind: Pod metadata: name: worker-0 namespace: some-ns spec: workloadRef: name: training-job-workload podGroup: workers … How gang scheduling works The gang policy enforces all-or-nothing placement. Without gang scheduling, a Job might be partially scheduled, consuming resources without being able to run, leading to resource wastage and potential deadlocks. When you create Pods that are part of a gang-scheduled pod group, the scheduler’s GangScheduling plugin manages the lifecycle independently for each pod group (or replica key): When you create your Pods (or a controller makes them for you), the scheduler blocks them from scheduling, until: The referenced Workload object is created. The referenced pod group exists in a Workload. The number of pending Pods in that group meets your minCount. Once enough Pods arrive, the scheduler tries to place them. However, instead of binding them to nodes immediately, the Pods wait at a Permit gate. The scheduler checks if it has found valid assignments for the entire group (at least the minCount). If there is room for the group, the gate opens, and all Pods are bound to nodes. If only a subset of the group pods was successfully scheduled within a timeout (set to 5 minutes), the scheduler rejects all of the Pods in the group. They go back to the queue, freeing up the reserved resources for other workloads. We’d like to point out that that while this is a first implementation, the Kubernetes project firmly intends to improve and expand the gang scheduling algorithm in future releases. Benefits we hope to deliver include a single-cycle scheduling phase for a whole gang, workload-level preemption, and more, moving towards the north star goal. Opportunistic batching In addition to explicit gang scheduling, v1.35 introduces opportunistic batching. This is a Beta feature that improves scheduling latency for identical Pods. Unlike gang scheduling, this feature does not require the Workload API or any explicit opt-in on the user’s part. It works opportunistically within the scheduler by identifying Pods that have identical scheduling requirements (container images, resource requests, affinities, etc.). When the scheduler processes a Pod, it can reuse the feasibility calculations for subsequent identical Pods in the queue, significantly speeding up the process. Most users will benefit from this optimization automatically, without taking any special steps, provided their Pods meet the following criteria. Restrictions Opportunistic batching works under specific conditions. All fields used by the kube-scheduler to find a placement must be identical between Pods. Additionally, using some features disables the batching mechanism for those Pods to ensure correctness. Note that you may need to review your kube-scheduler configuration to ensure it is not implicitly disabling batching for your workloads. See the docs for more details about restrictions. The north star vision The project has a broad ambition to deliver workload aware scheduling. These new APIs and scheduling enhancements are just the first steps. In the near future, the effort aims to tackle: Introducing a workload scheduling phase Improved support for multi-node DRA and topology aware scheduling Workload-level preemption Improved integration between scheduling and autoscaling Improved interaction with external workload schedulers Managing placement of workloads throughout their entire lifecycle Multi-workload scheduling simulations And more. The priority and implementation order of these focus areas are subject to change. Stay tuned for further updates. Getting started To try the workload aware scheduling improvements: Workload API: Enable the GenericWorkload feature gate on both kube-apiserver and kube-scheduler, and ensure the scheduling.k8s.io/v1alpha1 API group is enabled. Gang scheduling: Enable the GangScheduling feature gate on kube-scheduler (requires the Workload API to be enabled). Opportunistic batching: As a Beta feature, it is enabled by default in v1.35. You can disable it using the OpportunisticBatching feature gate on kube-scheduler if needed. We encourage you to try out workload aware scheduling in your test clusters and share your experiences to help shape the future of Kubernetes scheduling. You can send your feedback by: Reaching out via Slack (#sig-scheduling). Commenting on the workload aware scheduling tracking issue Filing a new issue in the Kubernetes repository. Learn more Read the KEPs for Workload API and gang scheduling and Opportunistic batching. Track the Workload aware scheduling issue for recent updates.
-
Kubernetes v1.35: Fine-grained Supplemental Groups Control Graduates to GA
on December 23, 2025 at 6:30 pm
On behalf of Kubernetes SIG Node, we are pleased to announce the graduation of fine-grained supplemental groups control to General Availability (GA) in Kubernetes v1.35! The new Pod field, supplementalGroupsPolicy, was introduced as an opt-in alpha feature for Kubernetes v1.31, and then had graduated to beta in v1.33. Now, the feature is generally available. This feature allows you to implement more precise control over supplemental groups in Linux containers that can strengthen the security posture particularly in accessing volumes. Moreover, it also enhances the transparency of UID/GID details in containers, offering improved security oversight. If you are planning to upgrade your cluster from v1.32 or an earlier version, please be aware that some behavioral breaking change introduced since beta (v1.33). For more details, see the behavioral changes introduced in beta and the upgrade considerations sections of the previous blog for graduation to beta. Motivation: Implicit group memberships defined in /etc/group in the container image Even though the majority of Kubernetes cluster admins/users may not be aware of this, by default Kubernetes merges group information from the Pod with information defined in /etc/group in the container image. Here’s an example; a Pod manifest that specifies spec.securityContext.runAsUser: 1000, spec.securityContext.runAsGroup: 3000 and spec.securityContext.supplementalGroups: 4000 as part of the Pod’s security context. apiVersion: v1 kind: Pod metadata: name: implicit-groups-example spec: securityContext: runAsUser: 1000 runAsGroup: 3000 supplementalGroups: [4000] containers: – name: example-container image: registry.k8s.io/e2e-test-images/agnhost:2.45 command: [ “sh”, “-c”, “sleep 1h” ] securityContext: allowPrivilegeEscalation: false What is the result of id command in the example-container container? The output should be similar to this: uid=1000 gid=3000 groups=3000,4000,50000 Where does group ID 50000 in supplementary groups (groups field) come from, even though 50000 is not defined in the Pod’s manifest at all? The answer is /etc/group file in the container image. Checking the contents of /etc/group in the container image contains something like the following: user-defined-in-image:x:1000: group-defined-in-image:x:50000:user-defined-in-image This shows that the container’s primary user 1000 belongs to the group 50000 in the last entry. Thus, the group membership defined in /etc/group in the container image for the container’s primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now. What’s wrong with it? The implicitly merged group information from /etc/group in the container image poses a security risk. These implicit GIDs can’t be detected or validated by policy engines because there’s no record of them in the Pod manifest. This can lead to unexpected access control issues, particularly when accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by UID/GIDs in Linux. Fine-grained supplemental groups control in a Pod: supplementaryGroupsPolicy To tackle this problem, a Pod’s .spec.securityContext now includes supplementalGroupsPolicy field. This field lets you control how Kubernetes calculates the supplementary groups for container processes within a Pod. The available policies are: Merge: The group membership defined in /etc/group for the container’s primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backward compatibility). Strict: Only the group IDs specified in fsGroup, supplementalGroups, or runAsGroup are attached as supplementary groups to the container processes. Group memberships defined in /etc/group for the container’s primary user are ignored. I’ll explain how the Strict policy works. The following Pod manifest specifies supplementalGroupsPolicy: Strict: apiVersion: v1 kind: Pod metadata: name: strict-supplementalgroups-policy-example spec: securityContext: runAsUser: 1000 runAsGroup: 3000 supplementalGroups: [4000] supplementalGroupsPolicy: Strict containers: – name: example-container image: registry.k8s.io/e2e-test-images/agnhost:2.45 command: [ “sh”, “-c”, “sleep 1h” ] securityContext: allowPrivilegeEscalation: false The result of id command in the example-container container should be similar to this: uid=1000 gid=3000 groups=3000,4000 You can see Strict policy can exclude group 50000 from groups! Thus, ensuring supplementalGroupsPolicy: Strict (enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod. Note:A container with sufficient privileges can change its process identity. The supplementalGroupsPolicy only affect the initial process identity. Read on for more details. Attached process identity in Pod status This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux field. It would be helpful to see if implicit group IDs are attached. … status: containerStatuses: – name: ctr user: linux: gid: 3000 supplementalGroups: – 3000 – 4000 uid: 1000 … Note:Please note that the values in status.containerStatuses[].user.linux field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2), setgid(2) or setgroups(2), etc.), the container process can change its identity. Thus, the actual process identity will be dynamic. There are several ways to restrict these permissions in containers. We suggest the belows as simple solutions: setting privilege: false and allowPrivilegeEscalation: false in your container’s securityContext, or conform your pod to Restricted policy in Pod Security Standard. Also, kubelet has no visibility into NRI plugins or container runtime internal workings. Cluster Administrator configuring nodes or highly privilege workloads with the permission of a local administrator may change supplemental groups for any pod. However this is outside of a scope of Kubernetes control and should not be a concern for security-hardened nodes. Strict policy requires up-to-date container runtimes The high level container runtime (e.g. containerd, CRI-O) plays a key role for calculating supplementary group ids that will be attached to the containers. Thus, supplementalGroupsPolicy: Strict requires a CRI runtime that support this feature. The old behavior (supplementalGroupsPolicy: Merge) can work with a CRI runtime that does not support this feature, because this policy is fully backward compatible. Here are some CRI runtimes that support this feature, and the versions you need to be running: containerd: v2.0 or later CRI-O: v1.31 or later And, you can see if the feature is supported in the Node’s .status.features.supplementalGroupsPolicy field. Please note that this field is different from status.declaredFeatures introduced in KEP-5328: Node Declared Features(formerly Node Capabilities). apiVersion: v1 kind: Node … status: features: supplementalGroupsPolicy: true As container runtimes support this feature universally, various security policies may start enforcing the Strict behavior as more secure. It is the best practice to ensure that your Pods are ready for this enforcement and all supplemental groups are transparently declared in Pod spec, rather than in images. Getting involved This enhancement was driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you! How can I learn more? Configure a Security Context for a Pod or Container for the further details of supplementalGroupsPolicy KEP-3619: Fine-grained SupplementalGroups control
-
Kubernetes v1.35: Kubelet Configuration Drop-in Directory Graduates to GA
on December 22, 2025 at 6:30 pm
With the recent v1.35 release of Kubernetes, support for a kubelet configuration drop-in directory is generally available. The newly stable feature simplifies the management of kubelet configuration across large, heterogeneous clusters. With v1.35, the kubelet command line argument –config-dir is production-ready and fully supported, allowing you to specify a directory containing kubelet configuration drop-in files. All files in that directory will be automatically merged with your main kubelet configuration. This allows cluster administrators to maintain a cohesive base configuration for kubelets while enabling targeted customizations for different node groups or use cases, and without complex tooling or manual configuration management. The problem: managing kubelet configuration at scale As Kubernetes clusters grow larger and more complex, they often include heterogeneous node pools with different hardware capabilities, workload requirements, and operational constraints. This diversity necessitates different kubelet configurations across node groups—yet managing these varied configurations at scale becomes increasingly challenging. Several pain points emerge: Configuration drift: Different nodes may have slightly different configurations, leading to inconsistent behavior Node group customization: GPU nodes, edge nodes, and standard compute nodes often require different kubelet settings Operational overhead: Maintaining separate, complete configuration files for each node type is error-prone and difficult to audit Change management: Rolling out configuration changes across heterogeneous node pools requires careful coordination Before this support was added to Kubernetes, cluster administrators had to choose between using a single monolithic configuration file for all nodes, manually maintaining multiple complete configuration files, or relying on separate tooling. Each approach had its own drawbacks. This graduation to stable gives cluster administrators a fully supported fourth way to solve that challenge. Example use cases Managing heterogeneous node pools Consider a cluster with multiple node types: standard compute nodes, high-capacity nodes (such as those with GPUs or large amounts of memory), and edge nodes with specialized requirements. Base configuration File: 00-base.conf apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration clusterDNS: – “10.96.0.10” clusterDomain: cluster.local High-capacity node override File: 50-high-capacity-nodes.conf apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration maxPods: 50 systemReserved: memory: “4Gi” cpu: “1000m” Edge node override File: 50-edge-nodes.conf (edge compute typically has lower capacity) apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration evictionHard: memory.available: “500Mi” nodefs.available: “5%” With this structure, high-capacity nodes apply both the base configuration and the capacity-specific overrides, while edge nodes apply the base configuration with edge-specific settings. Gradual configuration rollouts When rolling out configuration changes, you can: Add a new drop-in file with a high numeric prefix (e.g., 99-new-feature.conf) Test the changes on a subset of nodes Gradually roll out to more nodes Once stable, merge changes into the base configuration Viewing the merged configuration Since configuration is now spread across multiple files, you can inspect the final merged configuration using the kubelet’s /configz endpoint: # Start kubectl proxy kubectl proxy # In another terminal, fetch the merged configuration # Change the ‘<node-name>’ placeholder before running the curl command curl -X GET http://127.0.0.1:8001/api/v1/nodes/<node-name>/proxy/configz | jq . This shows the actual configuration the kubelet is using after all merging has been applied. The merged configuration also includes any configuration settings that were specified via kubelet command-line arguments. For detailed setup instructions, configuration examples, and merging behavior, see the official documentation: Set Kubelet Parameters Via A Configuration File Kubelet Configuration Directory Merging Good practices When using the kubelet configuration drop-in directory: Test configurations incrementally: Always test new drop-in configurations on a subset of nodes before rolling out cluster-wide to minimize risk Version control your drop-ins: Store your drop-in configuration files in version control (or the configuration source from which these are generated) alongside your infrastructure as code to track changes and enable easy rollbacks Use numeric prefixes for predictable ordering: Name files with numeric prefixes (e.g., 00-, 50-, 90-) to explicitly control merge order and make the configuration layering obvious to other administrators Be mindful of temporary files: Some text editors automatically create backup files (such as .bak, .swp, or files with ~ suffix) in the same directory when editing. Ensure these temporary or backup files are not left in the configuration directory, as they may be processed by the kubelet Acknowledgments This feature was developed through the collaborative efforts of SIG Node. Special thanks to all contributors who helped design, implement, test, and document this feature across its journey from alpha in v1.28, through beta in v1.30, to GA in v1.35. To provide feedback on this feature, join the Kubernetes Node Special Interest Group, participate in discussions on the public Slack channel (#sig-node), or file an issue on GitHub. Get involved If you have feedback or questions about kubelet configuration management, or want to share your experience using this feature, join the discussion: SIG Node community page Kubernetes Slack in the #sig-node channel SIG Node mailing list SIG Node would love to hear about your experiences using this feature in production!
-
Avoiding Zombie Cluster Members When Upgrading to etcd v3.6
on December 21, 2025 at 12:00 am
This article is a mirror of an original that was recently published to the official etcd blog. The key takeaway? Always upgrade to etcd v3.5.26 or later before moving to v3.6. This ensures your cluster is automatically repaired, and avoids zombie members. Issue summary Recently, the etcd community addressed an issue that may appear when users upgrade from v3.5 to v3.6. This bug can cause the cluster to report “zombie members”, which are etcd nodes that were removed from the database cluster some time ago, and are re-appearing and joining database consensus. The etcd cluster is then inoperable until these zombie members are removed. In etcd v3.5 and earlier, the v2store was the source of truth for membership data, even though the v3store was also present. As a part of our v2store deprecation plan, in v3.6 the v3store is the source of truth for cluster membership. Through a bug report we found out that, in some older clusters, v2store and v3store could become inconsistent. This inconsistency manifests after upgrading as seeing old, removed “zombie” cluster members re-appearing in the cluster. The fix and upgrade path We’ve added a mechanism in etcd v3.5.26 to automatically sync v3store from v2store, ensuring that affected clusters are repaired before upgrading to 3.6.x. To support the many users currently upgrading to 3.6, we have provided the following safe upgrade path: Upgrade your cluster to v3.5.26 or later. Wait and confirm that all members are healthy post-update. Upgrade to v3.6. We are unable to provide a safe workaround path for users who have some obstacle preventing updating to v3.5.26. As such, if v3.5.26 is not available from your packaging source or vendor, you should delay upgrading to v3.6 until it is. Additional technical detail Information below is offered for reference only. Users can follow the safe upgrade path without knowledge of the following details. This issue is encountered with clusters that have been running in production on etcd v3.5.25 or earlier. It is a side effect of adding and removing members from the cluster, or recovering the cluster from failure. This means that the issue is more likely the older the etcd cluster is, but it cannot be ruled out for any user regardless of the age of the cluster. etcd maintainers, working with issue reporters, have found three possible triggers for the issue based on symptoms and an analysis of etcd code and logs: Bug in etcdctl snapshot restore (v3.4 and old versions): When restoring a snapshot using etcdctl snapshot restore, etcdctl was supposed to remove existing members before adding the new ones. In v3.4, due to a bug, old members were not removed, resulting in zombie members. Refer to the comment on etcdctl. –force-new-cluster in v3.5 and earlier versions: In rare cases, forcibly creating a new single-member cluster did not fully remove old members, leaving zombies. The issue was resolved in v3.5.22. Please refer to this PR in the Raft project for detailed technical information. –unsafe-no-sync enabled: If –unsafe-no-sync is enabled, in rare cases etcd might persist a membership change to v3store but crash before writing it to the WAL, causing inconsistency between v2store and v3store. This is a problem for single-member clusters. For multi-member clusters, forcibly creating a new single-member cluster from the crashed node’s data may lead to zombie members. Note –unsafe-no-sync is generally not recommended, as it may break the guarantees given by the consensus protocol. Importantly, there may be other triggers for v2store and v3store membership data becoming inconsistent that we have not yet found. This means that you cannot assume that you are safe just because you have not performed any of the three actions above. Once users are upgraded to etcd v3.6, v3store becomes the source of membership data, and further inconsistency is not possible. Advanced users who want to verify the consistency between v2store and v3store can follow the steps described in this comment. This check is not required to fix the issue, nor does SIG etcd recommend bypassing the v3.5.26 update regardless of the results of the check. Key takeaway Always upgrade to v3.5.26 or later before moving to v3.6. This ensures your cluster is automatically repaired and avoids zombie members. Acknowledgements We would like to thank Christian Baumann for reporting this long-standing upgrade issue. His report and follow-up work helped bring the issue to our attention so that we could investigate and resolve it upstream.
-
Kubernetes 1.35: In-Place Pod Resize Graduates to Stable
on December 19, 2025 at 6:30 pm
This release marks a major step: more than 6 years after its initial conception, the In-Place Pod Resize feature (also known as In-Place Pod Vertical Scaling), first introduced as alpha in Kubernetes v1.27, and graduated to beta in Kubernetes v1.33, is now stable (GA) in Kubernetes 1.35! This graduation is a major milestone for improving resource efficiency and flexibility for workloads running on Kubernetes. What is in-place Pod Resize? In the past, the CPU and memory resources allocated to a container in a Pod were immutable. This meant changing them required deleting and recreating the entire Pod. For stateful services, batch jobs, or latency-sensitive workloads, this was an incredibly disruptive operation. In-Place Pod Resize makes CPU and memory requests and limits mutable, allowing you to adjust these resources within a running Pod, often without requiring a container restart. Key Concept: Desired Resources: A container’s spec.containers[*].resources field now represents the desired resources. For CPU and memory, these fields are now mutable. Actual Resources: The status.containerStatuses[*].resources field reflects the resources currently configured for a running container. Triggering a Resize: You can request a resize by updating the desired requests and limits in the Pod’s specification by utilizing the new resize subresource. How can I start using in-place Pod Resize? Detailed usage instructions and examples are provided in the official documentation: Resize CPU and Memory Resources assigned to Containers. How does this help me? In-place Pod Resize is a foundational building block that unlocks seamless, vertical autoscaling and improvements to workload efficiency. Resources adjusted without disruption Workloads sensitive to latency or restarts can have their resources modified in-place without downtime or loss of state. More powerful autoscaling Autoscalers are now empowered to adjust resources and with less impact. For example, Vertical Pod Autoscaler (VPA)’s InPlaceOrRecreate update mode, which leverages this feature, has graduated to beta. This allows resources to be adjusted automatically and seamlessly based on usage with minimal disruption. See AEP-4016 for more details. Address transient resource needs Workloads that temporarily need more resources can be adjusted quickly. This enables features like the CPU Startup Boost (AEP-7862) where applications can request more CPU during startup and then automatically scale back down. Here are a few examples of some use cases: A game server that needs to adjust its size with shifting player count. A pre-warmed worker that can be shrunk while unused but inflated with the first request. Dynamically scale with load for efficient bin-packing. Increased resources for JIT compilation on startup. Changes between beta (1.33) and stable (1.35) Since the initial beta in v1.33, development effort has primarily been around stabilizing the feature and improving its usability based on community feedback. Here are the primary changes for the stable release: Memory limit decrease Decreasing memory limits was previously prohibited. This restriction has been lifted, and memory limit decreases are now permitted. The Kubelet attempts to prevent OOM-kills by allowing the resize only if the current memory usage is below the new desired limit. However, this check is best-effort and not guaranteed. Prioritized resizes If a node doesn’t have enough room to accept all resize requests, Deferred resizes are reattempted based on the following priority: PriorityClass QoS class Duration Deferred, with older requests prioritized first. Pod Level Resources (Alpha) Support for in-place Pod Resize with Pod Level Resources has been introduced behind its own feature gate, which is alpha in v1.35. Increased observability: There are now new Kubelet metrics and Pod events specifically associated with In-Place Pod Resize to help users track and debug resource changes. What’s next? The graduation of In-Place Pod Resize to stable opens the door for powerful integrations across the Kubernetes ecosystem. There are several areas for futher improvement that are currently planned. Integration with autoscalers and other projects There are planned integrations with several autoscalers and other projects to improve workload efficiency at a larger scale. Some projects under discussion: VPA CPU startup boost (AEP-7862): Allows applications to request more CPU at startup and scale back down after a specific period of time. VPA Support for in-place updates (AEP-4016): VPA support for InPlaceOrRecreate has recently graduated to beta, with the eventual goal being to graduate the feature to stable. Support for InPlace mode is still being worked on; see this pull request. Ray autoscaler: Plans to leverage In-Place Pod Resize to improve workload efficiency. See this Google Cloud blog post for more details. Agent-sandbox “Soft-Pause”: Investigating leveraging in-place Pod Resize for better improved latency. See the Github issue for more details. Runtime support: Java and Python runtimes do not support resizing memory without restart. There is an open conversation with the Java developers, see the bug. If you have a project that could benefit from integration with in-place pod resize, please reach out using the channels listed in the feedback section! Feature expansion Today, In-Place Pod Resize is prohibited when used in combination with: swap, the static CPU Manager, and the static Memory Manager. Additionally, resources other than CPU and memory are still immutable. Expanding the set of supported features and resources is under consideration as more feedback about community needs comes in. There are also plans to support workload preemption; if there is not enough room on the node for the resize of a high priority pod, the goal is to enable policies to automatically evict a lower-priority pod or upsize the node. Improved stability Resolve kubelet-scheduler race conditions There are known race conditions between the kubelet and scheduler with regards to in-place pod resize. Work is underway to resolve these issues over the next few releases. See the issue for more details. Safer memory limit decrease The Kubelet’s best-effort check for OOM-kill prevention can be made even safer by moving the memory usage check into the container runtime itself. See the issue for more details. Providing feedback Looking to further build on this foundational feature, please share your feedback on how to improve and extend this feature. You can share your feedback through GitHub issues, mailing lists, or Slack channels related to the Kubernetes #sig-node and #sig-autoscaling communities. Thank you to everyone who contributed to making this long-awaited feature a reality!
