Kubernetes Blog The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community.

  • Kubernetes v1.36: New Metric for Route Sync in the Cloud Controller Manager
    on May 15, 2026 at 6:35 pm

    This article was originally published with the wrong date. It was later republished, dated the 15th of May 2026. Kubernetes v1.36 introduces a new alpha counter metric route_controller_route_sync_total to the Cloud Controller Manager (CCM) route controller implementation at k8s.io/cloud-provider. This metric increments each time routes are synced with the cloud provider. A/B testing watch-based route reconciliation This metric was added to help operators validate the CloudControllerManagerWatchBasedRoutesReconciliation feature gate introduced in Kubernetes v1.35. That feature gate switches the route controller from a fixed-interval loop to a watch-based approach that only reconciles when nodes actually change. This reduces unnecessary API calls to the infrastructure provider, lowering pressure on rate-limited APIs and allowing operators to make more efficient use of their available quota. To A/B test this, compare route_controller_route_sync_total with the feature gate disabled (default) versus enabled. In clusters where node changes are infrequent, you should see a significant drop in the sync rate with the feature gate turned on. Example: expected behavior With the feature gate disabled (the default fixed-interval loop), the counter increments steadily regardless of whether any node changes occurred: # After 10 minutes with no node changes route_controller_route_sync_total 60 # After 20 minutes, still no node changes route_controller_route_sync_total 120 With the feature gate enabled (watch-based reconciliation), the counter only increments when nodes are actually added, removed, or updated: # After 10 minutes with no node changes route_controller_route_sync_total 1 # After 20 minutes, still no node changes — counter unchanged route_controller_route_sync_total 1 # A new node joins the cluster — counter increments route_controller_route_sync_total 2 The difference is especially visible in stable clusters where nodes rarely change. Where can I give feedback? If you have feedback, feel free to reach out through any of the following channels: The #sig-cloud-provider channel on Kubernetes Slack The KEP-5237 issue on GitHub The SIG Cloud Provider community page for other communication channels How can I learn more? For more details, refer to KEP-5237.

  • Kubernetes v1.36: Mixed Version Proxy Graduates to Beta
    on May 15, 2026 at 6:00 pm

    Back in Kubernetes 1.28, we introduced the Mixed Version Proxy (MVP) as an Alpha feature (under the feature gate UnknownVersionInteroperabilityProxy) in a previous blog post. The goal was simple but critical: make cluster upgrades safer by ensuring that requests for resources not yet known to an older API server are correctly routed to a newer peer API server, instead of returning an incorrect 404 Not Found. We are excited to announce that the Mixed Version Proxy is moving to Beta in Kubernetes 1.36 and will be enabled by default! The feature has evolved significantly since its initial release, addressing key gaps and modernizing its architecture. Here is a look at how the feature has evolved and what you need to know to leverage it in your clusters. What problem are we solving? In a highly available control plane undergoing an upgrade, you often have API servers running different versions. These servers might serve different sets of APIs (Groups, Versions, Resources). Without MVP, if a client request lands on an API server that does not serve the requested resource (e.g., a new API version introduced in the upgrade), that server returns a 404 Not Found. This is technically incorrect because the resource is available in the cluster, just not on that specific server. This can lead to serious side effects, such as mistaken garbage collection or blocked namespace deletions. MVP solves this by proxying the request to a peer API server that can serve it. sequenceDiagram participant Client participant API_Server_A as API Server A (Older/Different) participant API_Server_B as API Server B (Newer/Capable) Client->>API_Server_A: 1. Request for Resource (e.g., v2) Note over API_Server_A: Determines it cannot serve locally API_Server_A->>API_Server_A: 2. Looks up capable peer in Discovery Cache API_Server_A->>API_Server_B: 3. Proxies request (adds x-kubernetes-peer-proxied header) API_Server_B->>API_Server_B: 4. Processes request locally API_Server_B–>>API_Server_A: 5. Returns Response API_Server_A–>>Client: 6. Forwards Response JavaScript must be enabled to view this content How has it evolved since 1.28 The initial Alpha implementation was a great proof of concept, but it had some limitations and relied on older mechanisms. Here is how we have modernized it for Beta: From StorageVersion API to Aggregated Discovery In the Alpha version, API servers relied on the StorageVersion API to figure out which peers served which resources. While functional, this approach had a significant limitation: the StorageVersion API is not yet supported for CRDs and aggregated APIs. For Beta, we have replaced the reliance on StorageVersion API calls with the use of Aggregated Discovery. API servers now use the aggregated discovery data to dynamically understand the capabilities of their peers. The Missing Piece: Peer-Aggregated Discovery The 1.28 blog post noted a significant gap: while we could proxy resource requests, discovery requests still only showed what the local API server knew about. In 1.36, we have added Peer-Aggregated Discovery support! Now, when a client performs discovery (e.g., listing available APIs), the API server merges its local view with the discovery data from all active peers. This provides clients with a complete, unified view of all APIs available across the entire cluster, regardless of which API server they connected to. sequenceDiagram participant Client participant API_Server_A as API Server A participant API_Server_B as API Server B Client->>API_Server_A: 1. Request Discovery Document API_Server_A->>API_Server_A: 2. Gets Local APIs API_Server_A->>API_Server_B: 3. Gets Peer APIs (Cached or Direct) API_Server_A->>API_Server_A: 4. Merges and sorts lists deterministically API_Server_A–>>Client: 5. Returns Unified Discovery Document JavaScript must be enabled to view this content While peer-aggregated discovery will be the default behavior (note that peer-aggregated discovery is enabled if the –peer-ca-file flag is set, otherwise the server will fallback to showing only its local APIs), there may be cases where you need to inspect only the resources served by the specific API server you are connected to. You can request this non-aggregated view by including the profile=nopeer parameter in your request’s Accept header (e.g., Accept: application/json;g=apidiscovery.k8s.io;v=v2;as=APIGroupDiscoveryList;profile=nopeer). Required configuration While the feature gate will be enabled by default, it requires certain flags to be set to allow for secure communication between peer API servers. To function correctly, make sure your API server is configured with the following flags: –feature-gates=UnknownVersionInteroperabilityProxy=true: This will be default in 1.36, but it is good to verify –peer-ca-file=<path-to-ca>: [CRITICAL] This is a required flag. You must provide the CA bundle that the source API server will use to authenticate the serving certificates of destination peer API servers. Without this, proxying will fail due to TLS verification errors. –peer-advertise-ip and –peer-advertise-port: These flags are used to set the network address that peers should use to reach this API server. If unset, the values from –advertise-address or –bind-address are used. If you have complex network topologies where API servers communicate over a specific internal interface, setting these flags explicitly is highly recommended. Configuring with kubeadm If you manage your cluster with kubeadm, you can configure these flags in your ClusterConfiguration file: apiVersion: kubeadm.k8s.io/v1beta4 kind: ClusterConfiguration apiServer: extraArgs: peer-ca-file: “/etc/kubernetes/pki/ca.crt” # peer-advertise-ip and port if needed Call to action If you are running multi-master clusters and upgrading them regularly, the Mixed Version Proxy is a major safety improvement. With it becoming default in 1.36, we encourage you to: Review your API server flags to ensure –peer-ca-file is set properly. Test the feature in your staging environments as you prepare for the 1.36 upgrade. Provide feedback to SIG API Machinery (Slack, mailing list, or by attending SIG API Machinery meetings) on your experience.

  • Kubernetes v1.36: Deprecation and removal of Service ExternalIPs
    on May 14, 2026 at 6:35 pm

    The .spec.externalIPs field for Service was an early attempt to provide cloud-load-balancer-like functionality for non-cloud clusters. Unfortunately, the API assumes that every user in the cluster is fully trusted, and in any situation where that is not the case, it enables various security exploits, as described in CVE-2020-8554. Since Kubernetes 1.21, the Kubernetes project has recommended that all users disable .spec.externalIPs. To make that easier, Kubernetes also added an admission controller (DenyServiceExternalIPs) that can be enabled to do this. At the time, SIG Network felt that blocking the functionality by default was too large a breaking change to consider. However, the security problems are still there, and as a project we’re increasingly unhappy with the “insecure by default” state of the feature. Additionally, there are now several better alternatives for non-cloud clusters wanting load-balancer-like functionality. As a result, the .spec.externalIPs field for Service is now formally deprecated in Kubernetes 1.36. We expect that a future minor release of Kubernetes will drop implementation of the behavior from kube-proxy, and will update the Kubernetes conformance criteria to require that conforming implementations do not provide support. A note on terminology, and what hasn’t been deprecated The phrase external IP is somewhat overloaded in Kubernetes: The Service API has a field .spec.externalIPs that can be used to add additional IP addresses that a Service will respond on. The Node API’s .status.addresses field can list addresses of several different types, one of which is called ExternalIP. The kubectl tool, when displaying information about a Service of type LoadBalancer in the default output format, will show the load balancer IP address under the column heading EXTERNAL-IP. This deprecation is about the first of those. If you are not setting the field externalIPs in any of your Services, then it does not apply to you. That said, as a precaution, you may still want to enable the DenyServiceExternalIPs admission controller to block any future use of the externalIPs field. Alternatives to externalIPs If you are using .spec.externalIPs, then there are several alternatives. Consider a Service like the following: apiVersion: v1 kind: Service metadata: name: my-example-service spec: type: ClusterIP selector: app.kubernetes.io/name: my-example-app ports: – protocol: TCP port: 80 targetPort: 8080 externalIPs: – “192.0.2.4” Using manually-managed LoadBalancer Services instead of externalIPs The easiest (but also worst) option is to just switch from using externalIPs to using a type: LoadBalancer service, and assigning a load balancer IP by hand. This is, essentially, exactly the same as externalIPs, with one important difference: the load balancer IP is part of the Service’s .status, not its .spec, and in a cluster with RBAC enabled, it can’t be edited by ordinary users by default. Thus, this replacement for externalIPs would only be available to users who were given permission by the admins (although those users would then be fully empowered to replicate CVE-2020-8554; there would still not be any further checks to ensure that one user wasn’t stealing another user’s IPs, etc.) Because of the way that .status works in Kubernetes, you must create the Service without a load balancer IP, and then add the IP as a second step: $ cat loadbalancer-service.yaml apiVersion: v1 kind: Service metadata: name: my-example-service spec: # prevent any real load balancer controllers from managing this service # by using a non-existent loadBalancerClass loadBalancerClass: non-existent-class type: LoadBalancer selector: app.kubernetes.io/name: my-example-app ports: – protocol: TCP port: 80 targetPort: 8080 $ kubectl apply -f loadbalancer-service.yaml service/my-example-service created $ kubectl patch service my-example-service –subresource=status –type=merge -p ‘{“status”:{“loadBalancer”:{“ingress”:[{“ip”:”192.0.2.4″}]}}}’ Using a non-cloud based load balancer controller Although LoadBalancer services were originally designed to be backed by cloud load balancers, Kubernetes can also support them on non-cloud platforms by using a third-party load balancer controller such as MetalLB. This solves the security problems associated with externalIPs because the administrator can configure what ranges of IP addresses the controller will assign to services, and the controller will ensure that two services can’t both use the same IP. So, for example, after installing and configuring MetalLB, a cluster administrator could configure a pool of IP addresses for use in the cluster: apiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: name: production namespace: metallb-system spec: addresses: – 192.0.2.0/24 autoAssign: true avoidBuggyIPs: false After which a user can create a type: LoadBalancer Service and MetalLB will handle the assignment of the IP address. MetalLB even supports the deprecated loadBalancerIP field in Service, so the end user can request a specific IP (assuming it is available) for backward-compatibility with the externalIPs approach, rather than being assigned one at random: apiVersion: v1 kind: Service metadata: name: my-example-service spec: type: LoadBalancer selector: app.kubernetes.io/name: my-example-app ports: – protocol: TCP port: 80 targetPort: 8080 loadBalancerIP: “192.0.2.4” Similar approaches would work with other load balancer controllers. This approach can allow cluster administrators to have control over which IP addresses are assigned, rather than users. Using Gateway API Another potential solution is to use an implementation of the Gateway API. Gateway API allows cluster administrators to define a Gateway resource, which can have an IP address attached to it via the .spec.addresses field. Since Gateway resources are designed to be managed by cluster administrators, RBAC rules can be put in place to only allow privileged users to manage them. An example of how this could look is: apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: example-gateway spec: gatewayClassName: example-gateway-class addresses: – type: IPAddress value: “192.0.2.4” — apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: example-route spec: parentRefs: – name: example-gateway rules: – backendRefs: – name: example-svc port: 80 — apiVersion: v1 kind: Service metadata: name: example-svc spec: type: ClusterIP selector: app.kubernetes.io/name: example-app ports: – protocol: TCP port: 80 targetPort: 8080 The Gateway API project is the next generation of Kubernetes Ingress, Load Balancing, and Service Mesh APIs within Kubernetes. Gateway API was designed to fix the shortcomings of the Service and Ingress resource, making it a very reliable robust solution that is under active development. Timeline for externalIPs deprecation The rough timeline for this deprecation is as follows: With the release of Kubernetes 1.36, the field was deprecated; Kubernetes now emits warnings when a user uses this field About a year later (v1.40 at the earliest) support for .spec.externalIPs will be disabled in kube-proxy, but users will have a way to opt back in should they require more time to migrate away About another year later – (v1.43 at the earliest) support will be disabled completely; users won’t have a way to opt back in

  • Kubernetes v1.36: Advancing Workload-Aware Scheduling
    on May 13, 2026 at 6:35 pm

    AI/ML and batch workloads introduce unique scheduling challenges that go beyond simple Pod-by-Pod scheduling. In Kubernetes v1.35, we introduced the first tranche of workload-aware scheduling improvements, featuring the foundational Workload API alongside basic gang scheduling support built on a Pod-based framework, and an opportunistic batching feature to efficiently process identical Pods. Kubernetes v1.36 introduces a significant architectural evolution by cleanly separating API concerns: the Workload API acts as a static template, while the new PodGroup API handles the runtime state. To support this, the kube-scheduler features a new PodGroup scheduling cycle that enables atomic workload processing and paves the way for future enhancements. This release also debuts the first iterations of topology-aware scheduling and workload-aware preemption to advance scheduling capabilities. Additionally, ResourceClaim support for workloads unlocks Dynamic Resource Allocation (DRA) for PodGroups. Finally, to demonstrate real-world readiness, v1.36 delivers the first phase of integration between the Job controller and the new API. Workload and PodGroup API updates The Workload API now serves as a static template, while the new PodGroup API describes the runtime object. Kubernetes v1.36 introduces the Workload and PodGroup APIs as part of the scheduling.k8s.io/v1alpha2 API group, completely replacing the previous v1alpha1 API version. In v1.35, Pod groups and their runtime states were embedded within the Workload resource. The new model decouples these concepts: the Workload now serves as a static template object, while the PodGroup manages the runtime state. This separation also improves performance and scalability as the PodGroup API allows per-replica sharding of status updates. Because the Workload API acts merely as a template, the kube-scheduler’s logic is streamlined. The scheduler can directly read the PodGroup, which contains all the information required by the scheduler, without needing to watch or parse the Workload object itself. Here is what the updated configuration looks like. Workload controllers (such as the Job controller) define the Workload object, which now acts as a static template for your Pod groups: apiVersion: scheduling.k8s.io/v1alpha2 kind: Workload metadata: name: training-job-workload namespace: some-ns spec: # Pod groups are now defined as templates, # which contains the PodGroup objects’ spec fields. podGroupTemplates: – name: workers schedulingPolicy: gang: # The gang is schedulable only if 4 pods can run at once minCount: 4 Controllers then stamp out runtime PodGroup instances based on those templates. The PodGroup runtime object holds the actual scheduling policy and references the template from which it was created. It also has a status containing conditions that mirror the states of individual Pods, reflecting the overall scheduling state of the group: apiVersion: scheduling.k8s.io/v1alpha2 kind: PodGroup metadata: name: training-job-workers-pg namespace: some-ns spec: # The PodGroup references the Workload template it originated from. # In comparison, .metadata.ownerReferences points to the “true” workload object, # e.g., a Job. podGroupTemplateRef: workload: workloadName: training-job-workload podGroupTemplateName: workers # The actual scheduling policy is placed inside the runtime PodGroup schedulingPolicy: gang: minCount: 4 status: # The status contains conditions mirroring individual Pod conditions. conditions: – type: PodGroupScheduled status: “True” lastTransitionTime: 2026-04-03T00:00:00Z Finally, to bridge this new architecture with individual Pods, the workloadRef field in the Pod API has been replaced with the schedulingGroup field. When creating Pods, you link them directly to the runtime PodGroup: apiVersion: v1 kind: Pod metadata: name: worker-0 namespace: some-ns spec: # The workloadRef field has been replaced by schedulingGroup schedulingGroup: podGroupName: training-job-workers-pg … By keeping the Workload as a static template and elevating the PodGroup to a first-class, standalone API, we establish a robust foundation for building advanced workload scheduling capabilities in future Kubernetes releases. PodGroup scheduling cycle and gang scheduling To efficiently manage these workloads, the kube-scheduler now features a dedicated PodGroup scheduling cycle. Instead of evaluating and reserving resources sequentially Pod-by-Pod, which risks scheduling deadlocks, the scheduler evaluates the group as a unified operation. When the scheduler pops a PodGroup member from the scheduling queue, regardless of the group’s specific policy, it fetches the rest of the queued Pods for that group, sorts them deterministically, and executes an atomic scheduling cycle as follows: The scheduler takes a single snapshot of the cluster state to prevent race conditions and ensure consistency while evaluating the entire group. It then attempts to find valid Node placements for all Pods in the group using a PodGroup scheduling algorithm, which leverages the standard Pod-based filtering and scoring phases. Based on the algorithm’s outcome, the scheduling decision is applied atomically for the entire PodGroup. Success: If the placement is found and group constraints are met, the schedulable member Pods are moved directly to the binding phase together. Any remaining unschedulable Pods are returned to the scheduling queue to wait for available resources so they can join the already scheduled Pods. (Note: If new Pods are added to a PodGroup after others are already scheduled, the cycle evaluates the new Pods while accounting for the existing ones. Crucially, Pods already assigned to Nodes remain running. The scheduler will not unassign or evict them, even if the group fails to meet its requirements in subsequent cycles.) Failure: If the group fails to meet its requirements, the entire group is considered unschedulable. None of the Pods are bound, and they are returned to the scheduling queue to retry later after a backoff period. This cycle acts as the foundation for gang scheduling. When your workload requires strict all-or-nothing placement, the gang policy leverages this cycle to prevent partial deployments that lead to resource wastage and potential deadlocks. While the scheduler still holds the Pods in the PreEnqueue until the minCount requirement is met, the actual scheduling phase now relies entirely on the new PodGroup cycle. Specifically, during the algorithm’s execution, the scheduler verifies that the number of schedulable Pods satisfies the minCount. If the cluster cannot accommodate the required minimum, none of the pods are bound. The group fails and waits for sufficient resources to free up. Limitations The first version of the PodGroup scheduling cycle comes with certain limitations: For basic homogeneous Pod groups (i.e., those where all Pods have identical scheduling requirements and lack inter-Pod dependencies like affinity, anti-affinity, or topology spread constraints), the algorithm is expected to find a placement if one exists. For heterogeneous Pod groups, finding a valid placement if one exists is not guaranteed, even when the solution might seem trivial. For Pod groups with inter-Pod dependencies, finding a valid placement if one exists is not guaranteed. In addition to the above, for cases involving intra-group dependencies (e.g., when the schedulability of one Pod depends on another group member via inter-Pod affinity), this algorithm may fail to find a placement regardless of cluster state due to its deterministic processing order. Topology-aware scheduling For complex distributed workloads like AI/ML training or batch processing, placing Pods randomly across a cluster can introduce significant network latency and bottleneck overall performance. Topology-aware scheduling addresses this problem by allowing you to define topology constraints directly on a PodGroup, ensuring its Pods are co-located within specific physical or logical domains: apiVersion: scheduling.k8s.io/v1alpha2 kind: PodGroup metadata: name: topology-aware-workers-pg spec: schedulingPolicy: gang: minCount: 4 # Enforce that the pods are co-located based on the rack topology schedulingConstraints: topology: – key: topology.kubernetes.io/rack In this example, the kube-scheduler attempts to schedule the Pods across various combinations of Nodes that match the rack topology constraint. It then selects the optimal placement based on how efficiently the PodGroup utilizes resources and how many Pods can successfully be scheduled within that domain. To achieve this, the scheduler extends the PodGroup scheduling cycle with a dedicated placement-based algorithm consisting of three phases: Generate candidate placements (subsets of Nodes that are theoretically feasible for the PodGroup’s assignment) based on the group’s scheduling constraints. The topology-aware scheduling plugin uses the new PlacementGenerate extension point to create these placements. Evaluate each proposed placement to confirm whether the entire PodGroup can actually fit there. Score all feasible placements to select the best fit for the PodGroup. The topology-aware scheduling plugins use the new PlacementScore extension point to score these placements. Currently, topology-aware scheduling does not trigger Pod preemption to satisfy constraints. However, we plan to integrate workload-aware preemption with topology constraints in the upcoming release. While Kubernetes v1.36 delivers this foundational topology-aware scheduling, the Kubernetes project is planning expand its capabilities soon. Future updates will introduce support for multiple topology levels, soft constraints (preferences), deeper integration with Dynamic Resource Allocation (DRA), and more robust behavior when paired with the basic scheduling policy. Workload-aware preemption To support the new PodGroup scheduling cycle, Kubernetes v1.36 introduces a new type of preemption mechanism called workload-aware preemption. When a PodGroup cannot be scheduled, the scheduler utilizes this mechanism to try making a scheduling of this PodGroup possible. Compared to the default preemption used in the standard Pod-by-Pod scheduling cycle, this new mechanism treats the entire PodGroup as a single preemptor unit. Instead of evaluating preemption victims on each Node separately, it searches across the entire cluster. This allows the scheduler to preempt Pods from multiple Nodes simultaneously, making enough space to schedule the whole PodGroup afterwards. Workload-aware preemption also introduces two additional concepts directly to the PodGroup API: PodGroup priority that overrides the priority of the individual Pods forming the PodGroup. PodGroup disruptionMode that dictates whether the Pods within a PodGroup can be preempted independently, or if they have to be preempted together in an all-or-nothing fashion. In Kubernetes v1.36, these fields are only respected by the workload-aware preemption mechanism. The people working on this set of features are hoping to extend support for these fields to other disruption sources, including default preemption used in the Pod-by-Pod scheduling cycle, in future releases. apiVersion: scheduling.k8s.io/v1alpha2 kind: PodGroup metadata: name: victim-pg spec: priorityClassName: high-priority priority: 1000 disruptionMode: PodGroup In this example, when the scheduler evaluates victim-pg as a potential preemption victim during a workload-aware preemption cycle, it will use 1000 as its priority and preempt the PodGroup in a strictly all-or-nothing fashion. DRA ResourceClaim support for workloads Since its general availability in Kubernetes v1.34, DRA has enabled Pods to make detailed requests for devices like GPUs, TPUs, and NICs. Requested devices can be shared by multiple Pods requesting the same ResourceClaim by name. Other requests can be replicated through a ResourceClaimTemplate, in which Kubernetes generates one ResourceClaim with a non-deterministic name for each Pod referencing the template. However, large-scale workloads that require certain Pods to share certain devices are currently left to manage creating individual ResourceClaims themselves. Now, in addition to Pods, PodGroups can represent the replicable unit for a ResourceClaimTemplate. For ResourceClaimTemplates referenced by one of a PodGroup’s spec.resourceClaims, Kubernetes generates one ResourceClaim for the entire PodGroup, no matter how many Pods are in the group. When one of a Pod’s spec.resourceClaims for a ResourceClaimTemplate matches one of its PodGroup’s spec.resourceClaims, the Pod’s claim resolves to the ResourceClaim generated for the PodGroup and a ResourceClaim will not be generated for that individual Pod. A single PodGroupTemplate in a Workload object can express resource requests which are both copied for each distinct PodGroup and shareable by the Pods within each group. The following example shows two Pods requesting the same ResourceClaim generated from a ResourceClaimTemplate for their PodGroup: apiVersion: scheduling.k8s.io/v1alpha2 kind: PodGroup metadata: name: training-job-workers-pg spec: … resourceClaims: – name: pg-claim resourceClaimTemplateName: my-claim-template — apiVersion: v1 kind: Pod metadata: name: topology-aware-workers-pg-pod-1 spec: … schedulingGroup: podGroupName: training-job-workers-pg resourceClaims: – name: pg-claim resourceClaimTemplateName: my-claim-template — apiVersion: v1 kind: Pod metadata: name: topology-aware-workers-pg-pod-2 spec: … schedulingGroup: podGroupName: training-job-workers-pg resourceClaims: – name: pg-claim resourceClaimTemplateName: my-claim-template In addition, ResourceClaims referenced by PodGroups, either through resourceClaimName or the claim generated from resourceClaimTemplateName, become reserved for the entire PodGroup. Previously, kube-scheduler could only list individual Pods in a ResourceClaim’s status.reservedFor field which is limited to 256 items. Now, a single PodGroup reference in status.reservedFor can represent many more than 256 Pods, allowing high-cardinality sharing of devices. Together, these changes enable massive workloads with complex topologies to utilize DRA for scalable device management. Integration with the Job controller In Kubernetes v1.36, the Job controller can create and manage Workload and PodGroup objects on your behalf, so that Jobs representing a tightly coupled parallel application, such as distributed AI training, are gang-scheduled without any additional tooling. Without this integration, you would have to create the Workload and PodGroup yourself and wire their references into the Pod template. Now, the Job controller automates this process natively. When the WorkloadWithJob feature gate is enabled, the Job controller automatically: creates a Workload and a corresponding runtime PodGroup for each qualifying Job, sets .spec.schedulingGroup onto every Pod the Job creates so the scheduler treats them as a single gang, and sets the Job as the owner of the generated objects, so they are garbage-collected when the Job is deleted. When does the integration kick in? To keep the first feature iteration predictable, the Job controller only creates a Workload and PodGroup when the Job has a well-defined, fixed shape: .spec.parallelism is greater than 1 .spec.completionMode is set to Indexed .spec.completions is equal to .spec.parallelism The schedulingGroup is not already set on the Pod template. These conditions describe the class of Jobs that gang scheduling can reason about: each Pod has a stable identity (Indexed), the gang size is known and fixed at admission time (parallelism == completions), and no other controller has already claimed scheduling responsibility (schedulingGroup field is unset). Jobs that do not meet these conditions are scheduled Pod-by-Pod, exactly as before. If you set schedulingGroup on the Pod template yourself (for example, because a higher-level controller is managing the workload), the Job controller leaves the Pod template alone and does not create its own Workload or PodGroup. This makes the feature safe to enable in clusters that already use an external batch system. Here is an example of a Job that qualifies for gang scheduling: apiVersion: batch/v1 kind: Job metadata: name: training-job namespace: job-ns spec: completionMode: Indexed parallelism: 4 completions: 4 template: spec: restartPolicy: Never containers: – name: worker image: registry.example/trainer:latest The Job controller creates a Workload and a PodGroup owned by this Job, and every Pod it creates carries a .spec.schedulingGroup that points at the generated PodGroup. The Pods are then scheduled together once all four can be placed at the same time using the PodGroup scheduling cycle described earlier in this post. What’s not covered yet The current constraints limit this integration to static, indexed, fully-parallel Jobs. Support for additional workload shapes, including elastic Jobs and other built-in controllers, is tracked in KEP-5547. In future Kubernetes releases, this integration will expand to support additional workload controllers, and the current constraints for Jobs may be relaxed. What’s next? The journey for workload-aware scheduling doesn’t stop here. For v1.37, the community is actively working on: Graduating Workload and PodGroup APIs to Beta: Our primary goal is to mature the Workload and PodGroup APIs to the Beta stage, solidifying their foundational role in the Kubernetes ecosystem. As part of this graduation process, we also plan to introduce minCount mutability to unlock elastic jobs and allow dynamic workloads to scale efficiently. Multi-level Workload hierarchies: To support complex modern AI workloads like JobSet or Disaggregated Inference via LeaderWorkerSet (LWS), we are working on expanding the architecture to support multi-level hierarchies. We aim to introduce a new API that allows grouping multiple PodGroups into hierarchical structures, directly reflecting the organization of real-world workload controllers. Graduating advanced scheduling features: We are focused on driving the maturity of the broader workload-aware scheduling ecosystem. This includes bringing existing features, such as topology-aware scheduling and workload-aware preemption, to the Beta stage. Unified controller integration API: To streamline adoption, we’re working on a controller integration API. This will provide real-world workload controllers with a unified, standardized method for consuming workload-aware scheduling capabilities. The priority and implementation order of these focus areas are subject to change. Stay tuned for further updates. Getting started All below workload-aware scheduling improvements are available as Alpha features in v1.36. To try them out, you must configure the following: Prerequisite: Workload and PodGroup API support: Enable the GenericWorkload feature gate on both the kube-apiserver and kube-scheduler, and ensure the scheduling.k8s.io/v1alpha2 API group is enabled. Once the prerequisite is met, you can enable specific features: Gang scheduling: Enable the GangScheduling feature gate on the kube-scheduler. Topology-aware scheduling: Enable the TopologyAwareWorkloadScheduling feature gate on the kube-scheduler. Workload-aware preemption: Enable the WorkloadAwarePreemption feature gate on the kube-scheduler (requires GangScheduling to also be enabled). DRA ResourceClaim support for workloads: Enable the DRAWorkloadResourceClaims feature gate on the kube-apiserver, kube-controller-manager, kube-scheduler and kubelet. Workload API integration with the Job controller: Enable the WorkloadWithJob feature gate on the kube-apiserver and kube-controller-manager. We encourage you to try out workload-aware scheduling in your test clusters and share your experiences to help shape the future of Kubernetes scheduling. You can send your feedback by: Reaching out via Slack (#workload-aware-scheduling). Joining the SIG Scheduling meetings. Filing a new issue in the Kubernetes repository. Learn more To dive deeper into the architecture and design of these features, read the KEPs: Workload API and gang scheduling Topology-aware scheduling Workload-aware preemption DRA ResourceClaim support for workloads Workload API support in Job controller

  • Kubernetes v1.36: PSI Metrics for Kubernetes Graduates to GA
    on May 12, 2026 at 6:35 pm

    Since its original implementation in the Linux kernel in 2018, Pressure Stall Information (PSI) has provided users with the high-fidelity signals needed to identify resource saturation before it becomes an outage. Unlike traditional utilization metrics, PSI tells the story of tasks stalled and time lost, all in nicely-packaged percentages of time across the CPU, memory, and I/O. With the recent release of Kubernetes v1.36, users across the ecosystem have a stable, reliable interface to observe resource contention at the node, pod, and container levels. In this post, we will dive into the improvements and performance testing that proved its readiness for production. Beyond utilization: why PSI? Monitoring CPU or memory usage alone can be misleading. A node may report XX% (below 100%) CPU utilization while certain tasks are experiencing severe latency due to scheduling delays. PSI fills this gap by providing: Cumulative Totals: Absolute time spent in a stalled state. Moving Averages: 10s, 60s, and 300s windows that allow operators to distinguish between transient spikes and sustained resource tension. Proving stability: performance testing at scale A common concern when graduating telemetry features is the resource overhead required to collect and serve the metrics. To address this, SIG Node conducted extensive performance validation on high-density workloads (80+ pods) across various machine types. Our testing focused on two primary scenarios to isolate the impact of the Kubelet and kernel-level collection respectively: Kernel PSI ON / Kubelet Feature OFF vs Kernel PSI ON / Kubelet Feature ON (Kubelet overhead) Kernel PSI OFF / Kubelet Feature ON vs Kernel PSI ON / Kubelet Feature ON (Kernel overhead) Scenario 1: The Kubelet Overhead First, we looked at the kubelet usage on 4 core machines (Case 1). For these, the Linux kernel was already tracking pressure on both clusters by default(psi=1), but we toggled the KubeletPSI feature gate to see if the Kubelet actively querying and exposing these metrics impacted the resource usage. The synchronized bursts seen in the graph are practically identical in both magnitude and frequency, confirming that the Kubelet’s collection logic is highly lightweight and blends seamlessly into standard housekeeping cycles. There is no issue about the feature affecting the pre-existing resource use, staying within the normal 0.1 cores or 2.5% of the total node capacity, and is therefore safe for production-scale deployments. (Case 1) Kubelet CPU Usage Rate ComparisonFigure 2: Kubelet CPU Usage Rate Comparison. Next, we evaluated the system overhead in the same run. As seen in the following graph, the System CPU usage lines for the Kubelet PSI-enabled (red) follows the same pattern as the Kubelet PSI-disabled (blue) clusters, with a slight expected increase from the baseline. This visualizes that once the OS is tracking PSI, at around 2.5 cores, the act of Kubernetes reading those cgroup metrics is negligible to performance. (Case 1) System CPU Usage Rate ComparisonFigure 1: Node System CPU Usage Rate Comparison. Scenario 2: The Kernel Overhead Shifting gears, we evaluated the underlying overhead of enabling PSI on the Linux kernel also on a 4 core machine. By comparing a cluster booted with psi=1 (COS default) against a cluster with psi=0, we isolated the exact cost of the OS-level bookkeeping. Even under heavy I/O and CPU load at an 80-pod density, the System CPU delta between the kernel-enabled and kernel-disabled clusters remained consistently between 0.037 cores and 0.125 cores or 0.925% – 3.125% of the total node capacity. There was a single spike to 0.225 cores, or 5.6%, but was controlled back down within a few seconds. This confirms that the internal kernel tracking is highly efficient under load. (Case 2) Node System CPU Usage Rate ComparisonFigure 3: Node System CPU Usage Rate Comparison. Figure 4 zooms in on the kubelet process itself, which serves as the primary collector for these metrics. . The results show that even while the kubelet performs periodic sweeps to aggregate data from the cgroup hierarchy, its CPU usage remains remarkably low with interchangeable spikes and nothing exceeding 0.25 cores or 6.25% of total capacity for longer than a second. (Case 2) Kubelet CPU Usage Rate ComparisonFigure 4: Kubelet CPU Usage Rate Comparison. Improvements between beta (1.34) and stable (1.36) Smarter Metric Emission for GA: We improved how the Kubelet handles underlying OS support for PSI. Previously, if the feature was enabled in Kubernetes but the underlying Linux kernel didn’t support PSI (psi=0), the Kubelet would emit misleading zero-valued metrics. These could trigger false alarms when read as real metrics instead of missing values. In v1.36, the Kubelet now detects OS-level PSI support via cgroup configurations before reporting. This ensures that pressure metrics are only collected and emitted when they are actually supported by the node, providing cleaner data for monitoring and alerting systems. Getting started To use PSI metrics in your Kubernetes cluster, your nodes must meet the following requirements: Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2. Ensure PSI is enabled at the OS level (your kernel must be compiled with CONFIG_PSI=y and must not be booted with the psi=0 parameter). As of v1.36, Kubelet PSI metrics are generally available and you do not need to opt in to any feature gate. Once the OS prerequisites are met, you can start scraping the /metrics/cadvisor endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes, the kubelet will simply omit the PSI metrics. If your cluster is running a recent enough version of Kubernetes and you are a privileged node administrator, you can also proxy to the kubelet’s HTTP API via the control plane’s API server to see real-time pressure data from the Summary API. Caution: Proxying to the kubelet is a privileged operation. Granting access to it is a security risk, so ensure you have the appropriate administrative permissions before executing these commands. CONTAINER_NAME=”example-container” kubectl get –raw “/api/v1/nodes/$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}’)/proxy/stats/summary” | jq ‘.pods[].containers[] | select(.name==”‘”$CONTAINER_NAME”‘”) | {name, cpu: .cpu.psi, memory: .memory.psi, io: .io.psi}’ Further reading If you want to dive deeper into how these metrics are calculated and exposed, check out these resources: The official Kernel documentation Understanding PSI in the Kubernetes documentation cAdvisor Metrics Implementation Acknowledgements Support for PSI metrics was developed through the collaborative efforts of SIG Node. Special thanks to all contributors who helped design, implement, test, review, and document this feature across its journey from alpha in v1.33, through beta in v1.34, to GA in v1.36. To provide feedback on this feature, join the Kubernetes Node Special Interest Group, participate in discussions on the public Slack channel (#sig-node), or file an issue on GitHub. Feedback If you have feedback and want to share your experience using this feature, join the discussion: SIG Node community page Kubernetes Slack in the #sig-node channel SIG Node mailing list SIG Node would love to hear about your experiences using this feature in production!

Scroll to Top