Kubernetes Blog The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community.
-
The Cloud Controller Manager Chicken and Egg Problem
on February 14, 2025 at 12:00 am
Kubernetes 1.31 completed the largest migration in Kubernetes history, removing the in-tree cloud provider. While the component migration is now done, this leaves some additional complexity for users and installer projects (for example, kOps or Cluster API) . We will go over those additional steps and failure points and make recommendations for cluster owners. This migration was complex and some logic had to be extracted from the core components, building four new subsystems. Cloud controller manager (KEP-2392) API server network proxy (KEP-1281) kubelet credential provider plugins (KEP-2133) Storage migration to use CSI (KEP-625) The cloud controller manager is part of the control plane. It is a critical component that replaces some functionality that existed previously in the kube-controller-manager and the kubelet. Components of Kubernetes One of the most critical functionalities of the cloud controller manager is the node controller, which is responsible for the initialization of the nodes. As you can see in the following diagram, when the kubelet starts, it registers the Node object with the apiserver, Tainting the node so it can be processed first by the cloud-controller-manager. The initial Node is missing the cloud-provider specific information, like the Node Addresses and the Labels with the cloud provider specific information like the Node, Region and Instance type information. Chicken and egg problem sequence diagram This new initialization process adds some latency to the node readiness. Previously, the kubelet was able to initialize the node at the same time it created the node. Since the logic has moved to the cloud-controller-manager, this can cause a chicken and egg problem during the cluster bootstrapping for those Kubernetes architectures that do not deploy the controller manager as the other components of the control plane, commonly as static pods, standalone binaries or daemonsets/deployments with tolerations to the taints and using hostNetwork (more on this below) Examples of the dependency problem As noted above, it is possible during bootstrapping for the cloud-controller-manager to be unschedulable and as such the cluster will not initialize properly. The following are a few concrete examples of how this problem can be expressed and the root causes for why they might occur. These examples assume you are running your cloud-controller-manager using a Kubernetes resource (e.g. Deployment, DaemonSet, or similar) to control its lifecycle. Because these methods rely on Kubernetes to schedule the cloud-controller-manager, care must be taken to ensure it will schedule properly. Example: Cloud controller manager not scheduling due to uninitialized taint As noted in the Kubernetes documentation, when the kubelet is started with the command line flag –cloud-provider=external, its corresponding Node object will have a no schedule taint named node.cloudprovider.kubernetes.io/uninitialized added. Because the cloud-controller-manager is responsible for removing the no schedule taint, this can create a situation where a cloud-controller-manager that is being managed by a Kubernetes resource, such as a Deployment or DaemonSet, may not be able to schedule. If the cloud-controller-manager is not able to be scheduled during the initialization of the control plane, then the resulting Node objects will all have the node.cloudprovider.kubernetes.io/uninitialized no schedule taint. It also means that this taint will not be removed as the cloud-controller-manager is responsible for its removal. If the no schedule taint is not removed, then critical workloads, such as the container network interface controllers, will not be able to schedule, and the cluster will be left in an unhealthy state. Example: Cloud controller manager not scheduling due to not-ready taint The next example would be possible in situations where the container network interface (CNI) is waiting for IP address information from the cloud-controller-manager (CCM), and the CCM has not tolerated the taint which would be removed by the CNI. The Kubernetes documentation describes the node.kubernetes.io/not-ready taint as follows: “The Node controller detects whether a Node is ready by monitoring its health and adds or removes this taint accordingly.” One of the conditions that can lead to a Node resource having this taint is when the container network has not yet been initialized on that node. As the cloud-controller-manager is responsible for adding the IP addresses to a Node resource, and the IP addresses are needed by the container network controllers to properly configure the container network, it is possible in some circumstances for a node to become stuck as not ready and uninitialized permanently. This situation occurs for a similar reason as the first example, although in this case, the node.kubernetes.io/not-ready taint is used with the no execute effect and thus will cause the cloud-controller-manager not to run on the node with the taint. If the cloud-controller-manager is not able to execute, then it will not initialize the node. It will cascade into the container network controllers not being able to run properly, and the node will end up carrying both the node.cloudprovider.kubernetes.io/uninitialized and node.kubernetes.io/not-ready taints, leaving the cluster in an unhealthy state. Our Recommendations There is no one “correct way” to run a cloud-controller-manager. The details will depend on the specific needs of the cluster administrators and users. When planning your clusters and the lifecycle of the cloud-controller-managers please consider the following guidance: For cloud-controller-managers running in the same cluster, they are managing. Use host network mode, rather than the pod network: in most cases, a cloud controller manager will need to communicate with an API service endpoint associated with the infrastructure. Setting “hostNetwork” to true will ensure that the cloud controller is using the host networking instead of the container network and, as such, will have the same network access as the host operating system. It will also remove the dependency on the networking plugin. This will ensure that the cloud controller has access to the infrastructure endpoint (always check your networking configuration against your infrastructure provider’s instructions). Use a scalable resource type. Deployments and DaemonSets are useful for controlling the lifecycle of a cloud controller. They allow easy access to running multiple copies for redundancy as well as using the Kubernetes scheduling to ensure proper placement in the cluster. When using these primitives to control the lifecycle of your cloud controllers and running multiple replicas, you must remember to enable leader election, or else your controllers will collide with each other which could lead to nodes not being initialized in the cluster. Target the controller manager containers to the control plane. There might exist other controllers which need to run outside the control plane (for example, Azure’s node manager controller). Still, the controller managers themselves should be deployed to the control plane. Use a node selector or affinity stanza to direct the scheduling of cloud controllers to the control plane to ensure that they are running in a protected space. Cloud controllers are vital to adding and removing nodes to a cluster as they form a link between Kubernetes and the physical infrastructure. Running them on the control plane will help to ensure that they run with a similar priority as other core cluster controllers and that they have some separation from non-privileged user workloads. It is worth noting that an anti-affinity stanza to prevent cloud controllers from running on the same host is also very useful to ensure that a single node failure will not degrade the cloud controller performance. Ensure that the tolerations allow operation. Use tolerations on the manifest for the cloud controller container to ensure that it will schedule to the correct nodes and that it can run in situations where a node is initializing. This means that cloud controllers should tolerate the node.cloudprovider.kubernetes.io/uninitialized taint, and it should also tolerate any taints associated with the control plane (for example, node-role.kubernetes.io/control-plane or node-role.kubernetes.io/master). It can also be useful to tolerate the node.kubernetes.io/not-ready taint to ensure that the cloud controller can run even when the node is not yet available for health monitoring. For cloud-controller-managers that will not be running on the cluster they manage (for example, in a hosted control plane on a separate cluster), then the rules are much more constrained by the dependencies of the environment of the cluster running the cloud-controller-manager. The advice for running on a self-managed cluster may not be appropriate as the types of conflicts and network constraints will be different. Please consult the architecture and requirements of your topology for these scenarios. Example This is an example of a Kubernetes Deployment highlighting the guidance shown above. It is important to note that this is for demonstration purposes only, for production uses please consult your cloud provider’s documentation. apiVersion: apps/v1 kind: Deployment metadata: labels: app.kubernetes.io/name: cloud-controller-manager name: cloud-controller-manager namespace: kube-system spec: replicas: 2 selector: matchLabels: app.kubernetes.io/name: cloud-controller-manager strategy: type: Recreate template: metadata: labels: app.kubernetes.io/name: cloud-controller-manager annotations: kubernetes.io/description: Cloud controller manager for my infrastructure spec: containers: # the container details will depend on your specific cloud controller manager – name: cloud-controller-manager command: – /bin/my-infrastructure-cloud-controller-manager – –leader-elect=true – -v=1 image: registry/my-infrastructure-cloud-controller-manager@latest resources: requests: cpu: 200m memory: 50Mi hostNetwork: true # these Pods are part of the control plane nodeSelector: node-role.kubernetes.io/control-plane: “” affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: – topologyKey: “kubernetes.io/hostname” labelSelector: matchLabels: app.kubernetes.io/name: cloud-controller-manager tolerations: – effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists – effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 120 – effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 120 – effect: NoSchedule key: node.cloudprovider.kubernetes.io/uninitialized operator: Exists – effect: NoSchedule key: node.kubernetes.io/not-ready operator: Exists When deciding how to deploy your cloud controller manager it is worth noting that cluster-proportional, or resource-based, pod autoscaling is not recommended. Running multiple replicas of a cloud controller manager is good practice for ensuring high-availability and redundancy, but does not contribute to better performance. In general, only a single instance of a cloud controller manager will be reconciling a cluster at any given time.
-
Spotlight on SIG Architecture: Enhancements
on January 21, 2025 at 12:00 am
This is the fourth interview of a SIG Architecture Spotlight series that will cover the different subprojects, and we will be covering SIG Architecture: Enhancements. In this SIG Architecture spotlight we talked with Kirsten Garrison, lead of the Enhancements subproject. The Enhancements subproject Frederico (FSM): Hi Kirsten, very happy to have the opportunity to talk about the Enhancements subproject. Let’s start with some quick information about yourself and your role. Kirsten Garrison (KG): I’m a lead of the Enhancements subproject of SIG-Architecture and currently work at Google. I first got involved by contributing to the service-catalog project with the help of Carolyn Van Slyck. With time, I joined the Release team, eventually becoming the Enhancements Lead and a Release Lead shadow. While on the release team, I worked on some ideas to make the process better for the SIGs and Enhancements team (the opt-in process) based on my team’s experiences. Eventually, I started attending Subproject meetings and contributing to the Subproject’s work. FSM: You mentioned the Enhancements subproject: how would you describe its main goals and areas of intervention? KG: The Enhancements Subproject primarily concerns itself with the Kubernetes Enhancement Proposal (KEP for short)—the “design” documents required for all features and significant changes to the Kubernetes project. The KEP and its impact FSM: The improvement of the KEP process was (and is) one in which SIG Architecture was heavily involved. Could you explain the process to those that aren’t aware of it? KG: Every release, the SIGs let the Release Team know which features they intend to work on to be put into the release. As mentioned above, the prerequisite for these changes is a KEP – a standardized design document that all authors must fill out and approve in the first weeks of the release cycle. Most features will move through 3 phases: alpha, beta and finally GA so approving a feature represents a significant commitment for the SIG. The KEP serves as the full source of truth of a feature. The KEP template has different requirements based on what stage a feature is in, but it generally requires a detailed discussion of the design and the impact as well as providing artifacts of stability and performance. The KEP takes quite a bit of iterative work between authors, SIG reviewers, api review team and the Production Readiness Review team1 before it is approved. Each set of reviewers is looking to make sure that the proposal meets their standards in order to have a stable and performant Kubernetes release. Only after all approvals are secured, can an author go forth and merge their feature in the Kubernetes code base. FSM: I see, quite a bit of additional structure was added. Looking back, what were the most significant improvements of that approach? KG: In general, I think that the improvements with the most impact had to do with focusing on the core intent of the KEP. KEPs exist not just to memorialize designs, but provide a structured way to discuss and come to an agreement about different facets of the change. At the core of the KEP process is communication and consideration. To that end, some of the significant changes revolve around a more detailed and accessible KEP template. A significant amount of work was put in over time to get the k/enhancements repo into its current form — a directory structure organized by SIG with the contours of the modern KEP template (with Proposal/Motivation/Design Details subsections). We might take that basic structure for granted today, but it really represents the work of many people trying to get the foundation of this process in place over time. As Kubernetes matures, we’ve needed to think about more than just the end goal of getting a single feature merged. We need to think about things like: stability, performance, setting and meeting user expectations. And as we’ve thought about those things the template has grown more detailed. The addition of the Production Readiness Review was major as well as the enhanced testing requirements (varying at different stages of a KEP’s lifecycle). Current areas of focus FSM: Speaking of maturing, we’ve recently released Kubernetes v1.31, and work on v1.32 has started. Are there any areas that the Enhancements sub-project is currently addressing that might change the way things are done? KG: We’re currently working on two things: Creating a Process KEP template. Sometimes people want to harness the KEP process for significant changes that are more process oriented rather than feature oriented. We want to support this because memorializing changes is important and giving people a better tool to do so will only encourage more discussion and transparency. KEP versioning. While our template changes aim to be as non-disruptive as possible, we believe that it will be easier to track and communicate those changes to the community better with a versioned KEP template and the policies that go alongside such versioning. Both features will take some time to get right and fully roll out (just like a KEP feature) but we believe that they will both provide improvements that will benefit the community at large. FSM: You mentioned improvements: I remember when project boards for Enhancement tracking were introduced in recent releases, to great effect and unanimous applause from release team members. Was this a particular area of focus for the subproject? KG: The Subproject provided support to the Release Team’s Enhancement team in the migration away from using the spreadsheet to a project board. The collection and tracking of enhancements has always been a logistical challenge. During my time on the Release Team, I helped with the transition to an opt-in system of enhancements, whereby the SIG leads “opt-in” KEPs for release tracking. This helped to enhance communication between authors and SIGs before any significant work was undertaken on a KEP and removed toil from the Enhancements team. This change used the existing tools to avoid introducing too many changes at once to the community. Later, the Release Team approached the Subproject with an idea of leveraging GitHub Project Boards to further improve the collection process. This was to be a move away from the use of complicated spreadsheets to using repo-native labels on k/enhancement issues and project boards. FSM: That surely adds an impact on simplifying the workflow… KG: Removing sources of friction and promoting clear communication is very important to the Enhancements Subproject. At the same time, it’s important to give careful consideration to decisions that impact the community as a whole. We want to make sure that changes are balanced to give an upside and while not causing any regressions and pain in the rollout. We supported the Release Team in ideation as well as through the actual migration to the project boards. It was a great success and exciting to see the team make high impact changes that helped everyone involved in the KEP process! Getting involved FSM: For those reading that might be curious and interested in helping, how would you describe the required skills for participating in the sub-project? KG: Familiarity with KEPs either via experience or taking time to look through the kubernetes/enhancements repo is helpful. All are welcome to participate if interested – we can take it from there. FSM: Excellent! Many thanks for your time and insight — any final comments you would like to share with our readers? KG: The Enhancements process is one of the most important parts of Kubernetes and requires enormous amounts of coordination and collaboration of people and teams across the project to make it successful. I’m thankful and inspired by everyone’s continued hard work and dedication to making the project great. This is truly a wonderful community. For more information, check the Production Readiness Review spotlight interview in this series. ↩︎
-
Kubernetes 1.32: Moving Volume Group Snapshots to Beta
on December 18, 2024 at 12:00 am
Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release. The recent release of Kubernetes v1.32 moved that support to beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point. This new feature is only supported for CSI volume drivers. An overview of volume group snapshots Some storage systems provide the ability to create a crash consistent snapshot of multiple volumes. A group snapshot represents copies made from multiple volumes, that are taken at the same point-in-time. A group snapshot can be used either to rehydrate new volumes (pre-populated with the snapshot data) or to restore existing volumes to a previous state (represented by the snapshots). Why add volume group snapshots to Kubernetes? The Kubernetes volume plugin system already provides a powerful abstraction that automates the provisioning, attaching, mounting, resizing, and snapshotting of block and file storage. Underpinning all these features is the Kubernetes goal of workload portability: Kubernetes aims to create an abstraction layer between distributed applications and underlying clusters so that applications can be agnostic to the specifics of the cluster they run on and application deployment requires no cluster specific knowledge. There was already a VolumeSnapshot API that provides the ability to take a snapshot of a persistent volume to protect against data loss or data corruption. However, there are other snapshotting functionalities not covered by the VolumeSnapshot API. Some storage systems support consistent group snapshots that allow a snapshot to be taken from multiple volumes at the same point-in-time to achieve write order consistency. This can be useful for applications that contain multiple volumes. For example, an application may have data stored in one volume and logs stored in another volume. If snapshots for the data volume and the logs volume are taken at different times, the application will not be consistent and will not function properly if it is restored from those snapshots when a disaster strikes. It is true that you can quiesce the application first, take an individual snapshot from each volume that is part of the application one after the other, and then unquiesce the application after all the individual snapshots are taken. This way, you would get application consistent snapshots. However, sometimes the application quiesce can be so time consuming that you want to do it less frequently, or it may not be possible to quiesce an application at all. For example, a user may want to run weekly backups with application quiesce and nightly backups without application quiesce but with consistent group support which provides crash consistency across all volumes in the group. Kubernetes APIs for volume group snapshots Kubernetes’ support for volume group snapshots relies on three API kinds that are used for managing snapshots: VolumeGroupSnapshot Created by a Kubernetes user (or perhaps by your own automation) to request creation of a volume group snapshot for multiple persistent volume claims. It contains information about the volume group snapshot operation such as the timestamp when the volume group snapshot was taken and whether it is ready to use. The creation and deletion of this object represents a desire to create or delete a cluster resource (a group snapshot). VolumeGroupSnapshotContent Created by the snapshot controller for a dynamically created VolumeGroupSnapshot. It contains information about the volume group snapshot including the volume group snapshot ID. This object represents a provisioned resource on the cluster (a group snapshot). The VolumeGroupSnapshotContent object binds to the VolumeGroupSnapshot for which it was created with a one-to-one mapping. VolumeGroupSnapshotClass Created by cluster administrators to describe how volume group snapshots should be created, including the driver information, the deletion policy, etc. These three API kinds are defined as CustomResourceDefinitions (CRDs). These CRDs must be installed in a Kubernetes cluster for a CSI Driver to support volume group snapshots. What components are needed to support volume group snapshots Volume group snapshots are implemented in the external-snapshotter repository. Implementing volume group snapshots meant adding or changing several components: Added new CustomResourceDefinitions for VolumeGroupSnapshot and two supporting APIs. Volume group snapshot controller logic is added to the common snapshot controller. Adding logic to make CSI calls into the snapshotter sidecar controller. The volume snapshot controller and CRDs are deployed once per cluster, while the sidecar is bundled with each CSI driver. Therefore, it makes sense to deploy the volume snapshot controller and CRDs as a cluster addon. The Kubernetes project recommends that Kubernetes distributors bundle and deploy the volume snapshot controller and CRDs as part of their Kubernetes cluster management process (independent of any CSI Driver). What’s new in Beta? The VolumeGroupSnapshot feature in CSI spec moved to GA in the v1.11.0 release. The snapshot validation webhook was deprecated in external-snapshotter v8.0.0 and it is now removed. Most of the validation webhook logic was added as validation rules into the CRDs. Minimum required Kubernetes version is 1.25 for these validation rules. One thing in the validation webhook not moved to CRDs is the prevention of creating multiple default volume snapshot classes and multiple default volume group snapshot classes for the same CSI driver. With the removal of the validation webhook, an error will still be raised when dynamically provisioning a VolumeSnapshot or VolumeGroupSnapshot when multiple default volume snapshot classes or multiple default volume group snapshot classes for the same CSI driver exist. The enable-volumegroup-snapshot flag in the snapshot-controller and the CSI snapshotter sidecar has been replaced by a feature gate. Since VolumeGroupSnapshot is a new API, the feature moves to Beta but the feature gate is disabled by default. To use this feature, enable the feature gate by adding the flag –feature-gates=CSIVolumeGroupSnapshot=true when starting the snapshot-controller and the CSI snapshotter sidecar. The logic to dynamically create the VolumeGroupSnapshot and its corresponding individual VolumeSnapshot and VolumeSnapshotContent objects are moved from the CSI snapshotter to the common snapshot-controller. New RBAC rules are added to the common snapshot-controller and some RBAC rules are removed from the CSI snapshotter sidecar accordingly. How do I use Kubernetes volume group snapshots Creating a new group snapshot with Kubernetes Once a VolumeGroupSnapshotClass object is defined and you have volumes you want to snapshot together, you may request a new group snapshot by creating a VolumeGroupSnapshot object. The source of the group snapshot specifies whether the underlying group snapshot should be dynamically created or if a pre-existing VolumeGroupSnapshotContent should be used. A pre-existing VolumeGroupSnapshotContent is created by a cluster administrator. It contains the details of the real volume group snapshot on the storage system which is available for use by cluster users. One of the following members in the source of the group snapshot must be set. selector – a label query over PersistentVolumeClaims that are to be grouped together for snapshotting. This selector will be used to match the label added to a PVC. volumeGroupSnapshotContentName – specifies the name of a pre-existing VolumeGroupSnapshotContent object representing an existing volume group snapshot. Dynamically provision a group snapshot In the following example, there are two PVCs. NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE pvc-0 Bound pvc-6e1f7d34-a5c5-4548-b104-01e72c72b9f2 100Mi RWO csi-hostpath-sc <unset> 2m15s pvc-1 Bound pvc-abc640b3-2cc1-4c56-ad0c-4f0f0e636efa 100Mi RWO csi-hostpath-sc <unset> 2m7s Label the PVCs. % kubectl label pvc pvc-0 group=myGroup persistentvolumeclaim/pvc-0 labeled % kubectl label pvc pvc-1 group=myGroup persistentvolumeclaim/pvc-1 labeled For dynamic provisioning, a selector must be set so that the snapshot controller can find PVCs with the matching labels to be snapshotted together. apiVersion: groupsnapshot.storage.k8s.io/v1beta1 kind: VolumeGroupSnapshot metadata: name: snapshot-daily-20241217 namespace: demo-namespace spec: volumeGroupSnapshotClassName: csi-groupSnapclass source: selector: matchLabels: group: myGroup In the VolumeGroupSnapshot spec, a user can specify the VolumeGroupSnapshotClass which has the information about which CSI driver should be used for creating the group snapshot. A VolumGroupSnapshotClass is required for dynamic provisioning. apiVersion: groupsnapshot.storage.k8s.io/v1beta1 kind: VolumeGroupSnapshotClass metadata: name: csi-groupSnapclass annotations: kubernetes.io/description: “Example group snapshot class” driver: example.csi.k8s.io deletionPolicy: Delete As a result of the volume group snapshot creation, a corresponding VolumeGroupSnapshotContent object will be created with a volumeGroupSnapshotHandle pointing to a resource on the storage system. Two individual volume snapshots will be created as part of the volume group snapshot creation. NAME READYTOUSE SOURCEPVC RESTORESIZE SNAPSHOTCONTENT AGE snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0 true pvc-0 100Mi snapcontent-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0 16m snapshot-da577d76bd2106c410616b346b2e72440f6ec7b12a75156263b989192b78caff true pvc-1 100Mi snapcontent-da577d76bd2106c410616b346b2e72440f6ec7b12a75156263b989192b78caff 16m Importing an existing group snapshot with Kubernetes To import a pre-existing volume group snapshot into Kubernetes, you must also import the corresponding individual volume snapshots. Identify the individual volume snapshot handles, manually construct a VolumeSnapshotContent object first, then create a VolumeSnapshot object pointing to the VolumeSnapshotContent object. Repeat this for every individual volume snapshot. Then manually create a VolumeGroupSnapshotContent object, specifying the volumeGroupSnapshotHandle and individual volumeSnapshotHandles already existing on the storage system. apiVersion: groupsnapshot.storage.k8s.io/v1beta1 kind: VolumeGroupSnapshotContent metadata: name: static-group-content spec: deletionPolicy: Delete driver: hostpath.csi.k8s.io source: groupSnapshotHandles: volumeGroupSnapshotHandle: e8779136-a93e-11ef-9549-66940726f2fd volumeSnapshotHandles: – e8779147-a93e-11ef-9549-66940726f2fd – e8783cd0-a93e-11ef-9549-66940726f2fd volumeGroupSnapshotRef: name: static-group-snapshot namespace: demo-namespace After that create a VolumeGroupSnapshot object pointing to the VolumeGroupSnapshotContent object. apiVersion: groupsnapshot.storage.k8s.io/v1beta1 kind: VolumeGroupSnapshot metadata: name: static-group-snapshot namespace: demo-namespace spec: source: volumeGroupSnapshotContentName: static-group-content How to use group snapshot for restore in Kubernetes At restore time, the user can request a new PersistentVolumeClaim to be created from a VolumeSnapshot object that is part of a VolumeGroupSnapshot. This will trigger provisioning of a new volume that is pre-populated with data from the specified snapshot. The user should repeat this until all volumes are created from all the snapshots that are part of a group snapshot. apiVersion: v1 kind: PersistentVolumeClaim metadata: name: examplepvc-restored-2024-12-17 namespace: demo-namespace spec: storageClassName: example-foo-nearline dataSource: name: snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0 kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.io accessModes: – ReadWriteOncePod resources: requests: storage: 100Mi # must be enough storage to fit the existing snapshot As a storage vendor, how do I add support for group snapshots to my CSI driver? To implement the volume group snapshot feature, a CSI driver must: Implement a new group controller service. Implement group controller RPCs: CreateVolumeGroupSnapshot, DeleteVolumeGroupSnapshot, and GetVolumeGroupSnapshot. Add group controller capability CREATE_DELETE_GET_VOLUME_GROUP_SNAPSHOT. See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details. As mentioned earlier, it is strongly recommended that Kubernetes distributors bundle and deploy the volume snapshot controller and CRDs as part of their Kubernetes cluster management process (independent of any CSI Driver). As part of this recommended deployment process, the Kubernetes team provides a number of sidecar (helper) containers, including the external-snapshotter sidecar container which has been updated to support volume group snapshot. The external-snapshotter watches the Kubernetes API server for VolumeGroupSnapshotContent objects, and triggers CreateVolumeGroupSnapshot and DeleteVolumeGroupSnapshot operations against a CSI endpoint. What are the limitations? The beta implementation of volume group snapshots for Kubernetes has the following limitations: Does not support reverting an existing PVC to an earlier state represented by a snapshot (only supports provisioning a new volume from a snapshot). No application consistency guarantees beyond any guarantees provided by the storage system (e.g. crash consistency). See this doc for more discussions on application consistency. What’s next? Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release. How can I learn more? The design spec for the volume group snapshot feature. The code repository for volume group snapshot APIs and controller. CSI documentation on the group snapshot feature. How do I get involved? This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta: Ben Swartzlander (bswartz) Cici Huang (cici37) Hemant Kumar (gnufied) James Defelice (jdef) Jan Šafránek (jsafrane) Madhu Rajanna (Madhu-1) Manish M Yathnalli (manishym) Michelle Au (msau42) Niels de Vos (nixpanic) Leonardo Cecchi (leonardoce) Rakshith R (Rakshith-R) Raunak Shah (RaunakShah) Saad Ali (saad-ali) Xing Yang (xing-yang) Yati Padia (yati1998) For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors. We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.
-
Enhancing Kubernetes API Server Efficiency with API Streaming
on December 17, 2024 at 12:00 am
Managing Kubernetes clusters efficiently is critical, especially as their size is growing. A significant challenge with large clusters is the memory overhead caused by list requests. In the existing implementation, the kube-apiserver processes list requests by assembling the entire response in-memory before transmitting any data to the client. But what if the response body is substantial, say hundreds of megabytes? Additionally, imagine a scenario where multiple list requests flood in simultaneously, perhaps after a brief network outage. While API Priority and Fairness has proven to reasonably protect kube-apiserver from CPU overload, its impact is visibly smaller for memory protection. This can be explained by the differing nature of resource consumption by a single API request – the CPU usage at any given time is capped by a constant, whereas memory, being uncompressible, can grow proportionally with the number of processed objects and is unbounded. This situation poses a genuine risk, potentially overwhelming and crashing any kube-apiserver within seconds due to out-of-memory (OOM) conditions. To better visualize the issue, let’s consider the below graph. The graph shows the memory usage of a kube-apiserver during a synthetic test. (see the synthetic test section for more details). The results clearly show that increasing the number of informers significantly boosts the server’s memory consumption. Notably, at approximately 16:40, the server crashed when serving only 16 informers. Why does kube-apiserver allocate so much memory for list requests? Our investigation revealed that this substantial memory allocation occurs because the server before sending the first byte to the client must: fetch data from the database, deserialize the data from its stored format, and finally construct the final response by converting and serializing the data into a client requested format This sequence results in significant temporary memory consumption. The actual usage depends on many factors like the page size, applied filters (e.g. label selectors), query parameters, and sizes of individual objects. Unfortunately, neither API Priority and Fairness nor Golang’s garbage collection or Golang memory limits can prevent the system from exhausting memory under these conditions. The memory is allocated suddenly and rapidly, and just a few requests can quickly deplete the available memory, leading to resource exhaustion. Depending on how the API server is run on the node, it might either be killed through OOM by the kernel when exceeding the configured memory limits during these uncontrolled spikes, or if limits are not configured it might have even worse impact on the control plane node. And worst, after the first API server failure, the same requests will likely hit another control plane node in an HA setup with probably the same impact. Potentially a situation that is hard to diagnose and hard to recover from. Streaming list requests Today, we’re excited to announce a major improvement. With the graduation of the watch list feature to beta in Kubernetes 1.32, client-go users can opt-in (after explicitly enabling WatchListClient feature gate) to streaming lists by switching from list to (a special kind of) watch requests. Watch requests are served from the watch cache, an in-memory cache designed to improve scalability of read operations. By streaming each item individually instead of returning the entire collection, the new method maintains constant memory overhead. The API server is bound by the maximum allowed size of an object in etcd plus a few additional allocations. This approach drastically reduces the temporary memory usage compared to traditional list requests, ensuring a more efficient and stable system, especially in clusters with a large number of objects of a given type or large average object sizes where despite paging memory consumption used to be high. Building on the insight gained from the synthetic test (see the synthetic test, we developed an automated performance test to systematically evaluate the impact of the watch list feature. This test replicates the same scenario, generating a large number of Secrets with a large payload, and scaling the number of informers to simulate heavy list request patterns. The automated test is executed periodically to monitor memory usage of the server with the feature enabled and disabled. The results showed significant improvements with the watch list feature enabled. With the feature turned on, the kube-apiserver’s memory consumption stabilized at approximately 2 GB. By contrast, with the feature disabled, memory usage increased to approximately 20GB, a 10x increase! These results confirm the effectiveness of the new streaming API, which reduces the temporary memory footprint. Enabling API Streaming for your component Upgrade to Kubernetes 1.32. Make sure your cluster uses etcd in version 3.4.31+ or 3.5.13+. Change your client software to use watch lists. If your client code is written in Golang, you’ll want to enable WatchListClient for client-go. For details on enabling that feature, read Introducing Feature Gates to Client-Go: Enhancing Flexibility and Control. What’s next? In Kubernetes 1.32, the feature is enabled in kube-controller-manager by default despite its beta state. This will eventually be expanded to other core components like kube-scheduler or kubelet; once the feature becomes generally available, if not earlier. Other 3rd-party components are encouraged to opt-in to the feature during the beta phase, especially when they are at risk of accessing a large number of resources or kinds with potentially large object sizes. For the time being, API Priority and Fairness assigns a reasonable small cost to list requests. This is necessary to allow enough parallelism for the average case where list requests are cheap enough. But it does not match the spiky exceptional situation of many and large objects. Once the majority of the Kubernetes ecosystem has switched to watch list, the list cost estimation can be changed to larger values without risking degraded performance in the average case, and with that increasing the protection against this kind of requests that can still hit the API server in the future. The synthetic test In order to reproduce the issue, we conducted a manual test to understand the impact of list requests on kube-apiserver memory usage. In the test, we created 400 Secrets, each containing 1 MB of data, and used informers to retrieve all Secrets. The results were alarming, only 16 informers were needed to cause the test server to run out of memory and crash, demonstrating how quickly memory consumption can grow under such conditions. Special shout out to @deads2k for his help in shaping this feature.
-
Kubernetes v1.32 Adds A New CPU Manager Static Policy Option For Strict CPU Reservation
on December 16, 2024 at 12:00 am
In Kubernetes v1.32, after years of community discussion, we are excited to introduce a strict-cpu-reservation option for the CPU Manager static policy. This feature is currently in alpha, with the associated policy hidden by default. You can only use the policy if you explicitly enable the alpha behavior in your cluster. Understanding the feature The CPU Manager static policy is used to reduce latency or improve performance. The reservedSystemCPUs defines an explicit CPU set for OS system daemons and kubernetes system daemons. This option is designed for Telco/NFV type use cases where uncontrolled interrupts/timers may impact the workload performance. you can use this option to define the explicit cpuset for the system/kubernetes daemons as well as the interrupts/timers, so the rest CPUs on the system can be used exclusively for workloads, with less impact from uncontrolled interrupts/timers. More details of this parameter can be found on the Explicitly Reserved CPU List page. If you want to protect your system daemons and interrupt processing, the obvious way is to use the reservedSystemCPUs option. However, until the Kubernetes v1.32 release, this isolation was only implemented for guaranteed pods that made requests for a whole number of CPUs. At pod admission time, the kubelet only compares the CPU requests against the allocatable CPUs. In Kubernetes, limits can be higher than the requests; the previous implementation allowed burstable and best-effort pods to use up the capacity of reservedSystemCPUs, which could then starve host OS services of CPU – and we know that people saw this in real life deployments. The existing behavior also made benchmarking (for both infrastructure and workloads) results inaccurate. When this new strict-cpu-reservation policy option is enabled, the CPU Manager static policy will not allow any workload to use the reserved system CPU cores. Enabling the feature To enable this feature, you need to turn on both the CPUManagerPolicyAlphaOptions feature gate and the strict-cpu-reservation policy option. And you need to remove the /var/lib/kubelet/cpu_manager_state file if it exists and restart kubelet. With the following kubelet configuration: kind: KubeletConfiguration apiVersion: kubelet.config.k8s.io/v1beta1 featureGates: … CPUManagerPolicyOptions: true CPUManagerPolicyAlphaOptions: true cpuManagerPolicy: static cpuManagerPolicyOptions: strict-cpu-reservation: “true” reservedSystemCPUs: “0,32,1,33,16,48” … When strict-cpu-reservation is not set or set to false: # cat /var/lib/kubelet/cpu_manager_state {“policyName”:”static”,”defaultCpuSet”:”0-63″,”checksum”:1058907510} When strict-cpu-reservation is set to true: # cat /var/lib/kubelet/cpu_manager_state {“policyName”:”static”,”defaultCpuSet”:”2-15,17-31,34-47,49-63″,”checksum”:4141502832} Monitoring the feature You can monitor the feature impact by checking the following CPU Manager counters: cpu_manager_shared_pool_size_millicores: report shared pool size, in millicores (e.g. 13500m) cpu_manager_exclusive_cpu_allocation_count: report exclusively allocated cores, counting full cores (e.g. 16) Your best-effort workloads may starve if the cpu_manager_shared_pool_size_millicores count is zero for prolonged time. We believe any pod that is required for operational purpose like a log forwarder should not run as best-effort, but you can review and adjust the amount of CPU cores reserved as needed. Conclusion Strict CPU reservation is critical for Telco/NFV use cases. It is also a prerequisite for enabling the all-in-one type of deployments where workloads are placed on nodes serving combined control+worker+storage roles. We want you to start using the feature and looking forward to your feedback. Further reading Please check out the Control CPU Management Policies on the Node task page to learn more about the CPU Manager, and how it fits in relation to the other node-level resource managers. Getting involved This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.