Kubernetes Blog The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community.
-
Navigating Failures in Pods With Devices
on July 3, 2025 at 12:00 am
Kubernetes is the de facto standard for container orchestration, but when it comes to handling specialized hardware like GPUs and other accelerators, things get a bit complicated. This blog post dives into the challenges of managing failure modes when operating pods with devices in Kubernetes, based on insights from Sergey Kanzhelev and Mrunal Patel’s talk at KubeCon NA 2024. You can follow the links to slides and recording. The AI/ML boom and its impact on Kubernetes The rise of AI/ML workloads has brought new challenges to Kubernetes. These workloads often rely heavily on specialized hardware, and any device failure can significantly impact performance and lead to frustrating interruptions. As highlighted in the 2024 Llama paper, hardware issues, particularly GPU failures, are a major cause of disruption in AI/ML training. You can also learn how much effort NVIDIA spends on handling devices failures and maintenance in the KubeCon talk by Ryan Hallisey and Piotr Prokop All-Your-GPUs-Are-Belong-to-Us: An Inside Look at NVIDIA’s Self-Healing GeForce NOW Infrastructure (recording) as they see 19 remediation requests per 1000 nodes a day! We also see data centers offering spot consumption models and overcommit on power, making device failures commonplace and a part of the business model. However, Kubernetes’s view on resources is still very static. The resource is either there or not. And if it is there, the assumption is that it will stay there fully functional – Kubernetes lacks good support for handling full or partial hardware failures. These long-existing assumptions combined with the overall complexity of a setup lead to a variety of failure modes, which we discuss here. Understanding AI/ML workloads Generally, all AI/ML workloads require specialized hardware, have challenging scheduling requirements, and are expensive when idle. AI/ML workloads typically fall into two categories – training and inference. Here is an oversimplified view of those categories’ characteristics, which are different from traditional workloads like web services: Training These workloads are resource-intensive, often consuming entire machines and running as gangs of pods. Training jobs are usually “run to completion” – but that could be days, weeks or even months. Any failure in a single pod can necessitate restarting the entire step across all the pods. Inference These workloads are usually long-running or run indefinitely, and can be small enough to consume a subset of a Node’s devices or large enough to span multiple nodes. They often require downloading huge files with the model weights. These workload types specifically break many past assumptions: Workload assumptions before and now Before Now Can get a better CPU and the app will work faster. Require a specific device (or class of devices) to run. When something doesn’t work, just recreate it. Allocation or reallocation is expensive. Any node will work. No need to coordinate between Pods. Scheduled in a special way – devices often connected in a cross-node topology. Each Pod can be plug-and-play replaced if failed. Pods are a part of a larger task. Lifecycle of an entire task depends on each Pod. Container images are slim and easily available. Container images may be so big that they require special handling. Long initialization can be offset by slow rollout. Initialization may be long and should be optimized, sometimes across many Pods together. Compute nodes are commoditized and relatively inexpensive, so some idle time is acceptable. Nodes with specialized hardware can be an order of magnitude more expensive than those without, so idle time is very wasteful. The existing failure model was relying on old assumptions. It may still work for the new workload types, but it has limited knowledge about devices and is very expensive for them. In some cases, even prohibitively expensive. You will see more examples later in this article. Why Kubernetes still reigns supreme This article is not going deeper into the question: why not start fresh for AI/ML workloads since they are so different from the traditional Kubernetes workloads. Despite many challenges, Kubernetes remains the platform of choice for AI/ML workloads. Its maturity, security, and rich ecosystem of tools make it a compelling option. While alternatives exist, they often lack the years of development and refinement that Kubernetes offers. And the Kubernetes developers are actively addressing the gaps identified in this article and beyond. The current state of device failure handling This section outlines different failure modes and the best practices and DIY (Do-It-Yourself) solutions used today. The next session will describe a roadmap of improving things for those failure modes. Failure modes: K8s infrastructure In order to understand the failures related to the Kubernetes infrastructure, you need to understand how many moving parts are involved in scheduling a Pod on the node. The sequence of events when the Pod is scheduled in the Node is as follows: Device plugin is scheduled on the Node Device plugin is registered with the kubelet via local gRPC Kubelet uses device plugin to watch for devices and updates capacity of the node Scheduler places a user Pod on a Node based on the updated capacity Kubelet asks Device plugin to Allocate devices for a User Pod Kubelet creates a User Pod with the allocated devices attached to it This diagram shows some of those actors involved: As there are so many actors interconnected, every one of them and every connection may experience interruptions. This leads to many exceptional situations that are often considered failures, and may cause serious workload interruptions: Pods failing admission at various stages of its lifecycle Pods unable to run on perfectly fine hardware Scheduling taking unexpectedly long time The goal for Kubernetes is to make the interruption between these components as reliable as possible. Kubelet already implements retries, grace periods, and other techniques to improve it. The roadmap section goes into details on other edge cases that the Kubernetes project tracks. However, all these improvements only work when these best practices are followed: Configure and restart kubelet and the container runtime (such as containerd or CRI-O) as early as possible to not interrupt the workload. Monitor device plugin health and carefully plan for upgrades. Do not overload the node with less-important workloads to prevent interruption of device plugin and other components. Configure user pods tolerations to handle node readiness flakes. Configure and code graceful termination logic carefully to not block devices for too long. Another class of Kubernetes infra-related issues is driver-related. With traditional resources like CPU and memory, no compatibility checks between the application and hardware were needed. With special devices like hardware accelerators, there are new failure modes. Device drivers installed on the node: Must match the hardware Be compatible with an app Must work with other drivers (like nccl, etc.) Best practices for handling driver versions: Monitor driver installer health Plan upgrades of infrastructure and Pods to match the version Have canary deployments whenever possible Following the best practices in this section and using device plugins and device driver installers from trusted and reliable sources generally eliminate this class of failures. Kubernetes is tracking work to make this space even better. Failure modes: device failed There is very little handling of device failure in Kubernetes today. Device plugins report the device failure only by changing the count of allocatable devices. And Kubernetes relies on standard mechanisms like liveness probes or container failures to allow Pods to communicate the failure condition to the kubelet. However, Kubernetes does not correlate device failures with container crashes and does not offer any mitigation beyond restarting the container while being attached to the same device. This is why many plugins and DIY solutions exist to handle device failures based on various signals. Health controller In many cases a failed device will result in unrecoverable and very expensive nodes doing nothing. A simple DIY solution is a node health controller. The controller could compare the device allocatable count with the capacity and if the capacity is greater, it starts a timer. Once the timer reaches a threshold, the health controller kills and recreates a node. There are problems with the health controller approach: Root cause of the device failure is typically not known The controller is not workload aware Failed device might not be in use and you want to keep other devices running The detection may be too slow as it is very generic The node may be part of a bigger set of nodes and simply cannot be deleted in isolation without other nodes There are variations of the health controller solving some of the problems above. The overall theme here though is that to best handle failed devices, you need customized handling for the specific workload. Kubernetes doesn’t yet offer enough abstraction to express how critical the device is for a node, for the cluster, and for the Pod it is assigned to. Pod failure policy Another DIY approach for device failure handling is a per-pod reaction on a failed device. This approach is applicable for training workloads that are implemented as Jobs. Pod can define special error codes for device failures. For example, whenever unexpected device behavior is encountered, Pod exits with a special exit code. Then the Pod failure policy can handle the device failure in a special way. Read more on Handling retriable and non-retriable pod failures with Pod failure policy There are some problems with the Pod failure policy approach for Jobs: There is no well-known device failed condition, so this approach does not work for the generic Pod case Error codes must be coded carefully and in some cases are hard to guarantee. Only works with Jobs with restartPolicy: Never, due to the limitation of a pod failure policy feature. So, this solution has limited applicability. Custom pod watcher A little more generic approach is to implement the Pod watcher as a DIY solution or use some third party tools offering this functionality. The pod watcher is most often used to handle device failures for inference workloads. Since Kubernetes just keeps a pod assigned to a device, even if the device is reportedly unhealthy, the idea is to detect this situation with the pod watcher and apply some remediation. It often involves obtaining device health status and its mapping to the Pod using Pod Resources API on the node. If a device fails, it can then delete the attached Pod as a remediation. The replica set will handle the Pod recreation on a healthy device. The other reasons to implement this watcher: Without it, the Pod will keep being assigned to the failed device forever. There is no descheduling for a pod with restartPolicy=Always. There are no built-in controllers that delete Pods in CrashLoopBackoff. Problems with the custom pod watcher: The signal for the pod watcher is expensive to get, and involves some privileged actions. It is a custom solution and it assumes the importance of a device for a Pod. The pod watcher relies on external controllers to reschedule a Pod. There are more variations of DIY solutions for handling device failures or upcoming maintenance. Overall, Kubernetes has enough extension points to implement these solutions. However, some extension points require higher privilege than users may be comfortable with or are too disruptive. The roadmap section goes into more details on specific improvements in handling the device failures. Failure modes: container code failed When the container code fails or something bad happens with it, like out of memory conditions, Kubernetes knows how to handle those cases. There is either the restart of a container, or a crash of a Pod if it has restartPolicy: Never and scheduling it on another node. Kubernetes has limited expressiveness on what is a failure (for example, non-zero exit code or liveness probe failure) and how to react on such a failure (mostly either Always restart or immediately fail the Pod). This level of expressiveness is often not enough for the complicated AI/ML workloads. AI/ML pods are better rescheduled locally or even in-place as that would save on image pulling time and device allocation. AI/ML pods are often interconnected and need to be restarted together. This adds another level of complexity and optimizing it often brings major savings in running AI/ML workloads. There are various DIY solutions to handle Pod failures orchestration. The most typical one is to wrap a main executable in a container by some orchestrator. And this orchestrator will be able to restart the main executable whenever the job needs to be restarted because some other pod has failed. Solutions like this are very fragile and elaborate. They are often worth the money saved comparing to a regular JobSet delete/recreate cycle when used in large training jobs. Making these solutions less fragile and more streamlined by developing new hooks and extension points in Kubernetes will make it easy to apply to smaller jobs, benefiting everybody. Failure modes: device degradation Not all device failures are terminal for the overall workload or batch job. As the hardware stack gets more and more complex, misconfiguration on one of the hardware stack layers, or driver failures, may result in devices that are functional, but lagging on performance. One device that is lagging behind can slow down the whole training job. We see reports of such cases more and more often. Kubernetes has no way to express this type of failures today and since it is the newest type of failure mode, there is not much of a best practice offered by hardware vendors for detection and third party tooling for remediation of these situations. Typically, these failures are detected based on observed workload characteristics. For example, the expected speed of AI/ML training steps on particular hardware. Remediation for those issues is highly depend on a workload needs. Roadmap As outlined in a section above, Kubernetes offers a lot of extension points which are used to implement various DIY solutions. The space of AI/ML is developing very fast, with changing requirements and usage patterns. SIG Node is taking a measured approach of enabling more extension points to implement the workload-specific scenarios over introduction of new semantics to support specific scenarios. This means prioritizing making information about failures readily available over implementing automatic remediations for those failures that might only be suitable for a subset of workloads. This approach ensures there are no drastic changes for workload handling which may break existing, well-oiled DIY solutions or experiences with the existing more traditional workloads. Many error handling techniques used today work for AI/ML, but are very expensive. SIG Node will invest in extension points to make those cheaper, with the understanding that the price cutting for AI/ML is critical. The following is the set of specific investments we envision for various failure modes. Roadmap for failure modes: K8s infrastructure The area of Kubernetes infrastructure is the easiest to understand and very important to make right for the upcoming transition from Device Plugins to DRA. SIG Node is tracking many work items in this area, most notably the following: integrate kubelet with the systemd watchdog · Issue #127460 DRA: detect stale DRA plugin sockets · Issue #128696 Support takeover for devicemanager/device-plugin · Issue #127803 Kubelet plugin registration reliability · Issue #127457 Recreate the Device Manager gRPC server if failed · Issue #128167 Retry pod admission on device plugin grpc failures · Issue #128043 Basically, every interaction of Kubernetes components must be reliable via either the kubelet improvements or the best practices in plugins development and deployment. Roadmap for failure modes: device failed For the device failures some patterns are already emerging in common scenarios that Kubernetes can support. However, the very first step is to make information about failed devices available easier. The very first step here is the work in KEP 4680 (Add Resource Health Status to the Pod Status for Device Plugin and DRA). Longer term ideas include to be tested: Integrate device failures into Pod Failure Policy. Node-local retry policies, enabling pod failure policies for Pods with restartPolicy=OnFailure and possibly beyond that. Ability to deschedule pod, including with the restartPolicy: Always, so it can get a new device allocated. Add device health to the ResourceSlice used to represent devices in DRA, rather than simply withdrawing an unhealthy device from the ResourceSlice. Roadmap for failure modes: container code failed The main improvements to handle container code failures for AI/ML workloads are all targeting cheaper error handling and recovery. The cheapness is mostly coming from reuse of pre-allocated resources as much as possible. From reusing the Pods by restarting containers in-place, to node local restart of containers instead of rescheduling whenever possible, to snapshotting support, and re-scheduling prioritizing the same node to save on image pulls. Consider this scenario: A big training job needs 512 Pods to run. And one of the pods failed. It means that all Pods need to be interrupted and synced up to restart the failed step. The most efficient way to achieve this generally is to reuse as many Pods as possible by restarting them in-place, while replacing the failed pod to clear up the error from it. Like demonstrated in this picture: It is possible to implement this scenario, but all solutions implementing it are fragile due to lack of certain extension points in Kubernetes. Adding these extension points to implement this scenario is on the Kubernetes roadmap. Roadmap for failure modes: device degradation There is very little done in this area – there is no clear detection signal, very limited troubleshooting tooling, and no built-in semantics to express the “degraded” device on Kubernetes. There has been discussion of adding data on device performance or degradation in the ResourceSlice used by DRA to represent devices, but it is not yet clearly defined. There are also projects like node-healthcheck-operator that can be used for some scenarios. We expect developments in this area from hardware vendors and cloud providers, and we expect to see mostly DIY solutions in the near future. As more users get exposed to AI/ML workloads, this is a space needing feedback on patterns used here. Join the conversation The Kubernetes community encourages feedback and participation in shaping the future of device failure handling. Join SIG Node and contribute to the ongoing discussions! This blog post provides a high-level overview of the challenges and future directions for device failure management in Kubernetes. By addressing these issues, Kubernetes can solidify its position as the leading platform for AI/ML workloads, ensuring resilience and reliability for applications that depend on specialized hardware.
-
Image Compatibility In Cloud Native Environments
on June 25, 2025 at 12:00 am
In industries where systems must run very reliably and meet strict performance criteria such as telecommunication, high-performance or AI computing, containerized applications often need specific operating system configuration or hardware presence. It is common practice to require the use of specific versions of the kernel, its configuration, device drivers, or system components. Despite the existence of the Open Container Initiative (OCI), a governing community to define standards and specifications for container images, there has been a gap in expression of such compatibility requirements. The need to address this issue has led to different proposals and, ultimately, an implementation in Kubernetes’ Node Feature Discovery (NFD). NFD is an open source Kubernetes project that automatically detects and reports hardware and system features of cluster nodes. This information helps users to schedule workloads on nodes that meet specific system requirements, which is especially useful for applications with strict hardware or operating system dependencies. The need for image compatibility specification Dependencies between containers and host OS A container image is built on a base image, which provides a minimal runtime environment, often a stripped-down Linux userland, completely empty or distroless. When an application requires certain features from the host OS, compatibility issues arise. These dependencies can manifest in several ways: Drivers: Host driver versions must match the supported range of a library version inside the container to avoid compatibility problems. Examples include GPUs and network drivers. Libraries or Software: The container must come with a specific version or range of versions for a library or software to run optimally in the environment. Examples from high performance computing are MPI, EFA, or Infiniband. Kernel Modules or Features: Specific kernel features or modules must be present. Examples include having support of write protected huge page faults, or the presence of VFIO And more… While containers in Kubernetes are the most likely unit of abstraction for these needs, the definition of compatibility can extend further to include other container technologies such as Singularity and other OCI artifacts such as binaries from a spack binary cache. Multi-cloud and hybrid cloud challenges Containerized applications are deployed across various Kubernetes distributions and cloud providers, where different host operating systems introduce compatibility challenges. Often those have to be pre-configured before workload deployment or are immutable. For instance, different cloud providers will include different operating systems like: RHCOS/RHEL Photon OS Amazon Linux 2 Container-Optimized OS Azure Linux OS And more… Each OS comes with unique kernel versions, configurations, and drivers, making compatibility a non-trivial issue for applications requiring specific features. It must be possible to quickly assess a container for its suitability to run on any specific environment. Image compatibility initiative An effort was made within the Open Containers Initiative Image Compatibility working group to introduce a standard for image compatibility metadata. A specification for compatibility would allow container authors to declare required host OS features, making compatibility requirements discoverable and programmable. The specification implemented in Kubernetes Node Feature Discovery is one of the discussed proposals. It aims to: Define a structured way to express compatibility in OCI image manifests. Support a compatibility specification alongside container images in image registries. Allow automated validation of compatibility before scheduling containers. The concept has since been implemented in the Kubernetes Node Feature Discovery project. Implementation in Node Feature Discovery The solution integrates compatibility metadata into Kubernetes via NFD features and the NodeFeatureGroup API. This interface enables the user to match containers to nodes based on exposing features of hardware and software, allowing for intelligent scheduling and workload optimization. Compatibility specification The compatibility specification is a structured list of compatibility objects containing Node Feature Groups. These objects define image requirements and facilitate validation against host nodes. The feature requirements are described by using the list of available features from the NFD project. The schema has the following structure: version (string) – Specifies the API version. compatibilities (array of objects) – List of compatibility sets. rules (object) – Specifies NodeFeatureGroup to define image requirements. weight (int, optional) – Node affinity weight. tag (string, optional) – Categorization tag. description (string, optional) – Short description. An example might look like the following: version: v1alpha1 compatibilities: – description: “My image requirements” rules: – name: “kernel and cpu” matchFeatures: – feature: kernel.loadedmodule matchExpressions: vfio-pci: {op: Exists} – feature: cpu.model matchExpressions: vendor_id: {op: In, value: [“Intel”, “AMD”]} – name: “one of available nics” matchAny: – matchFeatures: – feature: pci.device matchExpressions: vendor: {op: In, value: [“0eee”]} class: {op: In, value: [“0200”]} – matchFeatures: – feature: pci.device matchExpressions: vendor: {op: In, value: [“0fff”]} class: {op: In, value: [“0200”]} Client implementation for node validation To streamline compatibility validation, we implemented a client tool that allows for node validation based on an image’s compatibility artifact. In this workflow, the image author would generate a compatibility artifact that points to the image it describes in a registry via the referrers API. When a need arises to assess the fit of an image to a host, the tool can discover the artifact and verify compatibility of an image to a node before deployment. The client can validate nodes both inside and outside a Kubernetes cluster, extending the utility of the tool beyond the single Kubernetes use case. In the future, image compatibility could play a crucial role in creating specific workload profiles based on image compatibility requirements, aiding in more efficient scheduling. Additionally, it could potentially enable automatic node configuration to some extent, further optimizing resource allocation and ensuring seamless deployment of specialized workloads. Examples of usage Define image compatibility metadata A container image can have metadata that describes its requirements based on features discovered from nodes, like kernel modules or CPU models. The previous compatibility specification example in this article exemplified this use case. Attach the artifact to the image The image compatibility specification is stored as an OCI artifact. You can attach this metadata to your container image using the oras tool. The registry only needs to support OCI artifacts, support for arbitrary types is not required. Keep in mind that the container image and the artifact must be stored in the same registry. Use the following command to attach the artifact to the image: oras attach \ –artifact-type application/vnd.nfd.image-compatibility.v1alpha1 <image-url> \ <path-to-spec>.yaml:application/vnd.nfd.image-compatibility.spec.v1alpha1+yaml Validate image compatibility After attaching the compatibility specification, you can validate whether a node meets the image’s requirements. This validation can be done using the nfd client: nfd compat validate-node –image <image-url> Read the output from the client Finally you can read the report generated by the tool or use your own tools to act based on the generated JSON report. Conclusion The addition of image compatibility to Kubernetes through Node Feature Discovery underscores the growing importance of addressing compatibility in cloud native environments. It is only a start, as further work is needed to integrate compatibility into scheduling of workloads within and outside of Kubernetes. However, by integrating this feature into Kubernetes, mission-critical workloads can now define and validate host OS requirements more efficiently. Moving forward, the adoption of compatibility metadata within Kubernetes ecosystems will significantly enhance the reliability and performance of specialized containerized applications, ensuring they meet the stringent requirements of industries like telecommunications, high-performance computing or any environment that requires special hardware or host OS configuration. Get involved Join the Kubernetes Node Feature Discovery project if you’re interested in getting involved with the design and development of Image Compatibility API and tools. We always welcome new contributors.
-
Changes to Kubernetes Slack
on June 16, 2025 at 12:00 am
UPDATE: We’ve received notice from Salesforce that our Slack workspace WILL NOT BE DOWNGRADED on June 20th. Stand by for more details, but for now, there is no urgency to back up private channels or direct messages. Kubernetes Slack will lose its special status and will be changing into a standard free Slack on June 20, 2025. Sometime later this year, our community may move to a new platform. If you are responsible for a channel or private channel, or a member of a User Group, you will need to take some actions as soon as you can. For the last decade, Slack has supported our project with a free customized enterprise account. They have let us know that they can no longer do so, particularly since our Slack is one of the largest and more active ones on the platform. As such, they will be downgrading it to a standard free Slack while we decide on, and implement, other options. On Friday, June 20, we will be subject to the feature limitations of free Slack. The primary ones which will affect us will be only retaining 90 days of history, and having to disable several apps and workflows which we are currently using. The Slack Admin team will do their best to manage these limitations. Responsible channel owners, members of private channels, and members of User Groups should take some actions to prepare for the upgrade and preserve information as soon as possible. The CNCF Projects Staff have proposed that our community look at migrating to Discord. Because of existing issues where we have been pushing the limits of Slack, they have already explored what a Kubernetes Discord would look like. Discord would allow us to implement new tools and integrations which would help the community, such as GitHub group membership synchronization. The Steering Committee will discuss and decide on our future platform. Please see our FAQ, and check the kubernetes-dev mailing list and the #announcements channel for further news. If you have specific feedback on our Slack status join the discussion on GitHub.
-
Enhancing Kubernetes Event Management with Custom Aggregation
on June 10, 2025 at 12:00 am
Kubernetes Events provide crucial insights into cluster operations, but as clusters grow, managing and analyzing these events becomes increasingly challenging. This blog post explores how to build custom event aggregation systems that help engineering teams better understand cluster behavior and troubleshoot issues more effectively. The challenge with Kubernetes events In a Kubernetes cluster, events are generated for various operations – from pod scheduling and container starts to volume mounts and network configurations. While these events are invaluable for debugging and monitoring, several challenges emerge in production environments: Volume: Large clusters can generate thousands of events per minute Retention: Default event retention is limited to one hour Correlation: Related events from different components are not automatically linked Classification: Events lack standardized severity or category classifications Aggregation: Similar events are not automatically grouped To learn more about Events in Kubernetes, read the Event API reference. Real-World value Consider a production environment with tens of microservices where the users report intermittent transaction failures: Traditional event aggregation process: Engineers are wasting hours sifting through thousands of standalone events spread across namespaces. By the time they look into it, the older events have long since purged, and correlating pod restarts to node-level issues is practically impossible. With its event aggregation in its custom events: The system groups events across resources, instantly surfacing correlation patterns such as volume mount timeouts before pod restarts. History indicates it occurred during past record traffic spikes, highlighting a storage scalability issue in minutes rather than hours. The benefit of this approach is that organizations that implement it commonly cut down their troubleshooting time significantly along with increasing the reliability of systems by detecting patterns early. Building an Event aggregation system This post explores how to build a custom event aggregation system that addresses these challenges, aligned to Kubernetes best practices. I’ve picked the Go programming language for my example. Architecture overview This event aggregation system consists of three main components: Event Watcher: Monitors the Kubernetes API for new events Event Processor: Processes, categorizes, and correlates events Storage Backend: Stores processed events for longer retention Here’s a sketch for how to implement the event watcher: package main import ( “context” metav1 “k8s.io/apimachinery/pkg/apis/meta/v1” “k8s.io/client-go/kubernetes” “k8s.io/client-go/rest” eventsv1 “k8s.io/api/events/v1” ) type EventWatcher struct { clientset *kubernetes.Clientset } func NewEventWatcher(config *rest.Config) (*EventWatcher, error) { clientset, err := kubernetes.NewForConfig(config) if err != nil { return nil, err } return &EventWatcher{clientset: clientset}, nil } func (w *EventWatcher) Watch(ctx context.Context) (<-chan *eventsv1.Event, error) { events := make(chan *eventsv1.Event) watcher, err := w.clientset.EventsV1().Events(“”).Watch(ctx, metav1.ListOptions{}) if err != nil { return nil, err } go func() { defer close(events) for { select { case event := <-watcher.ResultChan(): if e, ok := event.Object.(*eventsv1.Event); ok { events <- e } case <-ctx.Done(): watcher.Stop() return } } }() return events, nil } Event processing and classification The event processor enriches events with additional context and classification: type EventProcessor struct { categoryRules []CategoryRule correlationRules []CorrelationRule } type ProcessedEvent struct { Event *eventsv1.Event Category string Severity string CorrelationID string Metadata map[string]string } func (p *EventProcessor) Process(event *eventsv1.Event) *ProcessedEvent { processed := &ProcessedEvent{ Event: event, Metadata: make(map[string]string), } // Apply classification rules processed.Category = p.classifyEvent(event) processed.Severity = p.determineSeverity(event) // Generate correlation ID for related events processed.CorrelationID = p.correlateEvent(event) // Add useful metadata processed.Metadata = p.extractMetadata(event) return processed } Implementing Event correlation One of the key features you could implement is a way of correlating related Events. Here’s an example correlation strategy: func (p *EventProcessor) correlateEvent(event *eventsv1.Event) string { // Correlation strategies: // 1. Time-based: Events within a time window // 2. Resource-based: Events affecting the same resource // 3. Causation-based: Events with cause-effect relationships correlationKey := generateCorrelationKey(event) return correlationKey } func generateCorrelationKey(event *eventsv1.Event) string { // Example: Combine namespace, resource type, and name return fmt.Sprintf(“%s/%s/%s”, event.InvolvedObject.Namespace, event.InvolvedObject.Kind, event.InvolvedObject.Name, ) } Event storage and retention For long-term storage and analysis, you’ll probably want a backend that supports: Efficient querying of large event volumes Flexible retention policies Support for aggregation queries Here’s a sample storage interface: type EventStorage interface { Store(context.Context, *ProcessedEvent) error Query(context.Context, EventQuery) ([]ProcessedEvent, error) Aggregate(context.Context, AggregationParams) ([]EventAggregate, error) } type EventQuery struct { TimeRange TimeRange Categories []string Severity []string CorrelationID string Limit int } type AggregationParams struct { GroupBy []string TimeWindow string Metrics []string } Good practices for Event management Resource Efficiency Implement rate limiting for event processing Use efficient filtering at the API server level Batch events for storage operations Scalability Distribute event processing across multiple workers Use leader election for coordination Implement backoff strategies for API rate limits Reliability Handle API server disconnections gracefully Buffer events during storage backend unavailability Implement retry mechanisms with exponential backoff Advanced features Pattern detection Implement pattern detection to identify recurring issues: type PatternDetector struct { patterns map[string]*Pattern threshold int } func (d *PatternDetector) Detect(events []ProcessedEvent) []Pattern { // Group similar events groups := groupSimilarEvents(events) // Analyze frequency and timing patterns := identifyPatterns(groups) return patterns } func groupSimilarEvents(events []ProcessedEvent) map[string][]ProcessedEvent { groups := make(map[string][]ProcessedEvent) for _, event := range events { // Create similarity key based on event characteristics similarityKey := fmt.Sprintf(“%s:%s:%s”, event.Event.Reason, event.Event.InvolvedObject.Kind, event.Event.InvolvedObject.Namespace, ) // Group events with the same key groups[similarityKey] = append(groups[similarityKey], event) } return groups } func identifyPatterns(groups map[string][]ProcessedEvent) []Pattern { var patterns []Pattern for key, events := range groups { // Only consider groups with enough events to form a pattern if len(events) < 3 { continue } // Sort events by time sort.Slice(events, func(i, j int) bool { return events[i].Event.LastTimestamp.Time.Before(events[j].Event.LastTimestamp.Time) }) // Calculate time range and frequency firstSeen := events[0].Event.FirstTimestamp.Time lastSeen := events[len(events)-1].Event.LastTimestamp.Time duration := lastSeen.Sub(firstSeen).Minutes() var frequency float64 if duration > 0 { frequency = float64(len(events)) / duration } // Create a pattern if it meets threshold criteria if frequency > 0.5 { // More than 1 event per 2 minutes pattern := Pattern{ Type: key, Count: len(events), FirstSeen: firstSeen, LastSeen: lastSeen, Frequency: frequency, EventSamples: events[:min(3, len(events))], // Keep up to 3 samples } patterns = append(patterns, pattern) } } return patterns } With this implementation, the system can identify recurring patterns such as node pressure events, pod scheduling failures, or networking issues that occur with a specific frequency. Real-time alerts The following example provides a starting point for building an alerting system based on event patterns. It is not a complete solution but a conceptual sketch to illustrate the approach. type AlertManager struct { rules []AlertRule notifiers []Notifier } func (a *AlertManager) EvaluateEvents(events []ProcessedEvent) { for _, rule := range a.rules { if rule.Matches(events) { alert := rule.GenerateAlert(events) a.notify(alert) } } } Conclusion A well-designed event aggregation system can significantly improve cluster observability and troubleshooting capabilities. By implementing custom event processing, correlation, and storage, operators can better understand cluster behavior and respond to issues more effectively. The solutions presented here can be extended and customized based on specific requirements while maintaining compatibility with the Kubernetes API and following best practices for scalability and reliability. Next steps Future enhancements could include: Machine learning for anomaly detection Integration with popular observability platforms Custom event APIs for application-specific events Enhanced visualization and reporting capabilities For more information on Kubernetes events and custom controllers, refer to the official Kubernetes documentation.
-
Introducing Gateway API Inference Extension
on June 5, 2025 at 12:00 am
Modern generative AI and large language model (LLM) services create unique traffic-routing challenges on Kubernetes. Unlike typical short-lived, stateless web requests, LLM inference sessions are often long-running, resource-intensive, and partially stateful. For example, a single GPU-backed model server may keep multiple inference sessions active and maintain in-memory token caches. Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed for these workloads. They also don’t account for model identity or request criticality (e.g., interactive chat vs. batch jobs). Organizations often patch together ad-hoc solutions, but a standardized approach is missing. Gateway API Inference Extension Gateway API Inference Extension was created to address this gap by building on the existing Gateway API, adding inference-specific routing capabilities while retaining the familiar model of Gateways and HTTPRoutes. By adding an inference extension to your existing gateway, you effectively transform it into an Inference Gateway, enabling you to self-host GenAI/LLMs with a “model-as-a-service” mindset. The project’s goal is to improve and standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing based on real-time model metrics. By achieving these, the project aims to reduce latency and improve accelerator (GPU) utilization for AI workloads. How it works The design introduces two new Custom Resources (CRDs) with distinct responsibilities, each aligning with a specific user persona in the AI/ML serving workflow: InferencePool Defines a pool of pods (model servers) running on shared compute (e.g., GPU nodes). The platform admin can configure how these pods are deployed, scaled, and balanced. An InferencePool ensures consistent resource usage and enforces platform-wide policies. An InferencePool is similar to a Service but specialized for AI/ML serving needs and aware of the model-serving protocol. InferenceModel A user-facing model endpoint managed by AI/ML owners. It maps a public name (e.g., “gpt-4-chat”) to the actual model within an InferencePool. This lets workload owners specify which models (and optional fine-tuning) they want served, plus a traffic-splitting or prioritization policy. In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform operators manage where and how it’s served. Request flow The flow of a request builds on the Gateway API model (Gateways and HTTPRoutes) with one or more extra inference-aware steps (extensions) in the middle. Here’s a high-level example of the request flow with the Endpoint Selection Extension (ESE): Gateway Routing A client sends a request (e.g., an HTTP POST to /completions). The Gateway (like Envoy) examines the HTTPRoute and identifies the matching InferencePool backend. Endpoint Selection Instead of simply forwarding to any available pod, the Gateway consults an inference-specific routing extension— the Endpoint Selection Extension—to pick the best of the available pods. This extension examines live pod metrics (queue lengths, memory usage, loaded adapters) to choose the ideal pod for the request. Inference-Aware Scheduling The chosen pod is the one that can handle the request with the lowest latency or highest efficiency, given the user’s criticality or resource needs. The Gateway then forwards traffic to that specific pod. This extra step provides a smarter, model-aware routing mechanism that still feels like a normal single request to the client. Additionally, the design is extensible—any Inference Gateway can be enhanced with additional inference-specific extensions to handle new routing strategies, advanced scheduling logic, or specialized hardware needs. As the project continues to grow, contributors are encouraged to develop new extensions that are fully compatible with the same underlying Gateway API model, further expanding the possibilities for efficient and intelligent GenAI/LLM routing. Benchmarks We evaluated this extension against a standard Kubernetes Service for a vLLM‐based model serving deployment. The test environment consisted of multiple H100 (80 GB) GPU pods running vLLM (version 1) on a Kubernetes cluster, with 10 Llama2 model replicas. The Latency Profile Generator (LPG) tool was used to generate traffic and measure throughput, latency, and other metrics. The ShareGPT dataset served as the workload, and traffic was ramped from 100 Queries per Second (QPS) up to 1000 QPS. Key results Comparable Throughput: Throughout the tested QPS range, the ESE delivered throughput roughly on par with a standard Kubernetes Service. Lower Latency: Per‐Output‐Token Latency: The ESE showed significantly lower p90 latency at higher QPS (500+), indicating that its model-aware routing decisions reduce queueing and resource contention as GPU memory approaches saturation. Overall p90 Latency: Similar trends emerged, with the ESE reducing end‐to‐end tail latencies compared to the baseline, particularly as traffic increased beyond 400–500 QPS. These results suggest that this extension’s model‐aware routing significantly reduced latency for GPU‐backed LLM workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can appear when using traditional load balancing methods for large, long‐running inference requests. Roadmap As the Gateway API Inference Extension heads toward GA, planned features include: Prefix-cache aware load balancing for remote caches LoRA adapter pipelines for automated rollout Fairness and priority between workloads in the same criticality band HPA support for scaling based on aggregate, per-model metrics Support for large multi-modal inputs/outputs Additional model types (e.g., diffusion models) Heterogeneous accelerators (serving on multiple accelerator types with latency- and cost-aware load balancing) Disaggregated serving for independently scaling pools Summary By aligning model serving with Kubernetes-native tooling, Gateway API Inference Extension aims to simplify and standardize how AI/ML traffic is routed. With model-aware routing, criticality-based prioritization, and more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently. Ready to learn more? Visit the project docs to dive deeper, give an Inference Gateway extension a try with a few simple steps, and get involved if you’re interested in contributing to the project!