A Quick Guide to Scaling AI/ML Workloads on Kubernetes

NGINX | January 11, 2024

When running artificial intelligence (AI) and machine learning (ML) model training and inference on Kubernetes, dynamic scaling up and down becomes a critical element. In addition to requiring high-bandwidth storage and networking to ingest data, AI model training also needs substantial (and expensive) compute, mostly from GPUs or other specialized processors. Even when leveraging pre-trained models, tasks like model serving and fine-tuning in production are still more compute-intensive than most enterprise workloads.

Cloud-native Kubernetes is designed for rapid scalability – up and down. It’s also designed to deliver more agility and cost-efficient resource usage for dynamic workloads across hybrid, multi-cloud environments.

In this blog, we cover the three most common ways to scale AI/ML workloads on Kubernetes so you can achieve optimal performance, cost savings, and adaptability for dynamic scaling in diverse environments.

Three Scaling Modalities for AI/ML Workloads on Kubernetes

The three common ways Kubernetes scales a workload are with the Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and Cluster Autoscaler.

Here is a breakdown of those three methods:

HPA – The equivalent of adding instances or pod replicas to an application, giving it more scale, capacity, and throughput.
VPA – The equivalent of resizing a pod to give it higher capacity with greater compute and memory.
Cluster Autoscaler – Automatically increases or decreases the number of nodes in a Kubernetes cluster depending on the current resource demand for the pods.

Each modality has its benefits for model training and inferencing, which you can explore in the use cases below.

HPA Use Cases

In many cases, distributed AI model training and inference workloads can scale horizontally (i.e., adding more pods to speed up the training process or request handling). This enables the workloads benefit from HPA, which can scale out the number of pods based on metrics like CPU and memory usage, or even custom and external metrics relevant to the workload. In scenarios where the workload varies over time, HPA can dynamically adjust the number of pods to ensure optimal resource utilization.

Another aspect of horizontally scaling AI workloads in Kubernetes is load balancing. To ensure optimal performance and timely request processing, incoming requests need to be distributed across multiple instances or pods. This is why one of the ideal tools that can be used in conjunction with HPA is an Ingress controller.

VPA Use Cases

AI model training tasks are often resource-intensive, requiring significant CPU, GPU, and memory resources. VPA can adjust these resource allocations dynamically. This helps ensure that each pod has enough resources to efficiently handle the training workload and that all assigned pods have sufficient compute capacity to perform calculations. In addition, memory requirements can fluctuate significantly during the training of large models. VPA can help prevent out-of-memory errors by increasing the memory allocation as needed.

While it’s technically possible to use both HPA and VPA together, it requires careful configuration to avoid conflicts, as they might try to scale the same workload in different ways (i.e., horizontally versus vertically). It’s essential to clearly define the boundaries for each autoscaler, ensuring they complement rather than conflict with each other. An emerging approach is to use both with different scopes – for instance, HPA for scaling across multiple pods based on workload and VPA for fine-tuning the resource allocation of each pod within the limits set by HPA.

Cluster Autoscaler Use Cases

Cluster Autoscaler can help dynamically adjust the overall pool of compute, storage, and networking infrastructure resources available cluster-wide to meet the demands of AI /ML workloads. By adjusting the number of nodes in a cluster based on current demands, an organization can load balance at the macro level. This is necessary to ensure optimal performance as AI/ML workloads can demand significant computational resources unpredictably.

HPA, VPA, and Cluster Autoscaler Each Have a Role

In summary, these are the three ways that Kubernetes autoscaling works and benefits AI workloads:

HPA scales AI model serving endpoints that need to handle varying request rates.
VPA optimizes resource allocation for AI/ML workloads and ensures each pod has enough resources for efficient processing without over-provisioning.
Cluster Autoscaler adds nodes to a cluster to ensure it can accommodate resource-intensive AI jobs or removes nodes when the compute demands are low.

HPA, VPA and Cluster Autoscaler complement each other in managing AI/ML workloads in Kubernetes. Cluster Autoscaler ensures there are enough nodes to meet workload demands, HPA efficiently distributes workloads across multiple pods, and VPA optimizes the resource allocation of these pods. Together, they provide a comprehensive scaling and resource management solution for AI/ML applications in Kubernetes environments.

Visit our Power and Protect Your AI Journey page to learn more on how F5 and NGINX can help deliver, secure, and optimize your AI/ML workloads.

Featured Blog Posts

Current trends in cloud-native technologies, platform engineering, and AI

Full lifecycle management of app and API delivery with F5 NGINX One

Simplifying Modern Application Delivery: Nine Months of F5 NGINX One

Tags: AI Factory Load Balancing

About the Author

Ilya KrutovProduct Marketing Manager

More blogs by Ilya Krutov

Featured Blog Posts

Current trends in cloud-native technologies, platform engineering, and AI

Full lifecycle management of app and API delivery with F5 NGINX One

Simplifying Modern Application Delivery: Nine Months of F5 NGINX One

Related Blog Posts

NGINX | 10/05/2022

Automating Certificate Management in a Kubernetes Environment

Simplify cert management by providing unique, automatically renewed and updated certificates to your endpoints.

DNS,

NGINX | 05/26/2022

Secure Your API Gateway with NGINX App Protect WAF

As monoliths move to microservices, applications are developed faster than ever. Speed is necessary to stay competitive and APIs sit at the front of these rapid modernization efforts. But the popularity of APIs for application modernization has significant implications for app security.

F5 NGINX,

Tech,

2022

NGINX | 12/09/2021

How Do I Choose? API Gateway vs. Ingress Controller vs. Service Mesh

When you need an API gateway in Kubernetes, how do you choose among API gateway vs. Ingress controller vs. service mesh? We guide you through the decision, with sample scenarios for north-south and east-west API traffic, plus use cases where an API gateway is the right tool.

F5 NGINX,

Tech,

2021

NGINX | 01/20/2021

Deploying NGINX as an API Gateway, Part 2: Protecting Backend Services

In the second post in our API gateway series, Liam shows you how to batten down the hatches on your API services. You can use rate limiting, access restrictions, request size limits, and request body validation to frustrate illegitimate or overly burdensome requests.

NGINX | 12/15/2015