A Quick Guide to Scaling AI/ML Workloads on Kubernetes

NGINX | January 11, 2024

When running artificial intelligence (AI) and machine learning (ML) model training and inference on Kubernetes, dynamic scaling up and down becomes a critical element. In addition to requiring high-bandwidth storage and networking to ingest data, AI model training also needs substantial (and expensive) compute, mostly from GPUs or other specialized processors. Even when leveraging pre-trained models, tasks like model serving and fine-tuning in production are still more compute-intensive than most enterprise workloads.

Cloud-native Kubernetes is designed for rapid scalability – up and down. It’s also designed to deliver more agility and cost-efficient resource usage for dynamic workloads across hybrid, multi-cloud environments.

In this blog, we cover the three most common ways to scale AI/ML workloads on Kubernetes so you can achieve optimal performance, cost savings, and adaptability for dynamic scaling in diverse environments.

Three Scaling Modalities for AI/ML Workloads on Kubernetes

The three common ways Kubernetes scales a workload are with the Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and Cluster Autoscaler.

Here is a breakdown of those three methods:

  • HPA – The equivalent of adding instances or pod replicas to an application, giving it more scale, capacity, and throughput.
  • VPA – The equivalent of resizing a pod to give it higher capacity with greater compute and memory.
  • Cluster Autoscaler – Automatically increases or decreases the number of nodes in a Kubernetes cluster depending on the current resource demand for the pods.

Each modality has its benefits for model training and inferencing, which you can explore in the use cases below.

HPA Use Cases

In many cases, distributed AI model training and inference workloads can scale horizontally (i.e., adding more pods to speed up the training process or request handling). This enables the workloads benefit from HPA, which can scale out the number of pods based on metrics like CPU and memory usage, or even custom and external metrics relevant to the workload. In scenarios where the workload varies over time, HPA can dynamically adjust the number of pods to ensure optimal resource utilization.

Another aspect of horizontally scaling AI workloads in Kubernetes is load balancing. To ensure optimal performance and timely request processing, incoming requests need to be distributed across multiple instances or pods. This is why one of the ideal tools that can be used in conjunction with HPA is an Ingress controller.

VPA Use Cases

AI model training tasks are often resource-intensive, requiring significant CPU, GPU, and memory resources. VPA can adjust these resource allocations dynamically. This helps ensure that each pod has enough resources to efficiently handle the training workload and that all assigned pods have sufficient compute capacity to perform calculations. In addition, memory requirements can fluctuate significantly during the training of large models. VPA can help prevent out-of-memory errors by increasing the memory allocation as needed.

While it’s technically possible to use both HPA and VPA together, it requires careful configuration to avoid conflicts, as they might try to scale the same workload in different ways (i.e., horizontally versus vertically). It’s essential to clearly define the boundaries for each autoscaler, ensuring they complement rather than conflict with each other. An emerging approach is to use both with different scopes – for instance, HPA for scaling across multiple pods based on workload and VPA for fine-tuning the resource allocation of each pod within the limits set by HPA.

Cluster Autoscaler Use Cases

Cluster Autoscaler can help dynamically adjust the overall pool of compute, storage, and networking infrastructure resources available cluster-wide to meet the demands of AI /ML workloads. By adjusting the number of nodes in a cluster based on current demands, an organization can load balance at the macro level. This is necessary to ensure optimal performance as AI/ML workloads can demand significant computational resources unpredictably.

HPA, VPA, and Cluster Autoscaler Each Have a Role

In summary, these are the three ways that Kubernetes autoscaling works and benefits AI workloads:

  • HPA scales AI model serving endpoints that need to handle varying request rates.
  • VPA optimizes resource allocation for AI/ML workloads and ensures each pod has enough resources for efficient processing without over-provisioning.
  • Cluster Autoscaler adds nodes to a cluster to ensure it can accommodate resource-intensive AI jobs or removes nodes when the compute demands are low.

HPA, VPA and Cluster Autoscaler complement each other in managing AI/ML workloads in Kubernetes. Cluster Autoscaler ensures there are enough nodes to meet workload demands, HPA efficiently distributes workloads across multiple pods, and VPA optimizes the resource allocation of these pods. Together, they provide a comprehensive scaling and resource management solution for AI/ML applications in Kubernetes environments.

Visit our Power and Protect Your AI Journey page to learn more on how F5 and NGINX can help deliver, secure, and optimize your AI/ML workloads.


Share

About the Author

Ilya Krutov
Ilya KrutovProduct Marketing Manager

More blogs by Ilya Krutov

Related Blog Posts

Automating Certificate Management in a Kubernetes Environment
NGINX | 10/05/2022

Automating Certificate Management in a Kubernetes Environment

Simplify cert management by providing unique, automatically renewed and updated certificates to your endpoints.

Secure Your API Gateway with NGINX App Protect WAF
NGINX | 05/26/2022

Secure Your API Gateway with NGINX App Protect WAF

As monoliths move to microservices, applications are developed faster than ever. Speed is necessary to stay competitive and APIs sit at the front of these rapid modernization efforts. But the popularity of APIs for application modernization has significant implications for app security.

How Do I Choose? API Gateway vs. Ingress Controller vs. Service Mesh
NGINX | 12/09/2021

How Do I Choose? API Gateway vs. Ingress Controller vs. Service Mesh

When you need an API gateway in Kubernetes, how do you choose among API gateway vs. Ingress controller vs. service mesh? We guide you through the decision, with sample scenarios for north-south and east-west API traffic, plus use cases where an API gateway is the right tool.

Deploying NGINX as an API Gateway, Part 2: Protecting Backend Services
NGINX | 01/20/2021

Deploying NGINX as an API Gateway, Part 2: Protecting Backend Services

In the second post in our API gateway series, Liam shows you how to batten down the hatches on your API services. You can use rate limiting, access restrictions, request size limits, and request body validation to frustrate illegitimate or overly burdensome requests.

New Joomla Exploit CVE-2015-8562
NGINX | 12/15/2015

New Joomla Exploit CVE-2015-8562

Read about the new zero day exploit in Joomla and see the NGINX configuration for how to apply a fix in NGINX or NGINX Plus.

Why Do I See “Welcome to nginx!” on My Favorite Website?
NGINX | 01/01/2014

Why Do I See “Welcome to nginx!” on My Favorite Website?

The ‘Welcome to NGINX!’ page is presented when NGINX web server software is installed on a computer but has not finished configuring

Deliver and Secure Every App
F5 application delivery and security solutions are built to ensure that every app and API deployed anywhere is fast, available, and secure. Learn how we can partner to deliver exceptional experiences every time.
Connect With Us
A Quick Guide to Scaling AI/ML Workloads on Kubernetes | F5