It’s hard to predict the precise amount of resources a new Kubernetes workload requires for optimal performance. And, as your workload evolves, that initial resource consumption will also change. Fortunately, Kubernetes supports scalability by default.
Scalability enables you to expand your system’s capacity during high demand, and release that additional capacity when demand drops.
This scaling occurs at the cluster and pod levels in Kubernetes. When automated, this enables your K8s environment to respond to workload changes rapidly without tedious manual configuration.
Table Of Contents
What Are The Types Of Autoscaling In Kubernetes?
Kubernetes supports three types of autoscaling:
Cluster autoscaling increases or decreases the number of worker nodes within a cluster. The process increases nodes when one or more pods are pending scheduling — as long as the cluster stays within the parameters you specified.
Cluster autoscaling releases excess worker nodes as they become idle to match the number of pods currently in use.
Vertical pod autoscaling
Vertical pod autoscaling increases or decreases the amount of CPU or memory within your existing pods’ containers as needed. This adjusts your CPU and memory capacity without increasing or decreasing the number of pods you have at the time.
The process does this by scaling the resource requests and limits of a container within stateful or stateless pods — although it is most ideal for stateful services.
Horizontal pod autoscaling
To meet demand, HPA increases or decreases the number of pod replicas in an application. This is our focus for the rest of this guide.
What Is HPA In Kubernetes?
HPA is the Kubernetes object that manages horizontal pod autoscaling. HPA is also a controller and a K8s REST API service layer.
Horizontal pod autoscaling is the process that automatically increases or decreases the number of pods based on your workload’s CPU or memory usage, custom metrics from within Kubernetes, or external metrics (from outside the cluster).
- The CPU and memory option tracks real-time resource usage, and when usage exceeds a pre-set percentage (or a raw value), HPA triggers the addition of new pod replicas to bring pod utilization closer to your desired level.
- Custom metrics comprise other pod usage metrics besides CPU utilization. Custom metrics include client requests per second, I/O writes per second, network traffic, or some other value related to the pod’s use case.
- External metrics track values unrelated to pods, such as the number of queued tasks.
While vertical pod autoscaling scales your pod’s container CPU and memory capacity up (increase) or down, Kubernetes HPA adjusts your pods out (increase) or in (decrease/shrink) based on demand.
So, what does horizontal pod autoscaling actually look like in Kubernetes?
Kubernetes HPA Explained: How Does Horizontal Pod Autoscaling Work?
Horizontal pod autoscaling only works for workloads that can scale, so it won’t work with DaemonSets.
By default, the HPA controller works as part of the standard kube-controller-manager daemon. So it manages only the pods that a replication controller creates. Those include deployments, replica sets, or stateful sets.
You first define your desired autoscaling parameters. HPA follows the MIN and MAX number of replicas you define when adjusting your pods.
Each workflow has its own Horizontal Pod Autoscaler. Each Kubernetes HPA works through a control loop where it checks its workload’s metrics every 15 seconds against your set threshold. But you can use the horizontal-pod-autoscaler-sync-period controller to customize that interval.
For example, when your workload decreases, and your pod count exceeds your predefined minimum, the HorizontalPodAutoscaler triggers the Deployment, StatefulSet, or other similar workload resources to remove unused pods.
To detect when to scale out or in, HPA relies on metrics.
- Metrics Server captures CPU and memory consumption data for nodes and pods from the kubernetes.summary_api. HPA’s autoscaling/v1 API only tracks average CPU utilization. Version 2, autoscaling/v2 API, triggers scaling based on memory usage, custom metrics, and multiple metrics within a single HPA object.
- To use external or custom metrics, you need to create an interface with a Kubernetes monitoring service or another source of metrics. You can do this through the custom.metrics.k8s.io API or external.metrics.k8s.io API.
Overall, the HPA scaling process is fast, automatic, and continuous. That helps provide the following benefits.
What Are The Benefits Of HPA In Kubernetes?
Horizontal pod autoscaling helps:
- Continuously assesses your metrics to ensure your application is available at all times.
- Automatically increase an application’s pods to meet increased workload, such as traffic volume or number of client requests, in order to sustain optimal performance.
- Automatically decrease the number of an application’s pods when demand reduces to save costs.
- Adjust the number of pod replicas according to the time of day you expect the highest or lowest demand.
- Set specific capacity needs for specific periods, such as weekends, holidays, or off-peak hours.
- Schedule capacity based on an event, for example, a code release.
- Allows pods to run for up to 10 minutes before implementing autoscaling to prevent pods from terminating prematurely (thrashing).
Yet horizontal pod autoscaling in Kubernetes has its limits.
What Are The Limitations Of Kubernetes HPA?
Some limitations of the Kubernetes Horizontal Pod Autoscaler include:
- It doesn’t work with DaemonSets.
- Only works if your cluster can add more pods without exceeding your pre-set parameters.
- A pod may terminate frequently (thrashing) if its CPU and memory parameters are not set up correctly.
- Even with HPA enabled, you can still waste resources if you overprovision for CPU or memory capacity.
- While you can implement HPA and VPA together in the same cluster, it may lead to inaccurate metrics where memory and CPU scaling events happen simultaneously.
- Waiting 10 minutes to autoscale pods prevents thrashing, but limits the ability to handle real-time workload demands.
- You may need to re-architect your K8s application to support multi-server workload distribution.
- Kubernetes HPA ignores metrics related to disk read/writes, network I/O, and storage consumption, leading to inefficient allocation.
- HPA’s dynamic scaling approach adds a layer of complexity to understanding Kubernetes costs. As such, you may need a robust Kubernetes cost optimization solution to help you track, analyze, and improve K8s cost visibility.
So, how can you make the most of Kubernetes HPA features?
18 Kubernetes HPA Best Practices To Apply Right Away (Checklist)
Take full advantage of the Kubernetes HorizontalPodAutoscaler by implementing the following best practices:
- Design or re-architect your application to take advantage of microservices architecture by default. You want it to support running parallel pods seamlessly.
- Set your desired threshold for pod resource requests before enabling HPA. These resource requests enable HPA to optimize pod scaling.
- Configure resource requests for each container in every pod. Missing resource request values in some containers may corrupt HPA calculations, resulting in inaccurate figures and poor scaling decisions.
- Install Metrics Server on your K8s cluster because Kubernetes HPA needs to access per-pod resource metrics in order to determine how to scale. The Metrics Server supports HPA by enabling it to retrieve the values from the metrics.k8s.io API.
- Each HPA appears in a cluster as a HorizontalPodAutoscaler object. You need to use commands like kubectl describe hpa HPA_NAME or kubectl get hpa to interact with the objects.
- Spend time monitoring your workload and understanding its requirements. That’s because for HPA to be effective, your cluster needs to be able to support it. Otherwise, HPA would just spin up pods that end up in a pending state instead of increasing capacity.
- Use HorizontalPodAutoscaler (HPA) alongside ClusterAutoscaler (CA) to automatically reduce the number of active nodes as the number of pods decreases. Besides improving performance, this is also an excellent way to optimize Kubernetes costs.
- Avoid using VerticalPodAutoscaler alongside HorizontalPodAutoscaler on memory and CPU metrics. Because HPA and VPA use both metrics in scaling decisions, you may experience unexpected issues if VPA and HPA scaling events occur simultaneously.
- Use HPA with VPA for other metrics. This helps you rightsize your pods to your workload more effectively than when using just HPA or VPA in Kubernetes.
- Instead of directly attaching the HPA resource to a ReplicaSet controller or a Replication controller, use it on a Deployment object. That’s because when you apply a rolling update on the Replication Controller or Deployment, you effectively replace it with a new Replication Controller.
- Create HPA resources with the declarative form to enable version control. Over time, this approach helps you track configuration changes more efficiently.
- A custom metrics API exposes fewer metrics, so it is easier to secure than an external metrics API. Yet, custom metrics work for internal metrics, which can be limiting if you need external data to provide broader performance and cost insights.
- Custom metrics can also limit scaling.
Using third-party solutions such as Prometheus or Grafana can help solve this problem by providing more precise scaling with additional features, like alerting and dashboards.
- Set your average pod utilization target at 50%-80%.
This range will ensure your app can handle unexpected peaks in traffic without scaling up. The lower your target value, the easier it is for HPA to kick in and scale your pods out, which may increase costs without necessarily enhancing performance.
- However, you do not want to raise the average pod utilization target above 80%.
Otherwise, you may have an inadequate buffer to handle the increased load while new replicas are spinning up.
- Monitor application performance to be sure.
APM metrics, such as CPU and memory usage, enable Kubernetes HPA to determine when and by how many pods to scale in and out. You can use open-source solutions like Grafana, ELK stack, and Prometheus or proprietary solutions like CloudZero (Kubernetes cost analysis), New Relic, and Sematext to check HPA’s accuracy and configure it accordingly.
- More doesn’t always mean better.
Some DevOps professionals recommend using multiple HPAs for each deployment. The thinking is that this helps scale components or services individually, fine-tuning scaling. However, this can add costs, so consider what tradeoff to make, depending on your priorities.
- Define Pod Disruption Budgets for high-priority applications.
PodDisruptionBudget prevents disruption to mission-critical pods running in your Kubernetes cluster. Specifying this for a particular application instructs autoscaler to avoid reducing replicas below the minimum amount you allocated in the disruption budget.
Speaking of budgets, you don’t want scalability to drain your Kubernetes budget. Yet almost no monitoring platform offers granular information about Kubernetes costs, despite tracking performance and security metrics.
Although many cost monitoring tools exist, most struggle to pinpoint who, what, and why your Kubernetes costs change. Yet, if you want to optimize Kubernetes costs at any scale, you need to know which components are costing you more than you intended.
CloudZero can help.
CloudZero’s Kubernetes cost analysis uses a unique, code-driven approach to help you understand your K8s costs by key Kubernetes concepts, including:
- Cost per pod
- Cost per namespace
- Cost per container or microservice
- Cost per cluster, and more
Better yet, CloudZero gives you insight into what drives your costs by showing your costs per service, deployment, environment, individual customer, product, feature, team, project, and other actionable insights.
By surfacing idle costs, CloudZero helps you increase your Kubernetes performance while reducing waste:
CloudZero also detects cost anomalies in real-time. It then notifies you about abnormal cost patterns before they become costly issues.
Kubernetes HPA FAQs
We’ve answered a few frequently asked questions about Kubernetes Horizontal Pod Autoscaler below.
What is the difference between VPA and HPA in Kubernetes?
While horizontal scaling (HPA) increases or decreases the number of pods in an application to accommodate demand, vertical scaling (VPA) increases or decreases the CPU and memory capacity of existing pod containers as needed.
How long does it take Kubernetes HPA to scale up?
It can take up to 90 seconds.
How long does it take Kubernetes HPA to scale down?
The process takes up to 15 seconds, by default. During this time, each pod can complete its processes before the HPA can trigger its termination. Scaling down too quickly may cause issues such as thrashing or flapping, where pods terminate prematurely.
What is the maximum size of HPA in Kubernetes?
You set the maximum and minimum sizes HPA should use for your Kubernetes deployment.
Must HPA use Metrics Server for autoscaling?
Yes. Metrics Server provides the CPU and memory metrics that HorizontalPodAutoscaler uses to calculate scaling decisions.
Is HPA based on limit or request?
Kubernetes HPS uses resource requests to calculate and compare resource utilization.
Can I use VPA and HPA within the same Kubernetes pod?
Yes, and no. Employ HPA with VPA if you use other metrics other than CPU and memory indicators. In both cases, CPU and memory metrics are used to make scaling decisions, so simultaneous deployment could lead to errors.
How is a Kubernetes pod different from a node?
A pod is the smallest executable unit in Kubernetes and comprises one or more containers, each with one or more use cases and their binaries. But a node is the physical server or Virtual machine that comprises a Kubernetes Cluster.
How do I check if HPA is enabled in Kubernetes?
To check if you created the HPA resource, run the kubectl get hpa : kubectl get hpa command.