Horizontal Autoscaling of Shoot Nodes

Every Shoot cluster that has at least one worker pool with minimum < maximum nodes configuration will get a cluster-autoscaler deployment. Gardener is leveraging the upstream community Kubernetes cluster-autoscaler component. Which means that logging of Kubernetes cluster-autoscaler is the same. Scaling of Shoot nodes are defined in their yaml configuration in part. In our stack the cluster-autoscaler in the Shoot namespace on the Seed cluster is responsible for scaling up and down pod deployment.

Scaling Up

Scale-up creates a watch on the API server looking for all pods. It checks for any unschedulable pods every 10 seconds (configurable by --scan-interval flag). A pod is unschedulable when the Kubernetes scheduler is unable to find a node that can accommodate the pod. For example, a pod can request more CPU that is available on any of the cluster nodes. Unschedulable pods are recognized by their PodCondition. Whenever a Kubernetes scheduler fails to find a place to run a pod, it sets "schedulable" PodCondition to false and reason to "unschedulable". If there are any items in the unschedulable pods list, Cluster Autoscaler tries to find a new place to run them.

It is assumed that the underlying cluster is run on top of some kind of node groups. Inside a node group, all machines have identical capacity and have the same set of assigned labels. Thus, increasing a size of a node group will create a new machine that will be similar to these already in the cluster – they will just not have any user-created pods running (but will have all pods run from the node manifest and daemon sets).

It may take some time before the created nodes appear in Kubernetes. It almost entirely depends on the cloud provider and the speed of node provisioning, including the TLS bootstrapping process. Cluster Autoscaler expects requested nodes to appear within 15 minutes (configured by --max-node-provision-time flag.) After this time, if they are still unregistered, it stops considering them in simulations and may attempt to scale up a different group if the pods are still pending. It will also attempt to remove any nodes left unregistered after this time.

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
  …
  maxSurge: 1  - how much nodes could be scaled up in one iteration
  maxUnavailable: 0 
  maximum: 3 - upper limit to which maximally could be Shoot nodes extended
  minimum: 1  - number of nodes which is the minimal (initial deployment status)

Example

Shoot name: mcm13
Number of nodes: 1
Node resources: 2 CPU 4GB Ram
Sacling configuration:
  maxSurge: 1
  maxUnavailable: 0
  maximum: 2
  minimum: 1
Gardener project name: dev
Name of seed namespace for Shoot: shoot--dev--mcm13

Lets consider we have 1 node Shoot cluster with 2 CPU and 4 GB of RAM. We will deploy on this Shoot cluster there pod with requests for resources which is higher than what Shoot node could offer.

apiVersion: v1
kind: Pod
metadata:
  name: task-ha
spec:
  containers:
    - name: ha-exaple
      image: nginx
      ports:
        - containerPort: 80
          name: "http-server"
      resources:
        requests:
          cpu: "1500m"
          memory: "2Gi"

On the Shoot is currently running some load and could not provide to pod resources mentioned in resource part and therefore autoscaling will be activated.

pod is unschedulable

Below you can find logs from cluster-autoscaler-xxxxxx-xxx pod when its activating scale-up

I0203 12:25:13.198285       1 klogx.go:86] Pod default/task-perf is unschedulable
I0203 12:25:13.198560       1 waste.go:55] Expanding Node Group Shoot--dev--mcm13-wrk-z1 would waste 25.00% CPU, 47.93% Memory, 36.47% Blended
I0203 12:25:13.198581       1 scale_up.go:468] Best option to resize: Shoot--dev--mcm13-wrk-z1
I0203 12:25:13.198589       1 scale_up.go:472] Estimated 1 nodes needed in Shoot--dev--mcm13-wrk-z1
I0203 12:25:13.198611       1 scale_up.go:595] Final scale-up plan: [{Shoot--dev--mcm13-wrk-z1 1->2 (max: 2)}]
I0203 12:25:13.198626       1 scale_up.go:691] Scale-up: setting group Shoot--dev--mcm13-wrk-z1 size to 2

When you check status of pod by kubectl get pod/cluster-autoscaler-xxxxxx-xxx in section status you should see False status with reason reason: Unschedulable

status:
  phase: Pending
  conditions:
    - type: PodScheduled
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2023-02-03T12:25:04Z'
      reason: Unschedulable
      message: >-
        0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.
        preemption: 0/1 nodes are available: 1 No preemption victims found for
        incoming pod.

After some time you can see that cluster has 2 nodes nodes scaled-up

And than pod could be safely scheduled.

pod is scheduled

Scaling Down

Every 10 seconds (configurable by --scan-interval flag), if no scale-up is needed, Cluster Autoscaler checks which nodes are unneeded. A node is considered for removal when all below conditions hold:

The sum of cpu and memory requests of all pods running on this node (DaemonSet pods and Mirror pods are included by default but this is configurable with --ignore-daemonsets-utilization and --ignore-mirror-pods-utilization flags) is smaller than 50% of the node's allocatable. (Before 1.1.0, node capacity was used instead of allocatable.) Utilization threshold can be configured using --scale-down-utilization-threshold flag.

All pods running on the node (except these that run on all nodes by default, like manifest-run pods or pods created by daemonsets) can be moved to other nodes. See What types of pods can prevent CA from removing a node? section for more details on what pods don't fulfill this condition, even if there is space for them elsewhere. While checking this condition, the new locations of all movable pods are memorized. With that, Cluster Autoscaler knows where each pod can be moved, and which nodes depend on which other nodes in terms of pod migration. Of course, it may happen that eventually the scheduler will place the pods somewhere else.

It doesn't have scale-down disabled annotation (see How can I prevent Cluster Autoscaler from scaling down a particular node?)

If a node is unneeded for more than 10 minutes, it will be terminated. (This time can be configured by flags – please see I have a couple of nodes with low utilization, but they are not scaled down. Why? section for a more detailed explanation.) Cluster Autoscaler terminates one non-empty node at a time to reduce the risk of creating new unschedulable pods. The next node may possibly be terminated just after the first one, if it was also unneeded for more than 10 min and didn't rely on the same nodes in simulation (see below example scenario), but not together. Empty nodes, on the other hand, can be terminated in bulk, up to 10 nodes at a time (configurable by --max-empty-bulk-delete flag.)

What happens when a non-empty node is terminated? As mentioned above, all pods should be migrated elsewhere. Cluster Autoscaler does this by evicting them and tainting the node, so they aren't scheduled there again.

DaemonSet pods may also be evicted. This can be configured separately for empty (i.e. containing only DaemonSet pods) and non-empty nodes with --daemonset-eviction-for-empty-nodes and --daemonset-eviction-for-occupied-nodes flags, respectively. Note that the default behavior is different on each flag: by default DaemonSet pods eviction will happen only on occupied nodes. Individual DaemonSet pods can also explicitly choose to be evicted (or not). See How can I enable/disable eviction for a specific DaemonSet for more details.

Example 1

Shoot name: mcm13
Number of nodes: 1
Node resources: 2 CPU 4GB Ram
Scaling configuration: 
  maxSurge: 1
  maxUnavailable: 0 
  maximum: 2
  minimum: 1
Gardener project name: dev
Name of seed namespace for Shoot: shoot--dev--mcm13

Lets consider that this Shoot has currently 2 nodes after scaling up procedure described in section Scaling Up.

nodes scaled-up

We will now remove pod with high resources requirements by kubectl delete pod/task-perf (if needed specify kubeconfig of Shoot cluster and namespace if it is not in default)

After max 5 minutes you should see in logs of cluster-autoscaler-xxxxxx-xxx:

I0203 13:10:26.954665       1 cluster.go:139] Shoot--dev--mcm13-wrk-z1-6c765-wk8gr for removal
I0203 13:10:26.955076       1 cluster.go:167] node Shoot--dev--mcm13-wrk-z1-6c765-wk8gr may be removed
I0203 13:10:36.967661       1 static_autoscaler.go:432] No unschedulable pods
I0203 13:10:36.967735       1 cluster.go:139] Shoot--dev--mcm13-wrk-z1-6c765-wk8gr for removal
I0203 13:10:36.968060       1 cluster.go:167] node Shoot--dev--mcm13-wrk-z1-6c765-wk8gr may be removed
I0203 13:10:46.980632       1 static_autoscaler.go:432] No unschedulable pods
I0203 13:10:46.980761       1 cluster.go:139] Shoot--dev--mcm13-wrk-z1-6c765-wk8gr for removal
I0203 13:10:46.981009       1 cluster.go:167] node Shoot--dev--mcm13-wrk-z1-6c765-wk8gr may be removed

When you see it first time on logs approx in 20 minutes will be node deleted and cluster will be down-scaled.

Then in logs after some time we can see that Autoscaler add taint to both nodes (he consider both of them as candidates for deletion).

I0203 13:25:18.149211       1 delete.go:103] Successfully added DeletionCandidateTaint on node Shoot--dev--mcm13-wrk-z1-6c765-gbjwm
I0203 13:25:18.157509       1 delete.go:103] Successfully added DeletionCandidateTaint on node Shoot--dev--mcm13-wrk-z1-6c765-wk8gr

nodes scaled-up

In next 10 minutes you can see how one of the node is removed taint for second node released.

I0203 13:36:39.153508       1 scale_down.go:1102] Scale-down: removing empty node Shoot--dev--mcm13-wrk-z1-6c765-gbjwm
I0203 13:36:39.177325       1 delete.go:103] Successfully added ToBeDeletedTaint on node Shoot--dev--mcm13-wrk-z1-6c765-gbjwm
I0203 13:36:39.221217       1 mcm_manager.go:546] MachineDeployment Shoot--dev--mcm13-wrk-z1 size decreased to 1 shoot--dev--mcm13-wrk-z1-6c765-wk8gr - node group min size reached
I0203 13:36:49.189295       1 pre_filtering_processor.go:66] Skipping Shoot--dev--mcm13-wrk-z1-6c765-gbjwm - node group min size reached
I0203 13:36:49.189330       1 scale_down.go:917] No candidates for scale down
I0203 13:36:49.189359       1 delete.go:197] Releasing taint {Key:DeletionCandidateOfClusterAutoscaler Value:1675430718 Effect:PreferNoSchedule TimeAdded:<nil>} on node shoot--dev--mcm13-wrk-z1-6c765-wk8gr
I0203 13:36:49.203160       1 delete.go:228] Successfully released DeletionCandidateTaint on node Shoot--dev--mcm13-wrk-z1-6c765-wk8gr
W0203 13:36:59.216194       1 clusterstate.go:594] Nodegroup is nil for openstack:///5aed4bed-e5db-4bca-bc70-d13640ae5260
I0203 13:36:59.216240       1 static_autoscaler.go:341] 1 unregistered nodes present
I0203 13:36:59.216267       1 static_autoscaler.go:432] No unschedulable pods

In Shoot cluster you can see now only one node ready. Difference is that scaling down taking longer than scaling up because downscale is critical operation and autoscaler want to be sure that this decrease of load is not only temporary.

nodes scaled-down

Manual Scaling

Manual scaling meaning editing yaml configuration manifest for shoots or editing Shoot object in Garden cluster.

It needs to be modified field shoot.spec.provider.workers.minimum to desired number of nodes but value of field needs to lower or equal to shoot.spec.provider.workers.maximum

For down-scaling you will decrease shoot.spec.provider.workers.minimum value to desired number of nodes (not lower than 1)

Example 2

Shoot name: mcm13
Number of nodes: 1
Node resources: 2 CPU 4GB Ram
Scaling configuration:
  maxSurge: 1
  maxUnavailable: 0
  maximum: 3
  minimum: 1
Gardener project name: dev
Name of seed namespace for Shoot: shoot--dev--mcm13
Name of garden cluster: pg-mcm1

Then edit object Shoot in Garden cluster. kubectl edit Shoot/mcm1 -n garden-dev --kubeconfig <kubeconfig of pg-mcm1 garden cluster>

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: mcm13
  …
  maxSurge: 1
  maxUnavailable: 0
  maximum: 3
  minimum: 1  --> 2 or 2 --> 1

When change will be applied by saving and closing yaml manifest of object reconciliation starts immediately.