Shoot K8S version management

Downgrading

Currently not supported.

What to do if lower version is needed?

Deploy new shoot with one of lower version which are supported by us.

What to do, if upgrade is needed and we want to test it?

Deploy another test shoot, with same version of K8S as you have on the shoot cluster you want to upgrade.
Deploy applications you want/need to test on newer version to shoot cluster from step 1.
Upgrade shoot cluster from step 1. to desired K8S version.
Delete shoot cluster from step 1.
Upgrade your existing cluster.

Upgrading

Before upgrading

Check if Sconified services, especially CAS is running in your cluster. If yes, follow the steps described in Upgrade K8s on Shoot Cluster with running CAS

Check supported K8S versions via kubectl directly on garden cluster or in the OSC Dashboard (described later).

How to check directly on Garden cluster via kubectl

Execute following command:

kubectl get cloudprofiles.core.gardener.cloud onmetal -o=custom-columns='SUPPORTED K8S VERSIONS:.spec.kubernetes.versions[*].version' | sed s/,/\\n/g

example output of previous command:

SUPPORTED K8S VERSIONS
…some lines omitted…
1.28.10
1.28.11
1.28.14
1.28.15
1.29.9
1.29.12
1.30.8
1.31.4

Use grep to search for specific version, as on example below:

$ kubectl get cloudprofiles.core.gardener.cloud onmetal
-o=custom-columns='SUPPORTED K8S VERSIONS:.spec.kubernetes.versions[*].version'
| sed s/,/\\n/g | grep 1.25.9
# output after executing the command should be `1.25.9`

To upgrade K8S version, you can safely do it through OSC Dashboard or by updating shoot yaml in the project namespace on garden cluster.

Through OSC Dashboard

Open OSC Dashboard

In the list of shoot cluster find the cluster you want to upgrade. Then as you can see on picture below follow these steps:

Open the Actions menu for the shoot you want to update
Select Update Cluster.
From drop-down menu Upgrade to version choose accordingly from available upgrade options based on your desired K8S version.

There is possibility, that you will need to upgrade more times, for example if shoot K8S version is 1.23.x and we want to upgrade to 1.26.x we would see in drop-down menu something like this, and we would need to upgrade first to 1.24.x, then 1.25.x and then 1.26.x:
- 1.23.x → 1.24.x
- 1.24.x → 1.25.x
- 1.25.x → 1.26.x
Then write or copy/paste shoot name for confirmation in the bottom-left field.
Click UPDATE button to trigger the update which should be completed within some minutes.

Through Shoot Manifest

This can be done in OSC Dashboard and via kubectl directly on the garden cluster.

In OSC Dashboard

Let's take a look on how to do it through dashboard.

Click on shoot name in the list as shown on picture below.
Click on YAML
Search for version and re-write it to your desired version as shown on picture below:

On Garden Cluster using Kubectl

List shoot clusters by using kubectl get shoots.core.gardener.cloud -A as on Example.

Example:

$ kubectl get shoots.core.gardener.cloud -A
NAMESPACE    NAME         CLOUDPROFILE   PROVIDER   REGION   K8S VERSION   HIBERNATION   LAST OPERATION               STATUS      AGE
garden-dev   shmr20-t00   onmetal        onmetal    mdb      1.26.2        Awake         Reconcile Succeeded (100%)   unhealthy   2d23h
garden-dev   shmr20-t01   onmetal        onmetal    mdb      1.26.2        Awake         Reconcile Processing (88%)   unknown     6h31m
garden-dev   shmr20-t02   onmetal        onmetal    mdb      1.26.2        Awake         Reconcile Processing (82%)   unhealthy   5h55m

Edit the same field version: in the shoot manifest. To do it we can use following command:

{ .zsh .copy } kubectl edit shoots.core.gardener.cloud <your-shoot-cluster-name> \ -n <your-shoot-cluster-namespace-from-previous-command>

Save by :wq! and wait for reconciliation to finish.

We could do it locally by getting yaml from the cluster executing:

kubectl get shoots.core.gardener.cloud <your-shoot-cluster-name> \
  -n <your-shoot-cluster-namespace-from-previous-command> -o yaml > <your-shoot-cluster-name>.yaml

then edit and save the manifest and apply it on the cluster using:

kubectl apply -f <your-shoot-cluster-name>.yaml

Upgrade K8s on Shoot Cluster with running CAS

Upgrading a shoot cluster with running CAS (Configuration and Attestation Service) instances require special attention because the CAS is tightly coupled to the physical nodes it runs on. This coupling is due to the CAS database key being encrypted with a CPU-specific seal key - meaning the key material needed to access the CAS database is only available on the original host’s CPU. As a result, if a CAS instance is moved or restarted on a different node (as often happens during upgrades or node replacements in cloud-native environments), it cannot decrypt its own data, leading to potential data inaccessibility or service disruption. This tight binding to hardware complicates cloud-native management, where practices like automated scaling, rolling upgrades, and node replacements are common.

To ensure a smooth upgrade process, it is recommended to migrate the CAS instances to a new worker pool before upgrading the shoot cluster.

shoot-k8s-upgrade-with-running-cas

The steps to follow are:

Create a backup of the CAS database
Downscale nodes in origin worker pool
Create new temporary worker pool
Migrate CAS pod to a node in the newly created worker pool
Upgrade the origin worker pool to target version
Migrate CAS back to origin worker pool
Delete temporary worker pool
Upscale nodes in origin worker pool

Steps

Create a backup of the CAS database

Before starting the upgrade process, it is crucial to create a backup of the CAS database. This ensures that you have a restore point in case anything goes wrong during the upgrade process.

Please consult to the Scone User Guide on how to create a backup of the CAS database.

Downscale nodes in origin worker pool

If there is limit of workers reached in the shoot cluster, we need to downscale so that at least 3 nodes can be added to the new temporary worker pool.

Check we can at least create 3 nodes in the Shoot Cluster

According to Determining Available IP Addresses in a Shoot Cluster verify at least 3 IP Addresses are available in the Shoot Cluster.
Scaling down the workergroup

If there are not at least 3 IP Addresses available, you need to decrease the number of nodes in the shoot cluster by scaling down first.

Create a new temporary worker pool

Create a new worker pool with 3 nodes (1 node in each availability zone). This new worker pool will be used to migrate the CAS instances.

Create additional worker pool with 3 nodes (one node in each Availability Zone)

Please duplicate the specification of the current worker pool and adjust the fields below. This can be done with yaml-view of the worker group within the gardner dashboard.

spec.provider.workers[].kubernetes.version: 1.XX.xx This version of the kubernetes nodes needs to be pinned to current version. This is crucial - add this field if it is not present.
spec.provider.workers[].name: pool-tmp
spec.provider.workers[].minimum: 3
spec.provider.workers[].maximum: 3

The yaml should look like this

spec:
 provider:
   workers:
     - cri:
         name: containerd
# usually there is no need to set kubernetes version here which then uses the version from control plane.
#            kubernetes:                 
#            version: 1.30.10
       name: pool-worker
       machine:
         type: vm-dedicated-sgx-4-16-16
         image:
           name: gardenlinux
           version: 1605.0.5
         architecture: amd64
       maximum: 6
       minimum: 3
       maxSurge: 1
       maxUnavailable: 0
       volume:
         type: fast
         size: 50Gi
       zones:
         - ffm1-dev
         - ffm2-dev
         - ffm3-dev
       systemComponents:
         allow: true
     - cri:
         name: containerd
# the kubernetes for the new worker pool needs to be pinned to the current version,
# so that the worker group will stick to the current version and not be upgraded later    
       kubernetes:
         version: 1.30.10 # this version needs to be pinned to current version
       name: pool-tmp
       machine:
         type: vm-dedicated-sgx-4-16-16
         image:
           name: gardenlinux
           version: 1605.0.5 # this version needs to be pinned to current version
         architecture: amd64
       maximum: 3
       minimum: 3
       maxSurge: 1
       maxUnavailable: 0
       volume:
         type: fast
         size: 50Gi
       zones:
         - ffm1-dev
         - ffm2-dev
         - ffm3-dev
       systemComponents:
         allow: true

This snippet can be added through Gardener Dashboard or directly on Garden Cluster into Shoot Cluster YAML manifest, please see 08-Shoot-worker-pool-management.

Save the shoot configuration and wait for reconciliation to finish.

It should have created 3 new worker nodes in temporary worker pool. Check if worker nodes from temporary worker pool were created and joined into the cluster.
```
kubectl get nodes -l worker.gardener.cloud/pool=pool-tmp
```
The additional worker pool should be in Ready state.

Check if CAS is healthy and additional worker nodes are registered to host the CAS

kubectl get cas cas -n osc-scone-system

# format for cas-registered label: cas-registered-<cas-name>.<namespace>-<version>
kubectl get nodes -L cas-registered-cas.osc-scone-system-5.9.0 \
-L las.scontain.com/capable \
-L las.scontain.com/ok

If you are running multiple CAS instances, verify that all instances are healthy and registered to the new worker pool.

Migrate CAS pod to a node in the newly created worker pool

Since CAS 5.9.0 pods tolerate nodes which are tainted node.kubernetes.io/unschedulable:NoSchedule we cannot only cordon/uncordon the nodes in the origin worker pool. We' d better taint the nodes in the origin worker pool and delete the cas pod to force it to reschedule on a node in the new temporary worker pool.

Migrate cas pod to a node of the new temporary worker pool

# taint nodes of the current worker pool
kubectl taint node -l worker.gardener.cloud/pool=pool-worker \
   osc.tsystems.com/migrate-cas=true:NoSchedule

# delete cas-0 pod to force it to reschedule on a node in the new temporary worker pool
kubectl delete pod cas-0 -n osc-scone-system

# wait for cas to become healthy again
kubectl get cas cas -n osc-scone-system

# check pod is running on a node of the new temporary worker pool
kubectl get pods -n osc-scone-system -o wide | grep cas-0

# remove the taint from nodes in origin worker pool
kubectl taint node -l worker.gardener.cloud/pool=pool-worker \
   osc.tsystems.com/migrate-cas=true:NoSchedule-

Upgrade the origin worker pool to target version

Perform the upgrade on the original worker pool to the desired Kubernetes version or configuration. Ensure that the upgrade process does not disrupt the running CAS instances by keeping them on the temporary pool during this phase.

Proceed with update of Shoot Cluster by following the steps 08-Shoot-k8s-version-management.

Wait for the all the nodes in worker pool in pool-worker to be upgraded and healthy.
Check if the worker nodes are healthy and running the desired version of Kubernetes
```
kubectl get nodes -l worker.gardener.cloud/pool=pool-worker
```

Check if CAS is healthy and all replaced worker nodes are registered to host the CAS

kubectl get cas cas -n osc-scone-system

# format for cas-registered label: cas-registered-<cas-name>.<namespace>-<version>
kubectl get nodes -L cas-registered-cas.osc-scone-system-5.9.0 \
   -L las.scontain.com/capable \
   -L las.scontain.com/ok

If you are running multiple CAS instances, verify that all instances are healthy and all replaced nodes are registered to host a CAS in the origin worker pool.

Migrate CAS back to origin worker pool

Since CAS pods tolerate nodes which are tainted node.kubernetes.io/unschedulable:NoSchedule we cannot only cordon/uncordon the nodes in the temporary worker pool. We' d better taint the nodes in the temporary worker pool and delete the cas pod to force it to reschedule on a node in the upgraded origin worker pool.

Migrate cas pod to a node of the origin upgraded worker pool

 # taint nodes of the current worker pool
kubectl taint node -l worker.gardener.cloud/pool=pool-tmp \
   osc.tsystems.com/migrate-cas=true:NoSchedule

# delete cas-0 pod to force it to reschedule on a node in the updated origin worker pool
kubectl delete pod cas-0 -n osc-scone-system

# wait for cas to become healthy again
kubectl get cas cas -n osc-scone-system

# check pod is running on a node of the new temporary worker pool
kubectl get pods -n osc-scone-system -o wide | grep cas-0

# remove the taint from nodes in temporary worker pool
 kubectl taint node -l worker.gardener.cloud/pool=pool-tmp \
   osc.tsystems.com/migrate-cas=true:NoSchedule-

Delete temporary worker pool

Once the upgrade is complete and the original pool is stable, decommission the temporary worker pool to optimize resource usage and costs.

Delete the temporary worker pool This can be done from within the gardner dashboard. Or remove the worker pool pool-tmp from the shoot manifest and save it. Wait for reconciliation to finish.

Upscale nodes in worker pool

Scale up the nodes in the original worker pool to the desired number of nodes as if was before. This ensures that you have enough resources to handle your workloads after the upgrade again.

If there was need to downscale the nodes in the original worker pool in order to create a new temporary worker pool, then scale up the nodes in the original worker pool to the desired number of nodes as it was before.