High Availability on OSC Managed Kubernetes

Purpose

This document explains how high availability (HA) works on Open Sovereign Cloud (OSC) managed Kubernetes across multiple layers:

the OSC infrastructure and availability zones (AZs)
the Gardener-managed Shoot control plane
the Shoot worker-node layer
the application and pod layer
persistent storage behavior for stateful workloads

It is written for customer architects who need to design deployment concepts for business-critical or regulated workloads.

Executive summary

High availability in OSC must be understood as a layered concept.

A highly available Shoot control plane does not automatically mean that the application is highly available. Likewise, a multi-zone worker setup does not automatically mean that OSC can transparently shift failed-zone worker VMs into another zone.

In OSC, each selected Gardener worker zone maps to an AZ-local compute/VM domain. This means:

each zone is an intentional failure domain
worker capacity in one zone is not the same thing as worker capacity in another zone
if an AZ-local compute or VM-cluster domain becomes unavailable, Gardener does not simply re-home the affected worker-node VMs into another zone
at the workload layer, Kubernetes can still recreate application replicas on healthy nodes in the remaining zones, but only if capacity, scheduling rules, and storage semantics allow it

At the same time, OSC persistent volumes are backed by storage distributed across all AZs. This is an important architectural characteristic:

a persistent volume is not tied to only one AZ from the workload perspective
a replacement pod in another healthy AZ can typically attach and use the same persistent volume there
this removes one of the common storage limitations seen on platforms with zone-local block storage

However, this still does not mean that every stateful workload is automatically active-active or zero-downtime:

the failed pod is replaced, not live-migrated
the volume may need detach/attach or recovery time
access mode semantics still matter
the application must tolerate restart or leader fail-over correctly

For critical workloads, architects must design HA on all relevant layers:

Shoot control plane
worker pools across AZs
pod placement and disruption handling
application replication or restart model
persistent-volume access mode and recovery behavior
fail-over capacity in the remaining healthy zones

HA dimensions in OSC

Platform-level HA

OSC itself is designed as a multi-cluster, multi-AZ platform. Core platform clusters distribute control-plane nodes across three availability zones in order to reduce blast radius and preserve quorum if one zone fails.

At this level, the objective is platform continuity: the regional services, APIs, networking, storage control, and Gardener hosting environment should remain available even when one AZ is impaired.

AZ-local compute / VM-cluster HA

For customer worker-node capacity, OSC uses AZ-local VM-cluster domains. A machine pool is represented by a separate VM-cluster per AZ.

This is important for architects:

the selected Gardener zones are not just labels; on OSC they correspond to concrete AZ-local infrastructure domains
each AZ-local VM-cluster is a failure domain from the worker-node provisioning perspective
local redundancy inside an AZ depends on the actual physical compute footprint in that AZ

In the OSC operating model described here, if a customer uses only one or two physical compute nodes in an AZ, that AZ should be treated as non-redundant from a local compute perspective. In other words, the zone may still be part of a multi-AZ design, but it is not locally fault tolerant.

If the AZ-local compute control plane or the AZ-local VM-cluster becomes unavailable, the entire compute pool of that AZ can become unavailable from Gardener's point of view.

Gardener Shoot control-plane HA

The Shoot control plane is separate from the worker nodes:

the Shoot control plane runs on the Seed
the worker nodes are VMs in the customer infrastructure domain

This separation is essential because Shoot control-plane HA and worker-node HA are different topics.

Persistent-volume HA semantics on OSC

For stateful workloads, OSC provides a major advantage: persistent volumes are backed by a storage platform distributed across all AZs.

For architects, this means:

a workload PV is not bound to the AZ in which the original pod was running
if a pod must be recreated in another AZ, the replacement pod can generally access the same PV from that AZ as well
a zone outage does not automatically make the PV unusable just because the original pod was located in the failed zone

This removes a common blocker for cross-AZ fail-over of stateful pods.

But it is important to understand what this does not imply:

it does not mean the same stateful pod can write actively from multiple AZs at the same time unless the access mode and application design support that
it does not remove restart or fail-over time
it does not remove the need to validate application crash recovery, fencing, leader election, or journaling behavior

So OSC distributed PVs improve the fail-over posture for stateful workloads, but they do not eliminate the need for application architecture decisions.

Workload HA inside the Shoot cluster

At the application layer, Kubernetes can replace failed pods, but only within the boundaries of:

healthy nodes
available resources
pod scheduling constraints
pod disruption budgets
storage access mode and attach behavior
application design

So the true availability of a critical service is determined by the combination of platform HA, worker HA, storage behavior, and workload architecture.