OpenBao Deployment Guideline

Disclaimer — customer responsibility

This page is a non-binding guideline intended to help you get started with your own Key Management System for Volume Encryption. It is provided as-is, without warranty of any kind.

Deploying, operating, securing, upgrading, and backing up the KMS — including all key material it holds — is entirely your responsibility. T-Systems / Open Sovereign Cloud is neither responsible nor accountable for KMS deployments based on this guideline, and your KMS is not covered by OSC support or service level agreements.

OpenBao is a third-party open-source product; always consult the official OpenBao documentation for authoritative and up-to-date guidance.

OpenBao is the recommended first choice, but not the only one: volume encryption equally supports any KMIP-compatible KMS or HSM — the choice of key management stays yours.

This guideline deploys OpenBao as a 3-node High-Availability cluster spread across 3 availability zones inside your shoot cluster. It keeps serving through the loss of any one pod or one availability zone, planned or unplanned.

Mechanism	Setting	Effect
Integrated Storage (Raft)	3 replicas (odd)	Consensus quorum = 2; the cluster tolerates 1 node down. Data is replicated to every node.
Zone spread	`topologySpreadConstraints` on `topology.kubernetes.io/zone`	Exactly one replica per availability zone, so losing a whole zone removes only one Raft node — quorum survives.
Node spread	`podAntiAffinity` on `kubernetes.io/hostname`	No two replicas on the same node.
Voluntary-disruption guard	`PodDisruptionBudget` `maxUnavailable: 1`	Node drains and rolling upgrades evict at most one pod at a time, never breaking quorum.
Auto-unseal (recommended)	`seal` stanza	A restarted pod auto-rejoins and auto-unseals — hands-free recovery, not just survival.

To tolerate two simultaneous failures, use 5 replicas (quorum = 3). The replica count must stay odd (3, 5, 7) for a stable quorum.

What this protects

The key encryption key (KEK) lives only inside OpenBao (Transit key with exportable=false) — it never reaches a worker node or the Kubernetes API.
A dump of the cluster's Secrets yields only wrapped (encrypted) volume keys — useless without the KEK.
The driver's token can only wrap and unwrap with the one key — it cannot read the key, touch other keys, or administer OpenBao.
A holder of both the token and the wrapped-key Secrets could unwrap them until the token is revoked or expires. Treat the token Secret as sensitive, scope it least-privilege, and rotate it.

Prerequisites

A shoot cluster with worker nodes in at least 3 availability zones (nodes labelled topology.kubernetes.io/zone).
kubectl and Helm ≥ 3.x.
A private CA to issue the OpenBao server certificate — TLS is mandatory for production (step 8). Public or ACME certificate authorities cannot issue certificates for ClusterIPs or cluster-internal DNS names, so use the cert-manager extension with a self-signed CA (shown in step 8) or your own PKI.

Do not store OpenBao on encrypted volumes

OpenBao's own data volumes must use an unencrypted StorageClass (for example default). Using the encrypted StorageClass would create a circular dependency: the volumes could never be attached while OpenBao is down.

helm repo add openbao https://openbao.github.io/openbao-helm
helm repo update openbao
helm search repo openbao/openbao        # this guideline pins chart 0.28.4 / app v2.5.5

All bao commands below run inside a server pod. Define a helper once (it targets openbao-0; standby nodes forward to the leader):

bao() { kubectl -n openbao exec -i openbao-0 -- env BAO_ADDR="http://127.0.0.1:8200" BAO_TOKEN="$BAO_TOKEN" bao "$@"; }

After enabling TLS (step 8), the pods already point BAO_ADDR at https://…, but the CLI additionally needs the CA — change the helper's environment to BAO_CACERT="/openbao/userconfig/openbao-server-tls/ca.crt" (and drop the BAO_ADDR override).

1. Production values

Save as openbao-ha-values.yaml. The comments mark the two production knobs (TLS, auto-unseal) you wire in steps 3 and 8.

global:
  # Bring-up only. TLS is MANDATORY for production — the connection carries
  # unwrapped volume keys. Enable it in step 8 before any productive use.
  tlsDisable: true

# The volume-encryption use case does not need the secrets sidecar injector.
injector:
  enabled: false

server:
  image:
    repository: openbao/openbao
    tag: "2.5.5"

  # Static-token auth only (step 7) — the Kubernetes auth delegator is unused.
  authDelegator:
    enabled: false

  # Create all three pods at once. A sealed pod is NotReady, and the chart default
  # (OrderedReady) would wait for each pod to become Ready before creating the
  # next — making the initial "unseal all three" loop in step 3 impossible.
  podManagementPolicy: Parallel

  ha:
    enabled: true
    replicas: 3                           # odd for quorum; tolerates 1 failure. Use 5 to tolerate 2.
    raft:
      enabled: true
      setNodeId: true                     # node_id = pod name (openbao-0/1/2)
      config: |
        ui = true

        listener "tcp" {
          address         = "[::]:8200"
          cluster_address = "[::]:8201"
          tls_disable     = true          # PRODUCTION: false + tls_cert_file/tls_key_file (step 8)
        }

        storage "raft" {
          path = "/openbao/data"
          # With TLS (step 8) change http to https AND add
          # `leader_ca_cert_file = "/openbao/userconfig/openbao-server-tls/ca.crt"`
          # to each stanza.
          retry_join { leader_api_addr = "http://openbao-0.openbao-internal:8200" }
          retry_join { leader_api_addr = "http://openbao-1.openbao-internal:8200" }
          retry_join { leader_api_addr = "http://openbao-2.openbao-internal:8200" }
        }
        # NOTE: Raft Autopilot cannot be set in this config — OpenBao silently
        # ignores an `autopilot` stanza here. Set it at runtime instead (step 4).

        service_registration "kubernetes" {}

        # Audit log on the persistent audit volume (auditStorage below).
        # Audit devices are configured declaratively here — OpenBao 2.5+
        # rejects the API-based `bao audit enable`.
        audit "file" "file" {
          options = {
            file_path = "/openbao/audit/audit.log"
          }
        }

        # PRODUCTION auto-unseal — uncomment ONE and configure BEFORE the first init
        # (step 3). Without it, every pod restart needs a manual unseal.
        # SECURITY: do not inline the PIN/token here (this config renders into a
        # ConfigMap) — inject it via environment variables from a Secret
        # (chart value `server.extraSecretEnvironmentVars`).
        # seal "pkcs11"  { lib = "/path/to/your-hsm-pkcs11.so" slot = "0" key_label = "openbao-unseal" }
        # seal "transit" { address = "https://bootstrap-bao:8200" key_name = "autounseal" mount_path = "transit" }

    disruptionBudget:
      enabled: true
      maxUnavailable: 1                    # never evict more than one voter at a time

  # Pin to ONE worker pool whose zones are the real availability zones
  # (see the note below). Adapt the pool name to your shoot.
  nodeSelector:
    worker.gardener.cloud/pool: <worker-pool-name>
  # One replica per zone (3 availability zones) ...
  topologySpreadConstraints: |
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app.kubernetes.io/name: {{ include "openbao.name" . }}
          app.kubernetes.io/instance: {{ .Release.Name }}
          component: server
  # ... and never two on the same node.
  affinity: |
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: {{ include "openbao.name" . }}
              app.kubernetes.io/instance: {{ .Release.Name }}
              component: server
          topologyKey: kubernetes.io/hostname

  dataStorage:
    enabled: true
    size: 10Gi                            # Raft data; size for your usage
    storageClass: default                 # must be an UNENCRYPTED StorageClass

  # Persistent audit volume (step 10). Without it, audit logs are lost on restart.
  auditStorage:
    enabled: true
    size: 5Gi
    storageClass: default

  resources:
    requests: { cpu: 250m, memory: 256Mi }
    limits:   { memory: 512Mi }

  # Active and standby nodes report Ready; a sealed node does not. This drives the
  # PodDisruptionBudget and rolling updates correctly.
  readinessProbe:
    enabled: true
    path: "/v1/sys/health?standbyok=true"

  # With manual (Shamir) unseal, OnDelete lets you restart and unseal one pod at a
  # time so a rollout never seals the quorum. With auto-unseal, RollingUpdate is
  # fine — pods unseal themselves on restart.
  updateStrategyType: OnDelete

  standalone:
    enabled: false

Spread across real failure domains

topologySpreadConstraints only helps if each topology.kubernetes.io/zone value is a distinct physical availability zone. An additional worker pool (for example a GPU pool) can carry its own synthetic zone label while living in an existing zone — the scheduler would treat it as an extra zone and could place two Raft voters in one real zone. The nodeSelector above pins OpenBao to one worker pool whose zones are the real availability zones. Verify after the install with kubectl -n openbao get pods -o wide.

2. Install

helm install openbao openbao/openbao --version 0.28.4 -n openbao --create-namespace \
  -f openbao-ha-values.yaml
kubectl -n openbao get pods -o wide

All three pods come up Running but sealed and not Ready until initialized (step 3) — that is expected. Confirm the spread is one pod per zone on three different nodes.

3. Initialize and unseal

Initialize once, on the first pod:

kubectl -n openbao exec openbao-0 -- bao operator init -key-shares=5 -key-threshold=3
# → 5 unseal keys + 1 initial root token.
# Store them OFFLINE, split across custodians. Never commit them anywhere.
export BAO_TOKEN=<initial-root-token>     # used only for setup (steps 4–7), revoked in step 7

Unseal every pod — each node needs the threshold of keys, and the Raft followers join here:

KEY1=<unseal-key-1>; KEY2=<unseal-key-2>; KEY3=<unseal-key-3>
for p in openbao-0 openbao-1 openbao-2; do
  for k in "$KEY1" "$KEY2" "$KEY3"; do kubectl -n openbao exec "$p" -- bao operator unseal "$k"; done
done

Auto-unseal (strongly recommended)

With manual unseal, a restarted pod stays sealed until a human unseals it — the cluster survives a failure but does not self-heal. For hands-free recovery, configure a seal stanza (step 1) before the first bao operator init:

seal "pkcs11" — an HSM or PKCS#11 token.
seal "transit" — a separate, already-running OpenBao/Vault instance (do not point it at this cluster — that is circular).

Switching an already-initialized cluster to auto-unseal is a seal migration with brief downtime — follow the OpenBao documentation exactly.

4. Verify the Raft cluster

bao operator raft list-peers      # 3 peers, State=leader/follower, Voter=true for all three
bao status                        # Sealed=false, HA Mode=active/standby
# Autopilot is configured at runtime (it cannot be set in the config file):
bao operator raft autopilot set-config -cleanup-dead-servers=true -min-quorum=3 -dead-server-last-contact-threshold=10m

5. Enable Transit and create the key encryption key

bao secrets enable transit
bao write -f transit/keys/pvc-kek        # the KEK; keep exportable=false

Enable automatic key rotation (Transit keeps old key versions, so previously wrapped volume keys still decrypt):

bao write transit/keys/pvc-kek/config auto_rotate_period=2160h min_decryption_version=1   # ~90 days

6. Least-privilege policy

Grant only wrap and unwrap on the one key — nothing else:

cat <<'EOF' | bao policy write pvc-enc -
path "transit/encrypt/pvc-kek" { capabilities = ["update"] }
path "transit/decrypt/pvc-kek" { capabilities = ["update"] }
EOF

7. Create the driver token

Mint a least-privilege, periodic token bound to the policy (the root token is used for setup only) and store it in the Secret referenced by your encrypted StorageClass:

bao token create -policy=pvc-enc -period=24h -display-name=csi-pvc-enc -orphan

kubectl -n kube-system create secret generic kms-token \
  --from-literal=token=<the-periodic-token>
# with TLS (step 8), additionally add the CA: --from-literal=ca.crt="$(cat openbao-ca.crt)"

Keep the token alive

A periodic token stays valid indefinitely, but only if it is renewed within each period. Renewal happens in OpenBao — the token value and the Kubernetes Secret stay unchanged, and no volume Secrets are involved. A minimal renewer:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: kms-token-renew
  namespace: kube-system
spec:
  schedule: "0 */8 * * *"                  # three times per 24h period
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: renew
              image: openbao/openbao:2.5.5
              command: ["bao", "token", "renew"]
              env:
                - name: BAO_ADDR
                  value: "http://<openbao-active-clusterip>:8200"
                - name: BAO_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: kms-token
                      key: token

If the token does expire: no data is lost and running Pods keep working, but attaching, rescheduling, and resizing encrypted volumes fail. Mint a new token and update the kms-token Secret to recover.

Revoke the root token

After setup, revoke the root token and use a scoped operator token for day-2 operations (step 10):

cat <<'EOF' | bao policy write bao-ops -
path "sys/storage/raft/snapshot"                { capabilities = ["read"] }
path "sys/storage/raft/snapshot-force"          { capabilities = ["create", "update"] }
path "sys/storage/raft/configuration"           { capabilities = ["read"] }
path "sys/storage/raft/autopilot/configuration" { capabilities = ["read", "update"] }
path "sys/audit"                                { capabilities = ["read", "sudo"] }   # verify audit devices (read requires sudo)
path "transit/encrypt/pvc-kek"                  { capabilities = ["update"] }   # disruption-test probe (step 10)
EOF
bao token create -policy=bao-ops -period=72h -orphan -display-name=bao-ops
bao token revoke <initial-root-token>      # then: export BAO_TOKEN=<ops-token>

Recovering root access

After the root token is revoked, admin operations (like changing policies) need a new root token. OpenBao disables the classic unauthenticated generate-root endpoints by default for security reasons, and the bao operator generate-root command of OpenBao 2.5 still targets them — it fails with "unsupported operation". To recover root access with your unseal keys, temporarily set disable_unauthed_generate_root_endpoints = false in the server configuration (restart required), run bao operator generate-root, and revert the setting afterwards. Keep the ops token renewed so this stays a rare event.

8. Network, endpoint and TLS

The storage driver cannot resolve cluster-internal DNS — use a Service ClusterIP as the encryptionKMSEndpoint, never a DNS name.
Use the openbao-active Service, not the plain openbao one: it always points at the current Raft leader and never routes to a sealed node.

TLS — mandatory for production

Every unwrap request returns the plaintext key of a volume over this connection — running the KMS on plain HTTP is acceptable for a first functional bring-up, but not an option for production. Enable TLS before the first encrypted volume holds real data.

The server certificate must cover the openbao-active ClusterIP (IP SAN), the pod DNS names openbao-{0,1,2}.openbao-internal (Raft join), and 127.0.0.1 (in-pod CLI). No public or ACME certificate authority can issue such a certificate, so create a private CA. With the cert-manager extension, bootstrap a self-signed CA (long-lived, here 10 years) and issue the server certificate from it (here 5 years):

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: selfsigned
  namespace: openbao
spec:
  selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: openbao-ca
  namespace: openbao
spec:
  isCA: true
  commonName: openbao-ca
  duration: 87600h                    # 10 years
  secretName: openbao-ca
  issuerRef:
    name: selfsigned
    kind: Issuer
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: openbao-ca
  namespace: openbao
spec:
  ca:
    secretName: openbao-ca

Look up the ClusterIP the certificate must contain, then issue the server certificate:

kubectl -n openbao get svc openbao-active -o jsonpath='{.spec.clusterIP}'

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: openbao-server-tls
  namespace: openbao
spec:
  secretName: openbao-server-tls
  duration: 43800h                    # 5 years
  dnsNames:
    - openbao-0.openbao-internal
    - openbao-1.openbao-internal
    - openbao-2.openbao-internal
  ipAddresses:
    - <openbao-active-clusterip>
    - 127.0.0.1
  issuerRef:
    name: openbao-ca
    kind: Issuer

Enable TLS in the chart

In openbao-ha-values.yaml: set global.tlsDisable: false, mount the certificate Secret, and update the listener and the Raft retry_join stanzas:

server:
  extraVolumes:
    - type: secret
      name: openbao-server-tls        # mounted at /openbao/userconfig/openbao-server-tls

listener "tcp" {
  address         = "[::]:8200"
  cluster_address = "[::]:8201"
  tls_disable     = false
  tls_cert_file   = "/openbao/userconfig/openbao-server-tls/tls.crt"
  tls_key_file    = "/openbao/userconfig/openbao-server-tls/tls.key"
}

storage "raft" {
  path = "/openbao/data"
  retry_join {
    leader_api_addr     = "https://openbao-0.openbao-internal:8200"
    leader_ca_cert_file = "/openbao/userconfig/openbao-server-tls/ca.crt"
  }
  # ... same for openbao-1 / openbao-2
}

Apply with helm upgrade, then restart the pods one at a time (see Upgrades). The endpoint becomes https://<openbao-active-clusterip>:8200 — update the StorageClass accordingly and hand the CA to the driver via the kms-token Secret:

kubectl -n openbao get secret openbao-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d > openbao-ca.crt
kubectl -n kube-system create secret generic kms-token \
  --from-literal=token=<the-periodic-token> \
  --from-file=ca.crt=openbao-ca.crt \
  --dry-run=client -o yaml | kubectl apply -f -

The token-renewal CronJob (step 7) then also needs BAO_ADDR=https://… and the CA: mount the ca.crt key of the kms-token Secret and point the BAO_CACERT environment variable at it.

Certificate lifetimes and rotation

cert-manager renews the certificate Secret automatically before expiry, but OpenBao only reads the certificate files at startup: after a renewal, perform a controlled rolling restart — one pod at a time; with manual unseal, unseal each restarted pod again, with auto-unseal it is hands-free. With a 5-year server certificate this is a rare, plannable event. Shortening the lifetime (for example to 1 year) is a good hardening step as long as you operate that controlled restart process — pair short lifetimes with auto-unseal. One more trigger to plan for: rotating the CA itself requires updating the ca.crt in the kms-token Secret. And a hard rule: never delete and recreate the openbao-active Service — its ClusterIP is baked into the server certificate and into every encrypted volume at creation time, and existing volumes cannot follow an endpoint change.

9. Wire it to the StorageClass

Use the openbao-active ClusterIP as the endpoint and pvc-kek as the key — see Volume Encryption for the full StorageClass example. The endpoint protocol must match the listener: http:// while tlsDisable=true, https:// once TLS is enabled.

10. Operations

Disruption test

Prove the setup tolerates the loss of one pod — kill the leader, the hardest case:

VICTIM=$(bao operator raft list-peers | awk '$3=="leader"{print $1}')
SURVIVOR=$(for p in openbao-0 openbao-1 openbao-2; do [ "$p" != "$VICTIM" ] && echo "$p" && break; done)
kubectl -n openbao delete pod "$VICTIM" --wait=false
# run against the surviving pod: Transit keeps serving, a new leader is elected
kubectl -n openbao exec "$SURVIVOR" -- env BAO_CACERT=/openbao/userconfig/openbao-server-tls/ca.crt BAO_TOKEN="$BAO_TOKEN" \
  bao write transit/encrypt/pvc-kek plaintext="$(echo test | base64)"

With manual unseal, unseal the restarted pod to restore full redundancy; with auto-unseal it self-heals.

Backups

bao operator raft snapshot save /tmp/bao.snap
kubectl -n openbao cp openbao-0:/tmp/bao.snap ./bao-$(date +%F).snap   # store off-cluster, on a schedule

Danger

Losing the OpenBao storage (the KEK) means losing every encrypted volume. Snapshot on a schedule, store the snapshots off-cluster, and test the restore procedure.

Audit

The audit device is configured declaratively in the server configuration (step 1) and writes to the persistent auditStorage volume — OpenBao 2.5+ does not allow enabling audit devices via the API. Every wrap and unwrap call is logged on the active node at /openbao/audit/audit.log. Verify and ship the file off-cluster:

bao audit list                # requires the sudo capability on sys/audit (bao-ops policy)
kubectl -n openbao exec <active-pod> -- tail /openbao/audit/audit.log

Upgrades

With updateStrategyType: OnDelete (manual unseal), roll one pod at a time: delete a pod, wait for it to run, unseal it, confirm it is Ready, then the next. With auto-unseal, switch to RollingUpdate and let the chart roll them.

Production checklist

Availability

3 Raft replicas — one per availability zone, no two on the same node
PodDisruptionBudget with maxUnavailable: 1
Auto-unseal configured before the first initialization
Disruption test passed — one pod killed, volumes still attach

Security

TLS enabled with a private CA (~10 years) and a server certificate (~5 years, or shorter with a controlled restart process)
CA handed to the driver via the ca.crt key of the kms-token Secret
Key encryption key not exportable, automatic rotation enabled
Driver token least-privilege (wrap and unwrap only) and periodic, with a renewer or a rotation process
Root token revoked after setup

Operations

StorageClass endpoint is the openbao-active Service ClusterIP, protocol matching the listener (https:// with TLS)
Audit device configured declaratively in the server config (step 1), log on a persistent volume, shipped off-cluster
Raft snapshots scheduled, stored off-cluster, restore tested