OpenBao Deployment Guideline
Disclaimer — customer responsibility
This page is a non-binding guideline intended to help you get started with your own Key Management System for Volume Encryption. It is provided as-is, without warranty of any kind.
Deploying, operating, securing, upgrading, and backing up the KMS — including all key material it holds — is entirely your responsibility. T-Systems / Open Sovereign Cloud is neither responsible nor accountable for KMS deployments based on this guideline, and your KMS is not covered by OSC support or service level agreements.
OpenBao is a third-party open-source product; always consult the official OpenBao documentation for authoritative and up-to-date guidance.
OpenBao is the recommended first choice, but not the only one: volume encryption equally supports any KMIP-compatible KMS or HSM — the choice of key management stays yours.
This guideline deploys OpenBao as a 3-node High-Availability cluster spread across 3 availability zones inside your shoot cluster. It keeps serving through the loss of any one pod or one availability zone, planned or unplanned.
| Mechanism | Setting | Effect |
|---|---|---|
| Integrated Storage (Raft) | 3 replicas (odd) | Consensus quorum = 2; the cluster tolerates 1 node down. Data is replicated to every node. |
| Zone spread | topologySpreadConstraints on topology.kubernetes.io/zone |
Exactly one replica per availability zone, so losing a whole zone removes only one Raft node — quorum survives. |
| Node spread | podAntiAffinity on kubernetes.io/hostname |
No two replicas on the same node. |
| Voluntary-disruption guard | PodDisruptionBudget maxUnavailable: 1 |
Node drains and rolling upgrades evict at most one pod at a time, never breaking quorum. |
| Auto-unseal (recommended) | seal stanza |
A restarted pod auto-rejoins and auto-unseals — hands-free recovery, not just survival. |
To tolerate two simultaneous failures, use 5 replicas (quorum = 3). The replica count must stay odd (3, 5, 7) for a stable quorum.
What this protects
- The key encryption key (KEK) lives only inside OpenBao
(Transit key with
exportable=false) — it never reaches a worker node or the Kubernetes API. - A dump of the cluster's Secrets yields only wrapped (encrypted) volume keys — useless without the KEK.
- The driver's token can only wrap and unwrap with the one key — it cannot read the key, touch other keys, or administer OpenBao.
- A holder of both the token and the wrapped-key Secrets could unwrap them until the token is revoked or expires. Treat the token Secret as sensitive, scope it least-privilege, and rotate it.
Prerequisites
- A shoot cluster with worker nodes in at least 3 availability zones
(nodes labelled
topology.kubernetes.io/zone). kubectland Helm ≥ 3.x.- A private CA to issue the OpenBao server certificate — TLS is mandatory for production (step 8). Public or ACME certificate authorities cannot issue certificates for ClusterIPs or cluster-internal DNS names, so use the cert-manager extension with a self-signed CA (shown in step 8) or your own PKI.
Do not store OpenBao on encrypted volumes
OpenBao's own data volumes must use an unencrypted StorageClass
(for example default).
Using the encrypted StorageClass would create a circular dependency:
the volumes could never be attached while OpenBao is down.
helm repo add openbao https://openbao.github.io/openbao-helm
helm repo update openbao
helm search repo openbao/openbao # this guideline pins chart 0.28.4 / app v2.5.5
All bao commands below run inside a server pod.
Define a helper once (it targets openbao-0; standby nodes forward to the leader):
bao() { kubectl -n openbao exec -i openbao-0 -- env BAO_ADDR="http://127.0.0.1:8200" BAO_TOKEN="$BAO_TOKEN" bao "$@"; }
After enabling TLS (step 8), the pods already point BAO_ADDR at https://…,
but the CLI additionally needs the CA —
change the helper's environment to
BAO_CACERT="/openbao/userconfig/openbao-server-tls/ca.crt"
(and drop the BAO_ADDR override).
1. Production values
Save as openbao-ha-values.yaml.
The comments mark the two production knobs (TLS, auto-unseal)
you wire in steps 3 and 8.
global:
# Bring-up only. TLS is MANDATORY for production — the connection carries
# unwrapped volume keys. Enable it in step 8 before any productive use.
tlsDisable: true
# The volume-encryption use case does not need the secrets sidecar injector.
injector:
enabled: false
server:
image:
repository: openbao/openbao
tag: "2.5.5"
# Static-token auth only (step 7) — the Kubernetes auth delegator is unused.
authDelegator:
enabled: false
# Create all three pods at once. A sealed pod is NotReady, and the chart default
# (OrderedReady) would wait for each pod to become Ready before creating the
# next — making the initial "unseal all three" loop in step 3 impossible.
podManagementPolicy: Parallel
ha:
enabled: true
replicas: 3 # odd for quorum; tolerates 1 failure. Use 5 to tolerate 2.
raft:
enabled: true
setNodeId: true # node_id = pod name (openbao-0/1/2)
config: |
ui = true
listener "tcp" {
address = "[::]:8200"
cluster_address = "[::]:8201"
tls_disable = true # PRODUCTION: false + tls_cert_file/tls_key_file (step 8)
}
storage "raft" {
path = "/openbao/data"
# With TLS (step 8) change http to https AND add
# `leader_ca_cert_file = "/openbao/userconfig/openbao-server-tls/ca.crt"`
# to each stanza.
retry_join { leader_api_addr = "http://openbao-0.openbao-internal:8200" }
retry_join { leader_api_addr = "http://openbao-1.openbao-internal:8200" }
retry_join { leader_api_addr = "http://openbao-2.openbao-internal:8200" }
}
# NOTE: Raft Autopilot cannot be set in this config — OpenBao silently
# ignores an `autopilot` stanza here. Set it at runtime instead (step 4).
service_registration "kubernetes" {}
# Audit log on the persistent audit volume (auditStorage below).
# Audit devices are configured declaratively here — OpenBao 2.5+
# rejects the API-based `bao audit enable`.
audit "file" "file" {
options = {
file_path = "/openbao/audit/audit.log"
}
}
# PRODUCTION auto-unseal — uncomment ONE and configure BEFORE the first init
# (step 3). Without it, every pod restart needs a manual unseal.
# SECURITY: do not inline the PIN/token here (this config renders into a
# ConfigMap) — inject it via environment variables from a Secret
# (chart value `server.extraSecretEnvironmentVars`).
# seal "pkcs11" { lib = "/path/to/your-hsm-pkcs11.so" slot = "0" key_label = "openbao-unseal" }
# seal "transit" { address = "https://bootstrap-bao:8200" key_name = "autounseal" mount_path = "transit" }
disruptionBudget:
enabled: true
maxUnavailable: 1 # never evict more than one voter at a time
# Pin to ONE worker pool whose zones are the real availability zones
# (see the note below). Adapt the pool name to your shoot.
nodeSelector:
worker.gardener.cloud/pool: <worker-pool-name>
# One replica per zone (3 availability zones) ...
topologySpreadConstraints: |
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: {{ include "openbao.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
component: server
# ... and never two on the same node.
affinity: |
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: {{ include "openbao.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
component: server
topologyKey: kubernetes.io/hostname
dataStorage:
enabled: true
size: 10Gi # Raft data; size for your usage
storageClass: default # must be an UNENCRYPTED StorageClass
# Persistent audit volume (step 10). Without it, audit logs are lost on restart.
auditStorage:
enabled: true
size: 5Gi
storageClass: default
resources:
requests: { cpu: 250m, memory: 256Mi }
limits: { memory: 512Mi }
# Active and standby nodes report Ready; a sealed node does not. This drives the
# PodDisruptionBudget and rolling updates correctly.
readinessProbe:
enabled: true
path: "/v1/sys/health?standbyok=true"
# With manual (Shamir) unseal, OnDelete lets you restart and unseal one pod at a
# time so a rollout never seals the quorum. With auto-unseal, RollingUpdate is
# fine — pods unseal themselves on restart.
updateStrategyType: OnDelete
standalone:
enabled: false
Spread across real failure domains
topologySpreadConstraints only helps if each topology.kubernetes.io/zone
value is a distinct physical availability zone.
An additional worker pool (for example a GPU pool) can carry its own
synthetic zone label while living in an existing zone —
the scheduler would treat it as an extra zone
and could place two Raft voters in one real zone.
The nodeSelector above pins OpenBao to one worker pool
whose zones are the real availability zones.
Verify after the install with kubectl -n openbao get pods -o wide.
2. Install
helm install openbao openbao/openbao --version 0.28.4 -n openbao --create-namespace \
-f openbao-ha-values.yaml
kubectl -n openbao get pods -o wide
All three pods come up Running but sealed and not Ready until initialized (step 3) — that is expected. Confirm the spread is one pod per zone on three different nodes.
3. Initialize and unseal
Initialize once, on the first pod:
kubectl -n openbao exec openbao-0 -- bao operator init -key-shares=5 -key-threshold=3
# → 5 unseal keys + 1 initial root token.
# Store them OFFLINE, split across custodians. Never commit them anywhere.
export BAO_TOKEN=<initial-root-token> # used only for setup (steps 4–7), revoked in step 7
Unseal every pod — each node needs the threshold of keys, and the Raft followers join here:
KEY1=<unseal-key-1>; KEY2=<unseal-key-2>; KEY3=<unseal-key-3>
for p in openbao-0 openbao-1 openbao-2; do
for k in "$KEY1" "$KEY2" "$KEY3"; do kubectl -n openbao exec "$p" -- bao operator unseal "$k"; done
done
Auto-unseal (strongly recommended)
With manual unseal, a restarted pod stays sealed until a human unseals it —
the cluster survives a failure but does not self-heal.
For hands-free recovery, configure a seal stanza (step 1)
before the first bao operator init:
seal "pkcs11"— an HSM or PKCS#11 token.seal "transit"— a separate, already-running OpenBao/Vault instance (do not point it at this cluster — that is circular).
Switching an already-initialized cluster to auto-unseal is a seal migration with brief downtime — follow the OpenBao documentation exactly.
4. Verify the Raft cluster
bao operator raft list-peers # 3 peers, State=leader/follower, Voter=true for all three
bao status # Sealed=false, HA Mode=active/standby
# Autopilot is configured at runtime (it cannot be set in the config file):
bao operator raft autopilot set-config -cleanup-dead-servers=true -min-quorum=3 -dead-server-last-contact-threshold=10m
5. Enable Transit and create the key encryption key
bao secrets enable transit
bao write -f transit/keys/pvc-kek # the KEK; keep exportable=false
Enable automatic key rotation (Transit keeps old key versions, so previously wrapped volume keys still decrypt):
bao write transit/keys/pvc-kek/config auto_rotate_period=2160h min_decryption_version=1 # ~90 days
6. Least-privilege policy
Grant only wrap and unwrap on the one key — nothing else:
cat <<'EOF' | bao policy write pvc-enc -
path "transit/encrypt/pvc-kek" { capabilities = ["update"] }
path "transit/decrypt/pvc-kek" { capabilities = ["update"] }
EOF
7. Create the driver token
Mint a least-privilege, periodic token bound to the policy (the root token is used for setup only) and store it in the Secret referenced by your encrypted StorageClass:
bao token create -policy=pvc-enc -period=24h -display-name=csi-pvc-enc -orphan
kubectl -n kube-system create secret generic kms-token \
--from-literal=token=<the-periodic-token>
# with TLS (step 8), additionally add the CA: --from-literal=ca.crt="$(cat openbao-ca.crt)"
Keep the token alive
A periodic token stays valid indefinitely, but only if it is renewed within each period. Renewal happens in OpenBao — the token value and the Kubernetes Secret stay unchanged, and no volume Secrets are involved. A minimal renewer:
apiVersion: batch/v1
kind: CronJob
metadata:
name: kms-token-renew
namespace: kube-system
spec:
schedule: "0 */8 * * *" # three times per 24h period
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: renew
image: openbao/openbao:2.5.5
command: ["bao", "token", "renew"]
env:
- name: BAO_ADDR
value: "http://<openbao-active-clusterip>:8200"
- name: BAO_TOKEN
valueFrom:
secretKeyRef:
name: kms-token
key: token
If the token does expire:
no data is lost and running Pods keep working,
but attaching, rescheduling, and resizing encrypted volumes fail.
Mint a new token and update the kms-token Secret to recover.
Revoke the root token
After setup, revoke the root token and use a scoped operator token for day-2 operations (step 10):
cat <<'EOF' | bao policy write bao-ops -
path "sys/storage/raft/snapshot" { capabilities = ["read"] }
path "sys/storage/raft/snapshot-force" { capabilities = ["create", "update"] }
path "sys/storage/raft/configuration" { capabilities = ["read"] }
path "sys/storage/raft/autopilot/configuration" { capabilities = ["read", "update"] }
path "sys/audit" { capabilities = ["read", "sudo"] } # verify audit devices (read requires sudo)
path "transit/encrypt/pvc-kek" { capabilities = ["update"] } # disruption-test probe (step 10)
EOF
bao token create -policy=bao-ops -period=72h -orphan -display-name=bao-ops
bao token revoke <initial-root-token> # then: export BAO_TOKEN=<ops-token>
Recovering root access
After the root token is revoked,
admin operations (like changing policies) need a new root token.
OpenBao disables the classic unauthenticated generate-root endpoints
by default for security reasons,
and the bao operator generate-root command of OpenBao 2.5
still targets them — it fails with "unsupported operation".
To recover root access with your unseal keys,
temporarily set disable_unauthed_generate_root_endpoints = false
in the server configuration (restart required),
run bao operator generate-root,
and revert the setting afterwards.
Keep the ops token renewed so this stays a rare event.
8. Network, endpoint and TLS
- The storage driver cannot resolve cluster-internal DNS —
use a Service ClusterIP as the
encryptionKMSEndpoint, never a DNS name. - Use the
openbao-activeService, not the plainopenbaoone: it always points at the current Raft leader and never routes to a sealed node.
TLS — mandatory for production
Every unwrap request returns the plaintext key of a volume over this connection — running the KMS on plain HTTP is acceptable for a first functional bring-up, but not an option for production. Enable TLS before the first encrypted volume holds real data.
The server certificate must cover the openbao-active ClusterIP (IP SAN),
the pod DNS names openbao-{0,1,2}.openbao-internal (Raft join),
and 127.0.0.1 (in-pod CLI).
No public or ACME certificate authority can issue such a certificate,
so create a private CA.
With the cert-manager extension,
bootstrap a self-signed CA (long-lived, here 10 years)
and issue the server certificate from it (here 5 years):
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: selfsigned
namespace: openbao
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: openbao-ca
namespace: openbao
spec:
isCA: true
commonName: openbao-ca
duration: 87600h # 10 years
secretName: openbao-ca
issuerRef:
name: selfsigned
kind: Issuer
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: openbao-ca
namespace: openbao
spec:
ca:
secretName: openbao-ca
Look up the ClusterIP the certificate must contain, then issue the server certificate:
kubectl -n openbao get svc openbao-active -o jsonpath='{.spec.clusterIP}'
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: openbao-server-tls
namespace: openbao
spec:
secretName: openbao-server-tls
duration: 43800h # 5 years
dnsNames:
- openbao-0.openbao-internal
- openbao-1.openbao-internal
- openbao-2.openbao-internal
ipAddresses:
- <openbao-active-clusterip>
- 127.0.0.1
issuerRef:
name: openbao-ca
kind: Issuer
Enable TLS in the chart
In openbao-ha-values.yaml:
set global.tlsDisable: false,
mount the certificate Secret,
and update the listener and the Raft retry_join stanzas:
server:
extraVolumes:
- type: secret
name: openbao-server-tls # mounted at /openbao/userconfig/openbao-server-tls
listener "tcp" {
address = "[::]:8200"
cluster_address = "[::]:8201"
tls_disable = false
tls_cert_file = "/openbao/userconfig/openbao-server-tls/tls.crt"
tls_key_file = "/openbao/userconfig/openbao-server-tls/tls.key"
}
storage "raft" {
path = "/openbao/data"
retry_join {
leader_api_addr = "https://openbao-0.openbao-internal:8200"
leader_ca_cert_file = "/openbao/userconfig/openbao-server-tls/ca.crt"
}
# ... same for openbao-1 / openbao-2
}
Apply with helm upgrade, then restart the pods one at a time
(see Upgrades).
The endpoint becomes https://<openbao-active-clusterip>:8200 —
update the StorageClass accordingly
and hand the CA to the driver via the kms-token Secret:
kubectl -n openbao get secret openbao-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d > openbao-ca.crt
kubectl -n kube-system create secret generic kms-token \
--from-literal=token=<the-periodic-token> \
--from-file=ca.crt=openbao-ca.crt \
--dry-run=client -o yaml | kubectl apply -f -
The token-renewal CronJob (step 7) then also needs
BAO_ADDR=https://… and the CA:
mount the ca.crt key of the kms-token Secret
and point the BAO_CACERT environment variable at it.
Certificate lifetimes and rotation
cert-manager renews the certificate Secret automatically before expiry,
but OpenBao only reads the certificate files at startup:
after a renewal, perform a controlled rolling restart —
one pod at a time; with manual unseal, unseal each restarted pod again,
with auto-unseal it is hands-free.
With a 5-year server certificate this is a rare, plannable event.
Shortening the lifetime (for example to 1 year) is a good hardening step
as long as you operate that controlled restart process —
pair short lifetimes with auto-unseal.
One more trigger to plan for:
rotating the CA itself requires updating the ca.crt
in the kms-token Secret.
And a hard rule: never delete and recreate the openbao-active
Service — its ClusterIP is baked into the server certificate
and into every encrypted volume at creation time,
and existing volumes cannot follow an endpoint change.
9. Wire it to the StorageClass
Use the openbao-active ClusterIP as the endpoint and pvc-kek as the key —
see Volume Encryption
for the full StorageClass example.
The endpoint protocol must match the listener:
http:// while tlsDisable=true, https:// once TLS is enabled.
10. Operations
Disruption test
Prove the setup tolerates the loss of one pod — kill the leader, the hardest case:
VICTIM=$(bao operator raft list-peers | awk '$3=="leader"{print $1}')
SURVIVOR=$(for p in openbao-0 openbao-1 openbao-2; do [ "$p" != "$VICTIM" ] && echo "$p" && break; done)
kubectl -n openbao delete pod "$VICTIM" --wait=false
# run against the surviving pod: Transit keeps serving, a new leader is elected
kubectl -n openbao exec "$SURVIVOR" -- env BAO_CACERT=/openbao/userconfig/openbao-server-tls/ca.crt BAO_TOKEN="$BAO_TOKEN" \
bao write transit/encrypt/pvc-kek plaintext="$(echo test | base64)"
With manual unseal, unseal the restarted pod to restore full redundancy; with auto-unseal it self-heals.
Backups
bao operator raft snapshot save /tmp/bao.snap
kubectl -n openbao cp openbao-0:/tmp/bao.snap ./bao-$(date +%F).snap # store off-cluster, on a schedule
Danger
Losing the OpenBao storage (the KEK) means losing every encrypted volume. Snapshot on a schedule, store the snapshots off-cluster, and test the restore procedure.
Audit
The audit device is configured declaratively in the server configuration
(step 1) and writes to the persistent auditStorage volume —
OpenBao 2.5+ does not allow enabling audit devices via the API.
Every wrap and unwrap call is logged on the active node
at /openbao/audit/audit.log. Verify and ship the file off-cluster:
bao audit list # requires the sudo capability on sys/audit (bao-ops policy)
kubectl -n openbao exec <active-pod> -- tail /openbao/audit/audit.log
Upgrades
With updateStrategyType: OnDelete (manual unseal), roll one pod at a time:
delete a pod, wait for it to run, unseal it, confirm it is Ready, then the next.
With auto-unseal, switch to RollingUpdate and let the chart roll them.
Production checklist
Availability
- 3 Raft replicas — one per availability zone, no two on the same node
PodDisruptionBudgetwithmaxUnavailable: 1- Auto-unseal configured before the first initialization
- Disruption test passed — one pod killed, volumes still attach
Security
- TLS enabled with a private CA (~10 years) and a server certificate (~5 years, or shorter with a controlled restart process)
- CA handed to the driver via the
ca.crtkey of thekms-tokenSecret - Key encryption key not exportable, automatic rotation enabled
- Driver token least-privilege (wrap and unwrap only) and periodic, with a renewer or a rotation process
- Root token revoked after setup
Operations
- StorageClass endpoint is the
openbao-activeService ClusterIP, protocol matching the listener (https://with TLS) - Audit device configured declaratively in the server config (step 1), log on a persistent volume, shipped off-cluster
- Raft snapshots scheduled, stored off-cluster, restore tested