User Guide

Learn how to work with the Scone as-a-service operator in a Kubernetes (k8s) environment.

One novel aspect of confidential computing is that applications can be protected against privileged software like the operating system and the hypervisor. Even if attackers get access to host machines, when running your workload confidentially, they will not be able to read data from your workload.

We make it happen by utilizing Intel Software Guard Extension (Intel SGX), a capability of intel processors to run program code in an encrypted manner in isolated sections of a processor, so called enclaves. With such a feature you can protect your data not only at rest and in transit - a widely used pattern, you can also protect your data while it is getting processed in memory and in processors.

A second novel aspect of confidential computing is that one can attest all components that we need to ensure the confidentiality and integrity of our application. These components include the CPU, its firmware, the application code and its data itself. The attestation ensures that these components are up-to-date and no vulnerabilities are known for these components.

In other words, one can establish trust in all components that are required to execute the application.

Not only that our hardware enables to protect workloads, we also add services on top of kubernetes that simplify building, deploying and running confidential workloads in respect of protection and attestation. This is getting achieved with Scone.

A good starting point to find out more about Intel SGX and Scone Services is:

Intel SGX⁸
Scone documentation²
Scone Operator User Manual⁶

The scone runtime environment consists of various services:

CAS Configuration and Attestation Service
LAS Local Attestation Service
SGX Plugin
Scone Operator

The Scone Operator is managing the lifecycle of all Scone Service Components. Not only that: It furthermore is needed for operational tasks, like handling policies.

The scone operator itself is crucial component. Who manages the lifecycle of the Scone-Operator, though?

With the help of our service, we manage the lifecycle of the scone operator. That includes:

deploying services initially
integrate monitoring of the scone services into prometheus

Scope

This document should help you understand the deployment of scone services with the help of the scone-service-operator. Provisioning, upgrading the CAS and the vault will not be described in depth throughout this documentation since it is not managed by this scone-service-operator

This user-guide does not tell you how you can build and run confidential workloads either. Please refer to the official scone sites¹ to retrieve information on that. There you will find a bunch of tutorials and samples that helps you getting started.

Concepts

Let us start with an overview about the Scone services and then continue with the idea of the scone-service-operator. Afterwards let us see how to deploy the necessary services to run confidential workloads in production.

SCONE helps developers to run their applications inside of SGX enclaves. An Intel SGX enclave facilitates an application to protect its data from accesses by all other software - even the operating system. In particular, an application can protect all its data against adversaries with root access. A root user cannot dump the main memory of an application to get access to all its keys. Often, configuration files of applications are only protected by the filesystem. Again, a user with root access can read these configuration files and all secrets that they might contain. SCONE uses SGX to help to encrypt configuration files to protect these.

SCONE comprises of a bunch of services to make life easier for developers to creating confidential workloads for kubernetes.

Configuration and Attestation Service (CAS)

SCONE CAS is a central component of the SCONE infrastructure. Programs executing in enclaves connect to CAS to obtain their confidential configuration. CAS provisions this configuration only after it has verified the integrity and authenticity of the requesting enclave using remote attestation. Additionally, CAS checks that the requesting enclave is authorized to obtain the confidential configuration. One can run CAS instances on the same node as the application, the same cluster, or a different cluster. The CAS operator enables us to configure CAS policies remotely using kubectl without needing to expose CAS to any external network.

Local Attestation Service (LAS)

A LAS instance must run on each Kubernetes node that supports confidential computing. Developers will not have to know about LAS as long as the SCONE operator can keep LAS running. In conjunction with CAS, it enables remote attestation of enclaves by performing a local attestation. Currently, LAS supports DCAP and EPID-based quoting enclaves. Additionally, it provides an independent SCONE quoting enclave (QE): The SCONE QE enables the decoupling of the availability of an application from Intel's attestation services. A use case of the SCONE QE is the air-gapped deployment of applications without Internet connectivity.

SGX Device Plugin Service (SGX Plugin)

The SCONE SGX Plugin simplifies the deployment of confidential applications by providing any unprivileged container access to the SGX devices on hosts that support SGX. Developers will not have to know about the SGX Plugin as long as the SCONE operator can keep the SGX Plugin running. A confidential application can only run on hosts with SGX support. When running in a Kubernetes cluster, you need to make sure your application's workload is scheduled to such nodes and that the application gets access to the corresponding SGX devices. A device plugin advertises hardware resources to the Kubelet. The SCONE SGX Plugin provides access to the SGX devices. To access the SGX devices of a node, however, the Kubernetes containers usually need to run in privileged mode. The SCONE SGX Plugin will not only provide access to the SGX devices on exactly those nodes in the cluster that have support for the SGX version that you need. It will also allow your containers to run in non-privileged mode. However, the SGX Plugin must have permission to access the SGX devices.

CAS Backup Controller

The CAS Backup Controller automatically registers backup encryption keys from all the nodes in the Kubernetes cluster. This allows the CAS to open its encrypted database on all cluster nodes using the encryption key of the node on which it is running. In this way, the CAS can be freely migrated to nodes in the cluster (e.g. for recovery after a node failure). Note that the CAS Backup Controller does NOT backup the CAS database: You need to explicitly back up the Kubernetes volume used by CAS.

Scone Kubernetes Operator

The SCONE Kubernetes Operator automates the management of SCONE-related services. It monitors the behavior of these services and ensures that the services stay in the desired state. This state is described with the help of Kubernetes custom resources.

The SCONE Operator defines a set of custom resources to manage the following SCONE resources:

CAS: the SCONE Configuration and Attestation Service,
LAS: the SCONE Local Attestation Service,
SGX Plugin: a Kubernetes plugin that provides containers with access to SGX,
a confidential Vault, and
signed and/or encrypted SCONE CAS policies.

Idea

The idea is to enable users run their workloads confidentially. Services that make confidential computing available should be managed by kubernetes operators so that impact on operational tasks for users is as little as possible.

How do we achieve that?

To simplify the initial rollout of scone services and also the lifecycle management of the scone-operator we introduce another service operator: the scone-service-operator.

The scone-service-operator's responsibility is to manage the lifecycle of the scone-operator which in turn monitors the scone services and manages the lifecycle of these scone services. The scone-service-operator 's responsibility includes:

create, update, delete the scone-operator
integrating the scone-operator into monitoring
initially provision SGX-Plugin, LAS and a non-provisioned CAS
create CAS database backup

Here it is important to note that OSC as a cloud provider - with the help of the scone-service-operator and the scone-operator - creates the basic non-confidential components. All security relevant tasks must run by the owner of CAS. Only the owner can run security relevant tasks, like:

provisioning a CAS
upgrading a CAS

Provisioning is the step where a user takes ownership of the CAS. In turn during this process he gets the credentials which enables him to drive security relevant management tasks.

Architecture

Let us see how the architectural overview looks like and how it fits together.

Deployment Diagram

The user deploys the scone-service-operator into his cluster. By default the operator is deployed into namespace osc-sec-scone-svc-operator. However, the target namespace for the scone-service-operator can be adapted to your needs during installation (see below).

The user, usually a service administrator on customers end, will then create a scone CR (cluster scoped CR of kind scone.confidential.security.osc.t-systems.com) to configure the service. The important properties to set here are:

service version to be installed
flag, if CAS should be installed (by default CAS will be installed)

Since CAS is optional, the installation of CAS can be skipped. A CAS installation for example should be skipped, if there is another CAS, maybe in another cluster in place, to serve with attestation and configuration services.

The scone-service-operator will be triggered by the creation of this scone CR. The controller then creates all resources needed to run the scone-operator into namespace osc-sec-us-scone-operator itself and also all configurations for the services to be installed, i.e. LAS, CAS, sgx-plugin. The services then will be created in namespace osc-scone-system.

Additionally the scone-service-operator creates the scone-cas-db-backupper in namespace osc-scone-system which pulls the snapshot of CAS database periodically and pushes it into an S3 Bucket.

Supported Scone Versions

Scone follows the same support policy as the Kubernetes project which is an N-2 policy in with the three latest minor releases are maintained. Although previous versions may work, they are not tested and therefore no guarantees are made as to their full compatibility. Scone also follows a similar strategy for support of Kubernetes itself. The below table shows the compatibility matrix.

Scone Service Operator	Helmchart	Scone Version	K8s Min	k8s Max	Upgrade to	New Install supported
1.1.0	1.1.0	5.8.0	1.27	1.28
1.2.0	1.2.0	5.8.0, 5.9.0	1.27	1.28

Lifecycle Management

It is all about installing, upgrading, repairing and deleting the scone-operator.

This is all done by just providing a kubernetes resource of kind scone.confidential.security.osc.t-systems.com. This is a cluster scoped resource. This resource reflects the desired state in spec section, like for example in the following:

The desired state of the following definition would be to have scone services Version 5.9.0 in place without having CAS installed. The scone-service-operator takes care about achieving the desired state.

kind: Scone
metadata:
  name: scone
spec:
  serviceVersion: 5.9.0
  managementState: Managed
  #skipCas: false
  #casDbStorage:
    #size: 1Gi
    #storageClass: default
  #casDbSnapshotStorage:
    #size: 1Gi
    #storageClass: default
  #serviceNodeSelector:
  #operatorNodeSelector:
  #cas:
    #resources:
      #limits:
        #ephemeral-storage: 1Gi
      #requests:
        #ephemeral-storage: 1Gi
  #backup:
    #interval: 3600
    #retention: 10
    #s3Creds:
      #secret: <s3creds-secret>
      #namespace: <s3creds-secret-namespace>
  ## CAS safetyService configuration
  #safetyService:
    ## The number of desired safety service replicas.
    #desiredReplicas: 3
    ## The minimum number of safety service replicas set in its Pod Disruption Budget.
    #minReplicas: 2
    ## The topology key for the pod anti-affinity associated with safety service deployment pods.
    ## Scheduler will try to spread safety service pods into the partitions
    ## created using this topology key. For more information check k8s docs out:
    ## https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/
    #podAntiAffinityTopologyKey: kubernetes.io/hostname
  #monitoring:
    #serviceMonitor:
      ## create a service monitor for the scone service
      #enabled: false
      ## labels to be added for the service monitor
      #labels:
        #release: kube-prometheus-stack
    #grafanaDashboard:
      ## create a dashboard for the scone service
      #enabled: false
      ## labels to be added for the dashboard
      #labels:
        #release: kube-prometheus-stack
  #tolerations:
    #- key: ""
     # operator: "Equal"
     # value: "false"
     # effect: "NoSchedule"

Scone Field Specification

Field	Default	Description	Optional
spec.version	-	version of scone services to be managed	no
spec.managementState	Managed	`Managed`: Operator manages the lifecycle `Unmanaged`: Set into this state if you want to manage the scone operator/services yourself	no
spec.skipCas	false	If you don't want to have CAS installed, set this flag to true. If you want to have a CAS installed, set this flag to false *NOTE:* An already installed CAS will be removed if skipCAS is set to true afterwards.	yes
spec.casDbStorage	spec: ... casDbStorage: size: 1Gi storageClass: default	This is for CAS DB and for CAS DB Snapshot. Parameters are optional. By default, storage parameters are not set and default values from Scone CAS will be applied then. *NOTE:* Allowed storage size is between 1Gi - 100Gi. Once the storage is created, the parameters cannot be changed.	yes
spec.casDbSnapshotStorage	spec: ... casDbSnapshotStorage: size: 1Gi storageClass: default	This is for CAS DB and for CAS DB Snapshot. Parameters are optional. By default, storage parameters are not set and default values from Scone CAS will be applied then. *NOTE:* Allowed storage size is between 1Gi - 100Gi. Once the storage is created, the parameters cannot be changed.	yes
spec.serviceNodeSelector	-	NodeSelector for CAS, LAS, SGXPlugin and CAS DB Backupper	yes
spec.operatorNodeSelector	-	NodeSelector for Scone Operator	yes
spec.cas.resources	spec: ... cas: ... resources: limits: ephemeral-storage: 1Gi requests: ephemeral-storage: 1Gi ... ...	To specify other resources than default you can override the default `resources`that should be applied to the CAS instance. E.g. to specify ephemeral-storage and memory resources: spec: ... cas: ... resources: limits: ephemeral-storage: 2Gi memory: 1Gi requests: ephemeral-storage: 1Gi memory: 500Mi ... ...	yes
spec.backup.interval	14400	Frequency of CAS DB Backup in seconds	yes
spec.backup.retention	10	Expire in days for CAS DB backups	yes
spec.backup.s3Creds.secret	-	If you have an already existing S3Bucket, this secret will be used by the CAS DB Backupper and copied to namespace `osc-scone-system` with name `s3-bucket-scone-cas-db-backupper`. If s3Creds are empty the operator creates an S3Bucket in namespace `osc-scone-system` with name `scone-cas-db-backupper`.	yes
spec.backup.s3Creds.namespace	-	If you have an already existing S3Bucket, this secret will be used by the CAS DB Backupper and copied to namespace `osc-scone-system` with name `s3-bucket-scone-cas-db-backupper`. If s3Creds are empty the operator creates an S3Bucket in namespace `osc-scone-system` with name `scone-cas-db-backupper`.	yes
spec.monitoring.serviceMonitor.enabled	false	If you want to create a prometheus serviceMonitor `controller-manager-metrics-monitor` which can be used to integrate prometheus metrics into a prometheus stack, like kube-prometheus-stack, you can enable it. false: serviceMonitor will not be created true: serviceMonitor will be created Note: The serviceMonitor can only be enabled, if there is the CRD `servicemonitors.monitoring.coreos.com` available in your cluster. Note: Only let the scone-service-operator create the ServiceMonitor `controller-manager-metrics-monitor` if not already created manually.	yes
spec.monitoring.serviceMonitor.labels	-	Once enabled the serviceMonitor you can attach labels to serviceMonitor `controller-manager-metrics-monitor` those are needed by your installed prometheus stack to let it automatically discover the serviceMonitor.	yes
spec.monitoring.grafanaDashboard.enabled	false	There is also basic dashboard to visualize the state of scone services available which can be enabled. You can find the dashboard in the configmap `scone-operator-grafana-dashboard`. An available prmetheus/grafana stack will auto-discover this dashboard to integrate it into your monitoring stack. false: configmap with dashboard will not be created true: configmap with dashboard will be created Note: Only let the scone-service-operator create the configmap `scone-operator-grafana-dashboard` if not already created manually.	yes
spec.monitoring.grafanaDashboard.labels	-	Once enabled the dashboard you can attach labels to configmap `scone-operator-grafana-dashboard` those are needed by your installed prometheus stack to let it automatically discover and integrate the dashboard.	yes
spec.safetyService	-	Configuration for Safety-service see embedded fields following. This feature will be supported from version 5.9.0. It allows to configure nodes those shall exist in the cluster and are able to host a CAS. This service provides metrics to enable detection of cluster issues in very early stage. With scone service version 5.9.0 this service is mandatory and will automatically applied with default values (see below). You can override these properties to apply your configuration. The field safetyService is optional. It cannot be set empty `safetyService: {}`since the embedded fields are mandatory. Refer to section Safety Service for more details about this feature.	yes
spec.safetyService.desiredReplicas	3	Number of desired replicas for the safety service deployment. Constraint: `desiredReplicas` >= 2	no
spec.safetyService.minReplicas	2	The minimum number of replicas of safety service allowed. This controls how many disruptions of safety service deployment are tolerated by the pod disruption budget. With the default values (`desiredReplicas=3`, `minReplicas=2` ), it would then allow 1 disruption. Constraint: `minReplicas` >= 2	no
spec.safetyService.podAntiAffinityTopologyKey	`kubernetes.io/hostname`	Topology key for the pod anti-affinity associated with safety service deployment pods. Use this topology key to ensure even number of pods are scheduled on nodes defined by this key. If there are 4 nodes in the cluster, the default key (assuming default values) will schedule 1 pod on each node, and one node with no pod as there are 3 desired replicas only.	no
spec.tolerations	-	With the specification of tolerations you can can run scone services on tainted nodes. This can be very helpful when you want to run sconified applications on dedicated nodes which are tainted. On these dedicated nodes also the scone services must run in order to run your sconified application. This feature is supported from version 5.9.0. Further information regarding taints and tolerations is available at kubernetes.io	yes
spec.tolerations.key	-	The `key` field specifies the taint key that the toleration matches, indicating the specific attribute or condition being tolerated.	yes
spec.tolerations.operator	-	The `operator` field defines the comparison method used to match the taint's value, determining how the toleration's value is compared to the taint's value.	yes
spec.tolerations.value	-	The `value` field specifies the taint value that the toleration matches, indicating the specific value of the attribute being tolerated.	yes
spec.tolerations.effect	-	The `effect` field specifies the taint effect that the toleration matches, indicating the consequence of a node not tolerating the taint.	yes

Scone Status Fields

The status section of the scone custom resource reflects the current lifecycle state of the scone-service-operator landscape.

Field	Description
status.deployedSpec	A hash value for latest deployed specification. This hash gets created using all fields from spec
status.deployedVersion	The deployed version reflects the latest deployed version of scone services. This can differ from the field spec.version in case an error occurred.
status.lastSuccessPhase	lastSuccessPhase depicts the previous state. It can be anything else than Error.
status.lastUpdate	lastUpdate reflects when the latest reconcile happened.
status.observedGeneration	observedGeneration reflects when the latest Generation has been reconciled - see also `metadata.generation`.
status.phase	phase reflects the current state of the service-operator.

The following diagram depicts the possible transitions and phases of a scone.confidential.security.osc.t-systems.com resource.

Deployment Diagram

Let's walk through the phases.

A user creates a new Custom Resource (CR) of kind scone.confidential.security.osc.t-systems.com and applies it into the cluster. The initial phase will be Empty. At this point the operator adds a finalizer into the resource and puts the CR into phase Creating. In phase Creating the service operator installs the scone operator and the scone services by applying the helm charts. After successful installation the CR gets into phase Running. Phase Running depicts the scone operator is ready to use. That should be the normal case. A user modifies the specification, e.g. changes the spec.backup.interval or spec.backup.retention. In that case the CR gets into phase Updating, which depicts that the change is getting applied. After successfully updating the scone-operator or the services, the CR gets back to phase Running.

If anything, for any reason, went wrong and the reconciliation to the desired state failed, then the CR gets into phase Error. The service operator tries up to 5 times to get into the desired state from phase Error. If it cannot reach the desired state after 5 times, it stops trying. In that case manual intervention is needed by the user.

The user (or better a human operator) then has to repair respective troubleshoot the deployment. Changes to the scone CR trigger put the scone-service-operator into phase Repairing where it tries to repair the scone-operator and services.

You can remove the whole stack by deleting the CR. This applies a helm uninstall of the installed package.

NOTE: Deletion of scone not only removes the scone-operator, it also removes all scone services coming with the installed helm packages. So pay attention!

The created s3-bucket will reside in osc-scone-system namespace. This is to have CAS DB snapshots available in order to recover a CAS from it. To remove the bucket and all objects stored inside of it one has to remove the s3 bucket resource explicitely.

Installation

Installing the whole service stack is a 3 step procedure

Installing the scone-service-operator with helm
Installing the scone services initially
Provisioning a production ready confidential CAS

Installation Steps

The following chapters will describe each step more in detail.

Install, Upgrade, Uninstall the scone-service-operator

We use helm to manage the lifecycle of the scone-service-operator. To install, upgrade, uninstall the scone-service-operator, we provide a helm chart to create all resources into your shoot. You need to have helm installed on your local machine, refer to upstream documentation.

Install helm chart

You can find the stable helm chart at oci://mtr.devops.telekom.de/osc/helm/sws-catalog/stable/scone-service-operator. The currently available versions of the scone-service-operator can be found here.

The following command assumes version 1.1.0 should be installed into namespace osc-sec-scone-svc-operator. Please refer to Helm values to configure the helm chart.

export CHART_VERSION=1.1.0
export NAMESPACE_SCONE_SERVICE_OPERATOR=osc-sec-scone-svc-operator

# pull the image to your local machine
helm pull oci://mtr.devops.telekom.de/osc/helm/sws-catalog/stable/scone-service-operator --version ${CHART_VERSION}

# untar the chart
tar -xvf scone-service-operator-${CHART_VERSION}.tgz

# install helm chart
helm install scone-service-operator --create-namespace \
    --namespace=${NAMESPACE_SCONE_SERVICE_OPERATOR} \
    scone-service-operator

After running the helm package installation you should find the helm chart installed in the namespace osc-sec-scone-svc-operator and to verify the successful deployment of the scone-service-operator with helm you can check the state of the helm package release and the pod deployment in the namespace osc-sec-scone-svc-operator.

export NAMESPACE_SCONE_SERVICE_OPERATOR=osc-sec-scone-svc-operator

# verify installation helm package release
helm list -n ${NAMESPACE_SCONE_SERVICE_OPERATOR}
NAME                    NAMESPACE                   REVISION    UPDATED                                 STATUS      CHART                           APP VERSION
scone-service-operator  osc-sec-scone-svc-operator      1           2024-09-04 13:04:30.100658 +0200 CEST   deployed    scone-service-operator-1.1.0    v1.1.0

# verfiy pod deployment
kubectl get pods -n ${NAMESPACE_SCONE_SERVICE_OPERATOR}
NAME                                                         READY   STATUS    RESTARTS   AGE
scone-service-operator-controller-manager-6c74676d97-t6bbd   1/1     Running   0          19h

Upgrade helm chart

Wether you want to configure your existing helm chart deployment new, or you want to upgrade to a new scone-service-operator -version you can achieve that with a helm upgrade.

You can find the stable helm chart at oci://mtr.devops.telekom.de/osc/helm/sws-catalog/stable/scone-service-operator. The currently available versions of the scone-service-operator can be found here.

Warning: Before upgrading the scone-service-operator make sure that the new version is compatible with the existing scone services that are already installed. If you are not sure, please refer to the supported scone versions to check the compatibility matrix.

Warning: In some cases manual intervention to upgrade the scone-service-operator might be needed. This can happen when for example helm is not capable to handle the update properly due to structural changes in the helm chart templates. In that case you might need to delete the existing helm release and install the new version.

Refer to following table to find the upgrade instructions for the scone-service-operator

Source	Target	See
1.1.x	1.1.x	Update scone-service-operator
1.1.x	1.2.x	Update scone-service-operator from `1.1.x` to `1.2.x`
1.2.x	1.2.x	Update scone-service-operator

Update scone-service-operator from `1.1.x` to `1.2.x`

Due to structural changes in helmchart release 1.2.x for scone-service-operator, the helm release cannot be simply updated from 1.1.x. The release 1.1.x has to be removed before release 1.2.x can be installed again. The following section describes the steps to be performed in order to upgrade the helm release from 1.1.x to 1.2.x. These steps do not harm the scone services (cas, sgxplugin, las) installed.

Export the current specification for scone and save to a file. This is needed to reapply the scone CR after the currently installed helm release has been removed and the new release has been installed.
```
kubectl get scone scone -o yaml > scone-1-1.yaml
```
Set field spec.ManagementState: Unmanaged of cluster wide scone resource. With that we take over management for scone, and we ensure that the scone-service-operator does not interfere with modification we do manually.
```
kubectl patch scone scone --type=json -p='[{"op": "replace", "path": "/spec/managementState", "value":"Unmanaged"}]'
```
Uninstall the helm release scone-service-operator. This will remove the scone-service-operator and any resources that came with helm release 1.1.x from the cluster.
```
helm uninstall -n osc-sec-scone-svc-operator scone-service-operator
```
Due to a finalizer in scone CR the scones.confidential.security.osc.t-systems.com CRD and the scone CR cannot be deleted, since the scone-service-operator has already been removed which normally handles the deletion of scone CR. We remove the finalizer from scone CR manually so that k8s can delete these resources as well.
```
kubectl patch scone scone --type=json -p='[{"op": "replace", "path": "/metadata/finalizers", "value":null}]'
```
Install new helm release, according to Install helm chart. Use proper CHART_VERSION, like for example:
```
export CHART_VERSION=1.2.0
```
Install the scone CR again, we previously saved in step 1. This will apply the scone CR with the same configuration as before.
```
kubectl apply -f scone-1-1.yaml
```

Verify installation was successful

# CAS is healthy
kubectl get cas -n osc-scone-system

# las and sgxplugin are healthy
kubectl get sgxplugin,las

# scone is in Phase Running
kubectl get scone

# check logs of scone-service-controller
kubectl logs -n osc-sec-scone-svc-operator $(kubectl get pods -n osc-sec-scone-svc-operator --output name)

Update scone-service-operator

This update path is standard and should be applied in most cases - when no update path is mentioned explicitly.

The following command assumes the scone-service-operator should be upgraded to version 1.1.1 and is deployed in namespace osc-sec-scone-svc-operator. Please refer to Helm values to configure helm chart according to your needs.

export CHART_VERSION=1.1.1
export NAMESPACE_SCONE_SERVICE_OPERATOR=osc-sec-scone-svc-operator

# pull the image to your local machine
helm pull oci://mtr.devops.telekom.de/osc/helm/sws-catalog/stable/scone-service-operator --version ${CHART_VERSION}

# untar the chart
tar -xvf scone-service-operator-${CHART_VERSION}.tgz

# upgrade helm chart
helm upgrade scone-service-operator \
    --namespace=${NAMESPACE_SCONE_SERVICE_OPERATOR} \
    scone-service-operator

After running the upgrade you should find the helm chart upgraded in the namespace osc-sec-scone-svc-operator and probably a restarted scone-service-operator pod. To verify the successful deployment of the scone-service-operator with helm you can check the state of the helm package release and the pod deployment in the namespace osc-sec-scone-svc-operator.

export NAMESPACE_SCONE_SERVICE_OPERATOR=osc-sec-scone-svc-operator

# verify installation helm package release
helm list -n ${NAMESPACE_SCONE_SERVICE_OPERATOR}
NAME                    NAMESPACE                   REVISION    UPDATED                                 STATUS      CHART                           APP VERSION
scone-service-operator  osc-sec-scone-svc-operator      2           2024-09-05 12:03:30.100658 +0200 CEST   deployed    scone-service-operator-1.1.1    v1.1.1

# verfiy pod deployment
kubectl get pods -n ${NAMESPACE_SCONE_SERVICE_OPERATOR}
NAME                                                         READY   STATUS    RESTARTS   AGE
scone-service-operator-controller-manager-6c74676d97-t6dfdd   1/1     Running   0          39s

Uninstall helm chart

An installed scone-service-operator can be uninstalled with helm. To uninstall the scone-service-operator the helm package and with that the deployment of the scone-service-operator, run the following command. This will remove the scone-service-operator from the cluster. It assumes the namespace osc-sec-scone-svc-operator is used.

export NAMESPACE_SCONE_SERVICE_OPERATOR=osc-sec-scone-svc-operator
helm uninstall scone-service-operator -n ${NAMESPACE_SCONE_SERVICE_OPERATOR}

The uninstallation of the scone-service-operator helm chart will not remove the scone services (scone-operator, LAS, CAS, SGX-Plugin) that you might have installed already with the help of the scone-service-operator. Already installed scone services will remain in the cluster and management of these services will now have to be done by the user himself. This is not recommended.

The helm uninstall will also not remove the CRD scones.confidential.security.osc.t-systems.com and a scone CR that might have been applied into the system. To remove the resources is up to the user.

These resources can be removed with the following commands:

# remove finalizer first before deleting the scone CR
kubectl patch scone scone -p '{"metadata":{"finalizers":null}}' --type=merge
kubectl delete scone scone
kubectl delete crd scones.confidential.security.osc.t-systems.com

Warning: If you have scone services running that you want to keep and still want to uninstall the scone-service-operator, be aware of the consequences. Consequences are that the scone-service-operator will not manage the lifecycle of the scone services anymore. The user has to take care about the management of the scone services himself.

Tip: If you want to remove the scone services as well, then first remove the scone CR which in turn removes the scone services and then uninstall the scone-service-operator. Refer to Delete the Scone service for more information.

After running the uninstallation you should find the helm chart removed from the namespace osc-sec-scone-svc-operator and the scone-service-operator pod removed as well. You can verify this by checking the state of the helm package release and the pod deployment in the namespace osc-sec-scone-svc-operator. Both the resources should not exist anymore.

export NAMESPACE_SCONE_SERVICE_OPERATOR=osc-sec-scone-svc-operator

# verify installation helm package release
helm list -n ${NAMESPACE_SCONE_SERVICE_OPERATOR}
NAME                    NAMESPACE                   REVISION    UPDATED                                 STATUS      CHART                           APP VERSION

# verfiy pod deployment
kubectl get pods -n ${NAMESPACE_SCONE_SERVICE_OPERATOR}
No resources found in osc-sec-scone-svc-operator namespace.

Helm Values

You can configure the helm chart according to your needs with the following values:

Value Default Description

controllerManager.nodeSelector

-

If you want to run the pod on dedicated nodes, you can specify a nodeSelector. Example:

controllerManager:
  nodeSelector:
    osc.services/enabled: "true"

Make sure nodes are labeld with the key `osc.services/enabled` and value `true` to run the pod on these nodes.

controllerManager.manager.resources

controllerManager:
...
  manager:
  ...
    resources:
      limits:
        cpu: 500mi
        memory: 1Gi
      requests:
        cpu: 10mi
        memory: 512Mi

When you specify a Pod, you can optionally specify how much of each resource a container needs. Refer to Resources Management for Pods documentation for more details.
Example:

controllerManager:
...
  manager:
  ...
    resources:
      limits:
        cpu: 1000mi
        memory: 2Gi
      requests:
        cpu: 100mi
        memory: 512Mi

monitoring.labels
monitoring.serviceMonitor.enabled
monitoring.prometheusRules.enabled

monitoring:
  labels:
    release: kube-prometheus-stack
  serviceMonitor:
    enabled: false
  prometheusRules:
    enabled: false

The monitoring values enable the creation of the serviceMonitor, prometheusRules resources, which integrate the scone-service-operator into an existing prometheus monitoring. Refer to section Monitoring scone-service-operator for details.
Example: To label all created monitoring-related resources for the scone-service-operator with release: osc-prometheus-monitoring and to enable only the serviceMonitor, use following settings:

monitoring:
  labels:
    release: osc-prometheus-monitoring
  serviceMonitor:
    enabled: true
  prometheusRules:
    enabled: false

Enabling the creation of the serviceMonitor and prometheusRules CRs require the CRDs

servicemonitors.monitoring.coreos.com
prometheusrules.monitoring.coreos.com

which typically come with a prometheus stack, like for example the kube-prometheus-stack.

For convenience, we provide a values file extra-values.yaml in the helm tar-file you can use to configure the helm chart. It should serve as a template which you can use to adapt the configuration to your needs. It can be then used in conjunction with helm install and helm upgrade. We recommend to copy this file to your local machine and adapt it to your needs. So it will not get overwritten when pulling the helm chart again.

export NAMESPACE_SCONE_SERVICE_OPERATOR=osc-sec-scone-svc-operator

# copy the extra-values.yaml to your local machine
cp scone-service-operator/extra-values.yaml .

# edit the extra-values.yaml to your needs
# vi extra-values.yaml .....

# install the helm chart with the extra-values.yaml
helm install scone-service-operator --create-namespace \
    --namespace=${NAMESPACE_SCONE_SERVICE_OPERATOR} \
    -f extra-values.yaml \
    scone-service-operator

# or upgrade the helm chart with the extra-values.yaml
helm upgrade scone-service-operator \
    --namespace=${NAMESPACE_SCONE_SERVICE_OPERATOR} \
    -f extra-values.yaml \
    scone-service-operator

Initially installing scone services

Installing the scone services requires the scone-service-operator to be installed. Refer to Install helm chart.

For installing the services the user applies a scone.confidential.security.osc.t-systems.com resource. Please check the Scone Field Specification table for valid properties.

Example: Add scone-controller and scone services version 5.9.0 including CAS.

cat <<EOF | kubectl apply -f -
apiVersion: confidential.security.osc.t-systems.com/v1alpha1
kind: Scone
metadata:
  name: scone
spec:
  serviceVersion: 5.9.0
  managementState: Managed
EOF

The creation of this resource triggers scone-service-operator to create the scone-operator and the initial services for las, cas and sgxplugin.

NOTE The scone and cas CRs must not be created with confirm deletion annotation. Otherwise, the installation will not start.

Provisioning CAS

To take CAS ownership one has to provision the CAS. A detailed description can be found in the scontain upstream documents³.

We outline specific parameters have to be used.

usage of MTR as container image registry

Prerequisites

Provisioning the CAS will take place from your local trusted machine. To provision the CAS, you'll need the following tools to be set up properly on your local machine.

Tools

docker
kubectl
kubectl-provision plugin
git
helm
jq
cosign
gpg

Prerequisites on the shoot cluster

enabled osc-s3-bucket-service extension
nodes with SGX support

Credentials

Since the kubectl-provision plugin repository hosted on github is private, please make sure you have been granted access.

Install kubectl-provision plugin**

# Version to provision
$ export VERSION=5.8.0

# Upstream URL of kubectl-provision plugin
$ export SCONE_KUBECTL_PROVISION_PLUGIN="https://raw.githubusercontent.com/scontain/SH/master/${VERSION}/kubectl-provision"

# Folder to hold your kubectl-provision plugin. It has to be in your PATH.
$ export SCONE_KUBECTL_PROVISION_PLUGIN_BIN=/usr/local/bin/kubectl-provision

# Download plugin and make it executable
$ curl -fsSL "${SCONE_KUBECTL_PROVISION_PLUGIN}"  -o "${SCONE_KUBECTL_PROVISION_PLUGIN_BIN}"
$ chmod a+x "${SCONE_KUBECTL_PROVISION_PLUGIN_BIN}"

DCAP API Key

It is recommend to subscribe to Intel Provisioning Certification Service (Intel PCS⁹) which provides you with the API keys needed to query the service for Intel SGX attestation collateral.

Each subscription to the Intel PCS Service for ECDSA Attestation issues two API keys: a primary key and a secondary key. Either one can be used. The point of issuing two keys is to provide continuity of service in the event the active key needs to be regenerated.

If you don't use your personal DCAP API Key when provisioning CAS later a default key will be used. With that default DCAP API Key there will be no guarantees about validity and availability.

Provision CAS

To provision CAS run following commands.

# Version to provision
$ export VERSION=5.8.0
# fetch images from our MTR
$ export IMAGE_REPO=mtr.devops.telekom.de/osc/security/confidential-computing

$ export DCAP_APIKEY=<YOUR DCAP API-KEY>
$ kubectl provision cas cas --verbose --namespace osc-scone-system --dcap-api ${DCAP_APIKEY}

After a while CAS should get into state healthy, provisioned and attestable. With that it is ready to use.

NAMESPACE          NAME   STATUS    PHASE     PROVISIONED   ATTESTABLE   SNAPSHOTS   MIGRATION   VERSION     AGE
osc-scone-system   cas    HEALTHY   HEALTHY   Yes           Yes          Persisted   2/2         5.8.0       15m

Upgrade the Scone services

The upgrade of the scone services is a two steps procedure.

Upgrade CAS
Upgrade Services LAS, SGX-Plugins

It starts with upgrading CAS. Afterwards SGX-Plugin and LAS can be upgraded.

ATTENTION: Before upgrading the scone services to a new version assure that version is supported by OSC. Check the support for that.

Upgrade CAS

The upgrade of CAS to a new version is to be done with kubectl-provision plugin again. To upgrade CAS execute the command you will find below. Assure you meet the prerequisites for the new version to be installed. Install corresponding version of kubectl-provision plugin first. Beyond that assure that you have all credentials in place to execute an upgrade of CAS.

$ kubectl provision cas cas -n osc-scone-system --upgrade "<VERSION>"

After CAS has been succesfully updated, the scone service operator will upgrade the scone CR and all other scone services.

Delete the Scone service

To remove all scone services remove your scone CR from your cluster.

$ kubectl delete scone scone

ATTENTION All scone services and resources are getting removed when deleting the scone CR.

NOTE To prevent a Scone and CAS installation from being deleted accidentally (for example with command kubectl delete scone scone) a user explicitly has to confirm to delete those resources. Only if certain annotation are set beforehand to those resources the deletion will be accepted. Otherwise the deletion will fail.

For the deletion of the scone CR is necessary to insert the correct annotation based on cluster configuration for delete confirmation with the following steps:

add the annotation cr.sws.osc.t-systems.com/confirm-deletion: <region>--<garden-project>--<shoot-name>--<namespace> to cas CR
add the annotation cr.sws.osc.t-systems.com/confirm-deletion: <region>--<garden-project>--<shoot-name> to scone CR
delete Scone service with command herein above

You can retrieve all information needed to create the correct annotation from your shoot-info configmap in kube-system namespace.

Snapshot (Backup) and Recovery CAS

Snapshotting the CAS database is enabled by default. Snapshots are stored onto another volume. Additionally, snapshots are pushed into an S3 Bucket periodically. Refer to section Backup CAS DB into S3 Bucket.

To recover CAS from a snapshot follow the upstream documentation⁵. To recover a CAS Database you need to have a snapshot of the CAS Database on your local machine. This snapshot can come from the S3 Bucket where snapshots are stored periodically or from a manual created backup with kubectl provision tool.

Find samples to back up and to recover below.

Backup

kubectl provision cas cas --local-backup -n osc-scone-system

Recovery

export DCAP_APIKEY=<your dcap api key>
kubectl provision cas cas --cas-database-recovery cas-osc-scone-system-last-snapshot-db --verbose --namespace osc-scone-system --dcap-api ${DCAP_APIKEY}

NOTE: A restore provisions a new CAS. Assure that there is no CAS CR applied and no CAS running before running the recovery. Furthermore during recovery the scone-service-operator needs to be set to Unmanaged. Otherwise the scone-service-operator would create a CAS resource again which contradicts the one coming from the recovery. After successful recovery the scone-service-operator can be set into ManagementState Managed again.

CAS Safety Service

The CAS Safety Service is a mechanism to ensure a minimum number of k8s nodes are available in the cluster to host a CAS. A CAS is tightly coupled to the nodes on which it is running. This is due to the fact that the database key used to decrypt and encrypt the CAS database data is itself encrypted with a key that is only known to the CPU (Seal key). This dependency makes it difficult to manage the nodes in the cluster in a cloud-native manner. For that reason, the Safety Service was introduced. With the Safety Service, a user can configure the minimum number of nodes that must be available in the cluster to host a CAS. The Safety Service will ensure that the minimum number of nodes is available before a CAS can be evicted from a node. Technically, this is implemented with the help of a Pod Disruption Budget for a cas-safety-service, which is placed on nodes according to podAntiAffinity.

The Safety Service provides the following features:

Configures the minimum and desired number of replicas for the Safety Service
Allows eviction of a CAS from a node only if there is another node available to host the CAS
Provides metrics to monitor the Safety Service regarding the number of available replicas (nodes)

The Safety Service ensures that a certain number of nodes which are capable to host a CAS are available by attaching a Pod Disruption Budget to the Safety Service deployment pods. The nodes selected by the Safety Service to run are SGX-capable nodes that have already been registered by the backup controller. This implies that the backup controller must be enabled to use the Safety Service and that the Safety Service can only run in provisioned CAS instances.

The CAS Safety Service ensures that a certain number of nodes capable of running CAS are available. The number of nodes can be configured in the scone-CR config.

A sample configuration can look like this:

kind: Scone
metadata:
  name: scone
spec:
  serviceVersion: 5.9.0
  managementState: Managed
  # CAS safetyService configuration
  safetyService:
    # The number of desired safety service replicas.
    desiredReplicas: 5
    # The minimum number of safety service replicas set in its Pod Disruption Budget.
    minReplicas: 3
    # The topology key for the pod anti-affinity associated with safety service deployment pods.
    # Scheduler will try to spread safety service pods into the partitions
    # created using this topology key. For more information check k8s docs out:
    # https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/
    podAntiAffinityTopologyKey: kubernetes.io/hostname

The desiredReplicas field defines the number of replicas that the Safety Service should have. The minReplicas field defines the minimum number of replicas set in the Pod Disruption Budget. The podAntiAffinityTopologyKey field defines the topology key for the pod anti-affinity associated with Safety Service deployment pods. The scheduler will try to spread Safety Service pods across the partitions created using this topology key.

Tip: To ensure that nodes capable of hosting a CAS are distributed across different Availability Zones, you should use podAntiAffinityTopologyKey: topology.kubernetes.io/zone as the topology key for the pod anti-affinity associated with Safety Service deployment pods.

Monitoring CAS Safety Service

Monitoring the CAS Safety Service is crucial to ensure that the CAS is running on the desired number of nodes.

Important: Currently, there is no dashboard or alerting mechanism in place to monitor the Safety Service by default. Ensure that you monitor and set alerts on the Safety Service metrics to confirm that the CAS is running on the desired number of nodes.

The metrics to monitor the Safety Service are exposed on the /metrics endpoint. You can find various scone-cas-safety-service metrics in the referenced attachment ¹⁸.

# HELP scone_cas_safety_service_deployment_status This metric corresponds to the status field "safetyServiceDeploymentStatus". Status field description: The status of the safety service deployment.\n\nThe value Healthy indicates that current number of ready replicas is equals to the desired number of replicas.\n\nThe value Alert indicates that safety service has pending replicas and ready replicas count is less than desired replicas count.\n\nThe value Unhealthy indicates that either current ready replicas count is less than spec.safetyService.minReplicas or deployment differs from spec.\nAlso unhealthy if deployment if deployment has unavailable replicas when number of ready replicas count is not less than desired replicas count.\n\nThe value Disabled is set when safety service is disabled.\n\nThe value Unknown is set when there current status could not be determined. Possible metrics values: -1, 0, 1, 2, 3. The value -1 indicates the status field value "Unknown". The value 0 indicates the status field value "Disabled". The value 1 indicates the status field value "Unhealthy". The value 2 indicates the status field value "Alert". The value 3 indicates the status field value "Healthy".
# TYPE scone_cas_safety_service_deployment_status gauge
scone_cas_safety_service_deployment_status{cr_name="cas",cr_namespace="osc-scone-system"} 3
# HELP scone_cas_safety_service_pdb_status This metric corresponds to the status field "safetyServicePDBStatus". Status field description: The status of the safety service pod disruption budget.\n\nThe value Healthy indicates that pod disruption budget fo the safety service is ok.\n\nThe value Alert indicates that pod disruption has no more allowed disruptions.\n\nThe value Unhealthy indicates that either the current healthy pods count is less than desired healthy pods count or pdb differs from spec.\n\nThe value Disabled is set when safety service is disabled.\n\nThe value Unknown is set when there current status could not be determined. Possible metrics values: -1, 0, 1, 2, 3. The value -1 indicates the status field value "Unknown". The value 0 indicates the status field value "Disabled". The value 1 indicates the status field value "Unhealthy". The value 2 indicates the status field value "Alert". The value 3 indicates the status field value "Healthy".
# TYPE scone_cas_safety_service_pdb_status gauge
scone_cas_safety_service_pdb_status{cr_name="cas",cr_namespace="osc-scone-system"} 3
# HELP scone_cas_safety_service_status This metric corresponds to the status field "safetyServiceStatus". Status field description: The safety service status. It assumes the least healthy state between `safety_service_deployment_status` and `safety_service_pdb_status` Possible metrics values: -1, 0, 1, 2, 3. The value -1 indicates the status field value "Unknown". The value 0 indicates the status field value "Disabled". The value 1 indicates the status field value "Unhealthy". The value 2 indicates the status field value "Alert". The value 3 indicates the status field value "Healthy".
# TYPE scone_cas_safety_service_status gauge
scone_cas_safety_service_status{cr_name="cas",cr_namespace="osc-scone-system"} 3

Recommendation

Follow the recommendations below to ensure the CAS can mitigate disruptions in productive environments when removing nodes from the cluster, due to - node maintenance, or - upgrade k8s version, or - upgrade of garden linux version, or - scaling down a cluster.

Before running actions mentioned above, ensure that the CAS Safety Service is running and healthy and the configuration meets your requirements.

Assure you have at least 3 nodes capable to host CAS in the cluster and set safetyService.desiredReplicas=3 and safetyService.minReplicas=2 (setting is used by default). These values should increase with the cluster size.

Set safetyService.minReplicas to a number that it is ensured CAS Safety Services are running on hosts in different AZs when using podAntiAffinityTopologyKey=kubernetes.io/hostname.
Example: 10 nodes which are capable to host a CAS are evenly distributed via 3 AZs (AZ1: 4 nodes, AZ2: 3 nodes, AZ3: 3 nodes ), set SafetyService.minReplicas greater than 4, e.g. 8. With this setting it is ensured that CAS Safety Services are running in each AZ.

Adjust safetyService.minReplicas and safetyService.desiredReplicas according to the number of nodes in the cluster and the number of nodes capable to host a CAS when resizing the cluster.

Only use safetyService.minReplicas.podAntiAffinityTopologyKey=topology.kubernetes.io/zone when having at least 3 AZs.

When using safetyService.minReplicas.podAntiAffinityTopologyKey=topology.kubernetes.io/zone set safetyService.desiredReplicas to the amount of AZs.

Set safetyService.desiredReplicas at least 1 greater than and safetyService.minReplicasto allow disruptions.

Monitor and Alert on CAS Safety Service metrics to ensure the CAS is running on the desired number of nodes. This is crucial to ensure there are always desired number of nodes in the cluster which can host CAS.

Shoot K8s Cluster Management

Shoot Cluster Management activities have to have carefully and thoroughly planned in respective of running scone services especially the CAS database when it comes to removing a machine (a k8s node).

The CAS database is always encrypted to protect the confidentiality of the contained secrets. The database encryption key is stored as cipher text in a separate file with the same filename, but ending with .key-store.

The key used to encrypt the database encryption key is derived using SGX' seal key derivation feature. Therefore, CAS can only decrypt the database encryption key and access the database if it is executed on the same machine which has created the .key-store file, or the key is explicitly made available to a CAS on another machine with CAS' Backup feature.

In particular, that means if the machine hosting CAS breaks, e.g. due to hardware failure, or is not available anymore, for example, because it was a cloud machine, you'll access to all data stored by CAS is lost!

Thanks to the CAS' Backup feature all SGX capable machines forming your cluster have been registered. That means that if for any reason a CAS is scheduled onto another (registered) machine, it can access the data because it is able to decrypt the database key with its SGX' seal key.

Have at least 2 Nodes (better more) available to overcome a machine outage which hosted your CAS.

It is not only sufficient that all machines are registered to host a CAS. More over a machine must also be able to connect to the Persistent Volume (PV) where the CAS' database is running. Unfortunately Open Sovereign Cloud currently does not support spreading a PV over multiple Availability Zones (AZ). That means: Only a machine running in the same AZ where the CAS is running can take over a CAS.

WARNING: Assure that you have multiple registered (migratable) machines in the same AZ where your CAS is running. Otherwise it will not overcome an outage of the machine where the CAS was running.

The limitations mentioned above have impact on actions, like

Upgrading the shoots k8s version
Resizing the shoot cluster
Removing a node hosting a CAS from service

Thanks to Pod Disruption Budget (PDB) CAS will be protected with such a PDB. This assures that there is always one instance running since Gardener respects such PDB and evicts node safely. That also means that Gardener stops draining the node where CAS is running.

An Administrator/User then has to manually evict the CAS. This will be described in following chapter.

Evict CAS from a machine (k8s node)

This section covers steps to safely move a CAS instance from one machine to another machine in general.

The procedure of safely evicting a CAS is needed in case of * Upgrading the shoots k8s version * Resizing the shoot cluster * Draining a node hosting CAS for maintenance reasons * Removing a machine hosting a CAS. This also includes Removing the underlying bare metal server. * etc.

Before evicting a CAS from a node one has to assure that there is another node available which is capable to run the CAS. Once this is verified one can transfer the CAS to another node.

Verify another node is capable to host the CAS Instance

A node capable to host a CAS must fulfill a bunch of criteria.

The node must * be registered to run CAS backups, * reside in the same AZ, * be schedulable, and * must have set certain labels.

If there is no such node available, one has to add another node into the cluster, which supports sgx workloads and which is located in the same AZ where CAS is currently running. Labeling of the node will be done automatically by the scone-operator.

A potential machine to move CAS to must reside in the same AZ as the machine that is currently hosting the CAS. This again is because it must be able to mount the Persistent Volume with the CAS Database. This is currently not possible if the machine runs in a different AZ. In near future this will change and cross connection will be enabled.

To figure out in which AZ the CAS is currently running, execute following commands.

# get node where CAS is running
$ export CASNODE=$(kubectl get pod cas-0 -n osc-scone-system -o json | jq '.spec.nodeName' | tr -d '"' )
$ echo $CASNODE

# get AZ where node resides
$ export CASAZ=$(kubectl get node $CASNODE -o json | jq '.metadata.labels."topology.kubernetes.io/zone"' | tr -d '"')
$ echo $CASAZ

To find nodes which can potentially host the CAS except the node that is currently hosting CAS, run the command below.

This command considers * the correct AZ * registered nodes * other labels needed to run CAS, and * the node must be schedulable

$ kubectl get nodes -o json | jq '.items[] |
  select(.metadata.labels."topology.kubernetes.io/zone"==$ENV.CASAZ) |
  select (.metadata.labels."cas-registered-cas.osc-scone-system-5.8.0"=="true") |
  select (.metadata.labels."las.scontain.com/capable" == "true")|
  select (.metadata.labels."las.scontain.com/ok" == "true") |
  select (.metadata.labels."sgx.intel.com/capable" == "true") |
  select (.spec.unschedulable != true) |
  .metadata.name' |
  tr -d '"' |
  grep -v $CASNODE

If the query above does not return any nodes then one first has to add a new node into the cluster. Now that we have assured that there is at least one node which is capable to host a CAS we can continue.

Move CAS to another node

Follow this steps to move the CAS pod to another node.

Disable scheduling to CAS node

Since we want to move CAS to another node, we need to set the currently hosting node to unschedulable, so that the CAS is being scheduled onto another node.

$ kubectl cordon $CASNODE

Delete CAS Pod

Finally the cas pod can be deleted. Kubernetes will place the pod onto another node then.

$ kubectl delete pod cas-0 -n osc-scone-system

Verify the CAS healthy again

Check CAS pod is running and CAS is in state healthy.

# CAS is running again - verify that CAS pod is running on another node than before
$ kubectl get pod cas-0 -n osc-scone-system -o wide

# CAS is in state healty
$ kubectl get cas cas -n osc-scone-system

Enable scheduling

If needed then set the CAS node back to schedulable.

$ kubectl uncordon $CASNODE

Upgrading the shoot k8s version

According to Gardner Documentation¹¹ there is difference between minor version update (e.g. 5.x.0) and patch version update (e.g. 5.8.x).

Patch version upgrade happens in-place. This means that the shoot worker nodes remain untouched and only the kubelet process restarts with the new Kubernetes version binary.

In contrast a Minor Version Upgrade happens in a rolling update fashion. The worker nodes will be terminated one after another and replaced by new machines. The existing workload is gracefully drained and evicted from the old worker nodes to new worker nodes, respecting the configured PodDisruptionBudgets (PDB).

CAS will be protected with such a PDB. Gardener will not drain the node where CAS is currently placed unless someone removes CAS from the the node.

Here manual interaction by the User/Administrator comes into play. He has to assure that there is another machine available that is capable to run the CAS. Once this is assured he can evict the CAS instance from the affected node. Once that is done Gardener will continue with upgrading remaining nodes.

In Gardener Dashboard you`ll find a message that a Pod Disruption Budget prevents from draining and evicting the node.

IMPORTANT: Before starting a k8s version upgrade make sure that the CIDR of your subnet allows additional nodes that will be added due to rolling update. If needed scale down the nodes first. E.g. if your subnetwork has CIDR set to 28 which allows max. 16 nodes assure there are max 15 nodes running before upgrading the cluster.

Assure another node is capable to host a CAS Instance

Check if node is capable to host the CAS Instance to safely evict CAS. Only if there is at least one node available, continue with following step.

Delete CAS Pod

Delete the CAS pod. Kubernetes will place the pod onto another node then.

$ kubectl delete pod cas-0 -n osc-scone-system

Verify CAS is in healty state

Refer to Verify the CAS healthy again to check if CAS is up and running again.

Resizing the shoot cluster

This chapter in briefly discusses what is to be considered when adding or removing nodes from a cluster, i.e. scaling up or scaling down the cluster concerning CAS.

Scaling up the cluster, means adding nodes into the cluster. This does not have severe impact on CAS as it resides on the node where it runs and does not have to evict. The added nodes will be registered automatically by CAS' Backup feature.

Scaling down can have severe impact on the CAS. Since the cluster scaler selects nodes to remove from the cluster randomly, it can happen that it selects the machine where CAS is currently running on. In that case the Cluster Scaler respects the Pod Disruption Budget which prevents it from removing the node until an Administrator/User explicitely drain and evicts the node. The Administrator/User has to assure that there is a node which is capable of running the CAS again.

In Gardener Dashboard you`ll find a message that a Pod Disruption Budget prevents from draining and evicting the node.

Assure another node is capable to host a CAS Instance

Refer to Chapter to Verify another node is capable to host the CAS Instance to safely evict CAS. Only if there is at least one node available, continue with following step.

Delete CAS Pod

Delete the CAS pod. Kubernetes will place the pod onto another node then.

$ kubectl delete pod cas-0 -n osc-scone-system

Verify CAS is in healty state

Refer to Verify the CAS healthy again to check if CAS is up and running again.

Remove a node from a service

This chapter briefly discusses what is to be considered when you want to remove a node which is hosting a CAS from service for any reason, e.g. maintenance of VM, or maintenance of underlying bare metal server.

Check if CAS is running on node to remove

First step is to double check if CAS is running on the node that should be removed from service.

# get node where CAS is running
$ export CASNODE=$(kubectl get pod cas-0 -n osc-scone-system -o json | jq '.spec.nodeName' | tr -d '"' )
$ echo $CASNODE

Verify if $CASENODE is the node you want to remove. If it is then continue with next step where to assure there is another node to host CAS, else you do not have to care about evicting CAS safely since it is not running on this node.

Assure another node is capable to host a CAS Instance

Refer to Chapter to Verify another node is capable to host the CAS Instance to safely evict CAS. Only if there is at least one node available, continue with following steps.

Move CAS to another node

Follow description Move CAS to another node to move the pod to another node.

Verify CAS is in healty state

Refer to Verify the CAS healthy again to check that CAS is up and running and healthy again.

Drain the node

Evict all other workloads gracefully to other nodes.

$ kubectl cordon $CASNODE
$ kubectl drain $CASNODE --force --ignore-daemonsets --delete-emptydir-data

Take node out of service

Finally CAS has been evicted successfully and you can remove the node from the cluster.

Vault

Scone-Operator manages the lifecycle of a confidential Hashicorp Vault. Given that you can let the scone-operator manage a confidential Vault for you.

Refer to upstream documentation on provisioning and deploying a confidential vault⁴.

Refer to the very detailed upstream documentation Confidential Hashicorp Vault on Kubernetes ⁷ where you can find a step by step guide to provision a confidential vault without using kubectl-provision -plugin.

Provision a vault

It is recommended to use mirrored images from MTR to build confidential Vault.

Use following environment settings to use the mirrored images.

# Version to provision
$ export VERSION=5.8.0
# fetch images from our MTR
$ export IMAGE_REPO=mtr.devops.telekom.de/osc/security/confidential-computing

$ kubectl provision vault cas --verbose --namespace osc-scone-system

Monitoring

Monitoring of the scone services is crucial to ensure the services are running as expected. Therefore the scone-service-operator and the scone-operator provide metrics to enable the user to monitor the services.

There are two components that should be monitored:

scone-operator including scone services,
and scone-service-operator.

The following chapter describes how to create the respected resources in order to integrate monitoring of these components into a kube-prometheus-stack¹⁶. Other prometheus monitoring stacks should work as well.

Monitoring Scone Operator including Scone Services

The monitoring of scone-operator including scone services can be integrated into an available prometheus monitoring stack including a dashboard for Grafana. To configure the resources for monitoring the scone-operator and the scone services, you can use the serviceMonitor CR and a configmap including the dashboard. See scone.spec.monitoring for more details.

Scone Operator Metrics

The Scone Operator exports to a set of metrics that can be used to monitor the operator and the scone services. Please check the attached page with a list of metrics¹⁸ the Scone Operator provides.

Scone Operator Alert Rules

Currently there are no Alert Rules provided for the Scone Operator. You can use the metrics that the scone operator exports to create your own alert rules.

Grafana Dashboard

Currently, we provide a basic dashboard for the Scone Operator. You can enable the creation of the dashboard in the scone config.

Monitoring Scone Service Operator

The scone-service-operator itself integrates into Prometheus Monitoring for operational means. Also alert-rules can be applied in order to inform operations team about critical issues.

Scone Service Operator Metrics

The Scone Service Operator exports a set of custom metrics in addition to the default controller-runtime metrics. These are:

Metric Name	Description	Type	Labels
`osc_sws_total_cr_count`	The total number of CRs of `scones.confidential.security.osc.t-systems.com`. Should be 0 if there is none, 1 if there is one `scone` CR. More than 1 should never occur.	Gauge	`cr_type` `operator_name`
`osc_sws_operator_safe_to_delete`	Indicates whether the operator is safe to delete. 0: indicates that there are CRs managed by the `scone-service-operator` and it should be deleted with caution, 1 : indicates that there is no CR managed by this `scone-service-operator` and it can be safely deleted	Gauge	`operator_name`
`osc_sws_disabled_safeguards`	Indicates whether the CR has disabled safeguards. 0: Safeguard is enabled. 1: Safeguard is disabled. This metric will be automatically set to 1 if the field `spec.managementState` is set to `Unmanaged`.	Gauge	`cr_name` `cr_namespace` `cr_type` `cr.sws.osc.t-systems.com/unmanaged`

The Safeguard metrics can be used from external applications to manage the scone-service-operator so for example to remove the scone-service-operator safely, or to get notice of the managementState.

Please check the attached page with a list of metrics¹⁷ the Scone Service Operator provides.

Tip: You can find the serviceMonitor template in the scone-service-operator helm chart under templates/prometheus-monitoringservice.yaml. You can use this file to customize the serviceMonitor to your needs. To install the serviceMonitor CR, you can use the helm upgrade or helm install command along with the corresponding helm values to enable the creation of the serviceMonitor CR. Only enable the installation of the serviceMonitor CR if you have a Prometheus stack running in your cluster, e.g. a kube-prometheus-stack.

Scone Service Operator Alert Rules

The Scone Service Operator comes with a set of alert rules that can be applied to the Prometheus instance. These alert rules are based on the standard metrics that the operator exports. You can use these prometheusRules as an example to serve as a template for your own alert rules.

Alert	Group	Description	Severity
InstanceDown	OSCSconeAlerts	{{ $labels.instance }} Pod has been in a non-ready state for longer than 15 minutes	critical
HighCPUUsageDetected	OSCSconeAlerts	{{ $labels.instance }}: CPU usage is above 90% (current value is: {{ $value }}	warning
HighMemoryUsageDetected	OSCSconeAlerts	{{ $labels.instance }}: Memory usage is above 90% (current value is: {{ $value }}	warning
ReconcileError	OSCSconeAlerts	{{ $labels.instance }}: reconcile errors detected in {{ $labels.controller }}	critical

Tip: You can find the alert rules in the scone-service-operator helm chart under templates/prometheus-rules.yaml. You can use this file to customize the alert rules to your needs. To install these alert rules, you can use the helm upgrade or helm install command along with the corresponding helm values to enable the creation of the PrometheusRule CR. Only enable the installation of this prometheusRule CR if you have a Prometheus stack running in your cluster, e.g. a kube-prometheus-stack.

Scone Service Operator Dashboards

Currently, there is no dashboard provided for the Scone Service Operator. To create a dashboard for the Scone Service Operator, you can use the custom metrics that the operator exports and of course the standard controller-runtime metrics.

Auditing - CAS Audit Logs

The CAS audit log is a cryptographically provable journal of security-sensitive operations executed over the course of the lifetime of a CAS instance. The audit log can be evaluated to learn - among others -

how a SCONE application's configuration came about,
how its secrets were created, and
which services accessed these secrets.

This can be used to prove that an application's configuration has not been tampered with. The log is created and signed by the SCONE-secured CAS itself, therefore even untrusted CAS owners or operators cannot tamper with the log. It might also be used for billing session owners.

Check the sample¹⁴ cas_audit.log to get an impression of the format.

This audit log gets written * into ephemeral storage at /var/log/cas/audit/cas_audit.log in the cas pod. Since this audit log file is written to ephemeral storage it starts a new log with every new start of the CAS pod - the audit log from the previous container execution will be lost. * to stdout which will usually be collected by log-collector and pushed into a logging system

Note If you require the audit-log file to be available permanently even after restarting CAS you should collect and ship the audit-log to object storage or the logging system of your choice.

Note Since the ephemeral-storage is limited (by default to 1Gi) the CAS pod gets evicted once this threshold is reached. Due to the eviction/restart of the CAS the audit-log will be lost. It is recommended to forward the audit-log periodically to permanent storage and to rotate the audit-log file.

One can download the cas_audit.log with following command.

kubectl exec -n osc-scone-system cas-0 -c audit -- tar cf - /var/log/cas/audit/cas_audit.log | tar xf - -C .

More details on verification of audit log entries, logged audit events, implementation details, etc. can be found in the CAS Audit Log Section¹⁵ of the Scone Upstream documentation.

Backup CAS DB into S3 Bucket

Scone Service Operator deploys CAS DB Backupper in order to periodically store CAS DB snapshots into an S3 Bucket.

Creation

If you want to use an already existing S3Bucket then add its secret's name and namespace to spec.backup.s3Creds. The secret will be copied to namespace osc-scone-system with name s3-bucket-scone-cas-db-backupper.
If there is a secret in namespace osc-scone-system with name s3-bucket-scone-cas-db-backupper. This secret will be used to connect to the S3Bucket.
If you don't have an S3Bucket available leave spec.backup.s3Creds empty and the Scone Service Operator creates an S3Bucket to namespace osc-scone-system with name s3-bucket-scone-cas-db-backupper.

Update

If you update the secret or the S3Bucket the CAS DB Backupper deployment should be restarted manually to update the environment variables used to connect to S3Bucket.

Deletion

During deletion only the CAS DB Backupper helm chart is removed. The original secret, the secret in namespace osc-scone-system and the S3Bucket stay unchanged. You have to remove them manually when these are not needed anymore.

IMPORTANT: Deleting the s3-bucket resource will remove the bucket and all objects inside of it.

Best Practices

Scontain recommends security related best practices¹³ for running a confidential application in production mode. Please find a collection of Best Practices in Scone Upstream documentation¹³.

Additionally Scontain exposes known security relevant bulletins for running services in production to help hardening your applications.

Bulletin	Reference
DEBUG MODE AND PORTS	https://sconedocs.github.io/S5_debug_mode/

Known Issues

Migration of CAS only works in same Availability Zone

Description: CAS' Backup registers all nodes (encrypt database key with SGX' seal key) which are capable of running sgx workload as a potential node for taking over CAS. It doesn't matter which Availabitlity Zone the node resides. Currently the connected storage can only connect to workloads running in the same AZ. That means a CAS must reside on nodes in one AZ. The k8s scheduler currently restricts the CAS to be placed on a node in another AZ because of storage constraints.

User Guide

Scope

Concepts

Idea

Architecture

Supported Scone Versions

Lifecycle Management

Scone Field Specification

Scone Status Fields

Installation

Install, Upgrade, Uninstall the scone-service-operator

Install helm chart

Upgrade helm chart

Update scone-service-operator from 1.1.x to 1.2.x

Update scone-service-operator

Uninstall helm chart

Helm Values

Initially installing scone services

Provisioning CAS

Prerequisites

Tools

Prerequisites on the shoot cluster

Credentials

Install kubectl-provision plugin**

DCAP API Key

Provision CAS

Upgrade the Scone services

Upgrade CAS

Delete the Scone service

Snapshot (Backup) and Recovery CAS

CAS Safety Service

Monitoring CAS Safety Service

Recommendation

Shoot K8s Cluster Management

Evict CAS from a machine (k8s node)

Verify another node is capable to host the CAS Instance

Move CAS to another node

Disable scheduling to CAS node

Delete CAS Pod

Verify the CAS healthy again

Enable scheduling

Upgrading the shoot k8s version

Assure another node is capable to host a CAS Instance

Delete CAS Pod

Verify CAS is in healty state

Resizing the shoot cluster

Assure another node is capable to host a CAS Instance

Delete CAS Pod

Verify CAS is in healty state

Remove a node from a service

Check if CAS is running on node to remove

Assure another node is capable to host a CAS Instance

Move CAS to another node

Verify CAS is in healty state

Drain the node

Take node out of service

Vault

Provision a vault

Monitoring

Monitoring Scone Operator including Scone Services

Scone Operator Metrics

Scone Operator Alert Rules

Grafana Dashboard

Monitoring Scone Service Operator

Scone Service Operator Metrics

Scone Service Operator Alert Rules

Scone Service Operator Dashboards

Auditing - CAS Audit Logs

Backup CAS DB into S3 Bucket

Creation

Update

Deletion

Best Practices

Known Issues

Migration of CAS only works in same Availability Zone

References

Update scone-service-operator from `1.1.x` to `1.2.x`