Skip to content

User Guide

Learn how to work with the Scone as-a-service operator in a Kubernetes (k8s) environment.

One novel aspect of confidential computing is that applications can be protected against privileged software like the operating system and the hypervisor. Even if attackers get access to host machines, when running your workload confidentially, they will not be able to read data from your workload.

We make it happen by utilizing Intel Software Guard Extension (Intel SGX), a capability of intel processors to run program code in an encrypted manner in isolated sections of a processor, so called enclaves. With such a feature you can protect your data not only at rest and in transit - a widely used pattern, you can also protect your data while it is getting processed in memory and in processors.

A second novel aspect of confidential computing is that one can attest all components that we need to ensure the confidentiality and integrity of our application. These components include the CPU, its firmware, the application code and its data itself. The attestation ensures that these components are up-to-date and no vulnerabilities are known for these components.

In other words, one can establish trust in all components that are required to execute the application.

Not only that our hardware enables to protect workloads, we also add services on top of kubernetes that simplify building, deploying and running confidential workloads in respect of protection and attestation. This is getting achieved with Scone.

A good starting point to find out more about Intel SGX and Scone Services is:

  • Intel SGX8
  • Scone documentation2
  • Scone Operator User Manual6

The scone runtime environment consists of various services:

  • CAS Configuration and Attestation Service
  • LAS Local Attestation Service
  • SGX Plugin
  • Scone Operator

The Scone Operator is managing the lifecycle of all Scone Service Components. Not only that: It furthermore is needed for operational tasks, like handling policies.

The scone operator itself is crucial component. Who manages the lifecycle of the Scone-Operator, though?

With the help of our service, we manage the lifecycle of the scone operator. That includes:

  • deploying services initially
  • integrate monitoring of the scone services into prometheus

Scope

This document should help you understand the deployment of scone services with the help of the scone-service-operator. Provisioning, upgrading the CAS and the vault will not be described in depth throughout this documentation since it is not managed by this scone-service-operator

This user-guide does not tell you how you can build and run confidential workloads either. Please refer to the official scone sites1 to retrieve information on that. There you will find a bunch of tutorials and samples that helps you getting started.

Concepts

Let us start with an overview about the Scone services and then continue with the idea of the scone-service-operator. Afterwards let us see how to deploy the necessary services to run confidential workloads in production.

SCONE helps developers to run their applications inside of SGX enclaves. An Intel SGX enclave facilitates an application to protect its data from accesses by all other software - even the operating system. In particular, an application can protect all its data against adversaries with root access. A root user cannot dump the main memory of an application to get access to all its keys. Often, configuration files of applications are only protected by the filesystem. Again, a user with root access can read these configuration files and all secrets that they might contain. SCONE uses SGX to help to encrypt configuration files to protect these.

SCONE comprises of a bunch of services to make life easier for developers to creating confidential workloads for kubernetes.

Configuration and Attestation Service (CAS)

SCONE CAS is a central component of the SCONE infrastructure. Programs executing in enclaves connect to CAS to obtain their confidential configuration. CAS provisions this configuration only after it has verified the integrity and authenticity of the requesting enclave using remote attestation. Additionally, CAS checks that the requesting enclave is authorized to obtain the confidential configuration. One can run CAS instances on the same node as the application, the same cluster, or a different cluster. The CAS operator enables us to configure CAS policies remotely using kubectl without needing to expose CAS to any external network.

Local Attestation Service (LAS)

A LAS instance must run on each Kubernetes node that supports confidential computing. Developers will not have to know about LAS as long as the SCONE operator can keep LAS running. In conjunction with CAS, it enables remote attestation of enclaves by performing a local attestation. Currently, LAS supports DCAP and EPID-based quoting enclaves. Additionally, it provides an independent SCONE quoting enclave (QE): The SCONE QE enables the decoupling of the availability of an application from Intel's attestation services. A use case of the SCONE QE is the air-gapped deployment of applications without Internet connectivity.

SGX Device Plugin Service (SGX Plugin)

The SCONE SGX Plugin simplifies the deployment of confidential applications by providing any unprivileged container access to the SGX devices on hosts that support SGX. Developers will not have to know about the SGX Plugin as long as the SCONE operator can keep the SGX Plugin running. A confidential application can only run on hosts with SGX support. When running in a Kubernetes cluster, you need to make sure your application's workload is scheduled to such nodes and that the application gets access to the corresponding SGX devices. A device plugin advertises hardware resources to the Kubelet. The SCONE SGX Plugin provides access to the SGX devices. To access the SGX devices of a node, however, the Kubernetes containers usually need to run in privileged mode. The SCONE SGX Plugin will not only provide access to the SGX devices on exactly those nodes in the cluster that have support for the SGX version that you need. It will also allow your containers to run in non-privileged mode. However, the SGX Plugin must have permission to access the SGX devices.

CAS Backup Controller

The CAS Backup Controller automatically registers backup encryption keys from all the nodes in the Kubernetes cluster. This allows the CAS to open its encrypted database on all cluster nodes using the encryption key of the node on which it is running. In this way, the CAS can be freely migrated to nodes in the cluster (e.g. for recovery after a node failure). Note that the CAS Backup Controller does NOT backup the CAS database: You need to explicitly back up the Kubernetes volume used by CAS.

Scone Kubernetes Operator

The SCONE Kubernetes Operator automates the management of SCONE-related services. It monitors the behavior of these services and ensures that the services stay in the desired state. This state is described with the help of Kubernetes custom resources.

The SCONE Operator defines a set of custom resources to manage the following SCONE resources: * CAS: the SCONE Configuration and Attestation Service, * LAS: the SCONE Local Attestation Service, * SGX Plugin: a Kubernetes plugin that provides containers with access to SGX, * a confidential Vault, and * signed and/or encrypted SCONE CAS policies.

Idea

The idea is to enable users run their workloads confidentially. Services that make confidential computing available should be managed by kubernetes operators so that impact on operational tasks for users is as little as possible.

How do we achieve that?

To simplify the initial rollout of scone services and also the lifecycle management of the scone-operator we introduce another service operator: the scone-service-operator.

The scone-service-operator's responsibility is to manage the lifecycle of the scone-operator which in turn monitors the scone services and manages the lifecycle of these scone services. The scone-service-operator 's responsibility includes:

  • create, update, delete the scone-operator
  • integrating the scone-operator into monitoring
  • initially provision SGX-Plugin, LAS and a non-provisioned CAS
  • create CAS database backup

Here it is important to note that OSC as a cloud provider - with the help of the scone-service-operator and the scone-operator - creates the basic non-confidential components. All security relevant tasks must run by the owner of CAS. Only the owner can run security relevant tasks, like:

  • provisioning a CAS
  • upgrading a CAS

Provisioning is the step where a user takes ownership of the CAS. In turn during this process he gets the credentials which enables him to drive security relevant management tasks.

Architecture

Let us see how the architectural overview looks like and how it fits together.

Deployment Diagram

Enabling the Scone Service from the OSC Service-Catalog will deploy the scone-service-operator into your cluster. This will happen into namespace osc-sec-scone-svc-operator.

The user, usually a service administrator on customers end, will then create a scone CR (cluster scoped CR of kind scone.confidential.security.osc.t-systems.com) to configure the service. The important properties to set here are:

  • service version to be installed
  • flag, if CAS should be installed (by default CAS will be installed)

Since CAS is optional, the installation of CAS can be skipped. A CAS installation for example should be skipped, if there is another CAS, maybe in another cluster in place, to serve with attestation and configuration services.

The scone-service-operator will be triggered by the creation of this scone CR. The controller then creates all resources needed to run the scone-operator into namespace osc-sec-us-scone-operator itself and also all configurations for the services to be installed, i.e. LAS, CAS, sgx-plugin. The services then will be created in namespace osc-scone-system.

Additionally the scone-service-operator creates the scone-cas-db-backupper in namespace osc-scone-system which pulls the snapshot of CAS database periodically and pushes it into an S3 Bucket.

Supported Scone Versions

Scone follows the same support policy as the Kubernetes project which is an N-2 policy in with the three latest minor releases are maintained. Although previous versions may work, they are not tested and therefore no guarantees are made as to their full compatibility. Scone also follows a similar strategy for support of Kubernetes itself. The below table shows the compatibility matrix.

Scone Support Matrix

Scone Version K8s Min k8s Max Upgrade to New Install supported
5.8.0-rc.19 1.23 1.25 ✅
5.8.0 1.24 1.26 ✅

Lifecycle Management

It is all about installing, upgrading, repairing and deleting the scone-operator.

This is all done by just providing a kubernetes resource of kind scone.confidential.security.osc.t-systems.com. This is a cluster scoped resource. This resource reflects the desired state in spec section, like for example in the following:

The desired state of the following definition would be to have scone services Version 5.8.0 in place without having CAS installed. The scone-service-operator takes care about achieving the desired state.

apiVersion: confidential.security.osc.t-systems.com/v1alpha1
kind: Scone
metadata:
  name: scone
spec:
  serviceVersion: 5.8.0
  managementState: Managed
  skipCas: false
  #serviceNodeSelector:
    #osc.services/enabled: "true"
  #operatorNodeSelector:
    #osc.sgx/enabled: "true"
  #backup:
    #interval: 3600
    #retention: 20
    #s3Creds:
      #secret: <name of secret>
      #namespace: <namespace of secret>

Scone Specification

Field Default Description Optional
spec.version - version of scone services to be managed no
spec.managementState Managed Managed: Operator manages the lifecycle
Unmanaged: Set into this state if you want to manage the scone operator/services yourself
no
spec.skipCas false If you don't want to have CAS installed, set this flag to true. If you want to have a CAS installed, set this flag to false
NOTE: An already installed CAS will be removed if skipCAS is set to true afterwards.
yes
spec.serviceNodeSelector - NodeSelector for CAS, LAS, SGXPlugin and CAS DB Backupper yes
spec.operatorNodeSelector - NodeSelector for Scone Operator yes
spec.backup.interval 14400 Frequency of CAS DB Backup in seconds yes
spec.backup.retention 10 Expire in days for CAS DB backups yes
spec.backup.s3Creds.secret - If you have an already existing S3Bucket, this secret will be used by the CAS DB Backupper and copied to namespace osc-scone-system with name s3-bucket-scone-cas-db-backupper. If s3Creds are empty the operator creates an S3Bucket in namespace osc-scone-system with name scone-cas-db-backupper. yes
spec.backup.s3Creds.namespace - If you have an already existing S3Bucket, this secret will be used by the CAS DB Backupper and copied to namespace osc-scone-system with name s3-bucket-scone-cas-db-backupper. If s3Creds are empty the operator creates an S3Bucket in namespace osc-scone-system with name scone-cas-db-backupper. yes

The status section of the scone custom resource reflects the current lifecycle state of the scone-service-operator landscape.

Scone Status Fields

Field Description
status.deployedSpec A hash value for latest deployed specification. This hash gets created using all fields from spec
status.deployedVersion The deployed version reflects the latest deployed version of scone services. This can differ from the field spec.version in case an error occurred.
status.lastSuccessPhase lastSuccessPhase depicts the previous state. It can be anything else than Error.
status.lastUpdate lastUpdate reflects when the latest reconcile happened.
status.observedGeneration observedGeneration reflects when the latest Generation has been reconciled - see also metadata.generation.
status.phase phase reflects the current state of the service-operator.

The following diagram depicts the possible transitions and phases of a scone.confidential.security.osc.t-systems.com resource.

Deployment Diagram

Let's walk through the phases.

A user creates a new Custom Resource (CR) of kind scone.confidential.security.osc.t-systems.com and applies it into the cluster. The initial phase will be Empty. At this point the operator adds a finalizer into the resource and puts the CR into phase Creating. In phase Creating the service operator installs the scone operator and the scone services by applying the helm charts. After successful installation the CR gets into phase Running. Phase Running depicts the scone operator is ready to use. That should be the normal case. A user modifies the specification, e.g. changes the spec.backup.interval or spec.backup.retention. In that case the CR gets into phase Updating, which depicts that the change is getting applied. After successfully updating the scone-operator or the services, the CR gets back to phase Running.

If anything, for any reason, went wrong and the reconciliation to the desired state failed, then the CR gets into phase Error. The service operator tries up to 5 times to get into the desired state from phase Error. If it cannot reach the desired state after 5 times, it stops trying. In that case manual intervention is needed by the user.

The user (or better a human operator) then has to repair respective troubleshoot the deployment. Changes to the scone CR trigger put the scone-service-operator into phase Repairing where it tries to repair the scone-operator and services.

You can remove the whole stack by deleting the CR. This applies a helm uninstall of the installed package.

NOTE: Deletion of scone not only removes the scone-operator, it also removes all scone services coming with the installed helm packages. So pay attention!

The created s3-bucket will reside in osc-scone-system namespace. This is to have CAS DB snapshots available in order to recover a CAS from it. To remove the bucket and all objects stored inside of it one has to remove the s3 bucket resource explicitely.

Installation

Installing the whole service stack is a 3 step procedure

  1. Enabling the services
  2. Installing the services initially
  3. Provisioning a production ready confidential CAS

Deployment Diagram

The following chapters will describe each step more in detail.

Installing the scone-service-operator

First step is to enable the scone services from the busola dashboard. This will automatically deploy the scone-service-operator into the namespace osc-sec-scone-svc-operator.

This enables users to deploy the scone-operator and scone services initially. See next chapter.

Installing services initially

The user applies a resource of kind scone.confidential.security.osc.t-systems.com Please check table Scone specification for valid properties.

Example: Add scone-controller and scone services version 5.8.0 including CAS.

cat <<EOF | kubectl apply -f -
apiVersion: confidential.security.osc.t-systems.com/v1alpha1
kind: Scone
metadata:
  name: scone
spec:
  serviceVersion: 5.8.0
  managementState: Managed
EOF

The creation of this resource triggers scone-service-operator to create the scone-operator and the initial services for las, cas and sgxplugin.

Provisioning CAS

To take CAS ownership one has to provision the CAS.

A detailed description can be found in the scontain upstream documents3.

We outline specific parameters have to be used.

  • usage of MTR as registry
  • usage of mirrored images, prefixed with osc-

Prerequisites

Provisioning the CAS will take place from your local trusted machine. To provision the CAS you'll need following tools to have on your local machine.

Tools

  • docker
  • kubectl
  • kubectl-provision plugin
  • git
  • helm
  • jq
  • cosign

Credentials

Since the scone-operator repository hosted on github is private, please assure you have been granted access.

Install kubectl-provision plugin

# Version to provision
$ export VERSION=5.8.0

# Upstream URL of kubectl-provision plugin
$ export SCONE_KUBECTL_PROVISION_PLUGIN="https://raw.githubusercontent.com/scontain/SH/master/${VERSION}/kubectl-provision"

# Folder to hold your kubectl-provision plugin. It has to be in your PATH.
$ export SCONE_KUBECTL_PROVISION_PLUGIN_BIN=/usr/local/bin/kubectl-provision

# Download plugin and make it executable
$ curl -fsSL "${SCONE_KUBECTL_PROVISION_PLUGIN}"  -o "${SCONE_KUBECTL_PROVISION_PLUGIN_BIN}"
$ chmod a+x "${SCONE_KUBECTL_PROVISION_PLUGIN_BIN}"

DCAP API Key

It is recommend to subscribe to Intel Provisioning Certification Service (Intel PCS9) which provides you with the API keys needed to query the service for Intel SGX attestation collateral.

Each subscription to the Intel PCS Service for ECDSA Attestation issues two API keys: a primary key and a secondary key. Either one can be used. The point of issuing two keys is to provide continuity of service in the event the active key needs to be regenerated.

If you don't use your personal DCAP API Key when provisioning CAS later a default key will be used. With that default DCAP API Key there will be no guarantees about validity and availability.

Provision CAS

To provision CAS run following commands.

# Version to provision
$ export VERSION=5.8.0
# use osc- prefix for images
$ export IMAGE_PREFIX=osc-
# fetch images from our MTR
$ export IMAGE_REPO=mtr.devops.telekom.de/osc/security/confidential-computing

$ export DCAP_APIKEY=<YOUR DCAP API-KEY>
$ kubectl provision cas cas --verbose --namespace osc-scone-system --dcap-api ${DCAP_APIKEY}

After a while CAS should get into state healthy, provisioned and attestable. With that it is ready to use.

NAMESPACE          NAME   STATUS    PHASE     PROVISIONED   ATTESTABLE   SNAPSHOTS   MIGRATION   VERSION                         AGE
osc-scone-system   cas    HEALTHY   HEALTHY   Yes           Yes          Persisted   2/2         5.7.0-645-gf7463d920-nikolaus   15m

Upgrade the Scone services

The upgrade of the scone services is a two steps procedure.

  • Upgrade CAS
  • Upgrade Services LAS, SGX-Plugins

It starts with upgrading CAS. Afterwards SGX-Plugin and LAS can be upgraded.

ATTENTION: Before upgrading the scone services to a new version assure that version is supported by OSC. Check the support for that.

Upgrade CAS

The upgrade of CAS to a new version is to be done with kubectl-provision plugin again. To upgrade CAS execute the command you will find below. Assure you meet the prerequisites for the new version to be installed. Install corresponding version of kubectl-provision plugin first. Beyond that assure that you have all credentials in place to execute an upgrade of CAS.

$ kubectl provision cas cas -n osc-scone-system --upgrade "<VERSION>"

After Cas has been succesfully updated, the service controller will upgrade the scone CR and all other scone services.

Delete the Scone service

To remove all scone services remove your scone CR from your cluster.

$ kubectl delete scone scone

ATTENTION All scone services and resources are getting removed when deleting the scone CR.

Snapshot (Backup) and Recovery CAS

Snapshotting the CAS database is enabled by default. Snapshots are stored onto another volume. Additionally snapshot are pushed to an S3 Bucket periodically. Refer to section Backup CAS DB into S3 Bucket.

To recover CAS from a snapshot follow the upstream documentation5. To recover a CAS Database you need to have a snapshot of the CAS Database on your local machine. This snapshot can come from the S3 Bucket where snapshots are stored periodically or from a manual created backup with kubectl provision tool.

Find samples to backup and to recover below.

Backup

$ kubectl provision cas cas --local-backup -n osc-scone-system

Recovery

export DCAP_APIKEY=<your dcap api key>
kubectl provision cas cas --cas-database-recovery cas-osc-scone-system-last-snapshot-db --verbose --namespace osc-scone-system --dcap-api ${DCAP_APIKEY}

NOTE: A restore provisions a new CAS. Assure that there is no CAS CR applied and no CAS running before running the recovery. Furthermore during recovery the scone-service-operator needs to be set to Unmanaged. Otherwise the scone-service-operator would create a CAS resource again which contradicts the one coming from the recovery. After successful recovery the scone-service-operator can be set into ManagementState Managed again.

Shoot K8s Cluster Management

Shoot Cluster Management activities have to have carefully and thoroughly planned in respective of running scone services especially the CAS database when it comes to removing a machine (a k8s node).

The CAS database is always encrypted to protect the confidentiality of the contained secrets. The database encryption key is stored as cipher text in a separate file with the same filename, but ending with .key-store.

The key used to encrypt the database encryption key is derived using SGX' seal key derivation feature. Therefore, CAS can only decrypt the database encryption key and access the database if it is executed on the same machine which has created the .key-store file, or the key is explicitly made available to a CAS on another machine with CAS' Backup feature.

In particular, that means if the machine hosting CAS breaks, e.g. due to hardware failure, or is not available anymore, for example, because it was a cloud machine, you'll access to all data stored by CAS is lost!

Thanks to the CAS' Backup feature all SGX capable machines forming your cluster have been registered. That means that if for any reason a CAS is scheduled onto another (registered) machine, it can access the data because it is able to decrypt the database key with its SGX' seal key.

Have at least 2 Nodes (better more) available to overcome a machine outage which hosted your CAS.

It is not only sufficient that all machines are registered to host a CAS. More over a machine must also be able to connect to the Persistent Volume (PV) where the CAS' database is running. Unfortunately Open Sovereign Cloud currently does not support spreading a PV over multiple Availability Zones (AZ). That means: Only a machine running in the same AZ where the CAS is running can take over a CAS.

WARNING: Assure that you have multiple registered (migratable) machines in the same AZ where your CAS is running. Otherwise it will not overcome an outage of the machine where the CAS was running.

The limitations mentioned above have impact on actions, like

  • Upgrading the shoots k8s version
  • Resizing the shoot cluster
  • Removing a node hosting a CAS from service

Thanks to Pod Disruption Budget (PDB) CAS will be protected with such a PDB. This assures that there is always one instance running since Gardener respects such PDB and evicts node safely. That also means that Gardener stops draining the node where CAS is running.

An Administrator/User then has to manually evict the CAS. This will be described in following chapter.

Evict CAS from a machine (k8s node)

This section covers steps to safely move a CAS instance from one machine to another machine in general.

The procedure of safely evicting a CAS is needed in case of * Upgrading the shoots k8s version * Resizing the shoot cluster * Draining a node hosting CAS for maintenance reasons * Removing a machine hosting a CAS. This also includes Removing the underlying bare metal server. * etc.

Before evicting a CAS from a node one has to assure that there is another node available which is capable to run the CAS. Once this is verified one can transfer the CAS to another node.

Assure another node is capable to host the CAS Instance

A node capable to host a CAS must fulfill a bunch of criteria.

The node must * be registered to run CAS backups, * reside in the same AZ, * be schedulable, and * must have set certain labels.

If there is no such node available, one has to add another node into the cluster, which supports sgx workloads and which is located in the same AZ where CAS is currently running. Labeling of the node will be done automatically by the scone-operator.

A potential machine to move CAS to must reside in the same AZ as the machine that is currently hosting the CAS. This again is because it must be able to mount the Persistent Volume with the CAS Database. This is currently not possible if the machine runs in a different AZ. In near future this will change and cross connection will be enabled.

To figure out in which AZ the CAS is currently running, execute following commands.

# get node where CAS is running
$ export CASNODE=$(kubectl get pod cas-0 -n osc-scone-system -o json | jq '.spec.nodeName' | tr -d '"' )
$ echo $CASNODE

# get AZ where node resides
$ export CASAZ=$(kubectl get node $CASNODE -o json | jq '.metadata.labels."topology.kubernetes.io/zone"' | tr -d '"')
$ echo $CASAZ

To find nodes which can potentially host the CAS except the node that is currently hosting CAS, run the command below.

This command considers * the correct AZ * registered nodes * other labels needed to run CAS, and * the node must be schedulable

$ kubectl get nodes -o json | jq '.items[] |
  select(.metadata.labels."topology.kubernetes.io/zone"==$ENV.CASAZ) |
  select (.metadata.labels."cas-registered-cas.osc-scone-system-5.8.0"=="true") |
  select (.metadata.labels."las.scontain.com/capable" == "true")|
  select (.metadata.labels."las.scontain.com/ok" == "true") |
  select (.metadata.labels."sgx.intel.com/capable" == "true") |
  select (.spec.unschedulable != true) |
  .metadata.name' |
  tr -d '"' |
  grep -v $CASNODE 

If the query above does not return any nodes then one first has to add a new node into the cluster. Now that we have assured that there is at least one node which is capable to host a CAS we can continue.

Move CAS to another node

Follow this steps to move the CAS pod to another node.

Disable scheduling to CAS node

Since we want to move CAS to another node, we need to set the currently hosting node to unschedulable, so that the CAS is being scheduled onto another node.

$ kubectl cordon $CASNODE

Delete CAS Pod

Finally the cas pod can be deleted. Kubernetes will place the pod onto another node then.

$ kubectl delete pod cas-0 -n osc-scone-system

Verify the CAS healthy again

Check CAS pod is running and CAS is in state healthy.

# CAS is running again - verify that CAS pod is running on another node than before 
$ kubectl get pod cas-0 -n osc-scone-system -o wide

# CAS is in state healty
$ kubectl get cas cas -n osc-scone-system

Enable scheduling

If needed then set the CAS node back to schedulable.

$ kubectl uncordon $CASNODE

Upgrading the shoot k8s version

According to Gardner Documentation12 there is difference between minor version update (e.g. 5.x.0) and patch version update (e.g. 5.8.x).

Patch version upgrade happens in-place. This means that the shoot worker nodes remain untouched and only the kubelet process restarts with the new Kubernetes version binary.

In contrast a Minor Version Upgrade happens in a rolling update fashion. The worker nodes will be terminated one after another and replaced by new machines. The existing workload is gracefully drained and evicted from the old worker nodes to new worker nodes, respecting the configured PodDisruptionBudgets (PDB).

CAS will be protected with such a PDB. Gardener will not drain the node where CAS is currently placed unless someone removes CAS from the the node.

Here manual interaction by the User/Administrator comes into play. He has to assure that there is another machine available that is capable to run the CAS. Once this is assured he can evict the CAS instance from the affected node. Once that is done Gardener will continue with upgrading remaining nodes.

In Gardener Dashboard you`ll find a message that a Pod Disruption Budget prevents from draining and evicting the node.

⚠ IMPORTANT: Before starting a k8s version upgrade make sure that the CIDR of your subnet allows additional nodes that will be added due to rolling update. If needed scale down the nodes first. E.g. if your subnetwork has CIDR set to 28 which allows max. 16 nodes assure there are max 15 nodes running before upgrading the cluster.

Assure another node is capable to host a CAS Instance

Check if node is capable to host the CAS Instance to safely evict CAS. Only if there is at least one node available, continue with following step.

Delete CAS Pod

Delete the CAS pod. Kubernetes will place the pod onto another node then.

$ kubectl delete pod cas-0 -n osc-scone-system

Verify CAS is in healty state

Refer to Verify the CAS healthy again to check if CAS is up and running again.

Resizing the shoot cluster

This chapter in briefly discusses what is to be considered when adding or removing nodes from a cluster, i.e. scaling up or scaling down the cluster concerning CAS.

Scaling up the cluster, means adding nodes into the cluster. This does not have severe impact on CAS as it resides on the node where it runs and does not have to evict. The added nodes will be registered automatically by CAS' Backup feature.

Scaling down can have severe impact on the CAS. Since the cluster scaler selects nodes to remove from the cluster randomly, it can happen that it selects the machine where CAS is currently running on. In that case the Cluster Scaler respects the Pod Disruption Budget which prevents it from removing the node until an Administrator/User explicitely drain and evicts the node. The Administrator/User has to assure that there is a node which is capable of running the CAS again.

In Gardener Dashboard you`ll find a message that a Pod Disruption Budget prevents from draining and evicting the node.

Assure another node is capable to host a CAS Instance

Refer to Chapter to Assure another node is capable to host the CAS Instance to safely evict CAS. Only if there is at least one node available, continue with following step.

Delete CAS Pod

Delete the CAS pod. Kubernetes will place the pod onto another node then.

$ kubectl delete pod cas-0 -n osc-scone-system

Verify CAS is in healty state

Refer to Verify the CAS healthy again to check if CAS is up and running again.

Remove a node from a service

This chapter briefly discusses what is to be considered when you want to remove a node which is hosting a CAS from service for any reason, e.g. maintenance of VM, or maintenance of underlying bare metal server.

Check if CAS is running on node to remove

First step is to double check if CAS is running on the node that should be removed from service.

# get node where CAS is running
$ export CASNODE=$(kubectl get pod cas-0 -n osc-scone-system -o json | jq '.spec.nodeName' | tr -d '"' )
$ echo $CASNODE

Verify if $CASENODE is the node you want to remove. If it is then continue with next step where to assure there is another node to host CAS, else you do not have to care about evicting CAS safely since it is not running on this node.

Assure another node is capable to host a CAS Instance

Refer to Chapter to Assure another node is capable to host the CAS Instance to safely evict CAS. Only if there is at least one node available, continue with following steps.

Move CAS to another node

Follow description Move CAS to another node to move the pod to another node.

Verify CAS is in healty state

Refer to Verify the CAS healthy again to check that CAS is up and running and healthy again.

Drain the node

Evict all other workloads gracefully to other nodes.

$ kubectl cordon $CASNODE
$ kubectl drain $CASNODE --force --ignore-daemonsets --delete-emptydir-data

Take node out of service

Finally CAS has been evicted successfully and you can remove the node from the cluster.

Vault

Scone-Operator manages the lifecycle of a confidential Hashicorp Vault. Given that you can let the scone-operator manage a confidential Vault for you.

Refer to upstream documentation on provisioning and deploying a confidential vault4.

Refer to the very detailed upstream documentation Confidential Hashicorp Vault on Kubernetes 7 where you can find a step by step guide to provision a confidential vault without using kubectl-provision -plugin.

Provision a vault

It is recommended to use mirrored images from MTR to build confidential Vault.

Use following environment settings to use the mirrored images.

# Version to provision
$ export VERSION=5.8.0
# use osc- prefix for images
$ export IMAGE_PREFIX=osc-
# fetch images from our MTR
$ export IMAGE_REPO=mtr.devops.telekom.de/osc/security/confidential-computing

$ kubectl provision vault cas --verbose --namespace osc-scone-system

Monitoring

There are two components to be monitored:

  • scone-operator,
  • and scone-service-operator.

Scone-Operator Monitoring

The monitoring of scone-operator11 automatically integrates into an available prometheus monitoring stack including a dashboard for Grafana.

Scone Service Operator Monitoring

The scone-service-operator itself integrates into monitoring for operational means. Also alert-rules are automatically applied in order to inform operations team about critical issues.

The monitoring of the Operator10 is deployed and activated automatically during the Operator deployment.

Auditing - CAS Audit Logs

The CAS audit log is a cryptographically provable journal of security-sensitive operations executed over the course of the lifetime of a CAS instance. The audit log can be evaluated to learn - among others -

  • how a SCONE application's configuration came about,
  • how its secrets were created, and
  • which services accessed these secrets.

This can be used to prove that an application's configuration has not been tampered with. The log is created and signed by the SCONE-secured CAS itself, therefore even untrusted CAS owners or operators cannot tamper with the log. It might also be used for billing session owners.

💡 Check the sample15 cas_audit.log to get an impression of the format.

This audit log gets written * into ephemeral storage at /var/log/cas/audit/cas_audit.log in the cas pod. Since this audit log file is written to ephemeral storage it starts a new log with every new start of the CAS pod - the audit log from the previous container execution will be lost. * to stdout which will usually be collected by log-collector and pushed into a logging system

⚠ Note If you require the audit-log file to be available permanently even after restarting CAS you should collect and ship the audit-log to object storage or the logging system of your choice.

One can download the cas_audit.log with following command.

kubectl exec -n osc-scone-system cas-0 -c audit -- tar cf - /var/log/cas/audit/cas_audit.log | tar xf - -C .

More details on verification of audit log entries, logged audit events, implementation details, etc. can be found in the CAS Audit Log Section16 of the Scone Upstream documentation.

Backup CAS DB into S3 Bucket

Scone Service Operator deploys CAS DB Backupper in order to periodically store CAS DB snapshots into an S3 Bucket.

Creation

  • If you want to use an already existing S3Bucket then add its secret's name and namespace to spec.backup.s3Creds. The secret will be copied to namespace osc-scone-system with name s3-bucket-scone-cas-db-backupper.
  • If there is a secret in namespace osc-scone-system with name s3-bucket-scone-cas-db-backupper. This secret will be used to connect to the S3Bucket.
  • If you don't have an S3Bucket available leave spec.backup.s3Creds empty and the Scone Service Operator creates an S3Bucket to namespace osc-scone-system with name s3-bucket-scone-cas-db-backupper.

Update

If you update the secret or the S3Bucket the CAS DB Backupper deployment should be restarted manually to update the environment variables used to connect to S3Bucket.

Deletion

During deletion only the CAS DB Backupper helm chart is removed. The original secret, the secret in namespace osc-scone-system and the S3Bucket stay unchanged. You have to remove them manually when these are not needed anymore.

⚠ IMPORTANT: Deleting the s3-bucket resource will remove the bucket and all objects inside of it.

Best Practices

Scontain recommends security related best practices14 for running a confidential application in production mode. Please find a collection of Best Practices in Scone Upstream documentation14.

Additionally Scontain exposes known security relevant bulletins for running services in production to help hardening your applications.

Bulletin Reference
DEBUG MODE AND PORTS https://sconedocs.github.io/S5_debug_mode/

Known Issues

Migration of CAS only works in same Availability Zone

Description: CAS' Backup registers all nodes (encrypt database key with SGX' seal key) which are capable of running sgx workload as a potential node for taking over CAS. It doesn't matter which Availabitlity Zone the node resides. Currently the connected storage can only connect to workloads running in the same AZ. That means a CAS must reside on nodes in one AZ. The k8s scheduler currently restricts the CAS to be placed on a node in another AZ because of storage constraints.

References