Skip to content

Monitoring SGX Memory

Overview

The cgroups-prometheus-exporter is responsible for monitoring and exposing resource consumption based on Linux cgroup information. Currently, it provides a solution to get the EPC usage exported as a Prometheus metrics.

Exported Metrics for Cgroups Prometheus Exporter

The cgroups-prometheus-exporter exports the following metrics:

Metric Name Description Type Labels
osc_cgroups_sgx_epc_used_bytes Represents the total amount of SGX EPC memory currently being used by the cgroup in bytes Gauge scope qos uid node namespace pod container
osc_cgroups_sgx_epc_limit_bytes Represents the total limit of SGX EPC memory which can be used by the cgroup in bytes, if not set it will be 0 Gauge scope qos uid node namespace pod container
osc_cgroups_memory_used_bytes Represents the total amount of main memory currently used by the cgroup in bytes Gauge scope qos uid node namespace pod container
osc_cgroups_sgx_epc_events Represents the events of SGX EPC memory is reaching the limit Gauge scope qos uid node namespace pod container
osc_cgroups_exporter_status Populated if cgroup directories or files cannot be processed Gauge msg path

Explanation of the Metric Label Parameters

scope

defines the scope

qos
Summarised value for the respective Kubernetes Quality of Service class.
pods
Summarised value for all pods running on the respective Kubernetes node.
pod
Value for an individual pod.
container
Value for an individual container.
qos
The QoS class of the respective pod/container.
uid
The unique identifier of the container or pod.
node
The name of the node on which the pod is deployed. This label helps to correlate the pod to the node.
msg
Type of processing error.
path
Path of the file where the processing problem occurred.
namespace
The namespace of the pod.
pod
The name of the pod.
container
The name of the container.

EPC memory consumption metrics

OscCgroupsSgxEpcUsedBytes and OscCgroupsSgxEpcLimitBytes are Prometheus metrics exported by the exporter so that the usage and limit of SGX EPC memory can be monitored for each Quality of Service (QoS) level in a containerized environment.

OscCgroupsSgxEpcUsedBytes

This metric represents the amount of currently consumed SGX EPC memory for different scopes (indicated by the scope label). It is measured in bytes.

OscCgroupsSgxEpcLimitBytes

This metric represents the maximum limit of SGX EPC memory that a container can use. It is also measured in bytes.

Note

If no upper limit is defined, the metric is set to 0.

Example

Here's an example of how these metrics might appear in Prometheus:

# HELP OscCgroupsSgxEpcUsedBytes The amount of SGX EPC memory in bytes currently being used by a container.
# TYPE OscCgroupsSgxEpcUsedBytes gauge
OscCgroupsSgxEpcUsedBytes{scope="pod",qos="guaranteed",uid="abc123",node="sgxnode",container="",namespace="kube-system",pod=""} 1048576

# HELP OscCgroupsSgxEpcLimitBytes The maximum limit of SGX EPC memory in bytes that a container can use.
# TYPE OscCgroupsSgxEpcLimitBytes gauge
OscCgroupsSgxEpcLimitBytes{scope="pod",qos="guaranteed",uid="abc123",node="sgxnode",container="",namespace="kube-system",pod=""} 2097152

In this example, the pod with UID "abc123" is currently using 1 MiB of its allocated 2 MiB of SGX EPC memory.

OscCgroupsMemoryUsedBytes

The osc_cgroups_memory_used_bytes metric reports main memory consumption value in bytes.

OscCgroupsSgxEpcEvents

The osc_cgroups_sgx_epc_events metric reports epc memory paging events.

OscCgroupsExporterStatus

The osc_cgroups_exporter_status metric reports a failure to process any of the supported memory files, indicating that there is a problem with accessing the cgroups directory or reading. The OscCgroupsExporterStatus metric is filled with the provided error message and the path to the failed file. If an error occurs, the gauge status is set to 1. Otherwise, the value for a specific directory is 0. This metric is used for our Prometheus Alerts.

Examples of published Metrics with different Scopes

  • metric osc_cgroups_memory_used_bytes, scope pod: stacked_pods
  • metric osc_cgroups_memory_used_bytes, scope pods: all_pods
  • metric osc_cgroups_memory_used_bytes, scope qos: qos
  • metric osc_cgroups_sgx_epc_used_bytes, scope container: all_pods
  • metric osc_cgroups_sgx_epc_limit_bytes, scope container: qos
  • metric osc_cgroups_exporter_status: qos

Node Feature Discovery

The Node Feature Discovery (NFD) extension provides the ability to modify values for Cgroups prometheus exporter. Here is an example of a Shoot YAML manifest:

kind: Shoot
apiVersion: core.gardener.cloud/v1beta1
metadata:
  name: 
  namespace: 
spec:
  extensions:
    - type: osc-nfd-shoot-service
      providerConfig:
        apiVersion: nfd.osc.extensions.config.gardener.cloud/v1alpha1
        kind: Configuration
        cgroups-prometheus-exporter:
          enabled: true
          values: |
            image:
              repository: mtr.devops.telekom.de/osc/common/monitoring/cgroups-prometheus-exporter
              tag: v0.2.1
              pullPolicy: Always
            prometheus:
              enablePrometheusRule: false
              enableServiceMonitor: true
        node-feature-rule:
          enabled: true
        node-feature-discovery:
          values: |
            image:
              repository:  mtr.devops.telekom.de/osc/gardener/node-feature-discovery
              pullPolicy: IfNotPresent
              tag: v0.13.4-minimal
          enabled: true
      disabled: false
   

Support Matrix

The Cgroups prometheus exporter was tested in following configurations:

Cgroups prometheus exporter version Garden Linux version Kubernetes version Containerd version
v0.2.1 1510.0 1.26.8 1.7.11
v0.2.1 1510.0 1.29.9 1.7.11
v0.2.1 1605.0 1.28.14 1.7.20
v0.2.1 1510.0 1.30.8 1.7.11
v0.2.1 1510.0 1.31.4 1.7.11