Customer Responsibilities for Operating a Resilient Application on OSC
Purpose and scope
This document defines the minimum customer responsibilities and recommended engineering practices customers must follow when designing, deploying, and operating resilient production and business-critical applications on the Open Sovereign Cloud (OSC). It applies to workloads running on OSC Kubernetes clusters and using OSC networking and storage services.
OSC provides resilient platform building blocks, including multiple Availability Zones (AZs), managed Kubernetes capabilities, storage services, and network services. End-to-end application resilience remains a shared responsibility. Unless explicitly included in a managed service agreement, the customer is responsible for:
- Application architecture and workload design
- Capacity management
- Backup and restore
- Disaster recovery
- Data lifecycle and housekeeping
- Application-level security
Most obligations in this document are not unique to OSC. They reflect standard cloud-native and Kubernetes engineering responsibilities that customers must meet on any managed Kubernetes platform when they operate a resilient application. OSC-specific requirements mainly arise from the applicable OSC service description, supported-version matrix, upgrade windows, and OSC network or storage service characteristics.
Key message
- Most of these obligations are standard Kubernetes and cloud-native operating requirements, not OSC-specific exceptions.
- Design for workload so that it can survive the failure of a full AZ - not only for normal operation asuming components never fail or restart.
- Keep workloads, manifests, and integrations within the versions and API levels supported by the applicable OSC service description. Unsupported components become both stability and security risks.
- Keep enough spare capacity to absorb failover and maintenance.
- Expand and clean up storage before it becomes critical.
- Minimize hard runtime dependencies on OSC/Kubernetes APIs and shared services.
- Plan network CIDRs and NAT behavior up front so connectivity, peering, VPNs, and egress scale do not fail by design.
- Prove resilience through regular testing, restore exercises and documented runbooks.
Shared responsibility model
| Topic | OSC responsibility | Customer responsibility |
|---|---|---|
| Platform availability | Provide regional and AZ-based platform building blocks, managed Kubernetes capabilities, storage and network services, and platform maintenance processes. | Design and operate the application so it actively uses those building blocks and tolerates their maintenance and failure modes. |
| Workload placement and replication | Provide AZs, node pools, scheduling primitives, and resilient infrastructure services. | Configure replicas, topology spread, anti-affinity, PodDisruptionBudgets, and data replication across failure domains. |
| Capacity | Provision agreed platform capacity in line with the service process. | Forecast demand, maintain failover headroom, order expansions in time, and validate that failover remains within safe utilization thresholds. |
| Storage | Provide block and object storage services with defined platform characteristics. | Monitor growth and performance, perform housekeeping, tune workload I/O patterns, and request expansion before capacity becomes critical. |
| Backup and DR | Provide platform primitives and, where agreed, backup targets or tooling. | Define backup scope and retention, keep independent copies, perform restore tests, and validate end-to-end disaster recovery. |
| API usage | Expose platform and Kubernetes APIs. | Reduce hard runtime dependence on those APIs, implement timeouts, retries, caching, and graceful degradation. |
| Lifecycle and deprecated versions | Publish supported versions, deprecation information, and upgrade windows. | Track lifecycle, upgrade in time, and remove deprecated Kubernetes/API usage before support deadlines. |
| Network architecture and IP planning | Provide networking primitives, routing constructs, VPN or peering options, and documented technical constraints. | Plan non-overlapping CIDRs, validate remote connectivity assumptions, and size node networks and NAT behavior for the application design. |
Mandatory customer responsibilities
Design for failure domains
- Critical production applications MUST be designed so that loss of one AZ does not create a full service outage.
- Customers MUST distribute replicas, traffic, and data across failure domains using replicas, topology spread constraints, pod anti-affinity, PodDisruptionBudgets, and multi-AZ replication where applicable.
- Customers MUST define degraded-mode behavior for AZ loss, node loss, and temporary network interruptions.
- Production and non-production MUST NOT share the same hardware pool.
- For clusters that are expected to survive AZ loss while maintaining spreading flexibility, a practical starting point is 6 worker nodes (2 per AZ). Smaller footprints require explicit review because they often cannot maintain spread, disruption budgets, and headroom simultaneously.
- Applications MUST tolerate at least 60 seconds of transient network interruption or cluster networking restart (for example Cilium restart or reconvergence) without uncontrolled cascading failure.
Capacity and utilization management
- Customers MUST maintain a rolling capacity plan for compute, storage, and critical stateful components. Forecasts SHOULD cover at least the next 90 days and be reviewed regularly for production workloads.
- Capacity MUST be modeled against failure scenarios, not only against normal-state averages. The relevant metric is requested or reserved capacity plus expected failover load.
- Customers MUST maintain sufficient spare capacity to absorb the loss of the largest failure domain and at least one additional node (N+1) where relevant.
- All production workloads MUST define CPU and memory requests. Limits MUST be defined or explicitly justified based on workload behavior and performance testing.
- Customers MUST configure priority classes for business-critical workloads.
- Memory overcommit MUST NOT be used for critical production components. CPU overcommit above 4:1 is not acceptable; jitter-sensitive or demanding workloads SHOULD remain at or below 2:1.
- Prefer small and medium instances and horizontal scale over very large single nodes or single large VMs.
- Relying on ad hoc capacity procurement during an incident is NOT an acceptable resilience strategy.
Sizing note for 3-AZ designs: If a service must survive loss of one of three equally sized AZs and the remaining AZs should stay at or below 80% requested utilization after failover, steady-state requested utilization will usually need to remain at or below about 53% of total regional capacity unless warm standby capacity is already reserved. With a 90% post-failover target, steady-state utilization is still only about 60%.
Storage capacity, performance, and expansion
- Customers MUST monitor storage used capacity, growth rate, latency, IOPS, queue depth, replication lag where applicable, and the load generated by backups and housekeeping jobs.
- Customers MUST implement housekeeping, retention, archival, and cleanup for databases, filesystems, and object storage. Keep primary production datasets as small as practical.
- Customers MUST initiate expansion or cleanup before storage becomes constrained. Waiting until a volume or bucket is almost full is not acceptable operational practice.
- Customers MUST correct workload patterns that create persistent maximum storage load or inefficient I/O, such as small-block single-thread I/O when the application could use caching, larger blocks, or parallel I/O.
- For high-volume temporary or session-like data, customers SHOULD use caches or memory-optimized stores where compatible with the application design instead of pushing avoidable load into primary transactional databases.
- Object storage used for backup MUST have retention and cleanup policies. Object storage is not a substitute for high-IO transactional storage and should not be used for very large populations of tiny files.
- If the customer deploys unmanaged RWX/shared storage or a customer-owned Ceph solution inside the cluster, the customer owns the design, operation, and support of that solution.
Reduce dependencies on APIs and shared services
- Customers MUST minimize hard runtime dependency on OSC APIs, the Kubernetes API, DNS, identity providers, and other shared control-plane services.
- No critical business transaction path SHOULD require synchronous access to the Kubernetes API unless that dependency is explicitly justified, tested, and protected by fallback behavior.
- Health probes MUST target the local application endpoint and MUST NOT depend on DNS, public FQDNs, or other application components. Probe configuration must reflect measured startup and failure behavior.
- API clients MUST use connection reuse, exponential backoff with jitter, sensible circuit breakers, and realistic timeouts. Very small 1-5 second timeouts are not acceptable for critical platform interactions; a default around 60 seconds is a better starting point unless testing justifies something else.
- Large data objects and high-churn application state MUST NOT be stored in Kubernetes API or etcd. If the application requires etcd-like semantics or high request volume, the customer SHOULD run a dedicated application-owned data store inside the cluster.
- Customers SHOULD avoid broad cross-cluster peering or tightly coupled network meshes. Prefer namespace isolation and expose only the few endpoints that truly require cross-cluster reachability.
Platform lifecycle and deprecated version compliance
- Customers MUST comply with supported versions, upgrade windows, and deprecation rules. Unsupported components become both stability and security risks.
- Workloads MUST be kept compatible with supported Kubernetes versions, supported Garden Linux images, supported add-ons, and supported API versions.
- Customers MUST remediate use of deprecated or removed Kubernetes APIs, controllers, or integration patterns before the applicable platform upgrade deadline.
- Customer-managed components inside the cluster - such as operators, ingress controllers, service meshes, CSI drivers, admission webhooks, sidecars, or application SDKs - remain the customer's responsibility unless expressly covered by a managed service agreement.
Network architecture and IP address planning
- Customers MUST own end-to-end IP planning for cluster node CIDRs, pod and service networks where relevant, remote on-premises networks, partner networks, and other cloud environments that may later be connected through VPNs, peerings, or transit routing.
- Address ranges that are intended to communicate with each other MUST NOT overlap. Overlapping CIDRs can break routing, peering, or VPN connectivity and often require disruptive remediation or avoidable NAT workarounds.
- Customers MUST validate IP overlap not only against today's connected networks but also against reasonably planned future connectivity, additional clusters, and disaster recovery environments.
- When cross-network connectivity is business-critical, customers SHOULD maintain an IP address management process and an approved addressing plan before provisioning new clusters or remote connections.
Network sizing and NAT gateway planning
- Customers MUST size node networks for both dimensions of the problem: enough addresses for projected peak node growth and enough NAT port capacity for the expected egress pattern, especially persistent connections to the same destination endpoints.
- Customers MUST understand that broader node networks provide more IP addresses but fewer NAT ports per network interface, while narrower node networks provide more NAT ports per interface but less room for node growth. This trade-off must be evaluated during design, not after saturation occurs.
- Customers MUST NOT rely on default NAT gateway settings. NAT gateway sizing, ports per interface, and network sizes MUST be validated against the expected number of nodes, load balancers, connection fan-out, and long-lived outbound connections.
- If the application depends on high outbound connection counts or many persistent connections to a small set of external endpoints, customers MUST explicitly review NAT gateway port sizing as part of capacity management.
| Example node CIDR | Approx. IP addresses | Example max NAT ports per network interface |
|---|---|---|
| /23 | 512 | 64 |
| /24 | 256 | 128 |
| /25 | 128 | 256 |
| /26 | 64 | 512 |
This table is intended to show the design trade-off only: more node IP space generally reduces NAT ports per interface, while smaller node ranges increase ports per interface but constrain node growth.
Kubernetes workload hardening
- Use startup, readiness, and liveness probes that reflect actual application health and are tuned based on measured startup time and recovery characteristics.
- Use replicas, PodDisruptionBudgets, topology spread, anti-affinity, and autoscaling where appropriate for the workload and business target.
- Critical components MUST NOT be singletons unless the business owner explicitly accepts the resulting single point of failure.
- Customers MUST validate that upgrades or restarts of pods, nodes, and network components do not trigger chain reactions across the application landscape.
- Dependencies to identity providers, external databases, object storage, or external APIs MUST be explicitly documented and protected with retries, caching, queues, or degraded-mode behavior.
Data protection, backup, and disaster recovery
- Customers MUST ensure compliance with relevant laws and regulations, such as data sovereignty and privacy. Check if data is personal or regulated; ensure allowed processing; sign data-processing agreements; and follow industry rules. Sensitive data must be protected through encryption and access controls.
- Customers MUST keep at least one backup copy outside the primary OSC storage domain and preferably outside the primary OSC environment for business-critical services.
- Restore testing MUST be performed regularly. The default expectation is quarterly restore validation and annual end-to-end disaster recovery rehearsal, plus additional validation after major architecture changes.
- Backup strategy SHOULD combine full and incremental backups where supported and should be aligned with business RPO and RTO targets.
- Backups MUST be encrypted, monitored, and covered by alerting and runbooks.
- Replication does NOT replace backup.
- Backup frequency MUST be balanced against storage load and operational impact; more backups are not automatically better if they jeopardize production stability.
- The customer owns and operates its Business Continuity Management framework and Business Continuity Plan, including disaster recovery planning, backup and restore procedures, recovery testing, escalation, and decision‑making authority; OSC does not assume responsibility for customer-specific business continuity management.
Security
- Security is not separate from resilience. Security failures frequently result in service outages, forced shutdowns, or data loss. Customers MUST design, deploy, and operate workloads so that security incidents can be prevented, detected, contained, and recovered without destabilizing the platform or other tenants.
- Customers own and protect all their tenant credentials (passwords, keys, tokens, certificates). Least‑privilege access, regular rotation, and immediate revocation after suspected compromise are mandatory.
- Customers are responsible for application‑level security controls, including authentication, authorization, and endpoint exposure.
- Customers MUST ensure workloads do not generate abusive or amplified traffic. Misconfigurations that could destabilize the platform or impact others are the customer’s responsibility. The OSC may immediately deactivate workloads that threaten platform stability.
- Customers MUST comply with supported versions, upgrade windows, and deprecation rules. Unsupported components become both stability and security risks.
Testing, evidence, and operational governance
- Before go-live, customers MUST complete end-to-end resilience tests aligned with OSC failure scenarios, including AZ failure survival, node maintenance, transient networking interruption, backup restore, version-upgrade compatibility, IP overlap validation for required connectivity paths, and dependency outage handling.
- Customers MUST maintain runbooks, escalation contacts, ownership information, and a clear responsibility and communication matrix.
- Exceptions to these requirements MUST be documented, risk-assessed, and approved by the customer service owner and the OSC counterpart before go-live.
- Evidence of compliance SHOULD be retained and updated after major changes, scale events, or architecture changes.
Operational thresholds and trigger points
| Domain | Threshold / minimum standard | Required customer action |
|---|---|---|
| Compute failover model | Post-AZ-failure requested CPU and memory utilization should stay at or below 80% of remaining capacity. In a balanced 3-AZ design without reserved standby, this usually means steady-state requested utilization at or below about 53% of total capacity. | Scale out, reserve standby capacity, or redesign the application before go-live if the failover model exceeds the threshold. |
| Node or node-pool utilization | Sustained utilization above 70% requires review. Sustained utilization above 80% requires capacity action. Utilization above 90% is outside the normal safe operating envelope. | Forecast, rebalance, optimize, or scale before the headroom is consumed. |
| Storage used capacity | Begin expansion planning or data cleanup at 70% used or when 80% usage is forecast within 30 days. Execute expansion by 80%. More than 90% used is critical. | Expand storage, adjust retention, archive data, or remediate workload behavior before service degradation occurs. |
| Supported versions and deprecated APIs | No production workload may remain on deprecated or removed versions or APIs beyond the communicated deadline. | Plan and execute upgrades, update manifests and integrations, and validate compatibility before the deadline. |
| Network address planning | No overlap is acceptable between any cluster network CIDR and any remote network that must be reachable through VPN, peering, or routed connectivity. | Re-plan CIDRs before provisioning or before enabling connectivity; do not treat downstream NAT as the default fix. |
| Node CIDR and NAT gateway sizing | Node CIDR and NAT gateway settings must support both peak address consumption and projected egress or NAT port demand, including persistent connection patterns to shared destinations. | Recalculate node CIDR, NAT gateway sizing, or egress architecture before onboarding workloads that materially change node count or outbound connection behavior. |
| CPU and memory overcommit | CPU overcommit should remain at or below 4:1 overall and at or below 2:1 for jitter-sensitive workloads. No memory overcommit for critical production components. | Reduce density, resize nodes, or split workloads across additional capacity. |
| Platform API timeout and retry policy | Avoid 1-5 second timeouts for critical platform interactions. Use realistic timeouts, around 60 seconds by default unless testing justifies something else, plus exponential backoff and jitter. | Adjust client libraries, retry policies, and fallback behavior; rerun failure tests after changes. |
| Transient networking interruptions | Applications must tolerate at least 60 seconds of network control-plane interruption or CNI restart without uncontrolled cascading failure. | Use retries, queues, cache, dependency isolation, and validated probe settings. |
| Restore and DR validation | Quarterly restore validation and annual end-to-end DR rehearsal are the default minimum expectation for business-critical workloads. | Maintain evidence, remediate gaps, and repeat after major architecture changes. |
Go-live quality gates
| Gate | Minimum evidence |
|---|---|
| Architecture and dependency review | Approved design showing failure domains, critical dependencies, degraded-mode behavior, and ownership boundaries. |
| AZ-failure test | Test evidence showing that loss of one AZ does not create a full outage and that documented RTO and RPO targets remain achievable. |
| Node maintenance test | Successful rolling node, VM, or worker maintenance test without uncontrolled disruption. |
| Network interruption test | Successful simulation of transient network interruption or Cilium restart of at least 60 seconds. |
| Version compliance | Inventory of Kubernetes and API dependencies, confirmation that no deprecated versions or removed APIs remain within the go-live and upgrade horizon, and an upgrade plan aligned with the service description. |
| Network architecture and IP plan | Approved IP plan covering node, pod, service, and remote network ranges, overlap check for VPN or peering use cases, and reviewed egress or NAT design. |
| Capacity plan | Reviewed forecast, utilization thresholds, failover headroom calculation, and agreed expansion lead times. |
| Storage plan | Storage alerts, housekeeping jobs, retention policy, and storage expansion runbook are in place. |
| Backup and restore | Restore into a clean environment has been proven and an independent backup copy has been verified. |
| Runbooks and communication | Named service owner, support contacts, escalation matrix, and incident runbooks are available and current. |
Exceptions and risk acceptance
Any deviation from these requirements must be documented, risk-assessed, and approved by the customer service owner and the OSC counterpart before go-live. Accepted exceptions should define compensating controls, expected business impact, and a review date. Exceptions should not be treated as permanent defaults.