opendatahub-operator Failure Modes¶
Coverage¶
| Injection Type | Danger | Experiment | Description |
|---|---|---|---|
| FinalizerBlock | low | finalizer-block.yaml | When a stuck finalizer prevents a DataScienceCluster from being deleted, the ope... |
| NetworkPartition | medium | network-partition.yaml | When the operator pods are network-partitioned, the leader should lose its lease... |
| PodKill | low | pod-kill.yaml | When one operator pod is killed, the remaining replicas should maintain the lead... |
| RBACRevoke | high | rbac-revoke.yaml | When the operator ClusterRoleBinding subjects are revoked, the controller should... |
| WebhookDisrupt | high | webhook-disrupt.yaml | When the validating webhook failurePolicy is weakened from Fail to Ignore, inval... |
Experiment Details¶
opendatahub-operator-finalizer-block¶
- Type: FinalizerBlock
- Danger Level: low
- Component: opendatahub-operator
When a stuck finalizer prevents a DataScienceCluster from being deleted, the operator should handle the Terminating state gracefully, report the blocked deletion in its status, and not leak component deployments across managed namespaces. The chaos framework removes the finalizer via TTL-based cleanup after 300s.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: opendatahub-operator-finalizer-block
spec:
tier: 3
target:
operator: opendatahub-operator
component: opendatahub-operator
resource: DataScienceCluster
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: opendatahub-operator-controller-manager
namespace: opendatahub-operator-system
conditionType: Available
timeout: "30s"
injection:
type: FinalizerBlock
# IMPORTANT: "default-dsc" is a placeholder. Replace it with the name
# of the actual DataScienceCluster resource deployed in the cluster
# before running this experiment.
parameters:
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
name: default-dsc
finalizer: platform.opendatahub.io/finalizer
ttl: "300s"
hypothesis:
description: >-
When a stuck finalizer prevents a DataScienceCluster from being
deleted, the operator should handle the Terminating state gracefully,
report the blocked deletion in its status, and not leak component
deployments across managed namespaces. The chaos framework removes
the finalizer via TTL-based cleanup after 300s.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 3
allowedNamespaces:
- opendatahub-operator-system
opendatahub-operator-network-partition¶
- Type: NetworkPartition
- Danger Level: medium
- Component: opendatahub-operator
When the operator pods are network-partitioned, the leader should lose its lease and stop reconciling DSCInitialization and DataScienceCluster resources. Once connectivity is restored, a new leader election should occur and reconciliation should resume.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: opendatahub-operator-network-partition
spec:
tier: 2
target:
operator: opendatahub-operator
component: opendatahub-operator
resource: Deployment/opendatahub-operator-controller-manager
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: opendatahub-operator-controller-manager
namespace: opendatahub-operator-system
conditionType: Available
timeout: "30s"
injection:
type: NetworkPartition
parameters:
labelSelector: control-plane=controller-manager
ttl: "300s"
hypothesis:
description: >-
When the operator pods are network-partitioned, the leader should
lose its lease and stop reconciling DSCInitialization and
DataScienceCluster resources. Once connectivity is restored, a new
leader election should occur and reconciliation should resume.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 3
allowedNamespaces:
- opendatahub-operator-system
opendatahub-operator-pod-kill¶
- Type: PodKill
- Danger Level: low
- Component: opendatahub-operator
When one operator pod is killed, the remaining replicas should maintain the leader lease. Kubernetes should recreate the killed pod within the recovery timeout. The 3-replica HA deployment ensures continuous reconciliation of DataScienceCluster resources.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: opendatahub-operator-pod-kill
spec:
tier: 1
target:
operator: opendatahub-operator
component: opendatahub-operator
resource: Deployment/opendatahub-operator-controller-manager
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: opendatahub-operator-controller-manager
namespace: opendatahub-operator-system
conditionType: Available
timeout: "30s"
injection:
type: PodKill
parameters:
labelSelector: control-plane=controller-manager
count: 1
ttl: "300s"
hypothesis:
description: >-
When one operator pod is killed, the remaining replicas should
maintain the leader lease. Kubernetes should recreate the killed pod
within the recovery timeout. The 3-replica HA deployment ensures
continuous reconciliation of DataScienceCluster resources.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub-operator-system
opendatahub-operator-rbac-revoke¶
- Type: RBACRevoke
- Danger Level: high
- Component: opendatahub-operator
When the operator ClusterRoleBinding subjects are revoked, the controller should lose its ability to manage component deployments across namespaces and surface permission-denied errors. Once permissions are restored, reconciliation should resume without manual intervention.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: opendatahub-operator-rbac-revoke
spec:
tier: 4
target:
operator: opendatahub-operator
component: opendatahub-operator
resource: ClusterRoleBinding/opendatahub-operator-controller-manager-rolebinding
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: opendatahub-operator-controller-manager
namespace: opendatahub-operator-system
conditionType: Available
timeout: "30s"
injection:
type: RBACRevoke
dangerLevel: high
parameters:
bindingName: opendatahub-operator-controller-manager-rolebinding
bindingType: ClusterRoleBinding
ttl: "60s"
hypothesis:
description: >-
When the operator ClusterRoleBinding subjects are revoked, the
controller should lose its ability to manage component deployments
across namespaces and surface permission-denied errors. Once
permissions are restored, reconciliation should resume without
manual intervention.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 3
allowDangerous: true
opendatahub-operator-webhook-disrupt¶
- Type: WebhookDisrupt
- Danger Level: high
- Component: opendatahub-operator
When the validating webhook failurePolicy is weakened from Fail to Ignore, invalid DataScienceCluster and DSCInitialization resources can bypass admission validation. The chaos framework restores the original failurePolicy via TTL-based cleanup after 60s.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: opendatahub-operator-webhook-disrupt
spec:
tier: 4
target:
operator: opendatahub-operator
component: opendatahub-operator
resource: ValidatingWebhookConfiguration/validating.datasciencecluster.opendatahub.io
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: opendatahub-operator-controller-manager
namespace: opendatahub-operator-system
conditionType: Available
timeout: "30s"
injection:
type: WebhookDisrupt
dangerLevel: high
parameters:
webhookName: validating.datasciencecluster.opendatahub.io
action: setFailurePolicy
value: Ignore
ttl: "60s"
hypothesis:
description: >-
When the validating webhook failurePolicy is weakened from Fail to
Ignore, invalid DataScienceCluster and DSCInitialization resources
can bypass admission validation. The chaos framework restores the
original failurePolicy via TTL-based cleanup after 60s.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 3
allowDangerous: true