training-operator Failure Modes¶
Coverage¶
| Injection Type | Danger | Experiment | Description |
|---|---|---|---|
| FinalizerBlock | low | finalizer-block.yaml | When a stuck finalizer prevents a PyTorchJob from being deleted, the controller ... |
| NetworkPartition | medium | network-partition.yaml | When the training-operator is network-partitioned from the API server, job statu... |
| PodKill | low | pod-kill.yaml | When the training-operator pod is killed, running training jobs continue via wor... |
| RBACRevoke | high | rbac-revoke.yaml | When the training-operator ClusterRoleBinding subjects are revoked, the controll... |
Experiment Details¶
training-operator-finalizer-block¶
- Type: FinalizerBlock
- Danger Level: low
- Component: training-operator-controller-manager
When a stuck finalizer prevents a PyTorchJob from being deleted, the controller should handle the Terminating state gracefully and not leak associated worker pods or services. The chaos framework removes the finalizer via TTL-based cleanup after 300s.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: training-operator-finalizer-block
spec:
tier: 3
target:
operator: training-operator
component: training-operator-controller-manager
resource: PyTorchJob
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: training-operator-controller-manager
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: FinalizerBlock
parameters:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
name: test-pytorchjob
finalizer: training-operator.kubeflow.org/finalizer
ttl: "300s"
hypothesis:
description: >-
When a stuck finalizer prevents a PyTorchJob from being deleted, the
controller should handle the Terminating state gracefully and not leak
associated worker pods or services. The chaos framework removes the
finalizer via TTL-based cleanup after 300s.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
training-operator-network-partition¶
- Type: NetworkPartition
- Danger Level: medium
- Component: training-operator-controller-manager
When the training-operator is network-partitioned from the API server, job status stops updating but running worker pods continue training. Once the partition is removed, reconciliation resumes without manual intervention.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: training-operator-network-partition
spec:
tier: 2
target:
operator: training-operator
component: training-operator-controller-manager
resource: Deployment/training-operator-controller-manager
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: training-operator-controller-manager
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: NetworkPartition
parameters:
labelSelector: control-plane=controller-manager,app.kubernetes.io/name=training-operator
ttl: "300s"
hypothesis:
description: >-
When the training-operator is network-partitioned from the API server,
job status stops updating but running worker pods continue training.
Once the partition is removed, reconciliation resumes without manual
intervention.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
training-operator-pod-kill¶
- Type: PodKill
- Danger Level: low
- Component: training-operator-controller-manager
When the training-operator pod is killed, running training jobs continue via worker pods. New job submissions queue until the controller recovers within the recovery timeout.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: training-operator-pod-kill
spec:
tier: 1
target:
operator: training-operator
component: training-operator-controller-manager
resource: Deployment/training-operator-controller-manager
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: training-operator-controller-manager
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: PodKill
parameters:
labelSelector: control-plane=controller-manager,app.kubernetes.io/name=training-operator
count: 1
ttl: "300s"
hypothesis:
description: >-
When the training-operator pod is killed, running training jobs
continue via worker pods. New job submissions queue until the
controller recovers within the recovery timeout.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
training-operator-rbac-revoke¶
- Type: RBACRevoke
- Danger Level: high
- Component: training-operator-controller-manager
When the training-operator ClusterRoleBinding subjects are revoked, the controller can no longer manage PyTorchJob resources. API calls return 403 errors. Once permissions are restored, normal operation resumes without restart.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: training-operator-rbac-revoke
spec:
tier: 4
target:
operator: training-operator
component: training-operator-controller-manager
resource: ClusterRoleBinding/training-operator-manager-rolebinding
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: training-operator-controller-manager
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: RBACRevoke
dangerLevel: high
parameters:
bindingName: training-operator-manager-rolebinding
bindingType: ClusterRoleBinding
ttl: "60s"
hypothesis:
description: >-
When the training-operator ClusterRoleBinding subjects are revoked,
the controller can no longer manage PyTorchJob resources. API calls
return 403 errors. Once permissions are restored, normal operation
resumes without restart.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true