training-operator Failure Modes¶

Coverage¶

Injection Type	Danger	Experiment	Description
FinalizerBlock	low	finalizer-block.yaml	When a stuck finalizer prevents a PyTorchJob from being deleted, the controller ...
NetworkPartition	medium	network-partition.yaml	When the training-operator is network-partitioned from the API server, job statu...
PodKill	low	pod-kill.yaml	When the training-operator pod is killed, running training jobs continue via wor...
RBACRevoke	high	rbac-revoke.yaml	When the training-operator ClusterRoleBinding subjects are revoked, the controll...

Experiment Details¶

training-operator-finalizer-block¶

Type: FinalizerBlock
Danger Level: low
Component: training-operator-controller-manager

When a stuck finalizer prevents a PyTorchJob from being deleted, the controller should handle the Terminating state gracefully and not leak associated worker pods or services. The chaos framework removes the finalizer via TTL-based cleanup after 300s.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: training-operator-finalizer-block
spec:
  tier: 3
  target:
    operator: training-operator
    component: training-operator-controller-manager
    resource: PyTorchJob
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: training-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: FinalizerBlock
    parameters:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      name: test-pytorchjob
      finalizer: training-operator.kubeflow.org/finalizer
    ttl: "300s"
  hypothesis:
    description: >-
      When a stuck finalizer prevents a PyTorchJob from being deleted, the
      controller should handle the Terminating state gracefully and not leak
      associated worker pods or services. The chaos framework removes the
      finalizer via TTL-based cleanup after 300s.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

training-operator-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: training-operator-controller-manager

When the training-operator is network-partitioned from the API server, job status stops updating but running worker pods continue training. Once the partition is removed, reconciliation resumes without manual intervention.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: training-operator-network-partition
spec:
  tier: 2
  target:
    operator: training-operator
    component: training-operator-controller-manager
    resource: Deployment/training-operator-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: training-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: control-plane=controller-manager,app.kubernetes.io/name=training-operator
    ttl: "300s"
  hypothesis:
    description: >-
      When the training-operator is network-partitioned from the API server,
      job status stops updating but running worker pods continue training.
      Once the partition is removed, reconciliation resumes without manual
      intervention.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

training-operator-pod-kill¶

Type: PodKill
Danger Level: low
Component: training-operator-controller-manager

When the training-operator pod is killed, running training jobs continue via worker pods. New job submissions queue until the controller recovers within the recovery timeout.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: training-operator-pod-kill
spec:
  tier: 1
  target:
    operator: training-operator
    component: training-operator-controller-manager
    resource: Deployment/training-operator-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: training-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: control-plane=controller-manager,app.kubernetes.io/name=training-operator
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the training-operator pod is killed, running training jobs
      continue via worker pods. New job submissions queue until the
      controller recovers within the recovery timeout.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

training-operator-rbac-revoke¶

Type: RBACRevoke
Danger Level: high
Component: training-operator-controller-manager

When the training-operator ClusterRoleBinding subjects are revoked, the controller can no longer manage PyTorchJob resources. API calls return 403 errors. Once permissions are restored, normal operation resumes without restart.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: training-operator-rbac-revoke
spec:
  tier: 4
  target:
    operator: training-operator
    component: training-operator-controller-manager
    resource: ClusterRoleBinding/training-operator-manager-rolebinding
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: training-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: training-operator-manager-rolebinding
      bindingType: ClusterRoleBinding
    ttl: "60s"
  hypothesis:
    description: >-
      When the training-operator ClusterRoleBinding subjects are revoked,
      the controller can no longer manage PyTorchJob resources. API calls
      return 403 errors. Once permissions are restored, normal operation
      resumes without restart.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true