ray Failure Modes¶

Coverage¶

Injection Type	Danger	Experiment	Description
FinalizerBlock	low	finalizer-block.yaml	When a stuck finalizer prevents a RayCluster from being deleted, the controller ...
NetworkPartition	medium	network-partition.yaml	When the ray-operator is network-partitioned from the API server, cluster scalin...
PodKill	low	pod-kill.yaml	When the ray-operator pod is killed, existing RayClusters keep running and servi...
RBACRevoke	high	rbac-revoke.yaml	When the ray-operator ClusterRoleBinding subjects are revoked, the controller ca...

Experiment Details¶

ray-finalizer-block¶

Type: FinalizerBlock
Danger Level: low
Component: ray-operator-controller-manager

When a stuck finalizer prevents a RayCluster from being deleted, the controller should handle the Terminating state gracefully and not leak associated head or worker pods. The chaos framework removes the finalizer via TTL-based cleanup after 300s.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: ray-finalizer-block
spec:
  tier: 3
  target:
    operator: ray
    component: ray-operator-controller-manager
    resource: RayCluster
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: ray-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: FinalizerBlock
    parameters:
      apiVersion: ray.io/v1
      kind: RayCluster
      name: test-raycluster
      finalizer: ray.io/finalizer
    ttl: "300s"
  hypothesis:
    description: >-
      When a stuck finalizer prevents a RayCluster from being deleted, the
      controller should handle the Terminating state gracefully and not leak
      associated head or worker pods. The chaos framework removes the
      finalizer via TTL-based cleanup after 300s.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

ray-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: ray-operator-controller-manager

When the ray-operator is network-partitioned from the API server, cluster scaling and health monitoring stops. Existing RayClusters continue running workloads. Once the partition is removed, reconciliation resumes without manual intervention.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: ray-network-partition
spec:
  tier: 2
  target:
    operator: ray
    component: ray-operator-controller-manager
    resource: Deployment/ray-operator-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: ray-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: control-plane=controller-manager,app.kubernetes.io/name=kuberay-operator
    ttl: "300s"
  hypothesis:
    description: >-
      When the ray-operator is network-partitioned from the API server,
      cluster scaling and health monitoring stops. Existing RayClusters
      continue running workloads. Once the partition is removed,
      reconciliation resumes without manual intervention.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

ray-pod-kill¶

Type: PodKill
Danger Level: low
Component: ray-operator-controller-manager

When the ray-operator pod is killed, existing RayClusters keep running and serving workloads. New cluster requests queue until the controller recovers within the recovery timeout.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: ray-pod-kill
spec:
  tier: 1
  target:
    operator: ray
    component: ray-operator-controller-manager
    resource: Deployment/ray-operator-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: ray-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: control-plane=controller-manager,app.kubernetes.io/name=kuberay-operator
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the ray-operator pod is killed, existing RayClusters keep
      running and serving workloads. New cluster requests queue until
      the controller recovers within the recovery timeout.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

ray-rbac-revoke¶

Type: RBACRevoke
Danger Level: high
Component: ray-operator-controller-manager

When the ray-operator ClusterRoleBinding subjects are revoked, the controller can no longer manage RayCluster resources. API calls return 403 errors. Once permissions are restored, normal operation resumes without restart.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: ray-rbac-revoke
spec:
  tier: 4
  target:
    operator: ray
    component: ray-operator-controller-manager
    resource: ClusterRoleBinding/ray-operator-manager-rolebinding
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: ray-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: ray-operator-manager-rolebinding
      bindingType: ClusterRoleBinding
    ttl: "60s"
  hypothesis:
    description: >-
      When the ray-operator ClusterRoleBinding subjects are revoked, the
      controller can no longer manage RayCluster resources. API calls
      return 403 errors. Once permissions are restored, normal operation
      resumes without restart.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true