Skip to content

workbenches Failure Modes

Coverage

Injection Type Danger Experiment Description
PodKill low dependency-dashboard-kill.yaml Killing the dashboard (which workbenches integrates with for notebook management...
NetworkPartition medium network-partition.yaml When the odh-notebook-controller pod is network-partitioned from the API server,...
PodKill low pod-kill.yaml When the odh-notebook-controller pod is killed, Kubernetes should recreate it wi...
RBACRevoke high rbac-revoke.yaml When the odh-notebook-controller ClusterRoleBinding subjects are revoked, the co...
WebhookDisrupt high webhook-disrupt.yaml When the notebook mutating webhook failurePolicy is weakened from Fail to Ignore...

Experiment Details

workbenches-dependency-dashboard-kill

  • Type: PodKill
  • Danger Level: low
  • Component: notebook-controller

Killing the dashboard (which workbenches integrates with for notebook management UI) should not crash the notebook controller. Workbenches should continue managing existing notebooks and recover when dashboard is restored.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: workbenches-dependency-dashboard-kill
spec:
  tier: 1
  target:
    operator: workbenches
    component: notebook-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: notebook-controller-deployment
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: "app=odh-dashboard"
    ttl: "300s"
  hypothesis:
    description: >-
      Killing the dashboard (which workbenches integrates with for notebook
      management UI) should not crash the notebook controller. Workbenches
      should continue managing existing notebooks and recover when dashboard
      is restored.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

workbenches-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: odh-notebook-controller

When the odh-notebook-controller pod is network-partitioned from the API server, it should stop reconciling notebook resources. Running notebooks should continue operating independently. Once the partition is removed, the controller should resume normal operation.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: workbenches-network-partition
spec:
  tier: 2
  target:
    operator: workbenches
    component: odh-notebook-controller
    resource: Deployment/odh-notebook-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-notebook-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: app=odh-notebook-controller
    ttl: "300s"
  hypothesis:
    description: >-
      When the odh-notebook-controller pod is network-partitioned from
      the API server, it should stop reconciling notebook resources.
      Running notebooks should continue operating independently. Once the
      partition is removed, the controller should resume normal operation.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

workbenches-pod-kill

  • Type: PodKill
  • Danger Level: low
  • Component: odh-notebook-controller

When the odh-notebook-controller pod is killed, Kubernetes should recreate it within the recovery timeout. The controller should resume managing notebook workbenches without interrupting running notebook sessions.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: workbenches-pod-kill
spec:
  tier: 1
  target:
    operator: workbenches
    component: odh-notebook-controller
    resource: Deployment/odh-notebook-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-notebook-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app=odh-notebook-controller
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the odh-notebook-controller pod is killed, Kubernetes should
      recreate it within the recovery timeout. The controller should
      resume managing notebook workbenches without interrupting running
      notebook sessions.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

workbenches-rbac-revoke

  • Type: RBACRevoke
  • Danger Level: high
  • Component: odh-notebook-controller

When the odh-notebook-controller ClusterRoleBinding subjects are revoked, the controller should lose its ability to manage notebook resources across namespaces and surface permission-denied errors. Once permissions are restored, reconciliation should resume.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: workbenches-rbac-revoke
spec:
  tier: 4
  target:
    operator: workbenches
    component: odh-notebook-controller
    resource: ClusterRoleBinding/odh-notebook-controller-manager-rolebinding
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-notebook-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: odh-notebook-controller-manager-rolebinding
      bindingType: ClusterRoleBinding
    ttl: "60s"
  hypothesis:
    description: >-
      When the odh-notebook-controller ClusterRoleBinding subjects are
      revoked, the controller should lose its ability to manage notebook
      resources across namespaces and surface permission-denied errors.
      Once permissions are restored, reconciliation should resume.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true

workbenches-webhook-disrupt

  • Type: WebhookDisrupt
  • Danger Level: high
  • Component: odh-notebook-controller

When the notebook mutating webhook failurePolicy is weakened from Fail to Ignore, new notebooks may be created without the required sidecar injection and OAuth proxy configuration. The chaos framework restores the original failurePolicy via TTL-based cleanup after 60s.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: workbenches-webhook-disrupt
spec:
  tier: 4
  target:
    operator: workbenches
    component: odh-notebook-controller
    resource: MutatingWebhookConfiguration/notebooks.opendatahub.io
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-notebook-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookDisrupt
    dangerLevel: high
    parameters:
      webhookName: notebooks.opendatahub.io
      webhookType: mutating
      action: setFailurePolicy
      value: Ignore
    ttl: "60s"
  hypothesis:
    description: >-
      When the notebook mutating webhook failurePolicy is weakened from
      Fail to Ignore, new notebooks may be created without the required
      sidecar injection and OAuth proxy configuration. The chaos framework
      restores the original failurePolicy via TTL-based cleanup after 60s.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true