kueue Failure Modes¶

Coverage¶

Injection Type	Danger	Experiment	Description
FinalizerBlock	low	finalizer-block.yaml	When a stuck finalizer prevents a Workload from being deleted, the controller sh...
NetworkPartition	medium	network-partition.yaml	When kueue-controller-manager pods are network-partitioned from the API server, ...
PodKill	low	pod-kill.yaml	When the kueue-controller-manager pod is killed, pending workloads should queue ...
RBACRevoke	high	rbac-revoke.yaml	When the kueue ClusterRoleBinding subjects are revoked, the controller can no lo...
WebhookDisrupt	high	webhook-disrupt.yaml	When the kueue validating webhook failurePolicy is weakened from Fail to Ignore,...

Experiment Details¶

kueue-finalizer-block¶

Type: FinalizerBlock
Danger Level: low
Component: kueue-controller-manager

When a stuck finalizer prevents a Workload from being deleted, the controller should handle the Terminating state gracefully and not leak associated resource quota reservations. The chaos framework removes the finalizer via TTL-based cleanup after 300s.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kueue-finalizer-block
spec:
  tier: 3
  target:
    operator: kueue
    component: kueue-controller-manager
    resource: Workload
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kueue-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: FinalizerBlock
    parameters:
      apiVersion: kueue.x-k8s.io/v1beta1
      kind: Workload
      name: test-workload
      finalizer: kueue.x-k8s.io/managed-resources
    ttl: "300s"
  hypothesis:
    description: >-
      When a stuck finalizer prevents a Workload from being deleted, the
      controller should handle the Terminating state gracefully and not
      leak associated resource quota reservations. The chaos framework
      removes the finalizer via TTL-based cleanup after 300s.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

kueue-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: kueue-controller-manager

When kueue-controller-manager pods are network-partitioned from the API server, workload admission stops and no new workloads are scheduled. Existing admitted workloads continue running. Once the partition is removed, scheduling resumes without manual intervention.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kueue-network-partition
spec:
  tier: 2
  target:
    operator: kueue
    component: kueue-controller-manager
    resource: Deployment/kueue-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kueue-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: control-plane=controller-manager,app.kubernetes.io/name=kueue
    ttl: "300s"
  hypothesis:
    description: >-
      When kueue-controller-manager pods are network-partitioned from the
      API server, workload admission stops and no new workloads are
      scheduled. Existing admitted workloads continue running. Once the
      partition is removed, scheduling resumes without manual intervention.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

kueue-pod-kill¶

Type: PodKill
Danger Level: low
Component: kueue-controller-manager

When the kueue-controller-manager pod is killed, pending workloads should queue but not be admitted during downtime. Kubernetes should recreate the pod, and the controller should recover and resume scheduling within the recovery timeout.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kueue-pod-kill
spec:
  tier: 1
  target:
    operator: kueue
    component: kueue-controller-manager
    resource: Deployment/kueue-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kueue-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: control-plane=controller-manager,app.kubernetes.io/name=kueue
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the kueue-controller-manager pod is killed, pending workloads
      should queue but not be admitted during downtime. Kubernetes should
      recreate the pod, and the controller should recover and resume
      scheduling within the recovery timeout.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

kueue-rbac-revoke¶

Type: RBACRevoke
Danger Level: high
Component: kueue-controller-manager

When the kueue ClusterRoleBinding subjects are revoked, the controller can no longer read or update Workloads, ClusterQueues, or LocalQueues. Admission stops with 403 errors. Once permissions are restored, normal scheduling resumes without restart.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kueue-rbac-revoke
spec:
  tier: 4
  target:
    operator: kueue
    component: kueue-controller-manager
    resource: ClusterRoleBinding/kueue-controller-manager-rolebinding
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kueue-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: kueue-controller-manager-rolebinding
      bindingType: ClusterRoleBinding
    ttl: "60s"
  hypothesis:
    description: >-
      When the kueue ClusterRoleBinding subjects are revoked, the controller
      can no longer read or update Workloads, ClusterQueues, or LocalQueues.
      Admission stops with 403 errors. Once permissions are restored, normal
      scheduling resumes without restart.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true

kueue-webhook-disrupt¶

Type: WebhookDisrupt
Danger Level: high
Component: kueue-controller-manager

When the kueue validating webhook failurePolicy is weakened from Fail to Ignore, invalid Workload and ClusterQueue specs can be submitted bypassing validation. The controller should handle invalid resources gracefully. The chaos framework restores the original failurePolicy via TTL-based cleanup after 60s.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kueue-webhook-disrupt
spec:
  tier: 4
  target:
    operator: kueue
    component: kueue-controller-manager
    resource: ValidatingWebhookConfiguration/vworkload.kb.io
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kueue-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookDisrupt
    dangerLevel: high
    parameters:
      webhookName: vworkload.kb.io
      action: setFailurePolicy
      value: Ignore
    ttl: "60s"
  hypothesis:
    description: >-
      When the kueue validating webhook failurePolicy is weakened from Fail
      to Ignore, invalid Workload and ClusterQueue specs can be submitted
      bypassing validation. The controller should handle invalid resources
      gracefully. The chaos framework restores the original failurePolicy
      via TTL-based cleanup after 60s.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true