Skip to content

Failure Modes Overview

Overview of all failure injection types available in Operator Chaos.

Quick Reference

Type Danger Description
CRDMutation Medium Mutates a spec field on a custom resource instance to test reconciliation of CR state.
ClientFault Low Injects errors, latency, or throttling into operator API calls via SDK integration.
ConfigDrift Low Modifies a key in a ConfigMap or Secret to test configuration reconciliation.
FinalizerBlock Medium Adds a stuck finalizer to a resource to test deletion handling and cleanup logic.
LabelStomping Medium Modifies or removes labels on operator-managed resources to test label-based reconciliation.
NamespaceDeletion High Deletes an entire namespace to test whether the operator recreates it and its managed resources.
NetworkPartition Medium Creates a deny-all NetworkPolicy isolating pods matching a label selector from all ingress and egress traffic.
OwnerRefOrphan Medium Removes ownerReferences from operator-managed resources to test re-adoption logic.
PodKill Low Force-deletes pods matching a label selector with zero grace period.
QuotaExhaustion Medium Creates a restrictive ResourceQuota to test operator behavior under resource pressure.
RBACRevoke High Clears all subjects from a ClusterRoleBinding or RoleBinding to test RBAC resilience.
WebhookDisrupt High Modifies failure policies on a ValidatingWebhookConfiguration to test webhook resilience.
WebhookLatency High Deploys a slow admission webhook to add latency to API server requests for specific resources.

Decision Tree

Which failure mode should I use?

graph TD
    A[What are you testing?] --> B{Pod lifecycle?}
    B -->|Yes| C[PodKill]
    A --> D{Network resilience?}
    D -->|Yes| E[NetworkPartition]
    A --> F{Config reconciliation?}
    F -->|Yes| G[ConfigDrift]
    A --> H{CR spec handling?}
    H -->|Yes| I[CRDMutation]
    A --> J{Webhook resilience?}
    J -->|Yes| K[WebhookDisrupt]
    A --> L{Permission handling?}
    L -->|Yes| M[RBACRevoke]
    A --> N{Deletion/cleanup?}
    N -->|Yes| O[FinalizerBlock]
    A --> P{API error handling?}
    P -->|Yes| Q[ClientFault]
    A --> R{Ownership/adoption?}
    R -->|Yes| S[OwnerRefOrphan]
    A --> T{Resource pressure?}
    T -->|Yes| U[QuotaExhaustion]
    A --> V{API latency?}
    V -->|Yes| W[WebhookLatency]

Coverage by Component

Component CRDMutation ClientFault ConfigDrift FinalizerBlock LabelStomping NamespaceDeletion NetworkPartition OwnerRefOrphan PodKill QuotaExhaustion RBACRevoke WebhookDisrupt WebhookLatency Total
codeflare - - - - - - - - - 4
dashboard - - - - - - - 6
data-science-pipelines - - - - - - - - 5
feast - - - - - - - - - - 3
kserve - - - - - - - 6
kueue - - - - - - - - 5
llamastack - - - - - - - - - 4
model-registry - - - - - - - 6
modelmesh - - - - - - - - 5
odh-model-controller 13
opendatahub-operator - - - - - - - - 5
ray - - - - - - - - - 4
training-operator - - - - - - - - - 4
trustyai - - - - - - - - - - 3
workbenches - - - - - - - - - 4