Failure Modes Overview¶
Overview of all failure injection types available in Operator Chaos.
Quick Reference¶
| Type | Danger | Description |
|---|---|---|
| CRDMutation | Medium | Mutates a spec field on a custom resource instance to test reconciliation of CR state. |
| ClientFault | Low | Injects errors, latency, or throttling into operator API calls via SDK integration. |
| ConfigDrift | Low | Modifies a key in a ConfigMap or Secret to test configuration reconciliation. |
| FinalizerBlock | Medium | Adds a stuck finalizer to a resource to test deletion handling and cleanup logic. |
| LabelStomping | Medium | Modifies or removes labels on operator-managed resources to test label-based reconciliation. |
| NamespaceDeletion | High | Deletes an entire namespace to test whether the operator recreates it and its managed resources. |
| NetworkPartition | Medium | Creates a deny-all NetworkPolicy isolating pods matching a label selector from all ingress and egress traffic. |
| OwnerRefOrphan | Medium | Removes ownerReferences from operator-managed resources to test re-adoption logic. |
| PodKill | Low | Force-deletes pods matching a label selector with zero grace period. |
| QuotaExhaustion | Medium | Creates a restrictive ResourceQuota to test operator behavior under resource pressure. |
| RBACRevoke | High | Clears all subjects from a ClusterRoleBinding or RoleBinding to test RBAC resilience. |
| WebhookDisrupt | High | Modifies failure policies on a ValidatingWebhookConfiguration to test webhook resilience. |
| WebhookLatency | High | Deploys a slow admission webhook to add latency to API server requests for specific resources. |
Decision Tree¶
Which failure mode should I use?
graph TD
A[What are you testing?] --> B{Pod lifecycle?}
B -->|Yes| C[PodKill]
A --> D{Network resilience?}
D -->|Yes| E[NetworkPartition]
A --> F{Config reconciliation?}
F -->|Yes| G[ConfigDrift]
A --> H{CR spec handling?}
H -->|Yes| I[CRDMutation]
A --> J{Webhook resilience?}
J -->|Yes| K[WebhookDisrupt]
A --> L{Permission handling?}
L -->|Yes| M[RBACRevoke]
A --> N{Deletion/cleanup?}
N -->|Yes| O[FinalizerBlock]
A --> P{API error handling?}
P -->|Yes| Q[ClientFault]
A --> R{Ownership/adoption?}
R -->|Yes| S[OwnerRefOrphan]
A --> T{Resource pressure?}
T -->|Yes| U[QuotaExhaustion]
A --> V{API latency?}
V -->|Yes| W[WebhookLatency]
Coverage by Component¶
| Component | CRDMutation | ClientFault | ConfigDrift | FinalizerBlock | LabelStomping | NamespaceDeletion | NetworkPartition | OwnerRefOrphan | PodKill | QuotaExhaustion | RBACRevoke | WebhookDisrupt | WebhookLatency | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| codeflare | - | - | - | - | - | - | - | - | - | 4 | ||||
| dashboard | - | - | - | - | - | - | - | 6 | ||||||
| data-science-pipelines | - | - | - | - | - | - | - | - | 5 | |||||
| feast | - | - | - | - | - | - | - | - | - | - | 3 | |||
| kserve | - | - | - | - | - | - | - | 6 | ||||||
| kueue | - | - | - | - | - | - | - | - | 5 | |||||
| llamastack | - | - | - | - | - | - | - | - | - | 4 | ||||
| model-registry | - | - | - | - | - | - | - | 6 | ||||||
| modelmesh | - | - | - | - | - | - | - | - | 5 | |||||
| odh-model-controller | 13 | |||||||||||||
| opendatahub-operator | - | - | - | - | - | - | - | - | 5 | |||||
| ray | - | - | - | - | - | - | - | - | - | 4 | ||||
| training-operator | - | - | - | - | - | - | - | - | - | 4 | ||||
| trustyai | - | - | - | - | - | - | - | - | - | - | 3 | |||
| workbenches | - | - | - | - | - | - | - | - | - | 4 |