Custom Failure Modes¶
This guide covers how to extend operator-chaos with custom failure modes. There are two paths depending on your use case:
- YAML Composition (no code): Write custom experiments using existing injection types
- Go Plugin Development: Add entirely new injection types to the framework
Most users will use YAML composition. Go plugins are only needed when the fault you want to inject isn't expressible through existing injection types.
Path A: YAML Composition (No Code Required)¶
This is the primary extensibility path. You can create complex, custom chaos experiments by composing the 11 built-in injection types with different parameters, targets, and steady-state checks.
Built-in Injection Types¶
The framework provides 11 injection types out of the box:
| Type | What It Does | Danger Level |
|---|---|---|
PodKill |
Force-delete pods matching a label selector | Low |
NetworkPartition |
Inject NetworkPolicy to isolate pods | Medium |
ConfigDrift |
Modify ConfigMap or Secret keys | Low to High |
CRDMutation |
Modify custom resource fields | Medium to High |
WebhookDisrupt |
Corrupt webhook configurations | High |
RBACRevoke |
Remove RBAC permissions | Medium |
FinalizerBlock |
Add blocking finalizers to resources | Medium |
ClientFault |
Inject API server request failures | High |
OwnerRefOrphan |
Remove ownerReferences to test re-adoption | Medium |
QuotaExhaustion |
Create restrictive ResourceQuota | Medium |
WebhookLatency |
Deploy slow admission webhook | High |
See the Failure Modes reference for full details on each type.
Experiment YAML Structure¶
Every experiment follows this structure:
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: my-custom-experiment
labels:
component: my-component
severity: standard
spec:
# What to target
target:
operator: opendatahub-operator
component: my-component
resource: Deployment/my-controller
# What fault to inject
injection:
type: PodKill
parameters:
signal: "SIGKILL"
labelSelector: "app=my-controller"
count: 1
ttl: "300s"
# What "healthy" looks like
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: my-controller
namespace: opendatahub
conditionType: Available
timeout: "30s"
# What you expect to happen
hypothesis:
description: "Controller recovers from pod termination within 60s"
recoveryTimeout: "60s"
# Safety limits
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
dryRun: false
Writing Custom Steady-State Checks¶
Steady-state checks define what "healthy" means for your component. There are three types:
1. conditionTrue Checks¶
Verify that a Kubernetes resource has a specific condition set to True:
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: my-controller
namespace: opendatahub
conditionType: Available
Common conditions:
- Deployment: Available, Progressing
- StatefulSet: Available
- Pod: Ready, ContainersReady
- Custom resources: any condition defined in the CRD
2. podReady Checks¶
Verify that pods matching a label selector are ready:
steadyState:
checks:
- type: podReady
namespace: opendatahub
labelSelector: "app=my-controller"
minReadyPods: 1
3. customCommand Checks¶
Run arbitrary commands to verify state:
steadyState:
checks:
- type: customCommand
command: "kubectl get inferenceservice -n test-ns my-model -o jsonpath='{.status.url}'"
expectedOutput: "http://my-model.test-ns.svc.cluster.local"
Use custom commands sparingly. Prefer conditionTrue and podReady when possible, as they're more reliable and don't depend on shell availability.
Parameterizing for Different Environments¶
The framework supports any Kubernetes operator deployment. Different deployments may differ in namespaces, labels, and resource names. For example, ODH and RHOAI use different defaults:
Namespace Differences¶
- ODH:
opendatahub - RHOAI:
redhat-ods-applications(for most components)
Using Knowledge Overlays¶
The knowledge/ directory contains topology models for each operator. There are overlay directories for environment-specific variations:
knowledge/
dashboard.yaml # ODH defaults
rhoai/
dashboard.yaml # RHOAI-specific overrides
odh/v2.10/
dashboard.yaml # ODH 2.10-specific overrides
rhoai/v3.3/
dashboard.yaml # RHOAI 3.3-specific overrides
When you specify --knowledge-paths to the CLI, later paths override earlier ones:
# ODH 2.10
operator-chaos run my-experiment.yaml \
--knowledge knowledge/*.yaml \
--knowledge knowledge/odh/v2.10/*.yaml
# RHOAI 3.3
operator-chaos run my-experiment.yaml \
--knowledge knowledge/*.yaml \
--knowledge knowledge/rhoai/*.yaml \
--knowledge knowledge/rhoai/v3.3/*.yaml
Label Differences¶
Some components use different labels in RHOAI:
# ODH
labelSelector: "app.kubernetes.io/part-of=odh-dashboard"
# RHOAI
labelSelector: "app.kubernetes.io/part-of=rhods-dashboard"
Check the knowledge YAML for your target component to see the correct labels.
Chaining Experiments in Suites¶
The operator-chaos suite command runs multiple experiments sequentially or in parallel:
# Run all experiments in a directory
operator-chaos suite experiments/dashboard/ --namespace opendatahub
# Dry-run to validate without executing
operator-chaos suite experiments/dashboard/ --dry-run
# Run in parallel (max 4 concurrent)
operator-chaos suite experiments/dashboard/ --parallel 4
# Generate JUnit report
operator-chaos suite experiments/dashboard/ \
--report-dir reports/ \
--namespace opendatahub
Custom suite directories for targeted testing:
experiments/
dashboard/
podkill.yaml
config-drift.yaml
network-partition.yaml
kserve/
webhook-disrupt.yaml
crd-mutation.yaml
smoke-tests/ # Custom suite for CI
critical-path.yaml
basic-recovery.yaml
Suite-level features: - Sequential execution (guarantees order) - Parallel execution (max concurrency limit) - Aggregate reporting (JUnit XML, JSON) - Early termination on first failure - Timeout per experiment
Example: Complete Custom Experiment¶
Here's a full example testing ConfigMap corruption handling:
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: custom-config-resilience
labels:
component: kserve
severity: high
test-type: config-validation
spec:
target:
operator: kserve
component: kserve-controller-manager
resource: ConfigMap/inferenceservice-config
injection:
type: ConfigDrift
parameters:
resourceType: ConfigMap
name: inferenceservice-config
key: deploy
value: '{"defaultDeploymentMode": "InvalidMode"}'
ttl: "300s"
steadyState:
checks:
# Verify controller stays healthy
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: kserve-controller-manager
namespace: opendatahub
conditionType: Available
# Verify pods don't crash
- type: podReady
namespace: opendatahub
labelSelector: "control-plane=kserve-controller-manager"
minReadyPods: 1
timeout: "30s"
hypothesis:
description: >
KServe controller should detect invalid config and either:
(a) reconcile back to valid state, or
(b) log validation errors without crashing.
New InferenceService creations should fail with clear error messages.
recoveryTimeout: "60s"
blastRadius:
maxPodsAffected: 5
allowedNamespaces:
- opendatahub
dryRun: false
allowDangerous: true # Required for high danger level
Run it:
# Validate first
operator-chaos run experiments/custom-config-resilience.yaml --dry-run
# Execute
operator-chaos run experiments/custom-config-resilience.yaml \
--namespace opendatahub \
--knowledge knowledge/*.yaml \
--verbose
Using Existing Experiments as Templates¶
The experiments/ directory contains production-tested experiments for all operators. Copy and modify them for your use case:
# Find experiments for your component
ls experiments/kserve/
# Copy and customize
cp experiments/kserve/kserve-podkill.yaml \
experiments/custom/my-kserve-test.yaml
# Edit parameters, labels, steady-state checks
vi experiments/custom/my-kserve-test.yaml
Path B: Go Plugin (New Injection Types)¶
Use this path only when you need to inject a fault that's not expressible through the 11 built-in types.
When to Write a New Injector¶
Use existing types when: - You can express the fault through PodKill, ConfigDrift, NetworkPartition, etc. - The fault is a composition of multiple existing types
Write a new injector when: - The fault requires custom API interactions not covered by existing types - You're injecting a completely new category of failure (e.g., storage disruption, DNS corruption) - You need specialized rollback logic that existing types don't provide
Examples:
| Fault | Use Existing Type | Or Write New Injector? |
|---|---|---|
| Kill pod with SIGTERM instead of SIGKILL | Use PodKill with signal: "SIGTERM" |
No new code needed |
| Corrupt Secret value | Use ConfigDrift with resourceType: Secret |
No new code needed |
| Block traffic to API server | Use NetworkPartition |
No new code needed |
| Inject disk I/O errors on PVCs | Not covered by existing types | Yes, write new injector |
| Corrupt DNS entries in CoreDNS ConfigMap | Use ConfigDrift targeting CoreDNS |
No new code needed |
| Simulate OOM kills | Not covered (needs cgroup manipulation) | Yes, write new injector |
The Injector Interface¶
All injectors implement this interface:
type Injector interface {
// Validate checks that parameters are correct before injection
Validate(spec v1alpha1.InjectionSpec, blast v1alpha1.BlastRadiusSpec) error
// Inject performs the fault and returns cleanup function + events
Inject(ctx context.Context, spec v1alpha1.InjectionSpec, namespace string) (CleanupFunc, []v1alpha1.InjectionEvent, error)
// Revert restores the system to pre-injection state (crash-safe)
Revert(ctx context.Context, spec v1alpha1.InjectionSpec, namespace string) error
}
Key requirements:
- Idempotency:
InjectandRevertmust be idempotent (safe to call multiple times) - Crash safety:
Revertmust work even after a process crash (persist rollback data in Kubernetes) - Blast radius enforcement: Validate
blastRadiuslimits before modifying resources - Event emission: Return structured events describing what was changed
Full Development Guide¶
For the complete guide on implementing a new injection type, including:
- Step-by-step implementation walkthrough
- Code examples (ResourceQuotaDisrupt injector)
- Testing patterns
- Rollback data persistence
- CRD updates and registration
- Best practices and common patterns
See: Adding Failure Modes
Best Practices¶
1. Start with PodKill¶
PodKill is the safest injection type and provides the fastest feedback loop:
# First experiment for any new component
operator-chaos run experiments/my-component-podkill.yaml --dry-run
operator-chaos run experiments/my-component-podkill.yaml --namespace opendatahub
If PodKill passes, move on to more disruptive types (ConfigDrift, NetworkPartition, etc.).
2. Always Use Dry-Run First¶
Dry-run mode validates experiments without executing them:
# Validates YAML syntax, injection parameters, blast radius
operator-chaos run my-experiment.yaml --dry-run
# For suites
operator-chaos suite experiments/dashboard/ --dry-run
Dry-run checks: - YAML syntax - Required parameters present - Blast radius within limits - Knowledge model exists for target operator - Steady-state checks are well-formed
3. Set Appropriate Blast Radius Limits¶
Always constrain the blast radius to prevent runaway failures:
blastRadius:
maxPodsAffected: 1 # Limit pod deletions
allowedNamespaces:
- opendatahub # Only this namespace
dryRun: false
allowDangerous: false # Require explicit opt-in for high-danger
For high-danger injections (ConfigDrift, WebhookDisrupt), you must set allowDangerous: true:
injection:
type: ConfigDrift
# ... parameters ...
blastRadius:
allowDangerous: true # Required for high-danger types
allowedNamespaces:
- opendatahub
4. Test on Non-Production Clusters¶
Never run chaos experiments on production clusters unless you have: - Tested the experiment on dev/staging first - Reviewed the blast radius with your team - Scheduled the test during a maintenance window - Prepared rollback procedures
Use dedicated test clusters for experiment development.
5. Document Your Experiments¶
Add metadata to help others understand your experiments:
metadata:
name: descriptive-experiment-name
labels:
component: kserve
severity: high
test-type: config-validation
owner: platform-team
jira-ticket: RHOAIENG-12345
annotations:
description: |
Tests KServe controller's handling of invalid configuration.
Expected behavior: controller logs errors but does not crash.
runbook: "https://docs.internal/chaos/kserve-config-tests"
Troubleshooting¶
Experiment Fails to Load¶
Symptom: Error: failed to load experiment: ...
Causes: - YAML syntax errors (indentation, missing quotes) - Invalid injection type - Missing required parameters
Fix:
# Validate YAML syntax
yamllint my-experiment.yaml
# Check experiment structure
operator-chaos run my-experiment.yaml --dry-run --verbose
Steady-State Check Fails Before Injection¶
Symptom: Error: steady-state check failed before injection
Causes: - Component is not healthy to begin with - Knowledge model specifies wrong resource names/namespaces - Steady-state check timeout too short
Fix:
# Verify component is healthy
kubectl get deployment -n opendatahub
kubectl describe deployment my-controller -n opendatahub
# Check knowledge model matches reality
cat knowledge/my-operator.yaml
# Increase steady-state timeout
spec:
steadyState:
timeout: "60s" # Increase from 30s
Cleanup Doesn't Complete¶
Symptom: Resources still have chaos annotations/labels after experiment
Causes: - Experiment interrupted (Ctrl+C during injection) - Controller crashed during cleanup - TTL expired but cleanup job didn't run
Fix:
# Manual cleanup
operator-chaos clean --namespace opendatahub
# Check for resources with chaos metadata
kubectl get all -n opendatahub \
-l "chaos.operatorchaos.io/injected=true"
# Remove chaos annotations manually (last resort)
kubectl annotate deployment my-controller \
chaos.operatorchaos.io/rollback-data- \
-n opendatahub
Permission Denied Errors¶
Symptom: Error: ... is forbidden: User "..." cannot ...
Causes: - Insufficient RBAC permissions for operator-chaos CLI - ServiceAccount missing required roles
Fix:
# Verify your permissions
kubectl auth can-i delete pods -n opendatahub
kubectl auth can-i update configmaps -n opendatahub
# Check operator-chaos ServiceAccount (if running in-cluster)
kubectl get clusterrole operator-chaos-role -o yaml
kubectl get clusterrolebinding operator-chaos-binding -o yaml
Blast Radius Violations¶
Symptom: Error: blast radius check failed: ...
Causes:
- Injection would affect more pods than maxPodsAffected
- Target namespace not in allowedNamespaces
- High-danger injection without allowDangerous: true
Fix:
# Adjust blast radius
blastRadius:
maxPodsAffected: 5 # Increase limit
allowedNamespaces:
- opendatahub
- test-namespace # Add namespace
allowDangerous: true # Enable dangerous injections
Next Steps¶
- Browse built-in failure modes for examples
- Explore experiments directory for production-tested experiments
- Read Adding Failure Modes to write Go plugins
- Check Architecture: Injection Engine to understand internals