Custom Failure Modes¶

This guide covers how to extend operator-chaos with custom failure modes. There are two paths depending on your use case:

YAML Composition (no code): Write custom experiments using existing injection types
Go Plugin Development: Add entirely new injection types to the framework

Most users will use YAML composition. Go plugins are only needed when the fault you want to inject isn't expressible through existing injection types.

Path A: YAML Composition (No Code Required)¶

This is the primary extensibility path. You can create complex, custom chaos experiments by composing the 11 built-in injection types with different parameters, targets, and steady-state checks.

Built-in Injection Types¶

The framework provides 11 injection types out of the box:

Type	What It Does	Danger Level
`PodKill`	Force-delete pods matching a label selector	Low
`NetworkPartition`	Inject NetworkPolicy to isolate pods	Medium
`ConfigDrift`	Modify ConfigMap or Secret keys	Low to High
`CRDMutation`	Modify custom resource fields	Medium to High
`WebhookDisrupt`	Corrupt webhook configurations	High
`RBACRevoke`	Remove RBAC permissions	Medium
`FinalizerBlock`	Add blocking finalizers to resources	Medium
`ClientFault`	Inject API server request failures	High
`OwnerRefOrphan`	Remove ownerReferences to test re-adoption	Medium
`QuotaExhaustion`	Create restrictive ResourceQuota	Medium
`WebhookLatency`	Deploy slow admission webhook	High

See the Failure Modes reference for full details on each type.

Experiment YAML Structure¶

Every experiment follows this structure:

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: my-custom-experiment
  labels:
    component: my-component
    severity: standard
spec:
  # What to target
  target:
    operator: opendatahub-operator
    component: my-component
    resource: Deployment/my-controller

  # What fault to inject
  injection:
    type: PodKill
    parameters:
      signal: "SIGKILL"
      labelSelector: "app=my-controller"
    count: 1
    ttl: "300s"

  # What "healthy" looks like
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: my-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"

  # What you expect to happen
  hypothesis:
    description: "Controller recovers from pod termination within 60s"
    recoveryTimeout: "60s"

  # Safety limits
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub
    dryRun: false

Writing Custom Steady-State Checks¶

Steady-state checks define what "healthy" means for your component. There are three types:

1. `conditionTrue` Checks¶

Verify that a Kubernetes resource has a specific condition set to True:

steadyState:
  checks:
    - type: conditionTrue
      apiVersion: apps/v1
      kind: Deployment
      name: my-controller
      namespace: opendatahub
      conditionType: Available

Common conditions: - Deployment: Available, Progressing - StatefulSet: Available - Pod: Ready, ContainersReady - Custom resources: any condition defined in the CRD

2. `podReady` Checks¶

Verify that pods matching a label selector are ready:

steadyState:
  checks:
    - type: podReady
      namespace: opendatahub
      labelSelector: "app=my-controller"
      minReadyPods: 1

3. `customCommand` Checks¶

Run arbitrary commands to verify state:

steadyState:
  checks:
    - type: customCommand
      command: "kubectl get inferenceservice -n test-ns my-model -o jsonpath='{.status.url}'"
      expectedOutput: "http://my-model.test-ns.svc.cluster.local"

Use custom commands sparingly. Prefer conditionTrue and podReady when possible, as they're more reliable and don't depend on shell availability.

Parameterizing for Different Environments¶

The framework supports any Kubernetes operator deployment. Different deployments may differ in namespaces, labels, and resource names. For example, ODH and RHOAI use different defaults:

Namespace Differences¶

ODH: opendatahub
RHOAI: redhat-ods-applications (for most components)

Using Knowledge Overlays¶

The knowledge/ directory contains topology models for each operator. There are overlay directories for environment-specific variations:

knowledge/
  dashboard.yaml                   # ODH defaults
  rhoai/
    dashboard.yaml                 # RHOAI-specific overrides
  odh/v2.10/
    dashboard.yaml                 # ODH 2.10-specific overrides
  rhoai/v3.3/
    dashboard.yaml                 # RHOAI 3.3-specific overrides

When you specify --knowledge-paths to the CLI, later paths override earlier ones:

# ODH 2.10
operator-chaos run my-experiment.yaml \
  --knowledge knowledge/*.yaml \
  --knowledge knowledge/odh/v2.10/*.yaml

# RHOAI 3.3
operator-chaos run my-experiment.yaml \
  --knowledge knowledge/*.yaml \
  --knowledge knowledge/rhoai/*.yaml \
  --knowledge knowledge/rhoai/v3.3/*.yaml

Label Differences¶

Some components use different labels in RHOAI:

# ODH
labelSelector: "app.kubernetes.io/part-of=odh-dashboard"

# RHOAI
labelSelector: "app.kubernetes.io/part-of=rhods-dashboard"

Check the knowledge YAML for your target component to see the correct labels.

Chaining Experiments in Suites¶

The operator-chaos suite command runs multiple experiments sequentially or in parallel:

# Run all experiments in a directory
operator-chaos suite experiments/dashboard/ --namespace opendatahub

# Dry-run to validate without executing
operator-chaos suite experiments/dashboard/ --dry-run

# Run in parallel (max 4 concurrent)
operator-chaos suite experiments/dashboard/ --parallel 4

# Generate JUnit report
operator-chaos suite experiments/dashboard/ \
  --report-dir reports/ \
  --namespace opendatahub

Custom suite directories for targeted testing:

experiments/
  dashboard/
    podkill.yaml
    config-drift.yaml
    network-partition.yaml
  kserve/
    webhook-disrupt.yaml
    crd-mutation.yaml
  smoke-tests/           # Custom suite for CI
    critical-path.yaml
    basic-recovery.yaml

Suite-level features: - Sequential execution (guarantees order) - Parallel execution (max concurrency limit) - Aggregate reporting (JUnit XML, JSON) - Early termination on first failure - Timeout per experiment

Example: Complete Custom Experiment¶

Here's a full example testing ConfigMap corruption handling:

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: custom-config-resilience
  labels:
    component: kserve
    severity: high
    test-type: config-validation
spec:
  target:
    operator: kserve
    component: kserve-controller-manager
    resource: ConfigMap/inferenceservice-config

  injection:
    type: ConfigDrift
    parameters:
      resourceType: ConfigMap
      name: inferenceservice-config
      key: deploy
      value: '{"defaultDeploymentMode": "InvalidMode"}'
    ttl: "300s"

  steadyState:
    checks:
      # Verify controller stays healthy
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: opendatahub
        conditionType: Available
      # Verify pods don't crash
      - type: podReady
        namespace: opendatahub
        labelSelector: "control-plane=kserve-controller-manager"
        minReadyPods: 1
    timeout: "30s"

  hypothesis:
    description: >
      KServe controller should detect invalid config and either:
      (a) reconcile back to valid state, or
      (b) log validation errors without crashing.
      New InferenceService creations should fail with clear error messages.
    recoveryTimeout: "60s"

  blastRadius:
    maxPodsAffected: 5
    allowedNamespaces:
      - opendatahub
    dryRun: false
    allowDangerous: true  # Required for high danger level

Run it:

# Validate first
operator-chaos run experiments/custom-config-resilience.yaml --dry-run

# Execute
operator-chaos run experiments/custom-config-resilience.yaml \
  --namespace opendatahub \
  --knowledge knowledge/*.yaml \
  --verbose

Using Existing Experiments as Templates¶

The experiments/ directory contains production-tested experiments for all operators. Copy and modify them for your use case:

# Find experiments for your component
ls experiments/kserve/

# Copy and customize
cp experiments/kserve/kserve-podkill.yaml \
   experiments/custom/my-kserve-test.yaml

# Edit parameters, labels, steady-state checks
vi experiments/custom/my-kserve-test.yaml

Path B: Go Plugin (New Injection Types)¶

Use this path only when you need to inject a fault that's not expressible through the 11 built-in types.

When to Write a New Injector¶

Use existing types when: - You can express the fault through PodKill, ConfigDrift, NetworkPartition, etc. - The fault is a composition of multiple existing types

Write a new injector when: - The fault requires custom API interactions not covered by existing types - You're injecting a completely new category of failure (e.g., storage disruption, DNS corruption) - You need specialized rollback logic that existing types don't provide

Examples:

Fault	Use Existing Type	Or Write New Injector?
Kill pod with SIGTERM instead of SIGKILL	Use `PodKill` with `signal: "SIGTERM"`	No new code needed
Corrupt Secret value	Use `ConfigDrift` with `resourceType: Secret`	No new code needed
Block traffic to API server	Use `NetworkPartition`	No new code needed
Inject disk I/O errors on PVCs	Not covered by existing types	Yes, write new injector
Corrupt DNS entries in CoreDNS ConfigMap	Use `ConfigDrift` targeting CoreDNS	No new code needed
Simulate OOM kills	Not covered (needs cgroup manipulation)	Yes, write new injector

The Injector Interface¶

All injectors implement this interface:

type Injector interface {
    // Validate checks that parameters are correct before injection
    Validate(spec v1alpha1.InjectionSpec, blast v1alpha1.BlastRadiusSpec) error

    // Inject performs the fault and returns cleanup function + events
    Inject(ctx context.Context, spec v1alpha1.InjectionSpec, namespace string) (CleanupFunc, []v1alpha1.InjectionEvent, error)

    // Revert restores the system to pre-injection state (crash-safe)
    Revert(ctx context.Context, spec v1alpha1.InjectionSpec, namespace string) error
}

Key requirements:

Idempotency: Inject and Revert must be idempotent (safe to call multiple times)
Crash safety: Revert must work even after a process crash (persist rollback data in Kubernetes)
Blast radius enforcement: Validate blastRadius limits before modifying resources
Event emission: Return structured events describing what was changed

Full Development Guide¶

For the complete guide on implementing a new injection type, including:

Step-by-step implementation walkthrough
Code examples (ResourceQuotaDisrupt injector)
Testing patterns
Rollback data persistence
CRD updates and registration
Best practices and common patterns

See: Adding Failure Modes

Best Practices¶

1. Start with PodKill¶

PodKill is the safest injection type and provides the fastest feedback loop:

# First experiment for any new component
operator-chaos run experiments/my-component-podkill.yaml --dry-run
operator-chaos run experiments/my-component-podkill.yaml --namespace opendatahub

If PodKill passes, move on to more disruptive types (ConfigDrift, NetworkPartition, etc.).

2. Always Use Dry-Run First¶

Dry-run mode validates experiments without executing them:

# Validates YAML syntax, injection parameters, blast radius
operator-chaos run my-experiment.yaml --dry-run

# For suites
operator-chaos suite experiments/dashboard/ --dry-run

Dry-run checks: - YAML syntax - Required parameters present - Blast radius within limits - Knowledge model exists for target operator - Steady-state checks are well-formed

3. Set Appropriate Blast Radius Limits¶

Always constrain the blast radius to prevent runaway failures:

blastRadius:
  maxPodsAffected: 1              # Limit pod deletions
  allowedNamespaces:
    - opendatahub                 # Only this namespace
  dryRun: false
  allowDangerous: false           # Require explicit opt-in for high-danger

For high-danger injections (ConfigDrift, WebhookDisrupt), you must set allowDangerous: true:

injection:
  type: ConfigDrift
  # ... parameters ...

blastRadius:
  allowDangerous: true            # Required for high-danger types
  allowedNamespaces:
    - opendatahub

4. Test on Non-Production Clusters¶

Never run chaos experiments on production clusters unless you have: - Tested the experiment on dev/staging first - Reviewed the blast radius with your team - Scheduled the test during a maintenance window - Prepared rollback procedures

Use dedicated test clusters for experiment development.

5. Document Your Experiments¶

Add metadata to help others understand your experiments:

metadata:
  name: descriptive-experiment-name
  labels:
    component: kserve
    severity: high
    test-type: config-validation
    owner: platform-team
    jira-ticket: RHOAIENG-12345
  annotations:
    description: |
      Tests KServe controller's handling of invalid configuration.
      Expected behavior: controller logs errors but does not crash.
    runbook: "https://docs.internal/chaos/kserve-config-tests"

Troubleshooting¶

Experiment Fails to Load¶

Symptom: Error: failed to load experiment: ...

Causes: - YAML syntax errors (indentation, missing quotes) - Invalid injection type - Missing required parameters

Fix:

# Validate YAML syntax
yamllint my-experiment.yaml

# Check experiment structure
operator-chaos run my-experiment.yaml --dry-run --verbose

Steady-State Check Fails Before Injection¶

Symptom: Error: steady-state check failed before injection

Causes: - Component is not healthy to begin with - Knowledge model specifies wrong resource names/namespaces - Steady-state check timeout too short

Fix:

# Verify component is healthy
kubectl get deployment -n opendatahub
kubectl describe deployment my-controller -n opendatahub

# Check knowledge model matches reality
cat knowledge/my-operator.yaml

# Increase steady-state timeout
spec:
  steadyState:
    timeout: "60s"  # Increase from 30s

Cleanup Doesn't Complete¶

Symptom: Resources still have chaos annotations/labels after experiment

Causes: - Experiment interrupted (Ctrl+C during injection) - Controller crashed during cleanup - TTL expired but cleanup job didn't run

Fix:

# Manual cleanup
operator-chaos clean --namespace opendatahub

# Check for resources with chaos metadata
kubectl get all -n opendatahub \
  -l "chaos.operatorchaos.io/injected=true"

# Remove chaos annotations manually (last resort)
kubectl annotate deployment my-controller \
  chaos.operatorchaos.io/rollback-data- \
  -n opendatahub

Permission Denied Errors¶

Symptom: Error: ... is forbidden: User "..." cannot ...

Causes: - Insufficient RBAC permissions for operator-chaos CLI - ServiceAccount missing required roles

Fix:

# Verify your permissions
kubectl auth can-i delete pods -n opendatahub
kubectl auth can-i update configmaps -n opendatahub

# Check operator-chaos ServiceAccount (if running in-cluster)
kubectl get clusterrole operator-chaos-role -o yaml
kubectl get clusterrolebinding operator-chaos-binding -o yaml

Blast Radius Violations¶

Symptom: Error: blast radius check failed: ...

Causes: - Injection would affect more pods than maxPodsAffected - Target namespace not in allowedNamespaces - High-danger injection without allowDangerous: true

Fix:

# Adjust blast radius
blastRadius:
  maxPodsAffected: 5              # Increase limit
  allowedNamespaces:
    - opendatahub
    - test-namespace              # Add namespace
  allowDangerous: true            # Enable dangerous injections

Next Steps¶

Browse built-in failure modes for examples
Explore experiments directory for production-tested experiments
Read Adding Failure Modes to write Go plugins
Check Architecture: Injection Engine to understand internals

Custom Failure Modes¶

Path A: YAML Composition (No Code Required)¶

Built-in Injection Types¶

Experiment YAML Structure¶

Writing Custom Steady-State Checks¶

1. conditionTrue Checks¶

2. podReady Checks¶

3. customCommand Checks¶

Parameterizing for Different Environments¶

Namespace Differences¶

Using Knowledge Overlays¶

Label Differences¶

Chaining Experiments in Suites¶

Example: Complete Custom Experiment¶

Using Existing Experiments as Templates¶

Path B: Go Plugin (New Injection Types)¶

When to Write a New Injector¶

The Injector Interface¶

Full Development Guide¶

Best Practices¶

1. Start with PodKill¶

2. Always Use Dry-Run First¶

3. Set Appropriate Blast Radius Limits¶

4. Test on Non-Production Clusters¶

5. Document Your Experiments¶

Troubleshooting¶

Experiment Fails to Load¶

Steady-State Check Fails Before Injection¶

Cleanup Doesn't Complete¶

Permission Denied Errors¶

Blast Radius Violations¶

Next Steps¶

1. `conditionTrue` Checks¶

2. `podReady` Checks¶

3. `customCommand` Checks¶