Operator Chaos¶
Chaos engineering for Kubernetes operators.
Test reconciliation semantics, not just pod restarts.
Why Operator Chaos?¶
Existing chaos tools (Krkn, Litmus, Chaos Mesh) test infrastructure resilience: kill a pod, verify it comes back. But Kubernetes operators manage complex resource graphs — Deployments, Services, ConfigMaps, CRDs — where the real question is:
"When something breaks, does the operator put everything back the way it should be?"
Operator Chaos answers this by testing reconciliation: verifying operators restore resources to their intended state after operator-semantic faults like CRD mutation, config drift, and RBAC revocation.
How It Works¶
flowchart LR
A["Define<br/>Experiment"] --> B["Verify<br/>Baseline"]
B --> C["Inject<br/>Fault"]
C --> D["Observe<br/>Recovery"]
D --> E{"Render<br/>Verdict"}
E -->|recovered| R["Resilient"]
E -->|partial| G["Degraded"]
E -->|not recovered| F["Failed"]
style A fill:#bbdefb,stroke:#1565c0
style B fill:#ce93d8,stroke:#6a1b9a
style C fill:#ffcc80,stroke:#e65100
style D fill:#a5d6a7,stroke:#2e7d32
style E fill:#b0bec5,stroke:#37474f
style R fill:#a5d6a7,stroke:#2e7d32
style G fill:#ffcc80,stroke:#e65100
style F fill:#ef9a9a,stroke:#c62828
Testing Fidelity¶
Operator Chaos is a test harness, not a fixed-fidelity tool. The fidelity of your chaos tests depends on the environment you point it at:
| Environment | Fidelity | What You Learn |
|---|---|---|
| Fake client (fuzz mode) | Unit-level | Reconciler logic handles faults correctly |
kind / minikube |
Integration | Operator recovers resources on a real API server |
| Staging OpenShift | System | Operator works with real RBAC, webhooks, network policies |
| Production-like OCP | Production | Operator handles real workloads under real constraints |
The tool itself is lightweight (single static binary, ~20MB container image). What changes is the target: same experiments, same verdicts, different confidence levels. Start with fuzz tests during development, graduate to live cluster tests for release qualification.
Offline vs Live Capabilities¶
Many operator-chaos commands work without any cluster connection:
| Command | Cluster Required? | What It Does |
|---|---|---|
operator-chaos validate |
No | Validates experiment and knowledge YAML syntax |
operator-chaos types |
No | Lists all available injection types |
operator-chaos init |
No | Scaffolds new experiment files |
operator-chaos preflight --local |
No | Validates knowledge YAML structure without cluster |
operator-chaos run |
Yes | Executes experiments against a live cluster |
operator-chaos suite |
Yes | Runs experiment suites against a live cluster |
operator-chaos preflight (no --local) |
Yes | Checks that declared resources exist on cluster |
operator-chaos clean |
Yes | Removes leftover chaos artifacts from cluster |
This means you can validate experiments, lint knowledge models, and scaffold new tests entirely offline, in CI without a cluster, or during development before you have access to a test environment.
Four Usage Modes¶
| Mode | What It Tests | Cluster? | When to Use |
|---|---|---|---|
| CLI Experiments | Full operator recovery on a live cluster | Yes | Pre-release validation, CI/CD |
| SDK Middleware | Operator behavior under API-level faults | Yes (or fake client) | Integration tests |
| Fuzz Testing | Reconciler correctness under random faults | No | Development, unit tests, CI |
| Upgrade Testing | Structural changes between operator versions | Yes | Release qualification, upgrade testing |
-
CLI Experiments
Run structured chaos experiments against a live cluster. Orchestrates the full lifecycle: steady state, inject, observe, evaluate.
-
SDK Middleware
Wrap a controller-runtime client with fault injection. No code changes to your reconciler needed.
-
Fuzz Testing
Test reconciler correctness under random faults. No cluster needed — uses fake client.
-
Upgrade Testing
Auto-generate chaos experiments from version diffs. Test CRD schema changes, resource ownership shifts, and dependency mutations.
Verdicts¶
Every experiment produces a structured verdict:
| Verdict | Meaning |
|---|---|
| Resilient | Operator restored all resources correctly within the timeout |
| Degraded | Operator recovered but with deviations from expected state |
| Failed | Operator did not recover within the timeout |
| Inconclusive | Baseline check failed, experiment could not run |