Skip to content

Operator Chaos

Chaos engineering for Kubernetes operators.
Test reconciliation semantics, not just pod restarts.

Get Started GitHub

Why Operator Chaos?

Existing chaos tools (Krkn, Litmus, Chaos Mesh) test infrastructure resilience: kill a pod, verify it comes back. But Kubernetes operators manage complex resource graphs — Deployments, Services, ConfigMaps, CRDs — where the real question is:

"When something breaks, does the operator put everything back the way it should be?"

Operator Chaos answers this by testing reconciliation: verifying operators restore resources to their intended state after operator-semantic faults like CRD mutation, config drift, and RBAC revocation.

How It Works

flowchart LR
    A["Define<br/>Experiment"] --> B["Verify<br/>Baseline"]
    B --> C["Inject<br/>Fault"]
    C --> D["Observe<br/>Recovery"]
    D --> E{"Render<br/>Verdict"}
    E -->|recovered| R["Resilient"]
    E -->|partial| G["Degraded"]
    E -->|not recovered| F["Failed"]

    style A fill:#bbdefb,stroke:#1565c0
    style B fill:#ce93d8,stroke:#6a1b9a
    style C fill:#ffcc80,stroke:#e65100
    style D fill:#a5d6a7,stroke:#2e7d32
    style E fill:#b0bec5,stroke:#37474f
    style R fill:#a5d6a7,stroke:#2e7d32
    style G fill:#ffcc80,stroke:#e65100
    style F fill:#ef9a9a,stroke:#c62828

Testing Fidelity

Operator Chaos is a test harness, not a fixed-fidelity tool. The fidelity of your chaos tests depends on the environment you point it at:

Environment Fidelity What You Learn
Fake client (fuzz mode) Unit-level Reconciler logic handles faults correctly
kind / minikube Integration Operator recovers resources on a real API server
Staging OpenShift System Operator works with real RBAC, webhooks, network policies
Production-like OCP Production Operator handles real workloads under real constraints

The tool itself is lightweight (single static binary, ~20MB container image). What changes is the target: same experiments, same verdicts, different confidence levels. Start with fuzz tests during development, graduate to live cluster tests for release qualification.

Offline vs Live Capabilities

Many operator-chaos commands work without any cluster connection:

Command Cluster Required? What It Does
operator-chaos validate No Validates experiment and knowledge YAML syntax
operator-chaos types No Lists all available injection types
operator-chaos init No Scaffolds new experiment files
operator-chaos preflight --local No Validates knowledge YAML structure without cluster
operator-chaos run Yes Executes experiments against a live cluster
operator-chaos suite Yes Runs experiment suites against a live cluster
operator-chaos preflight (no --local) Yes Checks that declared resources exist on cluster
operator-chaos clean Yes Removes leftover chaos artifacts from cluster

This means you can validate experiments, lint knowledge models, and scaffold new tests entirely offline, in CI without a cluster, or during development before you have access to a test environment.

Four Usage Modes

Mode What It Tests Cluster? When to Use
CLI Experiments Full operator recovery on a live cluster Yes Pre-release validation, CI/CD
SDK Middleware Operator behavior under API-level faults Yes (or fake client) Integration tests
Fuzz Testing Reconciler correctness under random faults No Development, unit tests, CI
Upgrade Testing Structural changes between operator versions Yes Release qualification, upgrade testing
  • CLI Experiments


    Run structured chaos experiments against a live cluster. Orchestrates the full lifecycle: steady state, inject, observe, evaluate.

    CLI Quick Start

  • SDK Middleware


    Wrap a controller-runtime client with fault injection. No code changes to your reconciler needed.

    SDK Quick Start

  • Fuzz Testing


    Test reconciler correctness under random faults. No cluster needed — uses fake client.

    Fuzz Quick Start

  • Upgrade Testing


    Auto-generate chaos experiments from version diffs. Test CRD schema changes, resource ownership shifts, and dependency mutations.

    Upgrade Testing Guide

Verdicts

Every experiment produces a structured verdict:

Verdict Meaning
Resilient Operator restored all resources correctly within the timeout
Degraded Operator recovered but with deviations from expected state
Failed Operator did not recover within the timeout
Inconclusive Baseline check failed, experiment could not run