Skip to content

Architecture Overview

Operator Chaos is a chaos engineering framework designed specifically for testing Kubernetes operator resilience. It combines declarative experiment definitions (CRDs) with a pluggable injection engine and observation-based evaluation.

Design Principles

  1. Declarative Experiments — Experiments are Kubernetes resources, versioned in Git and applied with kubectl
  2. Safety First — Multi-layer blast radius controls, danger level gating, and automatic rollback
  3. Crash-Safe Cleanup — All injections store rollback data in annotations, enabling recovery after controller restarts
  4. Blackboard Pattern — Multiple observers contribute evidence to a shared board for holistic evaluation
  5. Operator-Aware — Knowledge-driven understanding of operator resource ownership and dependencies

System Architecture

graph TB
    subgraph ui["User Interface"]
        CLI[CLI Tool]
        CRD[ChaosExperiment CRD]
        SDK[Go SDK]
    end

    subgraph cp["Control Plane"]
        Controller[Experiment Controller]
        Orchestrator[Lifecycle Orchestrator]
        Lock[Experiment Lock]
    end

    subgraph ie["Injection Engine"]
        Registry[Injector Registry]
        subgraph injectors[" "]
            PodKill[PodKill]
            Network[Network]
            Config[Config]
            CRD_Mut[CRD Mutation]
            Webhook[Webhook]
            RBAC[RBAC]
            Finalizer[Finalizer]
            Client[Client Fault]
            OwnerRef[OwnerRef Orphan]
            Quota[Quota Exhaust]
            WebhookLat[Webhook Latency]
        end
    end

    subgraph obs["Observation System"]
        Board[Observation Board]
        Recon[Reconciliation]
        Steady[Steady-State]
        Collateral[Collateral Damage]
    end

    subgraph eval["Evaluation"]
        Evaluator[Evaluator]
        Reporter[Reporter]
    end

    CLI --> CRD
    CRD --> Controller
    SDK --> Client

    Controller --> Orchestrator
    Orchestrator --> Lock
    Orchestrator --> Registry
    Registry --> injectors

    Orchestrator --> Board
    Recon --> Board
    Steady --> Board
    Collateral --> Board

    Board --> Evaluator
    Evaluator --> Reporter
    Reporter --> Controller

    style ui fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
    style cp fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#4a148c
    style ie fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#bf360c
    style injectors fill:#fff8e1,stroke:#f9a825,stroke-width:1px
    style obs fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
    style eval fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#b71c1c

    style Controller fill:#ce93d8,stroke:#6a1b9a
    style Orchestrator fill:#ce93d8,stroke:#6a1b9a
    style Registry fill:#ffcc80,stroke:#e65100
    style Board fill:#a5d6a7,stroke:#2e7d32
    style Evaluator fill:#ef9a9a,stroke:#c62828

Upgrade Diff Engine

Analyzes structural changes between operator releases to auto-generate upgrade test suites.

Responsibilities:

  • Compare versioned knowledge models using two-pass matching (exact + fuzzy)
  • Walk CRD OpenAPI v3 schemas to detect breaking changes, warnings, and safe migrations
  • Map detected diffs to targeted chaos experiments
  • Generate upgrade simulation suites

Key Algorithms:

  • Component Matching: Weighted similarity scoring (resource kinds, labels, controller, count)
  • Schema Walking: Recursive traversal with severity classification (breaking, warning, info)
  • Experiment Generation: Diff-to-injection mapping (CRD changes → CRDMutation, webhook changes → WebhookDisrupt)

CLI Commands:

  • operator-chaos diff — Compare two versioned directories
  • operator-chaos diff-crds — Deep CRD schema analysis
  • operator-chaos validate-version — Validate versioned knowledge model structure
  • operator-chaos simulate-upgrade — Generate and run upgrade test suites

See Upgrade Diff Engine Deep Dive for implementation details.

Component Breakdown

1. Control Plane

Experiment Controller

Kubernetes controller that watches ChaosExperiment CRDs and manages their lifecycle.

Responsibilities:

  • Watch CRD creation/updates
  • Validate experiments before execution
  • Update CRD status with phase, verdict, and findings
  • Coordinate with orchestrator for execution
  • Handle cleanup on experiment deletion

Key Phases:

Pending → SteadyStatePre → Injecting → Observing →
SteadyStatePost → Evaluating → Complete

Errors transition to Aborted.

Lifecycle Orchestrator

Coordinates the experiment state machine and delegates work to specialized engines.

Key Methods:

  • ValidateExperiment() — Pre-flight checks (blast radius, danger level, namespace restrictions)
  • RunPreCheck() — Verify baseline steady state
  • InjectFault() — Lookup injector and execute injection
  • RunPostCheck() — Run observation board with multiple contributors
  • EvaluateExperiment() — Compute verdict from findings
  • RevertFault() — Stateless cleanup via injector

Experiment Lock

Prevents concurrent experiments targeting the same operator. Implemented as Kubernetes Lease resources.

Features:

  • Operator-scoped locking (not namespace-scoped)
  • Configurable lease duration (default: 2x recovery timeout)
  • Automatic renewal during experiment execution
  • Force-override with --force flag

2. Injection Engine

Pluggable architecture for fault injection. Each injection type implements the Injector interface:

type Injector interface {
    Validate(spec InjectionSpec, blast BlastRadiusSpec) error
    Inject(ctx context.Context, spec InjectionSpec, namespace string) (CleanupFunc, []InjectionEvent, error)
    Revert(ctx context.Context, spec InjectionSpec, namespace string) error
}

Registry Pattern:

registry := injection.NewRegistry()
registry.Register(v1alpha1.PodKill, injection.NewPodKillInjector(client))
registry.Register(v1alpha1.NetworkPartition, injection.NewNetworkPartitionInjector(client))
// ... register all 11 injection types

Crash-Safe Cleanup:

All injectors store rollback data in resource annotations with integrity checksums:

annotations:
  chaos.operatorchaos.io/rollback: |
    {"data":"<base64-encoded-rollback-info>","checksum":"sha256:..."}

This enables Revert() to be called after controller restarts without relying on in-memory state.

3. Observation System (Blackboard Pattern)

The observation system uses the Blackboard architectural pattern to collect evidence from multiple independent observers.

Core Components:

  • ObservationBoard — Thread-safe shared data structure where observers write findings
  • ObservationContributor — Interface for observers (Observe(ctx, board) error)
  • Finding — Structured evidence (source, passed/failed, details)

Contributors:

  1. ReconciliationContributor (Phase 1, blocking)

    • Monitors target operator's reconciliation behavior
    • Counts reconcile cycles during recovery window
    • Detects stuck or thrashing reconcilers
  2. SteadyStateContributor (Phase 2, concurrent)

    • Runs user-defined steady-state checks (conditions, resource existence)
    • Verifies system returned to baseline
  3. CollateralContributor (Phase 2, concurrent)

    • Checks dependent operators/components
    • Detects cascading failures

Execution Flow:

board := observer.NewObservationBoard()

// Phase 1: Reconciliation (blocking)
reconContributor.Observe(ctx, board)

// Phase 2: Steady-state + collateral (concurrent)
observer.RunContributors(ctx, board, []ObservationContributor{
    steadyStateContributor,
    collateralContributor,
})

findings := board.Findings()  // All evidence collected

4. Evaluation & Reporting

Evaluator

Computes experiment verdict from collected findings using a decision tree:

Pre-check failed? → INCONCLUSIVE (system not ready)
Post-check failed? → FAILED (did not recover)
Reconciliation < 3 cycles? → DEGRADED (slow recovery)
Collateral damage? → DEGRADED (cascade)
All checks passed? → RESILIENT

Confidence Levels: high, medium, low based on observation quality.

Reporter

Generates structured reports in JSON format:

{
  "experiment": "kill-operator-pod",
  "timestamp": "2024-03-30T10:00:00Z",
  "target": {"operator": "opendatahub-operator", "component": "controller"},
  "injection": {"type": "PodKill", "targets": ["pod-abc123"]},
  "steadyState": {"pre": {...}, "post": {...}},
  "evaluation": {"verdict": "Resilient", "confidence": "high"},
  "reconciliation": {"cycles": 5, "duration": "12.3s"},
  "collateral": []
}

Reports are stored:

  • As ConfigMaps in the cluster (chaos-result-<experiment-name>)
  • As JSON files (if --report-dir specified in CLI mode)

Execution Modes

1. Controller Mode (CRD-Driven)

# Deploy controller
kubectl apply -f deploy/controller.yaml

# Submit experiment
kubectl apply -f experiments/my-test.yaml

# Watch progress
kubectl get chaosexperiment my-test -w

Use Case: GitOps workflows, CI/CD integration, scheduled experiments

2. CLI Mode (Standalone)

operator-chaos run experiments/my-test.yaml --report-dir=./reports

Use Case: Local development, one-off tests, debugging

3. SDK Mode (Programmatic)

orchestrator := orchestrator.New(orchestrator.OrchestratorConfig{...})
result, err := orchestrator.Run(ctx, experiment)

Use Case: Custom test harnesses, fuzzing frameworks

Safety Mechanisms

Multi-Layer Blast Radius Control

  1. Namespace Restrictions

    • Forbidden namespaces: kube-system, kube-public, default, openshift-*
    • Experiments must explicitly list allowedNamespaces
  2. Resource Limits

    • maxPodsAffected enforced per experiment
    • Forbidden resources (e.g., etcd, API server) blocked by default
  3. Danger Level Gating

    • low, medium, high levels
    • High-danger injections require allowDangerous: true
  4. Dry Run Mode

    • Validates experiment without executing injection
    • Useful for CI pre-flight checks

TTL-Based Auto-Cleanup

Injections can specify a TTL for automatic cleanup:

injection:
  type: NetworkPartition
  ttl: 5m  # Auto-cleanup after 5 minutes

A background cleanup controller scans for expired resources marked with TTL annotations.

Experiment Locking

Only one experiment per operator can run at a time, preventing:

  • Conflicting injections
  • Observer confusion (which fault caused which behavior?)
  • Cascading failures

Dependency Graph

The framework maintains a dependency graph of ODH components:

components:
  - name: dashboard
    operator: opendatahub-operator
    dependsOn:
      - {operator: opendatahub-operator, component: model-controller}
      - {operator: opendatahub-operator, component: notebook-controller}

This enables:

  • Collateral damage detection — Did a fault on component A break dependent component B?
  • Targeted testing — Focus on high-impact components with many dependents
  • Root cause analysis — Trace failures back to dependencies

Operator Knowledge Base

Built-in knowledge of ODH operators:

operators:
  - name: opendatahub-operator
    components:
      - name: controller-manager
        labelSelector: control-plane=controller-manager
        reconcilationResource:
          apiVersion: opendatahub.io/v1
          kind: DataScienceCluster

Used for:

  • Auto-detecting reconciliation targets
  • Validating experiment targets
  • Suggesting experiments based on operator structure

Next Steps