Architecture Overview¶

Operator Chaos is a chaos engineering framework designed specifically for testing Kubernetes operator resilience. It combines declarative experiment definitions (CRDs) with a pluggable injection engine and observation-based evaluation.

Design Principles¶

Declarative Experiments — Experiments are Kubernetes resources, versioned in Git and applied with kubectl
Safety First — Multi-layer blast radius controls, danger level gating, and automatic rollback
Crash-Safe Cleanup — All injections store rollback data in annotations, enabling recovery after controller restarts
Blackboard Pattern — Multiple observers contribute evidence to a shared board for holistic evaluation
Operator-Aware — Knowledge-driven understanding of operator resource ownership and dependencies

System Architecture¶

graph TB
    subgraph ui["User Interface"]
        CLI[CLI Tool]
        CRD[ChaosExperiment CRD]
        SDK[Go SDK]
    end

    subgraph cp["Control Plane"]
        Controller[Experiment Controller]
        Orchestrator[Lifecycle Orchestrator]
        Lock[Experiment Lock]
    end

    subgraph ie["Injection Engine"]
        Registry[Injector Registry]
        subgraph injectors[" "]
            PodKill[PodKill]
            Network[Network]
            Config[Config]
            CRD_Mut[CRD Mutation]
            Webhook[Webhook]
            RBAC[RBAC]
            Finalizer[Finalizer]
            Client[Client Fault]
            OwnerRef[OwnerRef Orphan]
            Quota[Quota Exhaust]
            WebhookLat[Webhook Latency]
        end
    end

    subgraph obs["Observation System"]
        Board[Observation Board]
        Recon[Reconciliation]
        Steady[Steady-State]
        Collateral[Collateral Damage]
    end

    subgraph eval["Evaluation"]
        Evaluator[Evaluator]
        Reporter[Reporter]
    end

    CLI --> CRD
    CRD --> Controller
    SDK --> Client

    Controller --> Orchestrator
    Orchestrator --> Lock
    Orchestrator --> Registry
    Registry --> injectors

    Orchestrator --> Board
    Recon --> Board
    Steady --> Board
    Collateral --> Board

    Board --> Evaluator
    Evaluator --> Reporter
    Reporter --> Controller

    style ui fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
    style cp fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#4a148c
    style ie fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#bf360c
    style injectors fill:#fff8e1,stroke:#f9a825,stroke-width:1px
    style obs fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
    style eval fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#b71c1c

    style Controller fill:#ce93d8,stroke:#6a1b9a
    style Orchestrator fill:#ce93d8,stroke:#6a1b9a
    style Registry fill:#ffcc80,stroke:#e65100
    style Board fill:#a5d6a7,stroke:#2e7d32
    style Evaluator fill:#ef9a9a,stroke:#c62828

Upgrade Diff Engine¶

Analyzes structural changes between operator releases to auto-generate upgrade test suites.

Responsibilities:

Compare versioned knowledge models using two-pass matching (exact + fuzzy)
Walk CRD OpenAPI v3 schemas to detect breaking changes, warnings, and safe migrations
Map detected diffs to targeted chaos experiments
Generate upgrade simulation suites

Key Algorithms:

Component Matching: Weighted similarity scoring (resource kinds, labels, controller, count)
Schema Walking: Recursive traversal with severity classification (breaking, warning, info)
Experiment Generation: Diff-to-injection mapping (CRD changes → CRDMutation, webhook changes → WebhookDisrupt)

CLI Commands:

operator-chaos diff — Compare two versioned directories
operator-chaos diff-crds — Deep CRD schema analysis
operator-chaos validate-version — Validate versioned knowledge model structure
operator-chaos simulate-upgrade — Generate and run upgrade test suites

See Upgrade Diff Engine Deep Dive for implementation details.

Component Breakdown¶

1. Control Plane¶

Experiment Controller¶

Kubernetes controller that watches ChaosExperiment CRDs and manages their lifecycle.

Responsibilities:

Watch CRD creation/updates
Validate experiments before execution
Update CRD status with phase, verdict, and findings
Coordinate with orchestrator for execution
Handle cleanup on experiment deletion

Key Phases:

Pending → SteadyStatePre → Injecting → Observing →
SteadyStatePost → Evaluating → Complete

Errors transition to Aborted.

Lifecycle Orchestrator¶

Coordinates the experiment state machine and delegates work to specialized engines.

Key Methods:

ValidateExperiment() — Pre-flight checks (blast radius, danger level, namespace restrictions)
RunPreCheck() — Verify baseline steady state
InjectFault() — Lookup injector and execute injection
RunPostCheck() — Run observation board with multiple contributors
EvaluateExperiment() — Compute verdict from findings
RevertFault() — Stateless cleanup via injector

Experiment Lock¶

Prevents concurrent experiments targeting the same operator. Implemented as Kubernetes Lease resources.

Features:

Operator-scoped locking (not namespace-scoped)
Configurable lease duration (default: 2x recovery timeout)
Automatic renewal during experiment execution
Force-override with --force flag

2. Injection Engine¶

Pluggable architecture for fault injection. Each injection type implements the Injector interface:

type Injector interface {
    Validate(spec InjectionSpec, blast BlastRadiusSpec) error
    Inject(ctx context.Context, spec InjectionSpec, namespace string) (CleanupFunc, []InjectionEvent, error)
    Revert(ctx context.Context, spec InjectionSpec, namespace string) error
}

Registry Pattern:

registry := injection.NewRegistry()
registry.Register(v1alpha1.PodKill, injection.NewPodKillInjector(client))
registry.Register(v1alpha1.NetworkPartition, injection.NewNetworkPartitionInjector(client))
// ... register all 11 injection types

Crash-Safe Cleanup:

All injectors store rollback data in resource annotations with integrity checksums:

annotations:
  chaos.operatorchaos.io/rollback: |
    {"data":"<base64-encoded-rollback-info>","checksum":"sha256:..."}

This enables Revert() to be called after controller restarts without relying on in-memory state.

3. Observation System (Blackboard Pattern)¶

The observation system uses the Blackboard architectural pattern to collect evidence from multiple independent observers.

Core Components:

ObservationBoard — Thread-safe shared data structure where observers write findings
ObservationContributor — Interface for observers (Observe(ctx, board) error)
Finding — Structured evidence (source, passed/failed, details)

Contributors:

ReconciliationContributor (Phase 1, blocking)
- Monitors target operator's reconciliation behavior
- Counts reconcile cycles during recovery window
- Detects stuck or thrashing reconcilers
SteadyStateContributor (Phase 2, concurrent)
- Runs user-defined steady-state checks (conditions, resource existence)
- Verifies system returned to baseline
CollateralContributor (Phase 2, concurrent)
- Checks dependent operators/components
- Detects cascading failures

Execution Flow:

board := observer.NewObservationBoard()

// Phase 1: Reconciliation (blocking)
reconContributor.Observe(ctx, board)

// Phase 2: Steady-state + collateral (concurrent)
observer.RunContributors(ctx, board, []ObservationContributor{
    steadyStateContributor,
    collateralContributor,
})

findings := board.Findings()  // All evidence collected

4. Evaluation & Reporting¶

Evaluator¶

Computes experiment verdict from collected findings using a decision tree:

Pre-check failed? → INCONCLUSIVE (system not ready)
Post-check failed? → FAILED (did not recover)
Reconciliation < 3 cycles? → DEGRADED (slow recovery)
Collateral damage? → DEGRADED (cascade)
All checks passed? → RESILIENT

Confidence Levels: high, medium, low based on observation quality.

Reporter¶

Generates structured reports in JSON format:

{
  "experiment": "kill-operator-pod",
  "timestamp": "2024-03-30T10:00:00Z",
  "target": {"operator": "opendatahub-operator", "component": "controller"},
  "injection": {"type": "PodKill", "targets": ["pod-abc123"]},
  "steadyState": {"pre": {...}, "post": {...}},
  "evaluation": {"verdict": "Resilient", "confidence": "high"},
  "reconciliation": {"cycles": 5, "duration": "12.3s"},
  "collateral": []
}

Reports are stored:

As ConfigMaps in the cluster (chaos-result-<experiment-name>)
As JSON files (if --report-dir specified in CLI mode)

Execution Modes¶

1. Controller Mode (CRD-Driven)¶

# Deploy controller
kubectl apply -f deploy/controller.yaml

# Submit experiment
kubectl apply -f experiments/my-test.yaml

# Watch progress
kubectl get chaosexperiment my-test -w

Use Case: GitOps workflows, CI/CD integration, scheduled experiments

2. CLI Mode (Standalone)¶

operator-chaos run experiments/my-test.yaml --report-dir=./reports

Use Case: Local development, one-off tests, debugging

3. SDK Mode (Programmatic)¶

orchestrator := orchestrator.New(orchestrator.OrchestratorConfig{...})
result, err := orchestrator.Run(ctx, experiment)

Use Case: Custom test harnesses, fuzzing frameworks

Safety Mechanisms¶

Multi-Layer Blast Radius Control¶

Namespace Restrictions
- Forbidden namespaces: kube-system, kube-public, default, openshift-*
- Experiments must explicitly list allowedNamespaces
Resource Limits
- maxPodsAffected enforced per experiment
- Forbidden resources (e.g., etcd, API server) blocked by default
Danger Level Gating
- low, medium, high levels
- High-danger injections require allowDangerous: true
Dry Run Mode
- Validates experiment without executing injection
- Useful for CI pre-flight checks

TTL-Based Auto-Cleanup¶

Injections can specify a TTL for automatic cleanup:

injection:
  type: NetworkPartition
  ttl: 5m  # Auto-cleanup after 5 minutes

A background cleanup controller scans for expired resources marked with TTL annotations.

Experiment Locking¶

Only one experiment per operator can run at a time, preventing:

Conflicting injections
Observer confusion (which fault caused which behavior?)
Cascading failures

Dependency Graph¶

The framework maintains a dependency graph of ODH components:

components:
  - name: dashboard
    operator: opendatahub-operator
    dependsOn:
      - {operator: opendatahub-operator, component: model-controller}
      - {operator: opendatahub-operator, component: notebook-controller}

This enables:

Collateral damage detection — Did a fault on component A break dependent component B?
Targeted testing — Focus on high-impact components with many dependents
Root cause analysis — Trace failures back to dependencies

Operator Knowledge Base¶

Built-in knowledge of ODH operators:

operators:
  - name: opendatahub-operator
    components:
      - name: controller-manager
        labelSelector: control-plane=controller-manager
        reconcilationResource:
          apiVersion: opendatahub.io/v1
          kind: DataScienceCluster

Used for:

Auto-detecting reconciliation targets
Validating experiment targets
Suggesting experiments based on operator structure

Next Steps¶

Injection Engine Deep Dive — How injectors work
Observer Blackboard Pattern — Observation architecture
Development Setup — Build from source