Architecture Overview¶
Operator Chaos is a chaos engineering framework designed specifically for testing Kubernetes operator resilience. It combines declarative experiment definitions (CRDs) with a pluggable injection engine and observation-based evaluation.
Design Principles¶
- Declarative Experiments — Experiments are Kubernetes resources, versioned in Git and applied with
kubectl - Safety First — Multi-layer blast radius controls, danger level gating, and automatic rollback
- Crash-Safe Cleanup — All injections store rollback data in annotations, enabling recovery after controller restarts
- Blackboard Pattern — Multiple observers contribute evidence to a shared board for holistic evaluation
- Operator-Aware — Knowledge-driven understanding of operator resource ownership and dependencies
System Architecture¶
graph TB
subgraph ui["User Interface"]
CLI[CLI Tool]
CRD[ChaosExperiment CRD]
SDK[Go SDK]
end
subgraph cp["Control Plane"]
Controller[Experiment Controller]
Orchestrator[Lifecycle Orchestrator]
Lock[Experiment Lock]
end
subgraph ie["Injection Engine"]
Registry[Injector Registry]
subgraph injectors[" "]
PodKill[PodKill]
Network[Network]
Config[Config]
CRD_Mut[CRD Mutation]
Webhook[Webhook]
RBAC[RBAC]
Finalizer[Finalizer]
Client[Client Fault]
OwnerRef[OwnerRef Orphan]
Quota[Quota Exhaust]
WebhookLat[Webhook Latency]
end
end
subgraph obs["Observation System"]
Board[Observation Board]
Recon[Reconciliation]
Steady[Steady-State]
Collateral[Collateral Damage]
end
subgraph eval["Evaluation"]
Evaluator[Evaluator]
Reporter[Reporter]
end
CLI --> CRD
CRD --> Controller
SDK --> Client
Controller --> Orchestrator
Orchestrator --> Lock
Orchestrator --> Registry
Registry --> injectors
Orchestrator --> Board
Recon --> Board
Steady --> Board
Collateral --> Board
Board --> Evaluator
Evaluator --> Reporter
Reporter --> Controller
style ui fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
style cp fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#4a148c
style ie fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#bf360c
style injectors fill:#fff8e1,stroke:#f9a825,stroke-width:1px
style obs fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
style eval fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#b71c1c
style Controller fill:#ce93d8,stroke:#6a1b9a
style Orchestrator fill:#ce93d8,stroke:#6a1b9a
style Registry fill:#ffcc80,stroke:#e65100
style Board fill:#a5d6a7,stroke:#2e7d32
style Evaluator fill:#ef9a9a,stroke:#c62828
Upgrade Diff Engine¶
Analyzes structural changes between operator releases to auto-generate upgrade test suites.
Responsibilities:
- Compare versioned knowledge models using two-pass matching (exact + fuzzy)
- Walk CRD OpenAPI v3 schemas to detect breaking changes, warnings, and safe migrations
- Map detected diffs to targeted chaos experiments
- Generate upgrade simulation suites
Key Algorithms:
- Component Matching: Weighted similarity scoring (resource kinds, labels, controller, count)
- Schema Walking: Recursive traversal with severity classification (breaking, warning, info)
- Experiment Generation: Diff-to-injection mapping (CRD changes →
CRDMutation, webhook changes →WebhookDisrupt)
CLI Commands:
operator-chaos diff— Compare two versioned directoriesoperator-chaos diff-crds— Deep CRD schema analysisoperator-chaos validate-version— Validate versioned knowledge model structureoperator-chaos simulate-upgrade— Generate and run upgrade test suites
See Upgrade Diff Engine Deep Dive for implementation details.
Component Breakdown¶
1. Control Plane¶
Experiment Controller¶
Kubernetes controller that watches ChaosExperiment CRDs and manages their lifecycle.
Responsibilities:
- Watch CRD creation/updates
- Validate experiments before execution
- Update CRD status with phase, verdict, and findings
- Coordinate with orchestrator for execution
- Handle cleanup on experiment deletion
Key Phases:
Errors transition to Aborted.
Lifecycle Orchestrator¶
Coordinates the experiment state machine and delegates work to specialized engines.
Key Methods:
ValidateExperiment()— Pre-flight checks (blast radius, danger level, namespace restrictions)RunPreCheck()— Verify baseline steady stateInjectFault()— Lookup injector and execute injectionRunPostCheck()— Run observation board with multiple contributorsEvaluateExperiment()— Compute verdict from findingsRevertFault()— Stateless cleanup via injector
Experiment Lock¶
Prevents concurrent experiments targeting the same operator. Implemented as Kubernetes Lease resources.
Features:
- Operator-scoped locking (not namespace-scoped)
- Configurable lease duration (default: 2x recovery timeout)
- Automatic renewal during experiment execution
- Force-override with
--forceflag
2. Injection Engine¶
Pluggable architecture for fault injection. Each injection type implements the Injector interface:
type Injector interface {
Validate(spec InjectionSpec, blast BlastRadiusSpec) error
Inject(ctx context.Context, spec InjectionSpec, namespace string) (CleanupFunc, []InjectionEvent, error)
Revert(ctx context.Context, spec InjectionSpec, namespace string) error
}
Registry Pattern:
registry := injection.NewRegistry()
registry.Register(v1alpha1.PodKill, injection.NewPodKillInjector(client))
registry.Register(v1alpha1.NetworkPartition, injection.NewNetworkPartitionInjector(client))
// ... register all 11 injection types
Crash-Safe Cleanup:
All injectors store rollback data in resource annotations with integrity checksums:
annotations:
chaos.operatorchaos.io/rollback: |
{"data":"<base64-encoded-rollback-info>","checksum":"sha256:..."}
This enables Revert() to be called after controller restarts without relying on in-memory state.
3. Observation System (Blackboard Pattern)¶
The observation system uses the Blackboard architectural pattern to collect evidence from multiple independent observers.
Core Components:
- ObservationBoard — Thread-safe shared data structure where observers write findings
- ObservationContributor — Interface for observers (
Observe(ctx, board) error) - Finding — Structured evidence (source, passed/failed, details)
Contributors:
-
ReconciliationContributor (Phase 1, blocking)
- Monitors target operator's reconciliation behavior
- Counts reconcile cycles during recovery window
- Detects stuck or thrashing reconcilers
-
SteadyStateContributor (Phase 2, concurrent)
- Runs user-defined steady-state checks (conditions, resource existence)
- Verifies system returned to baseline
-
CollateralContributor (Phase 2, concurrent)
- Checks dependent operators/components
- Detects cascading failures
Execution Flow:
board := observer.NewObservationBoard()
// Phase 1: Reconciliation (blocking)
reconContributor.Observe(ctx, board)
// Phase 2: Steady-state + collateral (concurrent)
observer.RunContributors(ctx, board, []ObservationContributor{
steadyStateContributor,
collateralContributor,
})
findings := board.Findings() // All evidence collected
4. Evaluation & Reporting¶
Evaluator¶
Computes experiment verdict from collected findings using a decision tree:
Pre-check failed? → INCONCLUSIVE (system not ready)
Post-check failed? → FAILED (did not recover)
Reconciliation < 3 cycles? → DEGRADED (slow recovery)
Collateral damage? → DEGRADED (cascade)
All checks passed? → RESILIENT
Confidence Levels: high, medium, low based on observation quality.
Reporter¶
Generates structured reports in JSON format:
{
"experiment": "kill-operator-pod",
"timestamp": "2024-03-30T10:00:00Z",
"target": {"operator": "opendatahub-operator", "component": "controller"},
"injection": {"type": "PodKill", "targets": ["pod-abc123"]},
"steadyState": {"pre": {...}, "post": {...}},
"evaluation": {"verdict": "Resilient", "confidence": "high"},
"reconciliation": {"cycles": 5, "duration": "12.3s"},
"collateral": []
}
Reports are stored:
- As ConfigMaps in the cluster (
chaos-result-<experiment-name>) - As JSON files (if
--report-dirspecified in CLI mode)
Execution Modes¶
1. Controller Mode (CRD-Driven)¶
# Deploy controller
kubectl apply -f deploy/controller.yaml
# Submit experiment
kubectl apply -f experiments/my-test.yaml
# Watch progress
kubectl get chaosexperiment my-test -w
Use Case: GitOps workflows, CI/CD integration, scheduled experiments
2. CLI Mode (Standalone)¶
Use Case: Local development, one-off tests, debugging
3. SDK Mode (Programmatic)¶
orchestrator := orchestrator.New(orchestrator.OrchestratorConfig{...})
result, err := orchestrator.Run(ctx, experiment)
Use Case: Custom test harnesses, fuzzing frameworks
Safety Mechanisms¶
Multi-Layer Blast Radius Control¶
-
Namespace Restrictions
- Forbidden namespaces:
kube-system,kube-public,default,openshift-* - Experiments must explicitly list
allowedNamespaces
- Forbidden namespaces:
-
Resource Limits
maxPodsAffectedenforced per experiment- Forbidden resources (e.g., etcd, API server) blocked by default
-
Danger Level Gating
low,medium,highlevels- High-danger injections require
allowDangerous: true
-
Dry Run Mode
- Validates experiment without executing injection
- Useful for CI pre-flight checks
TTL-Based Auto-Cleanup¶
Injections can specify a TTL for automatic cleanup:
A background cleanup controller scans for expired resources marked with TTL annotations.
Experiment Locking¶
Only one experiment per operator can run at a time, preventing:
- Conflicting injections
- Observer confusion (which fault caused which behavior?)
- Cascading failures
Dependency Graph¶
The framework maintains a dependency graph of ODH components:
components:
- name: dashboard
operator: opendatahub-operator
dependsOn:
- {operator: opendatahub-operator, component: model-controller}
- {operator: opendatahub-operator, component: notebook-controller}
This enables:
- Collateral damage detection — Did a fault on component A break dependent component B?
- Targeted testing — Focus on high-impact components with many dependents
- Root cause analysis — Trace failures back to dependencies
Operator Knowledge Base¶
Built-in knowledge of ODH operators:
operators:
- name: opendatahub-operator
components:
- name: controller-manager
labelSelector: control-plane=controller-manager
reconcilationResource:
apiVersion: opendatahub.io/v1
kind: DataScienceCluster
Used for:
- Auto-detecting reconciliation targets
- Validating experiment targets
- Suggesting experiments based on operator structure
Next Steps¶
- Injection Engine Deep Dive — How injectors work
- Observer Blackboard Pattern — Observation architecture
- Development Setup — Build from source