Controller Advanced Guide¶
This guide covers advanced controller mode topics: additional experiment examples, detailed status fields, safety mechanisms, scheduled experiments, and GitOps integration. For getting started with controller mode, see Controller Mode.
Additional Experiment Examples¶
ConfigDrift Experiment¶
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: inferenceservice-config-drift
namespace: operator-chaos-experiments
spec:
target:
operator: odh-model-controller
component: odh-model-controller
injection:
type: ConfigDrift
parameters:
name: inferenceservice-config
key: deploy
value: "corrupted-config-data"
ttl: "300s"
hypothesis:
description: >-
When the inferenceservice-config ConfigMap is corrupted, the
controller should detect and restore the correct configuration.
recoveryTimeout: 180s
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-model-controller
namespace: opendatahub
conditionType: Available
timeout: "30s"
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
allowDangerous: true
RBACRevoke Experiment (High Danger)¶
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: rbac-revoke-test
namespace: operator-chaos-experiments
spec:
target:
operator: odh-model-controller
component: odh-model-controller
injection:
type: RBACRevoke
parameters:
bindingName: odh-model-controller-rolebinding-opendatahub
bindingType: ClusterRoleBinding
ttl: "300s"
hypothesis:
description: >-
When the controller's RBAC binding is revoked, it should detect
permission errors and recover when RBAC is restored.
recoveryTimeout: 240s
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-model-controller
namespace: opendatahub
conditionType: Available
timeout: "30s"
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
allowDangerous: true
High-Danger Injections
Injection types like RBACRevoke and WebhookDisrupt are high-danger. You must set blastRadius.allowDangerous: true or the controller will reject the experiment.
Viewing Results¶
Status Fields¶
The controller updates .status with experiment results:
status:
phase: Complete
verdict: Resilient
observedGeneration: 1
message: "Experiment completed successfully"
startTime: "2024-03-30T12:00:00Z"
endTime: "2024-03-30T12:02:15Z"
injectionStartedAt: "2024-03-30T12:00:05Z"
steadyStatePre:
passed: true
checksRun: 1
checksPassed: 1
details:
- check:
type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-model-controller
namespace: opendatahub
conditionType: Available
passed: true
value: "True"
timestamp: "2024-03-30T12:00:02Z"
steadyStatePost:
passed: true
checksRun: 1
checksPassed: 1
details: [...]
timestamp: "2024-03-30T12:02:06Z"
injectionLog:
- timestamp: "2024-03-30T12:00:05Z"
type: PodKill
target: opendatahub/odh-model-controller-5c7d8f9b-xz4k2
action: deleted
details:
signal: SIGTERM
evaluationResult:
verdict: Resilient
confidence: high
recoveryTime: 115s
reconcileCycles: 2
deviations: []
conditions:
- type: SteadyStateEstablished
status: "True"
lastTransitionTime: "2024-03-30T12:00:02Z"
reason: PreCheckPassed
message: "Baseline steady-state established"
- type: FaultInjected
status: "True"
lastTransitionTime: "2024-03-30T12:00:05Z"
reason: InjectionSucceeded
message: "Fault injected successfully"
- type: RecoveryObserved
status: "True"
lastTransitionTime: "2024-03-30T12:02:05Z"
reason: RecoveryComplete
message: "Recovery timeout elapsed, all resources reconciled"
- type: Complete
status: "True"
lastTransitionTime: "2024-03-30T12:02:15Z"
reason: EvaluationComplete
message: "Experiment complete, verdict: Resilient"
Key status fields:
| Field | Type | Description |
|---|---|---|
phase |
string | Current phase (see lifecycle diagram) |
verdict |
string | Experiment verdict: Resilient, Degraded, Failed, Inconclusive |
message |
string | Human-readable status message |
startTime |
timestamp | When the experiment started |
endTime |
timestamp | When the experiment completed (phase Complete or Aborted) |
injectionStartedAt |
timestamp | When the fault was injected |
steadyStatePre |
object | Pre-injection check results |
steadyStatePost |
object | Post-recovery check results |
injectionLog |
array | Detailed log of injection actions |
evaluationResult |
object | Verdict, recovery time, reconcile cycles, deviations |
conditions |
array | Kubernetes-native status conditions |
Query Experiments¶
# List all experiments
kubectl get chaosexperiments -A
# Filter by verdict
kubectl get chaosexperiments -A -o json | \
jq '.items[] | select(.status.verdict == "Failed") | .metadata.name'
# Show experiments in progress
kubectl get chaosexperiments -A -o json | \
jq '.items[] | select(.status.phase != "Complete" and .status.phase != "Aborted")'
# Get detailed results
kubectl get chaosexperiment my-experiment -o yaml
Events¶
The controller emits events at each phase transition:
kubectl get events --field-selector involvedObject.kind=ChaosExperiment
# LAST SEEN TYPE REASON OBJECT
# 2m Normal PhaseTransition chaosexperiment/my-experiment Phase: Pending -> SteadyStatePre
# 2m Normal PhaseTransition chaosexperiment/my-experiment Phase: SteadyStatePre -> Injecting
# 2m Normal FaultInjected chaosexperiment/my-experiment Deleted pod odh-model-controller-5c7d8f9b-xz4k2
# 30s Normal PhaseTransition chaosexperiment/my-experiment Phase: Injecting -> Observing
# 5s Normal PhaseTransition chaosexperiment/my-experiment Phase: Observing -> SteadyStatePost
# 3s Normal PhaseTransition chaosexperiment/my-experiment Phase: SteadyStatePost -> Evaluating
# 2s Normal VerdictRendered chaosexperiment/my-experiment Verdict: Resilient (recovery: 115s, cycles: 2)
# 1s Normal PhaseTransition chaosexperiment/my-experiment Phase: Evaluating -> Complete
Safety Mechanisms¶
Distributed Locking¶
The controller uses Kubernetes Leases to prevent concurrent experiments on the same operator:
- Before injecting, controller acquires a lease for the target operator
- Lease name:
chaos-lock-<operator-name> - Lease namespace:
operator-chaos-system(configurable via--lock-namespace) - If another experiment holds the lease, the controller requeues with backoff
View active locks:
kubectl get leases -n operator-chaos-system
# NAME HOLDER AGE
# chaos-lock-odh-model-controller my-experiment 45s
The lock is released when the experiment reaches Complete or Aborted.
Finalizers¶
The controller adds a finalizer (chaos.operatorchaos.io/cleanup) during the Injecting phase. This ensures:
- If the CR is deleted mid-experiment, the controller reverts the fault before deleting
- If the controller crashes, the finalizer prevents orphaned faults
Crash recovery: If the controller crashes during an experiment, on restart it:
- Resumes from the last recorded phase
- Re-runs cleanup logic if phase is
Aborted - Removes the finalizer on terminal phases
TTL-Based Auto-Cleanup¶
Faults have a time-to-live (injection.ttl). Even if the controller crashes, the framework's TTL cleanup logic (running in the Observer) will eventually revert the fault.
Blast Radius Limits¶
The controller enforces blast radius constraints before injection:
maxPodsAffected: Maximum pods that can be affectedallowedNamespaces: Injection restricted to these namespacesforbiddenResources: Resources that must not be touchedallowDangerous: High-danger injections require explicit opt-in
Experiments that violate constraints are rejected with phase Aborted and message explaining the violation.
Scheduled Experiments with CronJobs¶
Run experiments on a schedule using Kubernetes CronJobs:
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-chaos
namespace: operator-chaos-experiments
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
serviceAccountName: chaos-job-runner
containers:
- name: create-experiment
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
cat <<EOF | kubectl apply -f -
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
generateName: nightly-podkill-
namespace: operator-chaos-experiments
spec:
target:
operator: odh-model-controller
component: odh-model-controller
injection:
type: PodKill
parameters:
labelSelector: control-plane=odh-model-controller
hypothesis:
description: "Nightly pod kill test"
recoveryTimeout: 120s
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-model-controller
namespace: opendatahub
conditionType: Available
timeout: "30s"
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
EOF
restartPolicy: OnFailure
Note: The ServiceAccount needs RBAC to create ChaosExperiment CRs.
GitOps Integration¶
Store experiments in Git and sync with Argo CD or Flux:
# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: chaos-experiments
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/my-org/chaos-experiments
path: experiments/odh-model-controller
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: operator-chaos-experiments
syncPolicy:
automated:
prune: true
selfHeal: false # Don't auto-heal to preserve experiment history
Emergency Stop¶
If experiments are stuck or the controller is misbehaving:
# Delete the controller deployment (stops new experiments)
kubectl delete deployment operator-chaos-controller -n operator-chaos-system
# Use the CLI to clean up faults manually
operator-chaos clean --namespace <namespace>
Troubleshooting¶
Experiment stuck in Pending¶
Check controller logs:
Common causes:
- Validation error (missing knowledge model, unknown injection type)
- Failed to acquire lock (another experiment is running on same operator)
- RBAC permissions missing
Experiment stuck in Observing¶
The controller is waiting for the recovery timeout to elapse. Check:
kubectl get chaosexperiment my-experiment -o jsonpath='{.spec.hypothesis.recoveryTimeout}'
# 120s
# Check how long we've been observing
kubectl get chaosexperiment my-experiment -o jsonpath='{.status.injectionStartedAt}'
Verdict is Inconclusive¶
The pre-injection steady-state check failed. Check:
kubectl get chaosexperiment my-experiment -o jsonpath='{.status.steadyStatePre}'
# {"passed":false,"checksRun":1,"checksPassed":0,"details":[...]}
Verify the target resource is healthy before running the experiment.
Finalizer not removed¶
If an experiment is stuck deleting with a finalizer:
# Check phase
kubectl get chaosexperiment my-experiment -o jsonpath='{.status.phase}'
# If phase is Complete or Aborted, force-remove finalizer
kubectl patch chaosexperiment my-experiment -p '{"metadata":{"finalizers":[]}}' --type=merge
Next Steps¶
- Learn about Knowledge Models to define operator semantics
- See Failure Modes for all available fault types
- Read CI Integration Guide for pipeline integration