Troubleshooting¶
Common issues and solutions when running chaos experiments.
Experiment Fails to Load¶
Symptom: Error: failed to load experiment: ...
Causes:
- YAML syntax errors (indentation, missing quotes)
- Invalid injection type name
- Missing required parameters
Fix:
# Validate YAML syntax
yamllint my-experiment.yaml
# Dry-run validates structure without executing
operator-chaos run my-experiment.yaml --dry-run --verbose
Steady-State Check Fails Before Injection¶
Symptom: Error: steady-state check failed before injection
The target component isn't healthy before the experiment starts. The framework won't inject faults into an already-broken system.
Fix:
# Verify the component is healthy
kubectl get deployment -n opendatahub
kubectl describe deployment my-controller -n opendatahub
# Check that the knowledge model matches your cluster
cat knowledge/my-operator.yaml
# Increase steady-state timeout if the component is slow to report ready
# In your experiment YAML:
# steadyState:
# timeout: "60s" # increase from default 30s
Blast Radius Violation¶
Symptom: Error: blast radius check failed: ...
Common causes:
- Injection would affect more pods than
maxPodsAffected - Target namespace not listed in
allowedNamespaces - High-danger injection (RBACRevoke, WebhookDisrupt, CRDMutation on Routes) without
allowDangerous: true
Fix:
blastRadius:
maxPodsAffected: 5
allowedNamespaces:
- opendatahub
- test-namespace
allowDangerous: true # required for high-danger injections
Permission Denied¶
Symptom: Error: ... is forbidden: User "..." cannot ...
The CLI or controller ServiceAccount lacks RBAC permissions for the injection type.
Fix:
# Check your permissions
kubectl auth can-i delete pods -n opendatahub
kubectl auth can-i update configmaps -n opendatahub
# For controller mode, check the ServiceAccount
kubectl get clusterrole operator-chaos-role -o yaml
kubectl get clusterrolebinding operator-chaos-binding -o yaml
Cleanup Doesn't Complete¶
Symptom: Resources still have chaos annotations/labels after experiment completes or is interrupted.
Fix:
# Automated cleanup scans for orphaned chaos artifacts
operator-chaos clean --namespace opendatahub
# Check for resources with chaos metadata
kubectl get all -n opendatahub \
-l "chaos.operatorchaos.io/injected=true"
# Manual annotation removal (last resort)
kubectl annotate deployment my-controller \
chaos.operatorchaos.io/rollback-data- \
-n opendatahub
Controller Mode: Experiment Stuck in Pending¶
Common causes:
- Missing knowledge model for the target operator
- Another experiment holds the distributed lock on the same operator
- Controller RBAC missing
Fix:
# Check controller logs
kubectl logs -n operator-chaos-system deployment/operator-chaos-controller
# Check for active locks
kubectl get leases -n operator-chaos-system
See the Controller Advanced Guide for more controller-specific troubleshooting.
Controller Mode: Experiment Stuck in Observing¶
The controller is waiting for the recovery timeout to elapse. This is normal behavior.
# Check the configured timeout
kubectl get chaosexperiment my-experiment -o jsonpath='{.spec.hypothesis.recoveryTimeout}'
# Check when injection started
kubectl get chaosexperiment my-experiment -o jsonpath='{.status.injectionStartedAt}'
Knowledge Model Mismatch¶
Symptom: Preflight passes but experiments fail because resource names, namespaces, or labels don't match what's actually on the cluster.
Fix:
# Run preflight to validate knowledge model against cluster
operator-chaos preflight --knowledge knowledge/my-operator.yaml
# Compare expected vs actual
kubectl get deployment -n opendatahub -o name
kubectl get configmap -n opendatahub -o name
Knowledge models have environment-specific overlays in knowledge/rhoai/ for RHOAI deployments. Make sure you're using the right overlay for your environment.
Getting Help¶
- GitHub Issues for bug reports
- CLI Commands Reference for command syntax
- Failure Modes Reference for injection type parameters