Troubleshooting¶

Common issues and solutions when running chaos experiments.

Experiment Fails to Load¶

Symptom: Error: failed to load experiment: ...

Causes:

YAML syntax errors (indentation, missing quotes)
Invalid injection type name
Missing required parameters

Fix:

# Validate YAML syntax
yamllint my-experiment.yaml

# Dry-run validates structure without executing
operator-chaos run my-experiment.yaml --dry-run --verbose

Steady-State Check Fails Before Injection¶

Symptom: Error: steady-state check failed before injection

The target component isn't healthy before the experiment starts. The framework won't inject faults into an already-broken system.

Fix:

# Verify the component is healthy
kubectl get deployment -n opendatahub
kubectl describe deployment my-controller -n opendatahub

# Check that the knowledge model matches your cluster
cat knowledge/my-operator.yaml

# Increase steady-state timeout if the component is slow to report ready
# In your experiment YAML:
#   steadyState:
#     timeout: "60s"  # increase from default 30s

Blast Radius Violation¶

Symptom: Error: blast radius check failed: ...

Common causes:

Injection would affect more pods than maxPodsAffected
Target namespace not listed in allowedNamespaces
High-danger injection (RBACRevoke, WebhookDisrupt, CRDMutation on Routes) without allowDangerous: true

Fix:

blastRadius:
  maxPodsAffected: 5
  allowedNamespaces:
    - opendatahub
    - test-namespace
  allowDangerous: true  # required for high-danger injections

Permission Denied¶

Symptom: Error: ... is forbidden: User "..." cannot ...

The CLI or controller ServiceAccount lacks RBAC permissions for the injection type.

Fix:

# Check your permissions
kubectl auth can-i delete pods -n opendatahub
kubectl auth can-i update configmaps -n opendatahub

# For controller mode, check the ServiceAccount
kubectl get clusterrole operator-chaos-role -o yaml
kubectl get clusterrolebinding operator-chaos-binding -o yaml

Cleanup Doesn't Complete¶

Symptom: Resources still have chaos annotations/labels after experiment completes or is interrupted.

Fix:

# Automated cleanup scans for orphaned chaos artifacts
operator-chaos clean --namespace opendatahub

# Check for resources with chaos metadata
kubectl get all -n opendatahub \
  -l "chaos.operatorchaos.io/injected=true"

# Manual annotation removal (last resort)
kubectl annotate deployment my-controller \
  chaos.operatorchaos.io/rollback-data- \
  -n opendatahub

Controller Mode: Experiment Stuck in Pending¶

Common causes:

Missing knowledge model for the target operator
Another experiment holds the distributed lock on the same operator
Controller RBAC missing

Fix:

# Check controller logs
kubectl logs -n operator-chaos-system deployment/operator-chaos-controller

# Check for active locks
kubectl get leases -n operator-chaos-system

See the Controller Advanced Guide for more controller-specific troubleshooting.

Controller Mode: Experiment Stuck in Observing¶

The controller is waiting for the recovery timeout to elapse. This is normal behavior.

# Check the configured timeout
kubectl get chaosexperiment my-experiment -o jsonpath='{.spec.hypothesis.recoveryTimeout}'

# Check when injection started
kubectl get chaosexperiment my-experiment -o jsonpath='{.status.injectionStartedAt}'

Knowledge Model Mismatch¶

Symptom: Preflight passes but experiments fail because resource names, namespaces, or labels don't match what's actually on the cluster.

Fix:

# Run preflight to validate knowledge model against cluster
operator-chaos preflight --knowledge knowledge/my-operator.yaml

# Compare expected vs actual
kubectl get deployment -n opendatahub -o name
kubectl get configmap -n opendatahub -o name

Knowledge models have environment-specific overlays in knowledge/rhoai/ for RHOAI deployments. Make sure you're using the right overlay for your environment.

Getting Help¶

GitHub Issues for bug reports
CLI Commands Reference for command syntax
Failure Modes Reference for injection type parameters