dashboard Failure Modes¶
Coverage¶
| Injection Type | Danger | Experiment | Description |
|---|---|---|---|
| ConfigDrift | high | config-drift.yaml | When the kube-rbac-proxy configuration is corrupted, the RBAC proxy sidecar shou... |
| NetworkPartition | medium | network-partition.yaml | When odh-dashboard pods are network-partitioned from the API server, the dashboa... |
| PodKill | low | pod-kill.yaml | When one odh-dashboard pod is killed, the remaining replica should continue serv... |
| QuotaExhaustion | medium | quota-exhaustion.yaml | Exhausting pod quota in the dashboard namespace should prevent new pods from bei... |
| RBACRevoke | high | rbac-revoke.yaml | When the odh-dashboard ClusterRoleBinding subjects are revoked, the dashboard sh... |
| CRDMutation | high | route-backend-disruption.yaml | Changing the Route backend service name to a non-existent service simulates rout... |
| CRDMutation | high | route-host-collision.yaml | Mutating the dashboard Route host to a non-matching domain simulates a host coll... |
| CRDMutation | high | route-host-deletion.yaml | Deleting the Route host field via null merge patch removes the host assignment f... |
| CRDMutation | high | route-shard-mismatch.yaml | Setting spec.host to a domain that does not match any configured IngressControll... |
| CRDMutation | high | route-tls-mutation.yaml | Changing the TLS termination mode from "edge" or "reencrypt" to "passthrough" fo... |
Experiment Details¶
dashboard-config-drift¶
- Type: ConfigDrift
- Danger Level: high
- Component: odh-dashboard
When the kube-rbac-proxy configuration is corrupted, the RBAC proxy sidecar should reject or misconfigure authorization decisions. The dashboard pods may need to be restarted to pick up the restored config. The operator should detect the drift and reconcile.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: dashboard-config-drift
spec:
tier: 2
target:
operator: dashboard
component: odh-dashboard
resource: ConfigMap/kube-rbac-proxy-config
steadyState:
checks:
- type: resourceExists
apiVersion: v1
kind: ConfigMap
name: kube-rbac-proxy-config
namespace: opendatahub
timeout: "30s"
injection:
type: ConfigDrift
dangerLevel: high
parameters:
name: kube-rbac-proxy-config
key: config-file.yaml
value: '{"authorization":{"static":[{"verb":"*","resource":"invalid"}]}}'
resourceType: ConfigMap
ttl: "300s"
hypothesis:
description: >-
When the kube-rbac-proxy configuration is corrupted, the RBAC proxy
sidecar should reject or misconfigure authorization decisions. The
dashboard pods may need to be restarted to pick up the restored
config. The operator should detect the drift and reconcile.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 2
allowedNamespaces:
- opendatahub
allowDangerous: true
dashboard-network-partition¶
- Type: NetworkPartition
- Danger Level: medium
- Component: odh-dashboard
When odh-dashboard pods are network-partitioned from the API server, the dashboard UI should become unavailable as the kube-rbac-proxy sidecar cannot verify authentication. Once the partition is removed, the dashboard should resume serving without manual intervention.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: dashboard-network-partition
spec:
tier: 2
target:
operator: dashboard
component: odh-dashboard
resource: Deployment/odh-dashboard
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-dashboard
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: NetworkPartition
parameters:
labelSelector: app=odh-dashboard
ttl: "300s"
hypothesis:
description: >-
When odh-dashboard pods are network-partitioned from the API server,
the dashboard UI should become unavailable as the kube-rbac-proxy
sidecar cannot verify authentication. Once the partition is removed,
the dashboard should resume serving without manual intervention.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 2
allowedNamespaces:
- opendatahub
dashboard-pod-kill¶
- Type: PodKill
- Danger Level: low
- Component: odh-dashboard
When one odh-dashboard pod is killed, the remaining replica should continue serving traffic. Kubernetes should recreate the killed pod within the recovery timeout and the deployment should return to 2/2 ready replicas.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: dashboard-pod-kill
spec:
tier: 1
target:
operator: dashboard
component: odh-dashboard
resource: Deployment/odh-dashboard
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-dashboard
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: PodKill
parameters:
labelSelector: app=odh-dashboard
count: 1
ttl: "300s"
hypothesis:
description: >-
When one odh-dashboard pod is killed, the remaining replica should
continue serving traffic. Kubernetes should recreate the killed pod
within the recovery timeout and the deployment should return to 2/2
ready replicas.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
dashboard-quota-exhaustion¶
- Type: QuotaExhaustion
- Danger Level: medium
- Component: odh-dashboard
Exhausting pod quota in the dashboard namespace should prevent new pods from being created. The operator should handle quota errors gracefully and recover when the quota is removed.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: dashboard-quota-exhaustion
spec:
tier: 5
target:
operator: dashboard
component: odh-dashboard
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-dashboard
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: QuotaExhaustion
parameters:
quotaName: "chaos-quota-dashboard"
pods: "0"
ttl: "120s"
hypothesis:
description: >-
Exhausting pod quota in the dashboard namespace should prevent new pods
from being created. The operator should handle quota errors gracefully
and recover when the quota is removed.
recoveryTimeout: 60s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
dashboard-rbac-revoke¶
- Type: RBACRevoke
- Danger Level: high
- Component: odh-dashboard
When the odh-dashboard ClusterRoleBinding subjects are revoked, the dashboard should lose access to cluster-scoped resources like storage classes, nodes, and namespaces. API calls from the UI should return 403 errors. Once permissions are restored, normal operation should resume without restart.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: dashboard-rbac-revoke
spec:
tier: 4
target:
operator: dashboard
component: odh-dashboard
resource: ClusterRoleBinding/odh-dashboard
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-dashboard
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: RBACRevoke
dangerLevel: high
parameters:
bindingName: odh-dashboard
bindingType: ClusterRoleBinding
ttl: "60s"
hypothesis:
description: >-
When the odh-dashboard ClusterRoleBinding subjects are revoked, the
dashboard should lose access to cluster-scoped resources like storage
classes, nodes, and namespaces. API calls from the UI should return
403 errors. Once permissions are restored, normal operation should
resume without restart.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 2
allowDangerous: true
dashboard-route-backend-disruption¶
- Type: CRDMutation
- Danger Level: high
- Component: odh-dashboard
Changing the Route backend service name to a non-existent service simulates route admission denial at the backend level. The Route is still admitted by the router, but all requests return 503 because the backend cannot be found. The operator should detect the broken backend reference and reconcile the Route to point to the correct service. Expected verdict: Resilient if the operator restores the backend, Vulnerable if requests continue to fail.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: dashboard-route-backend-disruption
spec:
tier: 3
target:
operator: dashboard
component: odh-dashboard
resource: Route/odh-dashboard
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-dashboard
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: CRDMutation
dangerLevel: high
parameters:
apiVersion: "route.openshift.io/v1"
kind: "Route"
name: "odh-dashboard"
path: "spec.to.name"
value: "chaos-nonexistent-service"
ttl: "300s"
hypothesis:
description: >-
Changing the Route backend service name to a non-existent service
simulates route admission denial at the backend level. The Route
is still admitted by the router, but all requests return 503
because the backend cannot be found. The operator should detect
the broken backend reference and reconcile the Route to point to
the correct service. Expected verdict: Resilient if the operator
restores the backend, Vulnerable if requests continue to fail.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
allowDangerous: true
dashboard-route-host-collision¶
- Type: CRDMutation
- Danger Level: high
- Component: odh-dashboard
Mutating the dashboard Route host to a non-matching domain simulates a host collision or DNS misconfiguration. The OpenShift router will re-evaluate the Route and may reject or de-admit it. The RHOAI operator should detect the Route drift and reconcile the host back to its correct value. Expected verdict: Resilient if the operator restores the Route, Vulnerable if the Route remains misconfigured.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: dashboard-route-host-collision
spec:
tier: 3
target:
operator: dashboard
component: odh-dashboard
resource: Route/odh-dashboard
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-dashboard
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: CRDMutation
dangerLevel: high
parameters:
apiVersion: "route.openshift.io/v1"
kind: "Route"
name: "odh-dashboard"
path: "spec.host"
value: "chaos-collision.apps.cluster.invalid"
ttl: "300s"
hypothesis:
description: >-
Mutating the dashboard Route host to a non-matching domain simulates
a host collision or DNS misconfiguration. The OpenShift router will
re-evaluate the Route and may reject or de-admit it. The RHOAI
operator should detect the Route drift and reconcile the host back
to its correct value. Expected verdict: Resilient if the operator
restores the Route, Vulnerable if the Route remains misconfigured.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
allowDangerous: true
dashboard-route-host-deletion¶
- Type: CRDMutation
- Danger Level: high
- Component: odh-dashboard
Deleting the Route host field via null merge patch removes the host assignment from the Route. The OpenShift router de-admits the Route since it has no host to serve, making the dashboard unreachable. The operator should detect the missing host and restore the Route to its original configuration. This indirectly tests status clearing: without a host, the router clears the Route's admission status. Expected verdict: Resilient if the operator restores the host, Vulnerable if the Route remains without a host.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: dashboard-route-host-deletion
spec:
tier: 3
target:
operator: dashboard
component: odh-dashboard
resource: Route/odh-dashboard
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-dashboard
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: CRDMutation
dangerLevel: high
parameters:
apiVersion: "route.openshift.io/v1"
kind: "Route"
name: "odh-dashboard"
path: "spec.host"
value: "null"
ttl: "300s"
hypothesis:
description: >-
Deleting the Route host field via null merge patch removes the
host assignment from the Route. The OpenShift router de-admits
the Route since it has no host to serve, making the dashboard
unreachable. The operator should detect the missing host and
restore the Route to its original configuration. This indirectly
tests status clearing: without a host, the router clears the
Route's admission status. Expected verdict: Resilient if the
operator restores the host, Vulnerable if the Route remains
without a host.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
allowDangerous: true
dashboard-route-shard-mismatch¶
- Type: CRDMutation
- Danger Level: high
- Component: odh-dashboard
Setting spec.host to a domain that does not match any configured IngressController's domain simulates a router shard misconfiguration. Unlike host-collision (which uses a cluster-like domain), this targets a completely non-routable local domain. No IngressController will claim the Route, making the dashboard unreachable through the orphaned host. The operator should detect the Route drift and restore the original host. Expected verdict: Resilient if the operator restores the host, Vulnerable if the Route remains orphaned on a non-existent shard.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: dashboard-route-shard-mismatch
spec:
tier: 3
target:
operator: dashboard
component: odh-dashboard
resource: Route/odh-dashboard
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-dashboard
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: CRDMutation
dangerLevel: high
parameters:
apiVersion: "route.openshift.io/v1"
kind: "Route"
name: "odh-dashboard"
path: "spec.host"
value: "dashboard.nonexistent-shard.local"
ttl: "300s"
hypothesis:
description: >-
Setting spec.host to a domain that does not match any configured
IngressController's domain simulates a router shard misconfiguration.
Unlike host-collision (which uses a cluster-like domain), this
targets a completely non-routable local domain. No IngressController
will claim the Route, making the dashboard unreachable through the
orphaned host. The operator should detect the Route drift and
restore the original host. Expected verdict: Resilient if the
operator restores the host, Vulnerable if the Route remains
orphaned on a non-existent shard.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
allowDangerous: true
dashboard-route-tls-mutation¶
- Type: CRDMutation
- Danger Level: high
- Component: odh-dashboard
Changing the TLS termination mode from "edge" or "reencrypt" to "passthrough" forces the router to stop terminating TLS and forward encrypted traffic directly to the backend pod. Since the dashboard pod likely does not serve TLS on its own, this breaks HTTPS access. The operator should detect the TLS config drift and restore the correct termination mode. Expected verdict: Resilient if the operator reconciles the TLS settings, Vulnerable if the Route remains broken.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: dashboard-route-tls-mutation
spec:
tier: 3
target:
operator: dashboard
component: odh-dashboard
resource: Route/odh-dashboard
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-dashboard
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: CRDMutation
dangerLevel: high
parameters:
apiVersion: "route.openshift.io/v1"
kind: "Route"
name: "odh-dashboard"
path: "spec.tls.termination"
value: "passthrough"
ttl: "300s"
hypothesis:
description: >-
Changing the TLS termination mode from "edge" or "reencrypt" to
"passthrough" forces the router to stop terminating TLS and forward
encrypted traffic directly to the backend pod. Since the dashboard
pod likely does not serve TLS on its own, this breaks HTTPS access.
The operator should detect the TLS config drift and restore the
correct termination mode. Expected verdict: Resilient if the
operator reconciles the TLS settings, Vulnerable if the Route
remains broken.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
allowDangerous: true