model-registry Failure Modes¶
Coverage¶
| Injection Type | Danger | Experiment | Description |
|---|---|---|---|
| FinalizerBlock | low | finalizer-block.yaml | When a stuck finalizer prevents a ModelRegistry from being deleted, the operator... |
| NetworkPartition | medium | network-partition.yaml | When the model-registry-operator pod is network-partitioned from the API server,... |
| PodKill | low | pod-kill.yaml | When the model-registry-operator pod is killed, Kubernetes should recreate it wi... |
| RBACRevoke | high | rbac-revoke.yaml | When the model-registry-operator ClusterRoleBinding subjects are revoked, the op... |
| CRDMutation | high | route-backend-disruption.yaml | Changing the model-registry Route backend service to a non-existent service simu... |
| CRDMutation | high | route-host-collision.yaml | Mutating the model-registry REST API Route host simulates a host collision or DN... |
| CRDMutation | high | route-tls-mutation.yaml | Changing the TLS termination mode on the model-registry REST API Route from edge... |
| WebhookDisrupt | high | webhook-disrupt.yaml | When the ModelRegistry validating webhook failurePolicy is weakened from Fail to... |
Experiment Details¶
model-registry-finalizer-block¶
- Type: FinalizerBlock
- Danger Level: low
- Component: model-registry-operator
When a stuck finalizer prevents a ModelRegistry from being deleted, the operator should handle the Terminating state gracefully, report the blocked deletion in its status, and not leak associated database or service resources. The chaos framework removes the finalizer via TTL-based cleanup after 300s.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: model-registry-finalizer-block
spec:
tier: 3
target:
operator: model-registry
component: model-registry-operator
resource: ModelRegistry
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: model-registry-operator-controller-manager
namespace: odh-model-registries
conditionType: Available
timeout: "30s"
injection:
type: FinalizerBlock
# IMPORTANT: "test-registry" is a placeholder. Replace it with the name
# of an actual ModelRegistry resource deployed in the target namespace
# before running this experiment.
parameters:
apiVersion: modelregistry.opendatahub.io/v1alpha1
kind: ModelRegistry
name: test-registry
finalizer: modelregistry.opendatahub.io/finalizer
ttl: "300s"
hypothesis:
description: >-
When a stuck finalizer prevents a ModelRegistry from being deleted,
the operator should handle the Terminating state gracefully, report
the blocked deletion in its status, and not leak associated database
or service resources. The chaos framework removes the finalizer via
TTL-based cleanup after 300s.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- odh-model-registries
model-registry-network-partition¶
- Type: NetworkPartition
- Danger Level: medium
- Component: model-registry-operator
When the model-registry-operator pod is network-partitioned from the API server, it should lose its leader lease and stop reconciling. Once the partition is removed, the operator should re-acquire the lease and resume normal operation.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: model-registry-network-partition
spec:
tier: 2
target:
operator: model-registry
component: model-registry-operator
resource: Deployment/model-registry-operator-controller-manager
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: model-registry-operator-controller-manager
namespace: odh-model-registries
conditionType: Available
timeout: "30s"
injection:
type: NetworkPartition
parameters:
labelSelector: control-plane=model-registry-operator
ttl: "300s"
hypothesis:
description: >-
When the model-registry-operator pod is network-partitioned from
the API server, it should lose its leader lease and stop reconciling.
Once the partition is removed, the operator should re-acquire the
lease and resume normal operation.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- odh-model-registries
model-registry-pod-kill¶
- Type: PodKill
- Danger Level: low
- Component: model-registry-operator
When the model-registry-operator pod is killed, Kubernetes should recreate it within the recovery timeout. The operator should resume reconciling ModelRegistry resources without data loss or registry downtime.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: model-registry-pod-kill
spec:
tier: 1
target:
operator: model-registry
component: model-registry-operator
resource: Deployment/model-registry-operator-controller-manager
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: model-registry-operator-controller-manager
namespace: odh-model-registries
conditionType: Available
timeout: "30s"
injection:
type: PodKill
parameters:
labelSelector: control-plane=model-registry-operator
count: 1
ttl: "300s"
hypothesis:
description: >-
When the model-registry-operator pod is killed, Kubernetes should
recreate it within the recovery timeout. The operator should resume
reconciling ModelRegistry resources without data loss or registry
downtime.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- odh-model-registries
model-registry-rbac-revoke¶
- Type: RBACRevoke
- Danger Level: high
- Component: model-registry-operator
When the model-registry-operator ClusterRoleBinding subjects are revoked, the operator should lose its ability to manage ModelRegistry instances and surface permission-denied errors. Once permissions are restored, reconciliation should resume without manual intervention.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: model-registry-rbac-revoke
spec:
tier: 4
target:
operator: model-registry
component: model-registry-operator
resource: ClusterRoleBinding/model-registry-operator-manager-rolebinding
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: model-registry-operator-controller-manager
namespace: odh-model-registries
conditionType: Available
timeout: "30s"
injection:
type: RBACRevoke
dangerLevel: high
parameters:
bindingName: model-registry-operator-manager-rolebinding
bindingType: ClusterRoleBinding
ttl: "60s"
hypothesis:
description: >-
When the model-registry-operator ClusterRoleBinding subjects are
revoked, the operator should lose its ability to manage ModelRegistry
instances and surface permission-denied errors. Once permissions are
restored, reconciliation should resume without manual intervention.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true
model-registry-route-backend-disruption¶
- Type: CRDMutation
- Danger Level: high
- Component: model-registry-operator
Changing the model-registry Route backend service to a non-existent service simulates backend disruption. All API requests return 503. The operator should detect the broken backend reference and reconcile the Route to point to the correct service. Expected verdict: Resilient if the operator restores the backend, Vulnerable if the REST API continues to fail.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: model-registry-route-backend-disruption
spec:
tier: 3
target:
operator: model-registry
component: model-registry-operator
resource: Route/model-registry-operator-rest
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: model-registry-operator-controller-manager
namespace: odh-model-registries
conditionType: Available
timeout: "30s"
injection:
type: CRDMutation
dangerLevel: high
parameters:
apiVersion: "route.openshift.io/v1"
kind: "Route"
name: "model-registry-operator-rest"
path: "spec.to.name"
value: "chaos-nonexistent-service"
ttl: "300s"
hypothesis:
description: >-
Changing the model-registry Route backend service to a non-existent
service simulates backend disruption. All API requests return 503.
The operator should detect the broken backend reference and reconcile
the Route to point to the correct service. Expected verdict:
Resilient if the operator restores the backend, Vulnerable if the
REST API continues to fail.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- odh-model-registries
allowDangerous: true
model-registry-route-host-collision¶
- Type: CRDMutation
- Danger Level: high
- Component: model-registry-operator
Mutating the model-registry REST API Route host simulates a host collision or DNS misconfiguration. The model-registry operator should detect the Route drift and reconcile the host back to its correct value. Expected verdict: Resilient if the operator restores the Route host, Vulnerable if the Route remains misconfigured and the REST API becomes unreachable.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: model-registry-route-host-collision
spec:
tier: 3
target:
operator: model-registry
component: model-registry-operator
resource: Route/model-registry-operator-rest
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: model-registry-operator-controller-manager
namespace: odh-model-registries
conditionType: Available
timeout: "30s"
injection:
type: CRDMutation
dangerLevel: high
parameters:
apiVersion: "route.openshift.io/v1"
kind: "Route"
name: "model-registry-operator-rest"
path: "spec.host"
value: "chaos-collision.apps.cluster.invalid"
ttl: "300s"
hypothesis:
description: >-
Mutating the model-registry REST API Route host simulates a
host collision or DNS misconfiguration. The model-registry operator
should detect the Route drift and reconcile the host back to its
correct value. Expected verdict: Resilient if the operator restores
the Route host, Vulnerable if the Route remains misconfigured and
the REST API becomes unreachable.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- odh-model-registries
allowDangerous: true
model-registry-route-tls-mutation¶
- Type: CRDMutation
- Danger Level: high
- Component: model-registry-operator
Changing the TLS termination mode on the model-registry REST API Route from edge/reencrypt to passthrough breaks HTTPS access to the model-registry API. The operator should detect the TLS config drift and restore the correct termination mode. Expected verdict: Resilient if the operator reconciles the TLS settings, Vulnerable if the REST API becomes unreachable over HTTPS.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: model-registry-route-tls-mutation
spec:
tier: 3
target:
operator: model-registry
component: model-registry-operator
resource: Route/model-registry-operator-rest
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: model-registry-operator-controller-manager
namespace: odh-model-registries
conditionType: Available
timeout: "30s"
injection:
type: CRDMutation
dangerLevel: high
parameters:
apiVersion: "route.openshift.io/v1"
kind: "Route"
name: "model-registry-operator-rest"
path: "spec.tls.termination"
value: "passthrough"
ttl: "300s"
hypothesis:
description: >-
Changing the TLS termination mode on the model-registry REST API
Route from edge/reencrypt to passthrough breaks HTTPS access to
the model-registry API. The operator should detect the TLS config
drift and restore the correct termination mode. Expected verdict:
Resilient if the operator reconciles the TLS settings, Vulnerable
if the REST API becomes unreachable over HTTPS.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- odh-model-registries
allowDangerous: true
model-registry-webhook-disrupt¶
- Type: WebhookDisrupt
- Danger Level: high
- Component: model-registry-operator
When the ModelRegistry validating webhook failurePolicy is weakened from Fail to Ignore, invalid ModelRegistry resources can bypass admission validation. The chaos framework restores the original failurePolicy via TTL-based cleanup after 60s.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: model-registry-webhook-disrupt
spec:
tier: 4
target:
operator: model-registry
component: model-registry-operator
resource: ValidatingWebhookConfiguration/vmodelregistry.opendatahub.io
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: model-registry-operator-controller-manager
namespace: odh-model-registries
conditionType: Available
timeout: "30s"
injection:
type: WebhookDisrupt
dangerLevel: high
parameters:
webhookName: vmodelregistry.opendatahub.io
action: setFailurePolicy
value: Ignore
ttl: "60s"
hypothesis:
description: >-
When the ModelRegistry validating webhook failurePolicy is weakened
from Fail to Ignore, invalid ModelRegistry resources can bypass
admission validation. The chaos framework restores the original
failurePolicy via TTL-based cleanup after 60s.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true