odh-model-controller Failure Modes¶

Coverage¶

Injection Type	Danger	Experiment	Description
ConfigDrift	high	config-drift.yaml	When the inferenceservice-config ConfigMap is corrupted with an invalid deployme...
ClientFault	low	cr-deletion-mid-reconcile.yaml	Injecting intermittent "not found" errors with 2s delay on GET operations simula...
CRDMutation	medium	crd-mutation.yaml	InferenceService has no scalar top-level spec fields, so this experiment injects...
PodKill	low	dependency-kserve-kill.yaml	Killing the kserve-controller-manager (a dependency of odh-model-controller) sho...
FinalizerBlock	low	finalizer-block.yaml	When a stuck finalizer prevents an InferenceService from being deleted, the odh-...
ConfigDrift	high	ingress-config-corruption.yaml	When the ingress key in inferenceservice-config is emptied, the odh-model-contro...
LabelStomping	high	label-stomping.yaml	When a label used for resource discovery is overwritten on the odh-model-control...
CRDMutation	high	leader-lease-corrupt.yaml	Controller detects corrupted leader lease holderIdentity and re-elects leader wi...
NamespaceDeletion	high	namespace-deletion.yaml	When the operator's namespace is deleted, the operator should detect the loss an...
NetworkPartition	medium	network-partition.yaml	When the odh-model-controller pod is network-partitioned from the API server, it...
OwnerRefOrphan	medium	ownerref-orphan.yaml	Removing ownerReferences from the odh-model-controller Deployment should trigger...
PodKill	low	pod-kill.yaml	When the odh-model-controller pod is killed, Kubernetes should recreate it withi...
QuotaExhaustion	medium	quota-exhaustion.yaml	Creating a restrictive ResourceQuota that prevents pod creation should cause the...
RBACRevoke	high	rbac-revoke.yaml	When the odh-model-controller ClusterRoleBinding subjects are revoked, the contr...
ClientFault	low	sdk-api-throttle.yaml	When 30% of Get and 20% of List operations are throttled with 500ms-1s delays, t...
ClientFault	high	sdk-conflict-storm.yaml	When 70% of Update and 50% of Patch operations fail with conflict errors, the co...
ClientFault	low	sdk-watch-disconnect.yaml	When 40% of reconcile operations encounter watch channel closures, the controlle...
ConfigDrift	high	webhook-cert-corrupt.yaml	All 7 webhooks fail after TLS cert corruption; cert-manager or operator restores...
WebhookDisrupt	high	webhook-disrupt.yaml	When the validating webhook failurePolicy is weakened from Fail to Ignore, inval...
WebhookLatency	high	webhook-latency.yaml	Deploying a slow admission webhook (25s delay, just under the 30s API server tim...

Experiment Details¶

odh-model-controller-config-drift¶

Type: ConfigDrift
Danger Level: high
Component: odh-model-controller

When the inferenceservice-config ConfigMap is corrupted with an invalid deployment mode, the odh-model-controller should detect the misconfiguration and either fall back to defaults or surface clear error conditions rather than silently failing.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-config-drift
spec:
  tier: 2
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: ConfigMap/inferenceservice-config
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: v1
        kind: ConfigMap
        name: inferenceservice-config
        namespace: opendatahub
    timeout: "30s"
  injection:
    type: ConfigDrift
    dangerLevel: high
    parameters:
      name: inferenceservice-config
      key: deploy
      value: '{"defaultDeploymentMode":"INVALID_MODE"}'
      resourceType: ConfigMap
    ttl: "300s"
  hypothesis:
    description: >-
      When the inferenceservice-config ConfigMap is corrupted with an
      invalid deployment mode, the odh-model-controller should detect
      the misconfiguration and either fall back to defaults or surface
      clear error conditions rather than silently failing.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub
    allowDangerous: true

odh-model-controller-cr-deletion-mid-reconcile¶

Type: ClientFault
Danger Level: low
Component: odh-model-controller

Injecting intermittent "not found" errors with 2s delay on GET operations simulates CR deletion during active reconciliation. The controller should handle nil-pointer scenarios gracefully without panicking or crash-looping. This is a common source of bugs in poorly written controllers. Requires ChaosClient SDK integration in the target operator.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-cr-deletion-mid-reconcile
spec:
  tier: 3
  target:
    operator: odh-model-controller
    component: odh-model-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: ClientFault
    parameters:
      faults: '{"get":{"errorRate":0.5,"error":"not found","delay":"2s"}}'
      configMapName: "operator-chaos-cr-deletion"
    ttl: "120s"
  hypothesis:
    description: >-
      Injecting intermittent "not found" errors with 2s delay on GET operations
      simulates CR deletion during active reconciliation. The controller should
      handle nil-pointer scenarios gracefully without panicking or crash-looping.
      This is a common source of bugs in poorly written controllers. Requires
      ChaosClient SDK integration in the target operator.
    recoveryTimeout: 60s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

odh-model-controller-crd-mutation¶

Type: CRDMutation
Danger Level: medium
Component: odh-model-controller

InferenceService has no scalar top-level spec fields, so this experiment injects an unknown field ("chaosTest") via merge patch. The controller should reconcile without error and not propagate the unknown field to downstream resources. The expected verdict is Resilient — the controller ignores unknown fields gracefully. The chaos framework removes the injected field via TTL-based cleanup after 300s.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-crd-mutation
spec:
  tier: 3
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: InferenceService
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: CRDMutation
    # NOTE: InferenceService has no scalar top-level spec fields that can be
    # trivially patched. Instead we inject an unknown field ("chaosTest") into
    # spec to trigger reconciliation. The controller should treat the unknown
    # field gracefully and not propagate it to downstream resources.
    #
    # IMPORTANT: Replace "test-isvc" with the name of an actual InferenceService
    # resource deployed in the target namespace before running this experiment.
    parameters:
      apiVersion: "serving.kserve.io/v1beta1"
      kind: "InferenceService"
      name: "test-isvc"
      field: "chaosTest"
      value: "injected"
    ttl: "300s"
  hypothesis:
    description: >-
      InferenceService has no scalar top-level spec fields, so this
      experiment injects an unknown field ("chaosTest") via merge patch.
      The controller should reconcile without error and not propagate
      the unknown field to downstream resources. The expected verdict is
      Resilient — the controller ignores unknown fields gracefully.
      The chaos framework removes the injected field via TTL-based cleanup
      after 300s.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

odh-model-controller-dependency-kserve-kill¶

Type: PodKill
Danger Level: low
Component: odh-model-controller

Killing the kserve-controller-manager (a dependency of odh-model-controller) should cause odh-model-controller to degrade gracefully instead of crash-looping. The controller should report appropriate status conditions and recover once kserve is restored.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-dependency-kserve-kill
spec:
  tier: 1
  target:
    operator: odh-model-controller
    component: odh-model-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: "control-plane=kserve-controller-manager"
    ttl: "300s"
  hypothesis:
    description: >-
      Killing the kserve-controller-manager (a dependency of odh-model-controller)
      should cause odh-model-controller to degrade gracefully instead of
      crash-looping. The controller should report appropriate status conditions
      and recover once kserve is restored.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

odh-model-controller-finalizer-block¶

Type: FinalizerBlock
Danger Level: low
Component: odh-model-controller

When a stuck finalizer prevents an InferenceService from being deleted, the odh-model-controller should handle the Terminating state gracefully, report the blocked deletion in its status, and not leak associated resources such as Routes or Services.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-finalizer-block
spec:
  tier: 3
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: InferenceService
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: FinalizerBlock
    # IMPORTANT: "test-isvc" is a placeholder. Replace it with the name of an
    # actual InferenceService resource deployed in the target namespace before
    # running this experiment. The experiment targets a specific CR instance,
    # so a real resource name is required.
    parameters:
      apiVersion: serving.kserve.io/v1beta1
      kind: InferenceService
      name: test-isvc
      finalizer: odh.inferenceservice.finalizers
    ttl: "300s"
  hypothesis:
    description: >-
      When a stuck finalizer prevents an InferenceService from being
      deleted, the odh-model-controller should handle the Terminating
      state gracefully, report the blocked deletion in its status, and
      not leak associated resources such as Routes or Services.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

odh-model-controller-ingress-config-corruption¶

Type: ConfigDrift
Danger Level: high
Component: odh-model-controller

When the ingress key in inferenceservice-config is emptied, the odh-model-controller should detect the invalid configuration and surface error conditions rather than silently failing. The ConfigMap is not owned by this controller, so recovery depends on the DSCI/DSC operator or manual restoration. The chaos framework restores the original value via TTL-based cleanup after 300s.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-ingress-config-corruption
spec:
  tier: 2
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: ConfigMap/inferenceservice-config
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: v1
        kind: ConfigMap
        name: inferenceservice-config
        namespace: opendatahub
    timeout: "30s"
  injection:
    type: ConfigDrift
    dangerLevel: high
    parameters:
      name: "inferenceservice-config"
      key: "ingress"
      value: "{}"
      resourceType: "ConfigMap"
    ttl: "300s"
  hypothesis:
    description: >-
      When the ingress key in inferenceservice-config is emptied, the
      odh-model-controller should detect the invalid configuration and
      surface error conditions rather than silently failing. The ConfigMap
      is not owned by this controller, so recovery depends on the
      DSCI/DSC operator or manual restoration. The chaos framework
      restores the original value via TTL-based cleanup after 300s.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub
    allowDangerous: true

odh-model-controller-label-stomping¶

Type: LabelStomping
Danger Level: high
Component: odh-model-controller

When a label used for resource discovery is overwritten on the odh-model-controller Deployment, the operator should detect the label drift and restore the correct label value.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-label-stomping
spec:
  tier: 3
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: Deployment/odh-model-controller
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
    timeout: "30s"
  injection:
    type: LabelStomping
    dangerLevel: high
    parameters:
      apiVersion: apps/v1
      kind: Deployment
      name: odh-model-controller
      labelKey: app.kubernetes.io/name
      action: overwrite
    ttl: "300s"
  hypothesis:
    description: >-
      When a label used for resource discovery is overwritten on the
      odh-model-controller Deployment, the operator should detect the
      label drift and restore the correct label value.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

odh-model-controller-leader-lease-corrupt¶

Type: CRDMutation
Danger Level: high
Component: odh-model-controller

Controller detects corrupted leader lease holderIdentity and re-elects leader within 60s, resuming reconciliation. CLEANUP RISK: The TTL-based cleanup restores the original holderIdentity value, which may overwrite a legitimately re-elected leader and cause a brief second disruption. The controller should recover from this via a second re-election.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-leader-lease-corrupt
spec:
  tier: 3
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: Lease/odh-model-controller.opendatahub.io
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: coordination.k8s.io/v1
        kind: Lease
        name: odh-model-controller.opendatahub.io
        namespace: opendatahub
    timeout: "30s"
  injection:
    type: CRDMutation
    dangerLevel: high
    parameters:
      apiVersion: "coordination.k8s.io/v1"
      kind: "Lease"
      name: "odh-model-controller.opendatahub.io"
      field: "holderIdentity"
      value: "chaos-injected-invalid"
    ttl: "120s"
  hypothesis:
    description: >-
      Controller detects corrupted leader lease holderIdentity and
      re-elects leader within 60s, resuming reconciliation.
      CLEANUP RISK: The TTL-based cleanup restores the original
      holderIdentity value, which may overwrite a legitimately
      re-elected leader and cause a brief second disruption. The
      controller should recover from this via a second re-election.
    recoveryTimeout: 60s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub
    allowDangerous: true

odh-model-controller-namespace-deletion¶

Type: NamespaceDeletion
Danger Level: high
Component: odh-model-controller

When the operator's namespace is deleted, the operator should detect the loss and recreate the namespace along with all managed resources. This tests the most destructive failure mode: complete namespace loss.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-namespace-deletion
spec:
  tier: 5
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: Namespace/opendatahub
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
    timeout: "30s"
  injection:
    type: NamespaceDeletion
    dangerLevel: high
    parameters:
      namespace: opendatahub
    ttl: "300s"
  hypothesis:
    description: >-
      When the operator's namespace is deleted, the operator should detect
      the loss and recreate the namespace along with all managed resources.
      This tests the most destructive failure mode: complete namespace loss.
    recoveryTimeout: 300s
  blastRadius:
    maxPodsAffected: 10
    allowedNamespaces:
      - opendatahub
    allowDangerous: true

odh-model-controller-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: odh-model-controller

When the odh-model-controller pod is network-partitioned from the API server, it should lose its leader lease and stop reconciling. Once the partition is removed, the controller should re-acquire the lease and resume normal operation without duplicate work.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-network-partition
spec:
  tier: 2
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: Deployment/odh-model-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: control-plane=odh-model-controller
    ttl: "300s"
  hypothesis:
    description: >-
      When the odh-model-controller pod is network-partitioned from the
      API server, it should lose its leader lease and stop reconciling.
      Once the partition is removed, the controller should re-acquire the
      lease and resume normal operation without duplicate work.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

odh-model-controller-ownerref-orphan¶

Type: OwnerRefOrphan
Danger Level: medium
Component: odh-model-controller

Removing ownerReferences from the odh-model-controller Deployment should trigger the operator to re-adopt it within the recovery timeout. Verifies the controller's ownership reconciliation logic.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-ownerref-orphan
spec:
  tier: 3
  target:
    operator: odh-model-controller
    component: odh-model-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: OwnerRefOrphan
    parameters:
      apiVersion: "apps/v1"
      kind: "Deployment"
      name: "odh-model-controller"
    ttl: "120s"
  hypothesis:
    description: >-
      Removing ownerReferences from the odh-model-controller Deployment
      should trigger the operator to re-adopt it within the recovery timeout.
      Verifies the controller's ownership reconciliation logic.
    recoveryTimeout: 60s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

odh-model-controller-pod-kill¶

Type: PodKill
Danger Level: low
Component: odh-model-controller

When the odh-model-controller pod is killed, Kubernetes should recreate it within the recovery timeout and the controller should resume reconciling InferenceService resources without data loss.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-pod-kill
spec:
  tier: 1
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: Deployment/odh-model-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: control-plane=odh-model-controller
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the odh-model-controller pod is killed, Kubernetes should
      recreate it within the recovery timeout and the controller should
      resume reconciling InferenceService resources without data loss.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

odh-model-controller-quota-exhaustion¶

Type: QuotaExhaustion
Danger Level: medium
Component: odh-model-controller

Creating a restrictive ResourceQuota that prevents pod creation should cause the operator to report quota-related errors and retry gracefully instead of crash-looping. When the quota is removed, the operator should resume normal operation.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-quota-exhaustion
spec:
  tier: 5
  target:
    operator: odh-model-controller
    component: odh-model-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: QuotaExhaustion
    parameters:
      quotaName: "chaos-quota-odh-model-controller"
      pods: "0"
      cpu: "1m"
      memory: "1Mi"
    ttl: "120s"
  hypothesis:
    description: >-
      Creating a restrictive ResourceQuota that prevents pod creation should
      cause the operator to report quota-related errors and retry gracefully
      instead of crash-looping. When the quota is removed, the operator
      should resume normal operation.
    recoveryTimeout: 90s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

odh-model-controller-rbac-revoke¶

Type: RBACRevoke
Danger Level: high
Component: odh-model-controller

When the odh-model-controller ClusterRoleBinding subjects are revoked, the controller should lose its ability to reconcile cluster-scoped resources and surface permission-denied errors in its logs. Once permissions are restored, normal reconciliation should resume without manual intervention.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-rbac-revoke
spec:
  tier: 4
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: ClusterRoleBinding/odh-model-controller-rolebinding-opendatahub
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: odh-model-controller-rolebinding-opendatahub
      bindingType: ClusterRoleBinding
    ttl: "60s"
  hypothesis:
    description: >-
      When the odh-model-controller ClusterRoleBinding subjects are
      revoked, the controller should lose its ability to reconcile
      cluster-scoped resources and surface permission-denied errors in
      its logs. Once permissions are restored, normal reconciliation
      should resume without manual intervention.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true

odh-model-controller-sdk-api-throttle¶

Type: ClientFault
Danger Level: low
Component: odh-model-controller

When 30% of Get and 20% of List operations are throttled with 500ms-1s delays, the controller should retry with backoff and eventually converge. InferenceService status should recover within the recovery timeout.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-sdk-api-throttle
spec:
  tier: 3
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: Deployment/odh-model-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: ClientFault
    parameters:
      faults: '{"get":{"errorRate":0.3,"error":"api server throttled","delay":"500ms"},"list":{"errorRate":0.2,"error":"api server throttled","delay":"1s"}}'
    ttl: "120s"
  hypothesis:
    description: >-
      When 30% of Get and 20% of List operations are throttled with
      500ms-1s delays, the controller should retry with backoff and
      eventually converge. InferenceService status should recover
      within the recovery timeout.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

odh-model-controller-sdk-conflict-storm¶

Type: ClientFault
Danger Level: high
Component: odh-model-controller

When 70% of Update and 50% of Patch operations fail with conflict errors, the controller should handle optimistic concurrency failures gracefully, re-read the resource, and retry. The controller must not enter an infinite retry loop or leak goroutines.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-sdk-conflict-storm
spec:
  tier: 3
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: Deployment/odh-model-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: ClientFault
    dangerLevel: high
    parameters:
      faults: '{"update":{"errorRate":0.7,"error":"conflict: the object has been modified"},"patch":{"errorRate":0.5,"error":"conflict: the object has been modified"}}'
    ttl: "120s"
  hypothesis:
    description: >-
      When 70% of Update and 50% of Patch operations fail with conflict
      errors, the controller should handle optimistic concurrency failures
      gracefully, re-read the resource, and retry. The controller must not
      enter an infinite retry loop or leak goroutines.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub
    allowDangerous: true

odh-model-controller-sdk-watch-disconnect¶

Type: ClientFault
Danger Level: low
Component: odh-model-controller

When 40% of reconcile operations encounter watch channel closures, the controller-runtime informer should re-establish the watch and the controller should resume processing events. No resources should be orphaned during the disruption window.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-sdk-watch-disconnect
spec:
  tier: 3
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: Deployment/odh-model-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: ClientFault
    parameters:
      faults: '{"reconcile":{"errorRate":0.4,"error":"watch channel closed"}}'
    ttl: "120s"
  hypothesis:
    description: >-
      When 40% of reconcile operations encounter watch channel closures,
      the controller-runtime informer should re-establish the watch and
      the controller should resume processing events. No resources should
      be orphaned during the disruption window.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

odh-model-controller-webhook-cert-corrupt¶

Type: ConfigDrift
Danger Level: high
Component: odh-model-controller

All 7 webhooks fail after TLS cert corruption; cert-manager or operator restores cert within 120s.

Experiment YAML

# NOTE: The Secret name 'odh-model-controller-webhook-cert' must be verified
# against the actual deployment. Controller-runtime webhook cert Secrets may
# follow different naming conventions depending on cert-manager or OLM setup.
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-webhook-cert-corrupt
spec:
  tier: 2
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: Secret/odh-model-controller-webhook-cert
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: v1
        kind: Secret
        name: odh-model-controller-webhook-cert
        namespace: opendatahub
    timeout: "30s"
  injection:
    type: ConfigDrift
    dangerLevel: high
    parameters:
      name: odh-model-controller-webhook-cert
      key: tls.crt
      value: corrupted
      resourceType: Secret
    ttl: "300s"
  hypothesis:
    description: >-
      All 7 webhooks fail after TLS cert corruption; cert-manager or operator
      restores cert within 120s.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub
    allowDangerous: true

odh-model-controller-webhook-disrupt¶

Type: WebhookDisrupt
Danger Level: high
Component: odh-model-controller

When the validating webhook failurePolicy is weakened from Fail to Ignore, invalid resources can bypass admission validation. The chaos framework restores the original failurePolicy via TTL-based cleanup after 60s. During the disruption window, the controller itself remains operational but admission guardrails are absent.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-webhook-disrupt
spec:
  tier: 4
  target:
    operator: odh-model-controller
    component: odh-model-controller
    resource: ValidatingWebhookConfiguration/validating.odh-model-controller.opendatahub.io
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookDisrupt
    dangerLevel: high
    parameters:
      webhookName: validating.odh-model-controller.opendatahub.io
      action: setFailurePolicy
      value: Ignore
    ttl: "60s"
  hypothesis:
    description: >-
      When the validating webhook failurePolicy is weakened from Fail to
      Ignore, invalid resources can bypass admission validation. The chaos
      framework restores the original failurePolicy via TTL-based cleanup
      after 60s. During the disruption window, the controller itself
      remains operational but admission guardrails are absent.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true

odh-model-controller-webhook-latency¶

Type: WebhookLatency
Danger Level: high
Component: odh-model-controller

Deploying a slow admission webhook (25s delay, just under the 30s API server timeout) intercepting InferenceService resources should not cause the operator to hang or crash. The operator should handle slow API responses with appropriate timeouts.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: odh-model-controller-webhook-latency
spec:
  tier: 4
  target:
    operator: odh-model-controller
    component: odh-model-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: odh-model-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookLatency
    dangerLevel: high
    parameters:
      resources: "inferenceservices"
      apiGroups: "serving.kserve.io"
      delay: "25s"
    ttl: "180s"
  hypothesis:
    description: >-
      Deploying a slow admission webhook (25s delay, just under the 30s API
      server timeout) intercepting InferenceService resources should not
      cause the operator to hang or crash. The operator should handle slow
      API responses with appropriate timeouts.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub
    allowDangerous: true