Skip to content

modelmesh Failure Modes

Coverage

Injection Type Danger Experiment Description
ConfigDrift high config-drift.yaml When the modelmesh serving configuration is corrupted, new model deployments rec...
NetworkPartition medium network-partition.yaml When the modelmesh-controller is network-partitioned from the API server, model ...
PodKill low pod-kill.yaml When the modelmesh-controller pod is killed, existing model endpoints keep servi...
RBACRevoke high rbac-revoke.yaml When the modelmesh ClusterRoleBinding subjects are revoked, the controller can n...
WebhookDisrupt high webhook-disrupt.yaml When the modelmesh ServingRuntime validating webhook failurePolicy is weakened f...

Experiment Details

modelmesh-config-drift

  • Type: ConfigDrift
  • Danger Level: high
  • Component: modelmesh-controller

When the modelmesh serving configuration is corrupted, new model deployments receive wrong serving parameters. Existing deployments remain unaffected. The operator should detect the drift and reconcile the correct configuration.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: modelmesh-config-drift
spec:
  tier: 2
  target:
    operator: modelmesh
    component: modelmesh-controller
    resource: ConfigMap/modelmesh-serving-config
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: v1
        kind: ConfigMap
        name: modelmesh-serving-config
        namespace: opendatahub
    timeout: "30s"
  injection:
    type: ConfigDrift
    dangerLevel: high
    parameters:
      name: modelmesh-serving-config
      key: config.yaml
      value: '{"modelServing":{"grpcMaxMessageSize":"-1","restProxy":"invalid://broken"}}'
      resourceType: ConfigMap
    ttl: "300s"
  hypothesis:
    description: >-
      When the modelmesh serving configuration is corrupted, new model
      deployments receive wrong serving parameters. Existing deployments
      remain unaffected. The operator should detect the drift and reconcile
      the correct configuration.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub
    allowDangerous: true

modelmesh-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: modelmesh-controller

When the modelmesh-controller is network-partitioned from the API server, model routing stops updating but existing routes continue working. Once the partition is removed, reconciliation resumes without manual intervention.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: modelmesh-network-partition
spec:
  tier: 2
  target:
    operator: modelmesh
    component: modelmesh-controller
    resource: Deployment/modelmesh-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: modelmesh-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: control-plane=modelmesh-controller,app.kubernetes.io/name=modelmesh-controller
    ttl: "300s"
  hypothesis:
    description: >-
      When the modelmesh-controller is network-partitioned from the API
      server, model routing stops updating but existing routes continue
      working. Once the partition is removed, reconciliation resumes
      without manual intervention.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

modelmesh-pod-kill

  • Type: PodKill
  • Danger Level: low
  • Component: modelmesh-controller

When the modelmesh-controller pod is killed, existing model endpoints keep serving via existing serving runtime pods. New ServingRuntime deployments queue until the controller recovers within the recovery timeout.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: modelmesh-pod-kill
spec:
  tier: 1
  target:
    operator: modelmesh
    component: modelmesh-controller
    resource: Deployment/modelmesh-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: modelmesh-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: control-plane=modelmesh-controller,app.kubernetes.io/name=modelmesh-controller
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the modelmesh-controller pod is killed, existing model endpoints
      keep serving via existing serving runtime pods. New ServingRuntime
      deployments queue until the controller recovers within the recovery
      timeout.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

modelmesh-rbac-revoke

  • Type: RBACRevoke
  • Danger Level: high
  • Component: modelmesh-controller

When the modelmesh ClusterRoleBinding subjects are revoked, the controller can no longer manage ServingRuntimes. API calls return 403 errors. Once permissions are restored, normal operation resumes without restart.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: modelmesh-rbac-revoke
spec:
  tier: 4
  target:
    operator: modelmesh
    component: modelmesh-controller
    resource: ClusterRoleBinding/modelmesh-controller-rolebinding
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: modelmesh-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: modelmesh-controller-rolebinding
      bindingType: ClusterRoleBinding
    ttl: "60s"
  hypothesis:
    description: >-
      When the modelmesh ClusterRoleBinding subjects are revoked, the
      controller can no longer manage ServingRuntimes. API calls return
      403 errors. Once permissions are restored, normal operation resumes
      without restart.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true

modelmesh-webhook-disrupt

  • Type: WebhookDisrupt
  • Danger Level: high
  • Component: modelmesh-controller

When the modelmesh ServingRuntime validating webhook failurePolicy is weakened from Fail to Ignore, invalid ServingRuntime specs can be submitted bypassing validation. The chaos framework restores the original failurePolicy via TTL-based cleanup after 60s.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: modelmesh-webhook-disrupt
spec:
  tier: 4
  target:
    operator: modelmesh
    component: modelmesh-controller
    resource: ValidatingWebhookConfiguration/vservingruntime.modelmesh.io
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: modelmesh-controller
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookDisrupt
    dangerLevel: high
    parameters:
      webhookName: vservingruntime.modelmesh.io
      action: setFailurePolicy
      value: Ignore
    ttl: "60s"
  hypothesis:
    description: >-
      When the modelmesh ServingRuntime validating webhook failurePolicy is
      weakened from Fail to Ignore, invalid ServingRuntime specs can be
      submitted bypassing validation. The chaos framework restores the
      original failurePolicy via TTL-based cleanup after 60s.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true