kserve Failure Modes¶

Coverage¶

Injection Type	Danger	Experiment	Description
PodKill	low	dependency-odh-model-controller-kill.yaml	Killing odh-model-controller (which kserve depends on for model serving routing)...
ConfigDrift	high	isvc-config-corruption.yaml	When the deploy key in the inferenceservice-config ConfigMap is overwritten with...
WebhookDisrupt	high	isvc-validator-disrupt.yaml	When the ValidatingWebhookConfiguration for InferenceService has its failurePoli...
NetworkPartition	medium	llm-controller-isolation.yaml	When the llmisvc-controller-manager is network-partitioned from the API server, ...
PodKill	low	main-controller-kill.yaml	When the kserve-controller-manager pod is killed, the Deployment controller recr...
OwnerRefOrphan	medium	ownerref-orphan.yaml	Removing ownerReferences from the kserve-controller-manager Deployment should tr...
CRDMutation	high	route-host-collision.yaml	Mutating a KServe InferenceService Route host simulates a DNS misconfiguration t...
CRDMutation	high	route-tls-mutation.yaml	Changing TLS termination on a KServe InferenceService Route from edge/reencrypt ...

Experiment Details¶

kserve-dependency-odh-model-controller-kill¶

Type: PodKill
Danger Level: low
Component: kserve-controller

Killing odh-model-controller (which kserve depends on for model serving routing) should not crash kserve. Kserve should continue operating in degraded mode and recover when the dependency is restored.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-dependency-odh-model-controller-kill
spec:
  tier: 1
  target:
    operator: kserve
    component: kserve-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: "app=odh-model-controller"
    ttl: "300s"
  hypothesis:
    description: >-
      Killing odh-model-controller (which kserve depends on for model serving
      routing) should not crash kserve. Kserve should continue operating in
      degraded mode and recover when the dependency is restored.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

kserve-isvc-config-corruption¶

Type: ConfigDrift
Danger Level: high
Component: kserve-controller-manager

When the deploy key in the inferenceservice-config ConfigMap is overwritten with an empty JSON object, kserve should detect the partial config corruption and recover within 120s. Existing InferenceService resources should continue serving, and the controller should either fall back to built-in defaults or surface clear error conditions rather than silently failing.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-isvc-config-corruption
  namespace: kserve
spec:
  tier: 2
  target:
    operator: kserve
    component: kserve-controller-manager
    resource: ConfigMap/inferenceservice-config
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: v1
        kind: ConfigMap
        name: inferenceservice-config
        namespace: kserve
    timeout: "30s"
  injection:
    type: ConfigDrift
    dangerLevel: high
    parameters:
      name: inferenceservice-config
      key: deploy
      value: "{}"
      resourceType: ConfigMap
    ttl: "300s"
  hypothesis:
    description: >-
      When the deploy key in the inferenceservice-config ConfigMap is
      overwritten with an empty JSON object, kserve should detect the
      partial config corruption and recover within 120s. Existing
      InferenceService resources should continue serving, and the
      controller should either fall back to built-in defaults or
      surface clear error conditions rather than silently failing.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - kserve
    allowDangerous: true

kserve-isvc-validator-disrupt¶

Type: WebhookDisrupt
Danger Level: high
Component: kserve-controller-manager

When the ValidatingWebhookConfiguration for InferenceService has its failurePolicy weakened from Fail to Ignore, invalid InferenceService resources can bypass validation. This tests the blast radius of a weakened admission policy. Recovery is provided by the chaos framework's TTL-based cleanup after 60s, since kserve does not self-heal webhook configuration drift.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-isvc-validator-disrupt
  namespace: kserve
spec:
  tier: 4
  target:
    operator: kserve
    component: kserve-controller-manager
    resource: ValidatingWebhookConfiguration/inferenceservice.serving.kserve.io
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: kserve
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookDisrupt
    dangerLevel: high
    parameters:
      webhookName: "inferenceservice.serving.kserve.io"
      action: "setFailurePolicy"
      value: "Ignore"
    ttl: "60s"
  hypothesis:
    description: >-
      When the ValidatingWebhookConfiguration for InferenceService has
      its failurePolicy weakened from Fail to Ignore, invalid
      InferenceService resources can bypass validation. This tests the
      blast radius of a weakened admission policy. Recovery is provided
      by the chaos framework's TTL-based cleanup after 60s, since kserve
      does not self-heal webhook configuration drift.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true

kserve-llm-controller-isolation¶

Type: NetworkPartition
Danger Level: medium
Component: llmisvc-controller-manager

When the llmisvc-controller-manager is network-partitioned from the API server, it should lose its leader lease and stop reconciling LLM resources. The main kserve-controller-manager must remain unaffected. Once the partition is lifted after 60s, the LLM controller should re-acquire its lease and resume normal operation within 120s.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-llm-controller-isolation
  namespace: kserve
spec:
  tier: 2
  target:
    operator: kserve
    component: llmisvc-controller-manager
    resource: Deployment/llmisvc-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: llmisvc-controller-manager
        namespace: kserve
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: control-plane=llmisvc-controller-manager
    ttl: "60s"
  hypothesis:
    description: >-
      When the llmisvc-controller-manager is network-partitioned from the
      API server, it should lose its leader lease and stop reconciling LLM
      resources. The main kserve-controller-manager must remain unaffected.
      Once the partition is lifted after 60s, the LLM controller should
      re-acquire its lease and resume normal operation within 120s.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - kserve

kserve-main-controller-kill¶

Type: PodKill
Danger Level: low
Component: kserve-controller-manager

When the kserve-controller-manager pod is killed, the Deployment controller recreates it and leader election completes recovery. InferenceService reconciliation should resume within 60s without data loss or duplicate work.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-main-controller-kill
  namespace: kserve
spec:
  tier: 1
  target:
    operator: kserve
    component: kserve-controller-manager
    resource: Deployment/kserve-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: kserve
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: control-plane=kserve-controller-manager
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the kserve-controller-manager pod is killed, the Deployment
      controller recreates it and leader election completes recovery.
      InferenceService reconciliation should resume within 60s without
      data loss or duplicate work.
    recoveryTimeout: 60s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - kserve

kserve-ownerref-orphan¶

Type: OwnerRefOrphan
Danger Level: medium
Component: kserve-controller

Removing ownerReferences from the kserve-controller-manager Deployment should trigger the operator to re-adopt it within the recovery timeout.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-ownerref-orphan
spec:
  tier: 3
  target:
    operator: kserve
    component: kserve-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: OwnerRefOrphan
    parameters:
      apiVersion: "apps/v1"
      kind: "Deployment"
      name: "kserve-controller-manager"
    ttl: "120s"
  hypothesis:
    description: >-
      Removing ownerReferences from the kserve-controller-manager Deployment
      should trigger the operator to re-adopt it within the recovery timeout.
    recoveryTimeout: 60s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

kserve-route-host-collision¶

Type: CRDMutation
Danger Level: high
Component: kserve-controller-manager

Mutating a KServe InferenceService Route host simulates a DNS misconfiguration that makes the model endpoint unreachable. KServe or the RHOAI operator should detect the Route drift and reconcile the host. NOTE: KServe creates Routes per InferenceService; the Route name in parameters must be customized for each deployment. Expected verdict: Resilient if the Route is restored, Vulnerable if the model endpoint remains unreachable.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-route-host-collision
spec:
  tier: 3
  target:
    operator: kserve
    component: kserve-controller-manager
    resource: Route/kserve-isvc-route
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: kserve
        conditionType: Available
    timeout: "30s"
  injection:
    type: CRDMutation
    dangerLevel: high
    # NOTE: Replace "kserve-isvc-route" with the actual Route name for
    # your InferenceService. KServe creates Routes per InferenceService
    # with names like "<isvc-name>-predictor" in the model namespace.
    parameters:
      apiVersion: "route.openshift.io/v1"
      kind: "Route"
      name: "kserve-isvc-route"
      path: "spec.host"
      value: "chaos-collision.apps.cluster.invalid"
    ttl: "300s"
  hypothesis:
    description: >-
      Mutating a KServe InferenceService Route host simulates a DNS
      misconfiguration that makes the model endpoint unreachable. KServe
      or the RHOAI operator should detect the Route drift and reconcile
      the host. NOTE: KServe creates Routes per InferenceService; the
      Route name in parameters must be customized for each deployment.
      Expected verdict: Resilient if the Route is restored, Vulnerable
      if the model endpoint remains unreachable.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - kserve
    allowDangerous: true

kserve-route-tls-mutation¶

Type: CRDMutation
Danger Level: high
Component: kserve-controller-manager

Changing TLS termination on a KServe InferenceService Route from edge/reencrypt to passthrough breaks HTTPS inference endpoints. The KServe controller or RHOAI operator should detect the TLS drift and restore the correct termination mode. NOTE: The Route name must be customized for each InferenceService deployment. Expected verdict: Resilient if restored, Vulnerable if inference endpoints stay broken.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-route-tls-mutation
spec:
  tier: 3
  target:
    operator: kserve
    component: kserve-controller-manager
    resource: Route/kserve-isvc-route
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: kserve
        conditionType: Available
    timeout: "30s"
  injection:
    type: CRDMutation
    dangerLevel: high
    # NOTE: Replace "kserve-isvc-route" with the actual Route name for
    # your InferenceService.
    parameters:
      apiVersion: "route.openshift.io/v1"
      kind: "Route"
      name: "kserve-isvc-route"
      path: "spec.tls.termination"
      value: "passthrough"
    ttl: "300s"
  hypothesis:
    description: >-
      Changing TLS termination on a KServe InferenceService Route from
      edge/reencrypt to passthrough breaks HTTPS inference endpoints.
      The KServe controller or RHOAI operator should detect the TLS
      drift and restore the correct termination mode. NOTE: The Route
      name must be customized for each InferenceService deployment.
      Expected verdict: Resilient if restored, Vulnerable if inference
      endpoints stay broken.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - kserve
    allowDangerous: true