Skip to content

kserve Failure Modes

Coverage

Injection Type Danger Experiment Description
PodKill low dependency-odh-model-controller-kill.yaml Killing odh-model-controller (which kserve depends on for model serving routing)...
ConfigDrift high isvc-config-corruption.yaml When the deploy key in the inferenceservice-config ConfigMap is overwritten with...
WebhookDisrupt high isvc-validator-disrupt.yaml When the ValidatingWebhookConfiguration for InferenceService has its failurePoli...
NetworkPartition medium llm-controller-isolation.yaml When the llmisvc-controller-manager is network-partitioned from the API server, ...
PodKill low main-controller-kill.yaml When the kserve-controller-manager pod is killed, the Deployment controller recr...
OwnerRefOrphan medium ownerref-orphan.yaml Removing ownerReferences from the kserve-controller-manager Deployment should tr...
CRDMutation high route-host-collision.yaml Mutating a KServe InferenceService Route host simulates a DNS misconfiguration t...
CRDMutation high route-tls-mutation.yaml Changing TLS termination on a KServe InferenceService Route from edge/reencrypt ...

Experiment Details

kserve-dependency-odh-model-controller-kill

  • Type: PodKill
  • Danger Level: low
  • Component: kserve-controller

Killing odh-model-controller (which kserve depends on for model serving routing) should not crash kserve. Kserve should continue operating in degraded mode and recover when the dependency is restored.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-dependency-odh-model-controller-kill
spec:
  tier: 1
  target:
    operator: kserve
    component: kserve-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: "app=odh-model-controller"
    ttl: "300s"
  hypothesis:
    description: >-
      Killing odh-model-controller (which kserve depends on for model serving
      routing) should not crash kserve. Kserve should continue operating in
      degraded mode and recover when the dependency is restored.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

kserve-isvc-config-corruption

  • Type: ConfigDrift
  • Danger Level: high
  • Component: kserve-controller-manager

When the deploy key in the inferenceservice-config ConfigMap is overwritten with an empty JSON object, kserve should detect the partial config corruption and recover within 120s. Existing InferenceService resources should continue serving, and the controller should either fall back to built-in defaults or surface clear error conditions rather than silently failing.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-isvc-config-corruption
  namespace: kserve
spec:
  tier: 2
  target:
    operator: kserve
    component: kserve-controller-manager
    resource: ConfigMap/inferenceservice-config
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: v1
        kind: ConfigMap
        name: inferenceservice-config
        namespace: kserve
    timeout: "30s"
  injection:
    type: ConfigDrift
    dangerLevel: high
    parameters:
      name: inferenceservice-config
      key: deploy
      value: "{}"
      resourceType: ConfigMap
    ttl: "300s"
  hypothesis:
    description: >-
      When the deploy key in the inferenceservice-config ConfigMap is
      overwritten with an empty JSON object, kserve should detect the
      partial config corruption and recover within 120s. Existing
      InferenceService resources should continue serving, and the
      controller should either fall back to built-in defaults or
      surface clear error conditions rather than silently failing.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - kserve
    allowDangerous: true

kserve-isvc-validator-disrupt

  • Type: WebhookDisrupt
  • Danger Level: high
  • Component: kserve-controller-manager

When the ValidatingWebhookConfiguration for InferenceService has its failurePolicy weakened from Fail to Ignore, invalid InferenceService resources can bypass validation. This tests the blast radius of a weakened admission policy. Recovery is provided by the chaos framework's TTL-based cleanup after 60s, since kserve does not self-heal webhook configuration drift.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-isvc-validator-disrupt
  namespace: kserve
spec:
  tier: 4
  target:
    operator: kserve
    component: kserve-controller-manager
    resource: ValidatingWebhookConfiguration/inferenceservice.serving.kserve.io
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: kserve
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookDisrupt
    dangerLevel: high
    parameters:
      webhookName: "inferenceservice.serving.kserve.io"
      action: "setFailurePolicy"
      value: "Ignore"
    ttl: "60s"
  hypothesis:
    description: >-
      When the ValidatingWebhookConfiguration for InferenceService has
      its failurePolicy weakened from Fail to Ignore, invalid
      InferenceService resources can bypass validation. This tests the
      blast radius of a weakened admission policy. Recovery is provided
      by the chaos framework's TTL-based cleanup after 60s, since kserve
      does not self-heal webhook configuration drift.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true

kserve-llm-controller-isolation

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: llmisvc-controller-manager

When the llmisvc-controller-manager is network-partitioned from the API server, it should lose its leader lease and stop reconciling LLM resources. The main kserve-controller-manager must remain unaffected. Once the partition is lifted after 60s, the LLM controller should re-acquire its lease and resume normal operation within 120s.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-llm-controller-isolation
  namespace: kserve
spec:
  tier: 2
  target:
    operator: kserve
    component: llmisvc-controller-manager
    resource: Deployment/llmisvc-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: llmisvc-controller-manager
        namespace: kserve
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: control-plane=llmisvc-controller-manager
    ttl: "60s"
  hypothesis:
    description: >-
      When the llmisvc-controller-manager is network-partitioned from the
      API server, it should lose its leader lease and stop reconciling LLM
      resources. The main kserve-controller-manager must remain unaffected.
      Once the partition is lifted after 60s, the LLM controller should
      re-acquire its lease and resume normal operation within 120s.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - kserve

kserve-main-controller-kill

  • Type: PodKill
  • Danger Level: low
  • Component: kserve-controller-manager

When the kserve-controller-manager pod is killed, the Deployment controller recreates it and leader election completes recovery. InferenceService reconciliation should resume within 60s without data loss or duplicate work.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-main-controller-kill
  namespace: kserve
spec:
  tier: 1
  target:
    operator: kserve
    component: kserve-controller-manager
    resource: Deployment/kserve-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: kserve
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: control-plane=kserve-controller-manager
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the kserve-controller-manager pod is killed, the Deployment
      controller recreates it and leader election completes recovery.
      InferenceService reconciliation should resume within 60s without
      data loss or duplicate work.
    recoveryTimeout: 60s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - kserve

kserve-ownerref-orphan

  • Type: OwnerRefOrphan
  • Danger Level: medium
  • Component: kserve-controller

Removing ownerReferences from the kserve-controller-manager Deployment should trigger the operator to re-adopt it within the recovery timeout.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-ownerref-orphan
spec:
  tier: 3
  target:
    operator: kserve
    component: kserve-controller
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: OwnerRefOrphan
    parameters:
      apiVersion: "apps/v1"
      kind: "Deployment"
      name: "kserve-controller-manager"
    ttl: "120s"
  hypothesis:
    description: >-
      Removing ownerReferences from the kserve-controller-manager Deployment
      should trigger the operator to re-adopt it within the recovery timeout.
    recoveryTimeout: 60s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

kserve-route-host-collision

  • Type: CRDMutation
  • Danger Level: high
  • Component: kserve-controller-manager

Mutating a KServe InferenceService Route host simulates a DNS misconfiguration that makes the model endpoint unreachable. KServe or the RHOAI operator should detect the Route drift and reconcile the host. NOTE: KServe creates Routes per InferenceService; the Route name in parameters must be customized for each deployment. Expected verdict: Resilient if the Route is restored, Vulnerable if the model endpoint remains unreachable.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-route-host-collision
spec:
  tier: 3
  target:
    operator: kserve
    component: kserve-controller-manager
    resource: Route/kserve-isvc-route
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: kserve
        conditionType: Available
    timeout: "30s"
  injection:
    type: CRDMutation
    dangerLevel: high
    # NOTE: Replace "kserve-isvc-route" with the actual Route name for
    # your InferenceService. KServe creates Routes per InferenceService
    # with names like "<isvc-name>-predictor" in the model namespace.
    parameters:
      apiVersion: "route.openshift.io/v1"
      kind: "Route"
      name: "kserve-isvc-route"
      path: "spec.host"
      value: "chaos-collision.apps.cluster.invalid"
    ttl: "300s"
  hypothesis:
    description: >-
      Mutating a KServe InferenceService Route host simulates a DNS
      misconfiguration that makes the model endpoint unreachable. KServe
      or the RHOAI operator should detect the Route drift and reconcile
      the host. NOTE: KServe creates Routes per InferenceService; the
      Route name in parameters must be customized for each deployment.
      Expected verdict: Resilient if the Route is restored, Vulnerable
      if the model endpoint remains unreachable.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - kserve
    allowDangerous: true

kserve-route-tls-mutation

  • Type: CRDMutation
  • Danger Level: high
  • Component: kserve-controller-manager

Changing TLS termination on a KServe InferenceService Route from edge/reencrypt to passthrough breaks HTTPS inference endpoints. The KServe controller or RHOAI operator should detect the TLS drift and restore the correct termination mode. NOTE: The Route name must be customized for each InferenceService deployment. Expected verdict: Resilient if restored, Vulnerable if inference endpoints stay broken.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: kserve-route-tls-mutation
spec:
  tier: 3
  target:
    operator: kserve
    component: kserve-controller-manager
    resource: Route/kserve-isvc-route
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: kserve-controller-manager
        namespace: kserve
        conditionType: Available
    timeout: "30s"
  injection:
    type: CRDMutation
    dangerLevel: high
    # NOTE: Replace "kserve-isvc-route" with the actual Route name for
    # your InferenceService.
    parameters:
      apiVersion: "route.openshift.io/v1"
      kind: "Route"
      name: "kserve-isvc-route"
      path: "spec.tls.termination"
      value: "passthrough"
    ttl: "300s"
  hypothesis:
    description: >-
      Changing TLS termination on a KServe InferenceService Route from
      edge/reencrypt to passthrough breaks HTTPS inference endpoints.
      The KServe controller or RHOAI operator should detect the TLS
      drift and restore the correct termination mode. NOTE: The Route
      name must be customized for each InferenceService deployment.
      Expected verdict: Resilient if restored, Vulnerable if inference
      endpoints stay broken.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - kserve
    allowDangerous: true