llamastack Failure Modes¶

Coverage¶

Injection Type	Danger	Experiment	Description
ConfigDrift	high	config-drift.yaml	When the llamastack serving configuration is corrupted, new LLM deployments rece...
NetworkPartition	medium	network-partition.yaml	When the llamastack-controller-manager is network-partitioned from the API serve...
PodKill	low	pod-kill.yaml	When the llamastack-controller-manager pod is killed, existing LlamaStack distri...
RBACRevoke	high	rbac-revoke.yaml	When the llamastack ClusterRoleBinding subjects are revoked, the controller can ...

Experiment Details¶

llamastack-config-drift¶

Type: ConfigDrift
Danger Level: high
Component: llamastack-controller-manager

When the llamastack serving configuration is corrupted, new LLM deployments receive invalid config and fail to start. Existing deployments remain unaffected. The operator should detect the drift and reconcile the correct configuration.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: llamastack-config-drift
spec:
  tier: 2
  target:
    operator: llamastack
    component: llamastack-controller-manager
    resource: ConfigMap/llamastack-serving-config
  steadyState:
    checks:
      - type: resourceExists
        apiVersion: v1
        kind: ConfigMap
        name: llamastack-serving-config
        namespace: opendatahub
    timeout: "30s"
  injection:
    type: ConfigDrift
    dangerLevel: high
    parameters:
      name: llamastack-serving-config
      key: config.yaml
      value: '{"serving":{"endpoint":"invalid://broken","timeout":"-1"}}'
      resourceType: ConfigMap
    ttl: "300s"
  hypothesis:
    description: >-
      When the llamastack serving configuration is corrupted, new LLM
      deployments receive invalid config and fail to start. Existing
      deployments remain unaffected. The operator should detect the drift
      and reconcile the correct configuration.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub
    allowDangerous: true

llamastack-network-partition¶

Type: NetworkPartition
Danger Level: medium
Component: llamastack-controller-manager

When the llamastack-controller-manager is network-partitioned from the API server, reconciliation stops. Running LLM endpoints remain unaffected. Once the partition is removed, reconciliation resumes without manual intervention.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: llamastack-network-partition
spec:
  tier: 2
  target:
    operator: llamastack
    component: llamastack-controller-manager
    resource: Deployment/llamastack-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: llamastack-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: control-plane=controller-manager,app.kubernetes.io/name=llamastack
    ttl: "300s"
  hypothesis:
    description: >-
      When the llamastack-controller-manager is network-partitioned from
      the API server, reconciliation stops. Running LLM endpoints remain
      unaffected. Once the partition is removed, reconciliation resumes
      without manual intervention.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

llamastack-pod-kill¶

Type: PodKill
Danger Level: low
Component: llamastack-controller-manager

When the llamastack-controller-manager pod is killed, existing LlamaStack distributions continue serving LLM endpoints. New deployments queue until the controller recovers within the recovery timeout.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: llamastack-pod-kill
spec:
  tier: 1
  target:
    operator: llamastack
    component: llamastack-controller-manager
    resource: Deployment/llamastack-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: llamastack-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: control-plane=controller-manager,app.kubernetes.io/name=llamastack
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the llamastack-controller-manager pod is killed, existing
      LlamaStack distributions continue serving LLM endpoints. New
      deployments queue until the controller recovers within the recovery
      timeout.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

llamastack-rbac-revoke¶

Type: RBACRevoke
Danger Level: high
Component: llamastack-controller-manager

When the llamastack ClusterRoleBinding subjects are revoked, the controller can no longer manage LlamaStack distributions. API calls return 403 errors. Once permissions are restored, normal operation resumes without restart.

Experiment YAML

apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: llamastack-rbac-revoke
spec:
  tier: 4
  target:
    operator: llamastack
    component: llamastack-controller-manager
    resource: ClusterRoleBinding/llamastack-controller-manager-rolebinding
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: llamastack-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: llamastack-controller-manager-rolebinding
      bindingType: ClusterRoleBinding
    ttl: "60s"
  hypothesis:
    description: >-
      When the llamastack ClusterRoleBinding subjects are revoked, the
      controller can no longer manage LlamaStack distributions. API calls
      return 403 errors. Once permissions are restored, normal operation
      resumes without restart.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true