feast Failure Modes¶
Coverage¶
| Injection Type | Danger | Experiment | Description |
|---|---|---|---|
| NetworkPartition | medium | network-partition.yaml | When the feast-operator is network-partitioned from the API server, FeatureStore... |
| PodKill | low | pod-kill.yaml | When the feast-operator pod is killed, existing FeatureStore instances continue ... |
| RBACRevoke | high | rbac-revoke.yaml | When the feast-operator ClusterRoleBinding subjects are revoked, the operator lo... |
Experiment Details¶
feast-network-partition¶
- Type: NetworkPartition
- Danger Level: medium
- Component: feast-operator-controller-manager
When the feast-operator is network-partitioned from the API server, FeatureStore reconciliation stops. Existing feature servers remain available and continue serving features. Once the partition is removed, reconciliation resumes without manual intervention.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: feast-network-partition
spec:
tier: 2
target:
operator: feast
component: feast-operator-controller-manager
resource: Deployment/feast-operator-controller-manager
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: feast-operator-controller-manager
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: NetworkPartition
parameters:
labelSelector: control-plane=controller-manager,app.kubernetes.io/name=feast-operator
ttl: "300s"
hypothesis:
description: >-
When the feast-operator is network-partitioned from the API server,
FeatureStore reconciliation stops. Existing feature servers remain
available and continue serving features. Once the partition is
removed, reconciliation resumes without manual intervention.
recoveryTimeout: 180s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
feast-pod-kill¶
- Type: PodKill
- Danger Level: low
- Component: feast-operator-controller-manager
When the feast-operator pod is killed, existing FeatureStore instances continue serving features. New FeatureStore deployments queue until the operator recovers and processes the backlog within the recovery timeout.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: feast-pod-kill
spec:
tier: 1
target:
operator: feast
component: feast-operator-controller-manager
resource: Deployment/feast-operator-controller-manager
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: feast-operator-controller-manager
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: PodKill
parameters:
labelSelector: control-plane=controller-manager,app.kubernetes.io/name=feast-operator
count: 1
ttl: "300s"
hypothesis:
description: >-
When the feast-operator pod is killed, existing FeatureStore instances
continue serving features. New FeatureStore deployments queue until
the operator recovers and processes the backlog within the recovery
timeout.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowedNamespaces:
- opendatahub
feast-rbac-revoke¶
- Type: RBACRevoke
- Danger Level: high
- Component: feast-operator-controller-manager
When the feast-operator ClusterRoleBinding subjects are revoked, the operator loses cluster access and can no longer manage FeatureStore resources. API calls return 403 errors. Once permissions are restored, normal operation resumes without restart.
Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: feast-rbac-revoke
spec:
tier: 4
target:
operator: feast
component: feast-operator-controller-manager
resource: ClusterRoleBinding/feast-operator-manager-rolebinding
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: feast-operator-controller-manager
namespace: opendatahub
conditionType: Available
timeout: "30s"
injection:
type: RBACRevoke
dangerLevel: high
parameters:
bindingName: feast-operator-manager-rolebinding
bindingType: ClusterRoleBinding
ttl: "60s"
hypothesis:
description: >-
When the feast-operator ClusterRoleBinding subjects are revoked, the
operator loses cluster access and can no longer manage FeatureStore
resources. API calls return 403 errors. Once permissions are restored,
normal operation resumes without restart.
recoveryTimeout: 120s
blastRadius:
maxPodsAffected: 1
allowDangerous: true