Skip to content

PodKill

Danger Level: Low

Force-deletes pods matching a label selector with zero grace period.

Spec Fields

Field Type Required Default Description
labelSelector string Yes - Equality-based label selector to match target pods (e.g., app=my-controller)
count int No 1 Number of pods to kill
ttl duration No 300s Auto-cleanup duration

How It Works

PodKill uses the Kubernetes API to delete pods matching the specified label selector with a zero grace period (GracePeriodSeconds: 0). When multiple pods match, the injector shuffles the list and kills up to count pods.

API calls: 1. List pods matching labelSelector in the target namespace 2. Delete each selected pod with GracePeriodSeconds: 0

Cleanup: No cleanup needed. The owning controller (Deployment, StatefulSet, DaemonSet) automatically recreates deleted pods.

Crash safety: Fully crash-safe. If the chaos tool crashes mid-injection, the owning controller still recreates pods. No rollback annotations needed.

Disruption Rubric

Expected behavior on a healthy operator: The operator's Deployment controller recreates the pod within seconds. The new pod passes readiness probes and the Deployment returns to Available condition within the recoveryTimeout. If the operator uses leader election, the new pod acquires the lease and resumes reconciliation.

Contract violation indicators: - Pod is not recreated (missing owning controller or broken replica count) - New pod enters CrashLoopBackOff (indicates a startup dependency issue) - Deployment stays unavailable beyond recoveryTimeout (indicates slow readiness) - Reconciliation does not resume after pod restart (indicates state loss)

Collateral damage risks: - Minimal. PodKill only affects pods matching the exact label selector - If the controller manages stateful resources (Leases, PVCs), verify they are not orphaned - On resource-constrained clusters, the new pod may be slow to schedule

Recovery expectations: - Recovery time: typically 10-30 seconds for a healthy Deployment - Reconcile cycles: 1 (the Deployment controller's standard behavior) - What "recovered" means: Deployment has Available=True condition

Cross-Component Results

Component Experiment Danger Description
codeflare codeflare-pod-kill low When the codeflare-operator pod is killed, existing Ray clusters remain unaffect...
dashboard dashboard-pod-kill low When one odh-dashboard pod is killed, the remaining replica should continue serv...
data-science-pipelines data-science-pipelines-pod-kill low When the data-science-pipelines-operator pod is killed, Kubernetes should recrea...
feast feast-pod-kill low When the feast-operator pod is killed, existing FeatureStore instances continue ...
kserve kserve-dependency-odh-model-controller-kill low Killing odh-model-controller (which kserve depends on for model serving routing)...
kserve kserve-main-controller-kill low When the kserve-controller-manager pod is killed, the Deployment controller recr...
kueue kueue-pod-kill low When the kueue-controller-manager pod is killed, pending workloads should queue ...
llamastack llamastack-pod-kill low When the llamastack-controller-manager pod is killed, existing LlamaStack distri...
model-registry model-registry-pod-kill low When the model-registry-operator pod is killed, Kubernetes should recreate it wi...
modelmesh modelmesh-pod-kill low When the modelmesh-controller pod is killed, existing model endpoints keep servi...
odh-model-controller odh-model-controller-dependency-kserve-kill low Killing the kserve-controller-manager (a dependency of odh-model-controller) sho...
odh-model-controller odh-model-controller-pod-kill low When the odh-model-controller pod is killed, Kubernetes should recreate it withi...
opendatahub-operator opendatahub-operator-pod-kill low When one operator pod is killed, the remaining replicas should maintain the lead...
ray ray-pod-kill low When the ray-operator pod is killed, existing RayClusters keep running and servi...
training-operator training-operator-pod-kill low When the training-operator pod is killed, running training jobs continue via wor...
trustyai trustyai-pod-kill low When the trustyai-service-operator pod is killed, existing TrustyAI services kee...
workbenches workbenches-dependency-dashboard-kill low Killing the dashboard (which workbenches integrates with for notebook management...
workbenches workbenches-pod-kill low When the odh-notebook-controller pod is killed, Kubernetes should recreate it wi...