PodKill¶

Danger Level: Low

Force-deletes pods matching a label selector with zero grace period.

Spec Fields¶

Field	Type	Required	Default	Description
`labelSelector`	`string`	Yes	-	Equality-based label selector to match target pods (e.g., app=my-controller)
`count`	`int`	No	`1`	Number of pods to kill
`ttl`	`duration`	No	`300s`	Auto-cleanup duration

How It Works¶

PodKill uses the Kubernetes API to delete pods matching the specified label selector with a zero grace period (GracePeriodSeconds: 0). When multiple pods match, the injector shuffles the list and kills up to count pods.

API calls: 1. List pods matching labelSelector in the target namespace 2. Delete each selected pod with GracePeriodSeconds: 0

Cleanup: No cleanup needed. The owning controller (Deployment, StatefulSet, DaemonSet) automatically recreates deleted pods.

Crash safety: Fully crash-safe. If the chaos tool crashes mid-injection, the owning controller still recreates pods. No rollback annotations needed.

Disruption Rubric¶

Expected behavior on a healthy operator: The operator's Deployment controller recreates the pod within seconds. The new pod passes readiness probes and the Deployment returns to Available condition within the recoveryTimeout. If the operator uses leader election, the new pod acquires the lease and resumes reconciliation.

Contract violation indicators: - Pod is not recreated (missing owning controller or broken replica count) - New pod enters CrashLoopBackOff (indicates a startup dependency issue) - Deployment stays unavailable beyond recoveryTimeout (indicates slow readiness) - Reconciliation does not resume after pod restart (indicates state loss)

Collateral damage risks: - Minimal. PodKill only affects pods matching the exact label selector - If the controller manages stateful resources (Leases, PVCs), verify they are not orphaned - On resource-constrained clusters, the new pod may be slow to schedule

Recovery expectations: - Recovery time: typically 10-30 seconds for a healthy Deployment - Reconcile cycles: 1 (the Deployment controller's standard behavior) - What "recovered" means: Deployment has Available=True condition

Cross-Component Results¶

Component	Experiment	Danger	Description
codeflare	codeflare-pod-kill	low	When the codeflare-operator pod is killed, existing Ray clusters remain unaffect...
dashboard	dashboard-pod-kill	low	When one odh-dashboard pod is killed, the remaining replica should continue serv...
data-science-pipelines	data-science-pipelines-pod-kill	low	When the data-science-pipelines-operator pod is killed, Kubernetes should recrea...
feast	feast-pod-kill	low	When the feast-operator pod is killed, existing FeatureStore instances continue ...
kserve	kserve-dependency-odh-model-controller-kill	low	Killing odh-model-controller (which kserve depends on for model serving routing)...
kserve	kserve-main-controller-kill	low	When the kserve-controller-manager pod is killed, the Deployment controller recr...
kueue	kueue-pod-kill	low	When the kueue-controller-manager pod is killed, pending workloads should queue ...
llamastack	llamastack-pod-kill	low	When the llamastack-controller-manager pod is killed, existing LlamaStack distri...
model-registry	model-registry-pod-kill	low	When the model-registry-operator pod is killed, Kubernetes should recreate it wi...
modelmesh	modelmesh-pod-kill	low	When the modelmesh-controller pod is killed, existing model endpoints keep servi...
odh-model-controller	odh-model-controller-dependency-kserve-kill	low	Killing the kserve-controller-manager (a dependency of odh-model-controller) sho...
odh-model-controller	odh-model-controller-pod-kill	low	When the odh-model-controller pod is killed, Kubernetes should recreate it withi...
opendatahub-operator	opendatahub-operator-pod-kill	low	When one operator pod is killed, the remaining replicas should maintain the lead...
ray	ray-pod-kill	low	When the ray-operator pod is killed, existing RayClusters keep running and servi...
training-operator	training-operator-pod-kill	low	When the training-operator pod is killed, running training jobs continue via wor...
trustyai	trustyai-pod-kill	low	When the trustyai-service-operator pod is killed, existing TrustyAI services kee...
workbenches	workbenches-dependency-dashboard-kill	low	Killing the dashboard (which workbenches integrates with for notebook management...
workbenches	workbenches-pod-kill	low	When the odh-notebook-controller pod is killed, Kubernetes should recreate it wi...