Skip to content

NetworkPartition

Danger Level: Medium

Creates a deny-all NetworkPolicy isolating pods matching a label selector from all ingress and egress traffic.

Spec Fields

Field Type Required Default Description
labelSelector string Yes - Equality-based label selector for target pods (set-based selectors not supported)
ttl duration No 300s Auto-cleanup duration for the NetworkPolicy

How It Works

NetworkPartition creates a Kubernetes NetworkPolicy that blocks all ingress and egress traffic for pods matching the label selector. The policy name is sanitized (truncated to 63 chars with a hash suffix for uniqueness) and labeled with app.kubernetes.io/managed-by: operator-chaos for cleanup.

API calls: 1. Parse labelSelector into metav1.LabelSelector 2. Create a NetworkPolicy with deny-all ingress and egress rules 3. On cleanup: Delete the NetworkPolicy by name

Cleanup: Deletes the created NetworkPolicy. Traffic resumes immediately after deletion (no pod restart needed).

Crash safety: If the chaos tool crashes, the NetworkPolicy persists. Use operator-chaos clean to find and remove orphaned policies by the managed-by label.

Disruption Rubric

Expected behavior on a healthy operator: The operator's pods lose network connectivity. API server calls from the controller fail. Once the NetworkPolicy is removed, the controller reconnects and resumes reconciliation. The Deployment should return to Available within recoveryTimeout.

Contract violation indicators: - Controller enters CrashLoopBackOff after network is restored (indicates no retry/backoff logic) - Controller does not resume reconciliation after partition ends (indicates lost watch connections without reconnect) - Data corruption or inconsistent state after recovery (indicates missing conflict resolution)

Collateral damage risks: - High. NetworkPolicy affects ALL pods matching the selector, not just the controller - If the selector matches data-plane pods (e.g., serving pods), user traffic is disrupted - On resource-constrained clusters (< 4 CPU per node), recovery may be slow due to scheduling pressure - The NetworkPolicy requires a CNI that enforces policies (Calico, Cilium). Without enforcement, this test is meaningless.

Recovery expectations: - Recovery time: 30-120 seconds depending on watch reconnection and leader election - Reconcile cycles: 1-3 (initial reconnect, catch-up reconciliation, steady state) - What "recovered" means: Deployment has Available=True, controller is actively reconciling - Known failure scenario: On 2-node clusters with high CPU utilization (>80%), NetworkPartition consistently produces Failed verdicts due to recovery timeout pressure

Cross-Component Results

Component Experiment Danger Description
codeflare codeflare-network-partition medium When the codeflare-operator is network-partitioned from the API server, AppWrapp...
dashboard dashboard-network-partition medium When odh-dashboard pods are network-partitioned from the API server, the dashboa...
data-science-pipelines data-science-pipelines-network-partition medium When the DSPO pod is network-partitioned from the API server, it should lose its...
feast feast-network-partition medium When the feast-operator is network-partitioned from the API server, FeatureStore...
kserve kserve-llm-controller-isolation medium When the llmisvc-controller-manager is network-partitioned from the API server, ...
kueue kueue-network-partition medium When kueue-controller-manager pods are network-partitioned from the API server, ...
llamastack llamastack-network-partition medium When the llamastack-controller-manager is network-partitioned from the API serve...
model-registry model-registry-network-partition medium When the model-registry-operator pod is network-partitioned from the API server,...
modelmesh modelmesh-network-partition medium When the modelmesh-controller is network-partitioned from the API server, model ...
odh-model-controller odh-model-controller-network-partition medium When the odh-model-controller pod is network-partitioned from the API server, it...
opendatahub-operator opendatahub-operator-network-partition medium When the operator pods are network-partitioned, the leader should lose its lease...
ray ray-network-partition medium When the ray-operator is network-partitioned from the API server, cluster scalin...
training-operator training-operator-network-partition medium When the training-operator is network-partitioned from the API server, job statu...
trustyai trustyai-network-partition medium When the trustyai-service-operator is network-partitioned from the API server, e...
workbenches workbenches-network-partition medium When the odh-notebook-controller pod is network-partitioned from the API server,...