NetworkPartition¶
Danger Level: Medium
Creates a deny-all NetworkPolicy isolating pods matching a label selector from all ingress and egress traffic.
Spec Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
labelSelector |
string |
Yes | - | Equality-based label selector for target pods (set-based selectors not supported) |
ttl |
duration |
No | 300s |
Auto-cleanup duration for the NetworkPolicy |
How It Works¶
NetworkPartition creates a Kubernetes NetworkPolicy that blocks all ingress and egress traffic for pods matching the label selector. The policy name is sanitized (truncated to 63 chars with a hash suffix for uniqueness) and labeled with app.kubernetes.io/managed-by: operator-chaos for cleanup.
API calls:
1. Parse labelSelector into metav1.LabelSelector
2. Create a NetworkPolicy with deny-all ingress and egress rules
3. On cleanup: Delete the NetworkPolicy by name
Cleanup: Deletes the created NetworkPolicy. Traffic resumes immediately after deletion (no pod restart needed).
Crash safety: If the chaos tool crashes, the NetworkPolicy persists. Use operator-chaos clean to find and remove orphaned policies by the managed-by label.
Disruption Rubric¶
Expected behavior on a healthy operator:
The operator's pods lose network connectivity. API server calls from the controller fail. Once the NetworkPolicy is removed, the controller reconnects and resumes reconciliation. The Deployment should return to Available within recoveryTimeout.
Contract violation indicators: - Controller enters CrashLoopBackOff after network is restored (indicates no retry/backoff logic) - Controller does not resume reconciliation after partition ends (indicates lost watch connections without reconnect) - Data corruption or inconsistent state after recovery (indicates missing conflict resolution)
Collateral damage risks: - High. NetworkPolicy affects ALL pods matching the selector, not just the controller - If the selector matches data-plane pods (e.g., serving pods), user traffic is disrupted - On resource-constrained clusters (< 4 CPU per node), recovery may be slow due to scheduling pressure - The NetworkPolicy requires a CNI that enforces policies (Calico, Cilium). Without enforcement, this test is meaningless.
Recovery expectations:
- Recovery time: 30-120 seconds depending on watch reconnection and leader election
- Reconcile cycles: 1-3 (initial reconnect, catch-up reconciliation, steady state)
- What "recovered" means: Deployment has Available=True, controller is actively reconciling
- Known failure scenario: On 2-node clusters with high CPU utilization (>80%), NetworkPartition consistently produces Failed verdicts due to recovery timeout pressure
Cross-Component Results¶
| Component | Experiment | Danger | Description |
|---|---|---|---|
| codeflare | codeflare-network-partition | medium | When the codeflare-operator is network-partitioned from the API server, AppWrapp... |
| dashboard | dashboard-network-partition | medium | When odh-dashboard pods are network-partitioned from the API server, the dashboa... |
| data-science-pipelines | data-science-pipelines-network-partition | medium | When the DSPO pod is network-partitioned from the API server, it should lose its... |
| feast | feast-network-partition | medium | When the feast-operator is network-partitioned from the API server, FeatureStore... |
| kserve | kserve-llm-controller-isolation | medium | When the llmisvc-controller-manager is network-partitioned from the API server, ... |
| kueue | kueue-network-partition | medium | When kueue-controller-manager pods are network-partitioned from the API server, ... |
| llamastack | llamastack-network-partition | medium | When the llamastack-controller-manager is network-partitioned from the API serve... |
| model-registry | model-registry-network-partition | medium | When the model-registry-operator pod is network-partitioned from the API server,... |
| modelmesh | modelmesh-network-partition | medium | When the modelmesh-controller is network-partitioned from the API server, model ... |
| odh-model-controller | odh-model-controller-network-partition | medium | When the odh-model-controller pod is network-partitioned from the API server, it... |
| opendatahub-operator | opendatahub-operator-network-partition | medium | When the operator pods are network-partitioned, the leader should lose its lease... |
| ray | ray-network-partition | medium | When the ray-operator is network-partitioned from the API server, cluster scalin... |
| training-operator | training-operator-network-partition | medium | When the training-operator is network-partitioned from the API server, job statu... |
| trustyai | trustyai-network-partition | medium | When the trustyai-service-operator is network-partitioned from the API server, e... |
| workbenches | workbenches-network-partition | medium | When the odh-notebook-controller pod is network-partitioned from the API server,... |