Injection Engine¶
The injection engine is a pluggable system for executing fault injections. It provides a registry pattern for managing injection types, a standardized interface for injectors, and crash-safe cleanup mechanisms.
Core Interface¶
All injection types implement the Injector interface:
type Injector interface {
Validate(spec v1alpha1.InjectionSpec, blast v1alpha1.BlastRadiusSpec) error
Inject(ctx context.Context, spec v1alpha1.InjectionSpec, namespace string) (CleanupFunc, []v1alpha1.InjectionEvent, error)
Revert(ctx context.Context, spec v1alpha1.InjectionSpec, namespace string) error
}
Injection Lifecycle¶
sequenceDiagram
participant O as Orchestrator
participant R as Registry
participant I as Injector
participant K as K8s Resource
rect rgb(227, 242, 253)
Note over O,R: Lookup
O->>R: Get(injectionType)
R-->>O: injector
end
rect rgb(243, 229, 245)
Note over O,I: Validation
O->>I: Validate(spec, blast)
I-->>O: OK
end
rect rgb(255, 243, 224)
Note over O,K: Injection
O->>I: Inject(ctx, spec, ns)
I->>K: Apply fault
I->>K: Store rollback annotation
I-->>O: CleanupFunc + Events
end
rect rgb(255, 235, 238)
Note over O,K: Recovery window (hypothesis timeout)
end
rect rgb(232, 245, 233)
Note over O,K: Revert
O->>I: Revert(ctx, spec, ns)
I->>K: Read rollback annotation
I->>K: Restore original state
I->>K: Remove chaos metadata
I-->>O: OK
end
Method Contracts¶
Validate(spec, blast) error¶
Pre-flight validation before injection:
- Required parameters — Check that all required parameters are present
- Parameter format — Validate label selectors, resource names, field paths
- Blast radius — Verify injection respects
maxPodsAffected,allowedNamespaces - Danger level — Ensure high-danger injections have
allowDangerous: true
Returns: Error if validation fails, preventing injection execution
Example (PodKill):
func (p *PodKillInjector) Validate(spec v1alpha1.InjectionSpec, blast v1alpha1.BlastRadiusSpec) error {
if spec.Parameters["labelSelector"] == "" {
return fmt.Errorf("labelSelector parameter is required")
}
selector, err := labels.Parse(spec.Parameters["labelSelector"])
if err != nil {
return fmt.Errorf("invalid labelSelector: %w", err)
}
if spec.Count <= 0 {
return fmt.Errorf("count must be positive")
}
return nil
}
Inject(ctx, spec, namespace) (CleanupFunc, []InjectionEvent, error)¶
Execute the fault injection:
- Perform injection — Apply the fault (create NetworkPolicy, delete Pod, etc.)
- Record events — Return structured logs of what was injected
- Return cleanup — Function to reverse the injection (or no-op if not reversible)
Returns:
CleanupFunc— Function to undo injection (may be no-op)[]InjectionEvent— List of injection events for audit logerror— If injection failed
Example (NetworkPartition):
func (n *NetworkPartitionInjector) Inject(ctx context.Context, spec v1alpha1.InjectionSpec, namespace string) (CleanupFunc, []v1alpha1.InjectionEvent, error) {
selector, _ := labels.Parse(spec.Parameters["labelSelector"])
// Create deny-all NetworkPolicy
policy := &networkingv1.NetworkPolicy{
ObjectMeta: metav1.ObjectMeta{
Name: "operator-chaos-np-" + sanitizeName(spec.Parameters["labelSelector"]),
Namespace: namespace,
Labels: safety.ChaosLabels(string(v1alpha1.NetworkPartition)),
},
Spec: networkingv1.NetworkPolicySpec{
PodSelector: metav1.LabelSelector{MatchLabels: matchLabels(selector)},
PolicyTypes: []networkingv1.PolicyType{
networkingv1.PolicyTypeIngress,
networkingv1.PolicyTypeEgress,
},
// Empty ingress/egress = deny all
},
}
if err := n.client.Create(ctx, policy); err != nil {
return nil, nil, fmt.Errorf("creating NetworkPolicy: %w", err)
}
events := []v1alpha1.InjectionEvent{
injection.NewEvent(v1alpha1.NetworkPartition, policy.Name, "created",
map[string]string{"namespace": namespace, "selector": spec.Parameters["labelSelector"]}),
}
cleanup := func(ctx context.Context) error {
return n.client.Delete(ctx, policy)
}
return cleanup, events, nil
}
Revert(ctx, spec, namespace) error¶
Stateless cleanup that can run after controller restarts:
- Read rollback annotation — Retrieve original state from resource annotations
- Restore original state — Apply rollback data
- Remove chaos metadata — Clean up labels and annotations
- Idempotent — Safe to call multiple times (return nil if already reverted)
Returns: Error if revert failed
Example (ConfigDrift):
func (d *ConfigDriftInjector) Revert(ctx context.Context, spec v1alpha1.InjectionSpec, namespace string) error {
key := types.NamespacedName{Name: spec.Parameters["name"], Namespace: namespace}
cm := &corev1.ConfigMap{}
if err := d.client.Get(ctx, key, cm); err != nil {
if apierrors.IsNotFound(err) {
return nil // Already gone, nothing to revert
}
return err
}
// Check for rollback annotation
rollbackStr, ok := cm.GetAnnotations()[safety.RollbackAnnotationKey]
if !ok {
return nil // No chaos metadata, already reverted
}
var rollbackInfo map[string]string
if err := safety.UnwrapRollbackData(rollbackStr, &rollbackInfo); err != nil {
return err
}
// Restore original value
dataKey := spec.Parameters["key"]
if rollbackInfo["keyExists"] == "false" {
delete(cm.Data, dataKey)
} else {
cm.Data[dataKey] = rollbackInfo["originalValue"]
}
// Remove chaos metadata
safety.RemoveChaosMetadata(cm, string(v1alpha1.ConfigDrift))
return d.client.Update(ctx, cm)
}
Registry Pattern¶
The Registry manages all registered injectors:
type Registry struct {
injectors map[v1alpha1.InjectionType]Injector
}
func NewRegistry() *Registry {
return &Registry{injectors: make(map[v1alpha1.InjectionType]Injector)}
}
func (r *Registry) Register(t v1alpha1.InjectionType, i Injector) {
r.injectors[t] = i
}
func (r *Registry) Get(t v1alpha1.InjectionType) (Injector, error) {
if inj, ok := r.injectors[t]; ok {
return inj, nil
}
return nil, fmt.Errorf("unknown injection type %q", t)
}
Initializing the Registry¶
func NewDefaultRegistry(client client.Client) *injection.Registry {
registry := injection.NewRegistry()
registry.Register(v1alpha1.PodKill, injection.NewPodKillInjector(client))
registry.Register(v1alpha1.NetworkPartition, injection.NewNetworkPartitionInjector(client))
registry.Register(v1alpha1.ConfigDrift, injection.NewConfigDriftInjector(client))
registry.Register(v1alpha1.CRDMutation, injection.NewCRDMutationInjector(client))
registry.Register(v1alpha1.WebhookDisrupt, injection.NewWebhookDisruptInjector(client))
registry.Register(v1alpha1.RBACRevoke, injection.NewRBACRevokeInjector(client))
registry.Register(v1alpha1.FinalizerBlock, injection.NewFinalizerBlockInjector(client))
registry.Register(v1alpha1.ClientFault, injection.NewClientFaultInjector(client))
registry.Register(v1alpha1.OwnerRefOrphan, injection.NewOwnerRefOrphanInjector(client))
registry.Register(v1alpha1.QuotaExhaustion, injection.NewQuotaExhaustionInjector(client))
registry.Register(v1alpha1.WebhookLatency, injection.NewWebhookLatencyInjector(client))
return registry
}
Crash-Safe Cleanup¶
All injectors use a standardized rollback annotation pattern to enable cleanup after controller restarts.
Rollback Annotation Format¶
annotations:
chaos.operatorchaos.io/rollback: |
{"data":"<base64-json>","checksum":"sha256:abcd1234..."}
The data field contains base64-encoded JSON with injection-specific rollback information. The checksum provides integrity verification.
Safety Helper Functions¶
import "github.com/opendatahub-io/operator-chaos/pkg/safety"
// Store rollback data with integrity checksum
rollbackInfo := map[string]string{"originalValue": "foo"}
rollbackStr, _ := safety.WrapRollbackData(rollbackInfo)
safety.ApplyChaosMetadata(resource, rollbackStr, "ConfigDrift")
// Retrieve and verify rollback data
rollbackStr := resource.GetAnnotations()[safety.RollbackAnnotationKey]
var rollbackInfo map[string]string
safety.UnwrapRollbackData(rollbackStr, &rollbackInfo)
// Remove chaos metadata
safety.RemoveChaosMetadata(resource, "ConfigDrift")
Chaos Labels¶
All chaos-managed resources are labeled for traceability:
labels := safety.ChaosLabels("PodKill")
// Returns:
// {
// "chaos.operatorchaos.io/managed": "true",
// "chaos.operatorchaos.io/type": "PodKill",
// }
Query resources managed by chaos:
Injection Type Taxonomy¶
Reversible vs. Irreversible¶
| Type | Reversible | Cleanup Strategy |
|---|---|---|
| PodKill | No | No-op (Kubernetes recreates pods) |
| NetworkPartition | Yes | Delete NetworkPolicy |
| ConfigDrift | Yes | Restore from annotation |
| CRDMutation | Yes | Restore via merge patch |
| WebhookDisrupt | Yes | Restore original policies |
| RBACRevoke | Yes | Restore subjects |
| FinalizerBlock | Yes | Remove finalizer |
| ClientFault | Yes | Delete/restore ConfigMap |
| OwnerRefOrphan | Yes | Restore ownerReferences from annotation |
| QuotaExhaustion | Yes | Delete injected ResourceQuota |
| WebhookLatency | Yes | Delete slow admission webhook |
Stateful vs. Stateless Cleanup¶
Stateful (CleanupFunc):
- Uses closure variables from
Inject() - Valid only for immediate cleanup in standalone mode
- Example: Deleting a NetworkPolicy by reference
cleanup := func(ctx context.Context) error {
return client.Delete(ctx, policy) // 'policy' captured from Inject scope
}
Stateless (Revert):
- Reads rollback data from annotations
- Works after controller restarts
- Example: Restoring ConfigMap data
Controller Mode Requirement
Controllers MUST use Revert() for cleanup, not CleanupFunc, since they can restart between injection and cleanup.
TTL-Based Auto-Cleanup¶
Injections can specify a TTL for automatic cleanup:
TTL Annotation¶
Cleanup Controller¶
A background controller scans for expired resources:
func (c *CleanupController) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {
// List all resources with TTL annotation
list := &unstructured.UnstructuredList{}
c.client.List(ctx, list, client.MatchingLabels{
"chaos.operatorchaos.io/managed": "true",
})
now := time.Now()
for _, item := range list.Items {
expiryStr := item.GetAnnotations()["chaos.operatorchaos.io/ttl-expiry"]
if expiryStr == "" {
continue
}
expiry, _ := time.Parse(time.RFC3339, expiryStr)
if now.After(expiry) {
// TTL expired, delete resource
c.client.Delete(ctx, &item)
}
}
return reconcile.Result{RequeueAfter: 30 * time.Second}, nil
}
Safety Mechanisms¶
1. Parameter Validation¶
All injectors validate parameters before execution:
func validatePodKillParams(spec v1alpha1.InjectionSpec, blast v1alpha1.BlastRadiusSpec) error {
if spec.Parameters["labelSelector"] == "" {
return errors.New("labelSelector is required")
}
selector, err := labels.Parse(spec.Parameters["labelSelector"])
if err != nil {
return fmt.Errorf("invalid labelSelector: %w", err)
}
if spec.Count > blast.MaxPodsAffected {
return fmt.Errorf("count %d exceeds maxPodsAffected %d", spec.Count, blast.MaxPodsAffected)
}
return nil
}
2. Danger Level Enforcement¶
High-danger injections require explicit acknowledgment:
if spec.Injection.DangerLevel == v1alpha1.DangerLevelHigh {
if !blast.AllowDangerous {
return fmt.Errorf("dangerLevel 'high' requires allowDangerous: true")
}
}
3. Namespace Restrictions¶
forbiddenNamespaces := []string{"kube-system", "kube-public", "default"}
for _, ns := range forbiddenNamespaces {
if namespace == ns {
return fmt.Errorf("namespace %q is forbidden", ns)
}
}
4. Dry Run Support¶
All injectors skip execution in dry-run mode:
Event Logging¶
Injectors return structured events for audit trails:
event := injection.NewEvent(
v1alpha1.PodKill,
"pod-abc123",
"deleted",
map[string]string{
"namespace": "opendatahub",
"node": "worker-1",
},
)
Events are stored in the CRD status:
status:
injectionLog:
- timestamp: "2024-03-30T10:00:00Z"
type: PodKill
target: pod-abc123
action: deleted
details:
namespace: opendatahub
node: worker-1
Best Practices¶
1. Idempotent Cleanup¶
Always make Revert() idempotent:
func (i *MyInjector) Revert(ctx, spec, namespace) error {
// Check if rollback annotation exists
rollbackStr, ok := resource.GetAnnotations()[safety.RollbackAnnotationKey]
if !ok {
return nil // Already reverted, nothing to do
}
// Perform cleanup...
}
2. Descriptive Error Messages¶
if err != nil {
return fmt.Errorf("creating NetworkPolicy %q in namespace %q: %w", policyName, namespace, err)
}
3. Resource Naming¶
Use deterministic naming for created resources:
This allows Revert() to reconstruct resource names without stored state.
4. Label Selectors¶
Parse and validate label selectors early:
selector, err := labels.Parse(spec.Parameters["labelSelector"])
if err != nil {
return fmt.Errorf("parsing labelSelector: %w", err)
}
Next Steps¶
- Observer Blackboard Pattern — How observations are collected
- Contributing: Adding Failure Modes — Implement a new injector
- Failure Modes Reference — All available failure modes