Fuzz Mode¶

Use Go's native fuzz engine to automatically explore fault combinations your reconciler might encounter. Instead of writing one test per error scenario, the fuzz harness generates thousands of random fault configurations and runs your reconciler against each one, catching panics, unhandled errors, and state corruption.

When to use Fuzz mode

Use this when you want to find edge cases in your reconciler's error handling during development. Manual tests cover the cases you think of. Fuzz testing covers the ones you don't.

What Does It Actually Do?¶

Traditional chaos testing requires you to define each fault scenario manually: "inject a connection refused on Get, verify recovery." Fuzz mode inverts this. You provide:

A reconciler factory (a function that creates your reconciler given a client.Client)
Seed objects (the Kubernetes resources your reconciler expects to exist)
Invariants (conditions that must always hold after reconciliation)

The fuzz engine then:

Generates random fault configurations (which operations fail, what error, how often)
Wraps a fake client with those faults using the SDK ChaosClient
Runs your reconciler
Reports any panics, unexpected errors, or invariant violations as test failures

No cluster is needed. Everything runs in-process with a fake client.

How It Works¶

flowchart TD
    F["Go Fuzz Engine"] -->|"generates opMask,<br/>faultType, intensity"| D[DecodeFaultConfig]
    D --> FC[FaultConfig]
    FC --> CC["ChaosClient<br/>(wraps Fake Client)"]
    CC --> R

    subgraph harness ["Harness.Run()"]
        R[Run Reconciler]
        R -->|panic| FAIL1["FAIL: always a bug"]
        R -->|non-chaos error| FAIL2["FAIL: real bug"]
        R -->|chaos error| OK1[Expected]
        OK1 --> INV[Check Invariants]
        INV -->|violation| FAIL3["FAIL: state corrupted"]
        INV -->|pass| PASS["PASS"]
    end

    style F fill:#bbdefb,stroke:#1565c0
    style D fill:#ce93d8,stroke:#6a1b9a
    style FC fill:#ffcc80,stroke:#e65100
    style CC fill:#ffcc80,stroke:#e65100
    style harness fill:#f5f5f5,stroke:#616161,stroke-width:2px,color:#424242
    style R fill:#ce93d8,stroke:#6a1b9a
    style OK1 fill:#bbdefb,stroke:#1565c0
    style INV fill:#ce93d8,stroke:#6a1b9a
    style FAIL1 fill:#ef9a9a,stroke:#c62828
    style FAIL2 fill:#ef9a9a,stroke:#c62828
    style FAIL3 fill:#ef9a9a,stroke:#c62828
    style PASS fill:#a5d6a7,stroke:#2e7d32

The key insight: chaos-injected errors are expected and silently ignored. If your reconciler returns a ChaosError, that means it propagated the injected fault correctly. The harness only reports failures for:

Panics: Your code crashed. Always a bug.
Non-chaos errors: Your reconciler returned an error that didn't originate from the ChaosClient. This means your code has a bug unrelated to fault handling.
Invariant violations: Your reconciler ran to completion, but post-reconcile state is wrong (e.g., a resource that should always exist was deleted).

Prerequisites¶

Go 1.18+ (for native fuzzing support)
controller-runtime v0.23+
No Kubernetes cluster needed (uses fake client)

Step-by-Step Walkthrough¶

Step 1: Write a reconciler factory¶

The fuzz harness needs a function that creates your reconciler given a client.Client. This decouples your reconciler construction from any specific client implementation:

import (
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/reconcile"
)

func myFactory(c client.Client) reconcile.Reconciler {
    return &MyReconciler{client: c}
}

Step 2: Write the fuzz test¶

Create a file named fuzz_test.go in your controller's package:

package mycontroller_test

import (
    "testing"

    corev1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/runtime"
    "k8s.io/apimachinery/pkg/types"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/reconcile"

    "github.com/opendatahub-io/operator-chaos/pkg/sdk/fuzz"
)

func FuzzMyReconciler(f *testing.F) {
    // Seed corpus: starting values for the fuzz engine to mutate.
    // Each f.Add() call provides one (opMask, faultType, intensity) tuple.
    f.Add(uint16(0x01FF), uint8(0), uint16(32768))  // all ops, conflict error, 50%
    f.Add(uint16(0x0001), uint8(1), uint16(65535))   // Get only, not found, 100%
    f.Add(uint16(0), uint8(0), uint16(0))            // no faults (baseline)

    scheme := runtime.NewScheme()
    _ = corev1.AddToScheme(scheme)

    f.Fuzz(func(t *testing.T, opMask uint16, faultType uint8, intensity uint16) {
        // Seed objects: the initial cluster state before reconciliation.
        cm := &corev1.ConfigMap{
            ObjectMeta: metav1.ObjectMeta{
                Name:      "my-config",
                Namespace: "default",
            },
            Data: map[string]string{"key": "value"},
        }

        // The reconcile request targeting this ConfigMap.
        req := reconcile.Request{
            NamespacedName: types.NamespacedName{
                Name:      "my-config",
                Namespace: "default",
            },
        }

        // Create harness with factory, scheme, request, and seed objects.
        h := fuzz.NewHarness(myFactory, scheme, req, cm)

        // Add invariant: ConfigMap must still exist after reconciliation.
        h.AddInvariant(fuzz.ObjectExists(
            types.NamespacedName{Name: "my-config", Namespace: "default"},
            &corev1.ConfigMap{},
        ))

        // Decode fuzz bytes into a FaultConfig and run.
        fc := fuzz.DecodeFaultConfig(opMask, faultType, intensity)
        if err := h.Run(t, fc); err != nil {
            t.Fatal(err)
        }
    })
}

Step 3: Run the fuzz test¶

# Quick smoke test (30 seconds)
$ go test ./pkg/mycontroller/ -fuzz=FuzzMyReconciler -fuzztime=30s
fuzz: elapsed: 0s, gathering baseline coverage: 0/3 completed
fuzz: elapsed: 0s, gathering baseline coverage: 3/3 completed, now fuzzing with 8 workers
fuzz: elapsed: 3s, execs: 1842 (614/sec), new interesting: 12 (total: 15)
fuzz: elapsed: 6s, execs: 4291 (816/sec), new interesting: 14 (total: 17)
fuzz: elapsed: 9s, execs: 6830 (846/sec), new interesting: 15 (total: 18)
fuzz: elapsed: 12s, execs: 9402 (857/sec), new interesting: 15 (total: 18)
fuzz: elapsed: 15s, execs: 12091 (896/sec), new interesting: 16 (total: 19)
fuzz: elapsed: 18s, execs: 14673 (860/sec), new interesting: 16 (total: 19)
fuzz: elapsed: 21s, execs: 17284 (870/sec), new interesting: 16 (total: 19)
fuzz: elapsed: 24s, execs: 19901 (872/sec), new interesting: 16 (total: 19)
fuzz: elapsed: 27s, execs: 22486 (862/sec), new interesting: 16 (total: 19)
fuzz: elapsed: 30s, execs: 25107 (874/sec), new interesting: 16 (total: 19)
PASS
ok      github.com/example/my-operator/pkg/mycontroller 30.124s

The engine ran ~25,000 different fault configurations in 30 seconds. "New interesting" means the engine found inputs that exercise new code paths.

# Thorough exploration (5 minutes)
$ go test ./pkg/mycontroller/ -fuzz=FuzzMyReconciler -fuzztime=5m

# Run indefinitely until a failure is found
$ go test ./pkg/mycontroller/ -fuzz=FuzzMyReconciler

Step 4: Interpret failures¶

When the fuzz engine finds a bug, it reports the inputs that triggered it:

--- FAIL: FuzzMyReconciler (0.03s)
    --- FAIL: FuzzMyReconciler/seed#1 (0.00s)
        harness.go:87: reconciler panicked: runtime error: invalid memory address
                        or nil pointer dereference

        Failing input:
            opMask:    0x0001 (Get only)
            faultType: 1 (not found)
            intensity: 65535 (100% error rate)

This tells you: when every Get call returns "not found", your reconciler panics due to a nil pointer dereference. The fix is to add a nil check after the Get call.

Failing inputs are saved to testdata/fuzz/FuzzMyReconciler/ and automatically replayed on subsequent go test runs, so the bug becomes a permanent regression test.

Step 5: Fix and re-run¶

After fixing the bug, re-run to verify the fix and continue exploring:

# Replay the saved failure (regression test)
$ go test ./pkg/mycontroller/ -run=FuzzMyReconciler
PASS
ok      github.com/example/my-operator/pkg/mycontroller 0.015s

# Continue fuzzing to find more issues
$ go test ./pkg/mycontroller/ -fuzz=FuzzMyReconciler -fuzztime=2m

Auto-Generate from Knowledge Models¶

If you have an operator knowledge model, you can skip writing the fuzz test manually:

$ operator-chaos generate fuzz-targets \
    --knowledge knowledge/kserve.yaml \
    --output fuzz_kserve_test.go

The generated file contains:

One FuzzXxx function per component (e.g., FuzzKserveControllerManager)
Seed objects derived from managedResources (Deployments, ConfigMaps, RBAC)
Invariants from steadyState.checks and Deployment replicas
Seed corpus entries derived from architectural traits (webhooks, finalizers, leader election)

You only need to replace the placeholder reconcilerFactory function with your actual reconciler constructor.

DecodeFaultConfig Reference¶

The DecodeFaultConfig function maps three fuzz primitives to a valid *sdk.FaultConfig:

fc := fuzz.DecodeFaultConfig(opMask, faultType, intensity)

opMask (uint16): Which operations get faults¶

Each bit enables faults for one operation:

Bit	Hex	Operation
0	`0x0001`	Get
1	`0x0002`	List
2	`0x0004`	Create
3	`0x0008`	Update
4	`0x0010`	Delete
5	`0x0020`	Patch
6	`0x0040`	DeleteAllOf
7	`0x0080`	Reconcile
8	`0x0100`	Apply

Examples:

0x0001 = only Get is faulted
0x0009 = Get + Update
0x01FF = all 9 operations

faultType (uint8): What error to inject¶

Index into 11 realistic Kubernetes error messages:

Index	Error Message
0	`the object has been modified; please apply your changes to the latest version and try again`
1	`not found`
2	`context deadline exceeded`
3	`Internal error occurred: unexpected response: 500`
4	`etcdserver: request timed out`
5	`rate limit exceeded, retry after 5s`
6	`connection refused`
7	`the server could not find the requested resource (HTTP 410: Gone)`
8	`admission webhook denied the request`
9	`exceeded quota`
10	`Service Unavailable`

Values > 10 wrap around (modulo 11).

intensity (uint16): How often the fault fires¶

Maps linearly to error rate: 0 = never fire, 65535 = always fire, 32768 = ~50%.

Invariants¶

Invariants are conditions that must hold after every reconciliation, regardless of what faults were injected.

Built-in: ObjectExists¶

Checks that a specific object still exists:

h.AddInvariant(fuzz.ObjectExists(
    types.NamespacedName{Name: "my-config", Namespace: "default"},
    &corev1.ConfigMap{},
))

Built-in: ObjectCount¶

Checks that the number of objects of a given type matches an expected count:

// Exactly 3 ConfigMaps should exist
h.AddInvariant(fuzz.ObjectCount(&corev1.ConfigMapList{}, 3))

// Exactly 1 ConfigMap with label "app=my-app"
h.AddInvariant(fuzz.ObjectCount(
    &corev1.ConfigMapList{},
    1,
    client.MatchingLabels{"app": "my-app"},
))

Custom invariants¶

Write any check as a function:

h.AddInvariant(func(ctx context.Context, c client.Client) error {
    cm := &corev1.ConfigMap{}
    key := types.NamespacedName{Name: "my-config", Namespace: "default"}
    if err := c.Get(ctx, key, cm); err != nil {
        return fmt.Errorf("ConfigMap missing: %w", err)
    }
    if cm.Data["key"] != "value" {
        return fmt.Errorf("ConfigMap data corrupted: got %q, want %q",
            cm.Data["key"], "value")
    }
    return nil
})

CI Integration¶

Add fuzz tests to your CI pipeline. Even a 2-minute run catches many issues:

# GitHub Actions
- name: Fuzz test reconcilers
  run: go test ./pkg/... -fuzz=. -fuzztime=2m

Commit the testdata/fuzz/ directory to your repository so discovered failures become permanent regression tests.

Comparison with Other Modes¶

	Fuzz Mode	SDK Mode	CLI Mode
Cluster needed?	No (fake client)	No (fake or real)	Yes (live cluster)
Fault configuration	Auto-generated	Manual	Manual (YAML)
Coverage	Thousands of combinations	One scenario per test	One scenario per experiment
What it finds	Panics, nil pointers, logic bugs	Specific error handling paths	Real recovery behavior
Speed	~800 runs/sec	Milliseconds per test	Seconds to minutes per experiment

Next Steps¶

Learn about SDK mode for targeted error injection
Run full cluster experiments with CLI mode