Knowledge Models¶
Knowledge models are YAML files that describe what an operator manages. The chaos framework uses these models to understand which resources to monitor during experiments, what "healthy" looks like, and how to validate recovery.
Why Knowledge Models Matter¶
Traditional chaos tools test infrastructure resilience: kill a pod, verify it restarts. But Kubernetes operators manage complex resource graphs—Deployments, Services, ConfigMaps, CRDs, webhooks, RBAC bindings—where the real question is:
"When something breaks, does the operator restore everything to its intended state?"
Knowledge models answer this by encoding:
- Component inventory: What resources does this operator manage?
- Steady-state definition: What conditions indicate the component is healthy?
- Dependency graph: Which components depend on each other?
- Recovery expectations: How long should the operator take to reconcile?
Without a knowledge model, the chaos framework can't distinguish between:
- A pod that's intentionally scaled to zero
- A pod that failed to recover from chaos
- A resource that doesn't exist yet because the operator hasn't reconciled
Knowledge models provide the semantic layer that makes operator-aware chaos testing possible.
YAML Schema¶
Top-Level Structure¶
operator:
name: string # required: operator name
namespace: string # required: namespace where the operator runs
repository: string # optional: source repository URL
version: string # optional: operator version (e.g., "2.21.0") - required for versioned knowledge models
platform: string # optional: platform name (e.g., "OpenShift", "Kubernetes") - required for versioned knowledge models
olmChannel: string # optional: OLM channel (e.g., "stable", "fast") - required for versioned knowledge models
components:
- name: string # required: unique component name
controller: string # required: controller that manages this component
managedResources: [] # required: at least one resource
dependencies: [] # optional: other component names
webhooks: [] # optional: webhook configurations
finalizers: [] # optional: finalizers this component manages
steadyState: {} # optional: steady-state checks
recovery:
reconcileTimeout: string # required: e.g. "300s"
maxReconcileCycles: int # required: e.g. 10
Operator Metadata¶
The operator section identifies the operator being modeled:
operator:
name: odh-model-controller
namespace: opendatahub
repository: https://github.com/opendatahub-io/odh-model-controller
version: "2.21.0"
platform: "OpenShift"
olmChannel: "stable"
| Field | Required | Description |
|---|---|---|
name |
Yes | Operator name (must be unique across knowledge files) |
namespace |
Yes | Namespace where the operator's control plane runs |
repository |
No | Source repository URL for documentation purposes |
version |
No* | Operator version (e.g., "2.21.0") - required for versioned knowledge models |
platform |
No* | Platform name (e.g., "OpenShift", "Kubernetes") - required for versioned knowledge models |
olmChannel |
No* | OLM channel (e.g., "stable", "fast") - required for versioned knowledge models |
* Versioned knowledge models (used with the upgrade diff engine) require version, platform, and olmChannel fields.
Components¶
Each component represents a logical unit managed by the operator:
components:
- name: odh-model-controller
controller: DataScienceCluster
managedResources:
- apiVersion: apps/v1
kind: Deployment
name: odh-model-controller
namespace: opendatahub
| Field | Required | Description |
|---|---|---|
name |
Yes | Unique component identifier |
controller |
Yes | Controller that reconciles this component (e.g., CRD name or controller type) |
managedResources |
Yes | List of Kubernetes resources managed by this component (at least one) |
dependencies |
No | Other components this component depends on (see Dependencies) |
webhooks |
No | Admission webhooks managed by this component |
finalizers |
No | Finalizers this component adds to resources |
steadyState |
No | Steady-state verification checks |
Managed Resources¶
Managed resources describe the Kubernetes objects the component creates and maintains:
managedResources:
- apiVersion: apps/v1
kind: Deployment
name: odh-model-controller
namespace: opendatahub
labels:
control-plane: odh-model-controller
app: odh-model-controller
expectedSpec:
replicas: 1
- apiVersion: v1
kind: ConfigMap
name: inferenceservice-config
namespace: opendatahub
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
name: odh-model-controller-rolebinding-opendatahub
| Field | Required | Description |
|---|---|---|
apiVersion |
Yes | Kubernetes API version (e.g., "apps/v1", "v1") |
kind |
Yes | Resource kind (e.g., "Deployment", "ConfigMap") |
name |
Yes | Resource name |
namespace |
No | Resource namespace (omit for cluster-scoped resources) |
labels |
No | Expected labels (used for pod selection in fault injection) |
ownerRef |
No | Expected owner reference kind |
expectedSpec |
No | Expected spec fields (checked during steady-state verification) |
The framework uses labels to target resources for injection types like PodKill and NetworkPartition. It uses expectedSpec to verify the operator restored the correct configuration after chaos.
Webhooks¶
Webhook configurations are critical for testing admission control resilience:
webhooks:
- name: mutating.pod.odh-model-controller.opendatahub.io
type: mutating
path: /mutate--v1-pod
- name: validating.isvc.odh-model-controller.opendatahub.io
type: validating
path: /validate-serving-kserve-io-v1beta1-inferenceservice
| Field | Required | Description |
|---|---|---|
name |
Yes | Webhook configuration name (matches ValidatingWebhookConfiguration or MutatingWebhookConfiguration object) |
type |
Yes | Webhook type: validating or mutating |
path |
Yes | HTTP path the webhook server listens on |
These are used by the WebhookDisrupt injection type to test how the operator handles admission control failures.
Finalizers¶
Document finalizers the component manages:
finalizers:
- odh.inferenceservice.finalizers
- modelregistry.opendatahub.io/finalizer
- runtimes.opendatahub.io/nim-cleanup-finalizer
The FinalizerBlock injection type uses this list to test finalizer handling by adding blocking finalizers to resources.
Dependencies¶
Dependencies define the relationship between components, enabling collateral damage detection:
components:
- name: odh-model-controller
controller: DataScienceCluster
dependencies:
- kserve # cross-operator dependency (operator name)
managedResources: [...]
# In the kserve knowledge model:
- name: llmisvc-controller-manager
controller: KServe
dependencies:
- kserve-controller-manager # intra-operator dependency (component name)
managedResources: [...]
Two dependency types:
graph TD
subgraph intra["kserve.yaml · intra-operator dependencies"]
A["kserve-controller-manager<br/>(primary controller)"]
B["llmisvc-controller-manager<br/>(LLM inference)"]
C["kserve-localmodel-controller<br/>(model caching)"]
B -->|depends on| A
C -->|depends on| A
end
subgraph cross["odh-model-controller.yaml · cross-operator dependency"]
D["odh-model-controller<br/>(model serving lifecycle)"]
end
D -->|"cross-operator<br/>depends on"| A
style intra fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
style cross fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#4a148c
style A fill:#bbdefb,stroke:#1565c0
style B fill:#90caf9,stroke:#1565c0
style C fill:#90caf9,stroke:#1565c0
style D fill:#ce93d8,stroke:#6a1b9a
- Intra-operator: Component name within the same knowledge file (e.g.,
llmisvc-controller-managerdepends onkserve-controller-manager) - Cross-operator: Operator name across knowledge files (e.g.,
odh-model-controllerdepends onkserve)
When you inject chaos into a component, the framework automatically checks the steady state of all dependent components after recovery. If a dependent component fails its checks, the verdict downgrades from Resilient to Degraded (never to Failed, since collateral damage is a side effect, not the target's failure).
To enable dependency tracking, load multiple knowledge files:
# Load all knowledge files from a directory
operator-chaos run experiment.yaml --knowledge-dir knowledge/
# Or load multiple individual files
operator-chaos run experiment.yaml --knowledge kserve.yaml --knowledge odh-model-controller.yaml
Steady State¶
Steady-state checks define what "healthy" means for this component:
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-model-controller
namespace: opendatahub
conditionType: Available
- type: resourceExists
apiVersion: v1
kind: ConfigMap
name: inferenceservice-config
namespace: opendatahub
timeout: "60s"
Check types:
| Type | Description | Required Fields |
|---|---|---|
conditionTrue |
Verify a Kubernetes resource condition is True |
apiVersion, kind, name, namespace, conditionType |
resourceExists |
Verify a resource exists | apiVersion, kind, name, namespace (if namespaced) |
The framework runs these checks:
- Pre-injection: Establish baseline (if checks fail, experiment is
Inconclusive) - Post-recovery: Verify the operator restored the component (if checks fail, verdict is
FailedorDegraded)
Recovery Expectations¶
Recovery expectations define how long the framework should wait for the operator to reconcile:
recovery:
reconcileTimeout: "300s" # Maximum time to wait for full recovery
maxReconcileCycles: 10 # Maximum reconcile cycles to tolerate
| Field | Required | Description |
|---|---|---|
reconcileTimeout |
Yes | Maximum time for the operator to restore all resources (e.g., "300s", "5m") |
maxReconcileCycles |
Yes | Maximum reconcile cycles before verdict downgrades to Degraded |
If the operator takes longer than reconcileTimeout, the verdict is Failed. If recovery completes but takes more than maxReconcileCycles, the verdict is Degraded (recovered, but inefficiently).
Complete Example: odh-model-controller¶
operator:
name: odh-model-controller
namespace: opendatahub
repository: https://github.com/opendatahub-io/odh-model-controller
components:
- name: odh-model-controller
controller: DataScienceCluster
managedResources:
- apiVersion: apps/v1
kind: Deployment
name: odh-model-controller
namespace: opendatahub
labels:
control-plane: odh-model-controller
app: odh-model-controller
expectedSpec:
replicas: 1
- apiVersion: v1
kind: ConfigMap
name: inferenceservice-config
namespace: opendatahub
- apiVersion: v1
kind: ServiceAccount
name: odh-model-controller
namespace: opendatahub
- apiVersion: v1
kind: Secret
name: odh-model-controller-webhook-cert
namespace: opendatahub
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
name: odh-model-controller-rolebinding-opendatahub
- apiVersion: coordination.k8s.io/v1
kind: Lease
name: odh-model-controller.opendatahub.io
namespace: opendatahub
webhooks:
- name: mutating.pod.odh-model-controller.opendatahub.io
type: mutating
path: /mutate--v1-pod
- name: minferencegraph-v1alpha1.odh-model-controller.opendatahub.io
type: mutating
path: /mutate-serving-kserve-io-v1alpha1-inferencegraph
- name: minferenceservice-v1beta1.odh-model-controller.opendatahub.io
type: mutating
path: /mutate-serving-kserve-io-v1beta1-inferenceservice
- name: validating.nim.account.odh-model-controller.opendatahub.io
type: validating
path: /validate-nim-opendatahub-io-v1-account
- name: validating.llmisvc.odh-model-controller.opendatahub.io
type: validating
path: /validate-serving-kserve-io-v1alpha1-llminferenceservice
- name: vinferencegraph-v1alpha1.odh-model-controller.opendatahub.io
type: validating
path: /validate-serving-kserve-io-v1alpha1-inferencegraph
- name: validating.isvc.odh-model-controller.opendatahub.io
type: validating
path: /validate-serving-kserve-io-v1beta1-inferenceservice
finalizers:
- odh.inferenceservice.finalizers
- modelregistry.opendatahub.io/finalizer
- runtimes.opendatahub.io/nim-cleanup-finalizer
dependencies:
- kserve
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: odh-model-controller
namespace: opendatahub
conditionType: Available
timeout: "60s"
recovery:
reconcileTimeout: "300s"
maxReconcileCycles: 10
Writing a Knowledge Model¶
1. Identify Components¶
Start by identifying the logical components your operator manages. For most operators, this maps to:
- Controllers (one component per controller binary)
- Webhooks (may be part of the controller component or separate)
- Optional components (features that can be enabled/disabled)
Example: KServe has 4 components:
kserve-controller-manager(main controller)llmisvc-controller-manager(LLM inference controller)kserve-localmodel-controller-manager(local model caching)kserve-localmodelnode-agent(node-level DaemonSet)
2. Enumerate Managed Resources¶
For each component, list all resources it creates:
# Find Deployments
kubectl get deployments -n <namespace> -l app.kubernetes.io/part-of=<operator>
# Find ConfigMaps
kubectl get configmaps -n <namespace> -l app.kubernetes.io/managed-by=<operator>
# Find RBAC (cluster-wide)
kubectl get clusterrolebindings | grep <operator>
# Find webhooks
kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations | grep <operator>
Document each resource's:
- API version and kind
- Name and namespace
- Expected labels (for pod selection during chaos)
- Expected spec fields (for post-recovery validation)
3. Define Steady State¶
Choose checks that prove the component is functional:
For controllers with Deployments:
steadyState:
checks:
- type: conditionTrue
apiVersion: apps/v1
kind: Deployment
name: my-controller
namespace: my-namespace
conditionType: Available
timeout: "60s"
For DaemonSets or optional resources:
steadyState:
checks:
- type: resourceExists
apiVersion: apps/v1
kind: DaemonSet
name: my-agent
namespace: my-namespace
timeout: "60s"
For CRD-managed resources:
steadyState:
checks:
- type: conditionTrue
apiVersion: myapi.io/v1
kind: MyCustomResource
name: my-instance
namespace: my-namespace
conditionType: Ready
timeout: "60s"
Steady-state timeout
Set timeout to the maximum time you expect the resource to become ready after creation. For most Deployments, 60 seconds is sufficient. For CRDs with complex initialization, use longer timeouts.
4. Document Dependencies¶
Map out which components depend on each other:
# Component A provides an API that Component B consumes
- name: component-b
dependencies:
- component-a # intra-operator dependency
# Component depends on another operator
- name: my-component
dependencies:
- other-operator # cross-operator dependency (operator name, not component)
5. Set Recovery Expectations¶
Base these on your operator's SLO:
recovery:
reconcileTimeout: "300s" # 5 minutes for full recovery
maxReconcileCycles: 10 # Tolerate up to 10 reconcile cycles
Guidelines:
reconcileTimeout: Set to 2-3x your normal reconcile time for the most complex resourcemaxReconcileCycles: Operators should converge in 1-2 cycles normally; allow 10 for chaos scenarios
6. Validate¶
# Local validation (no cluster access)
operator-chaos validate knowledge.yaml --knowledge
# Pre-flight checks (validates against live cluster)
operator-chaos preflight --knowledge knowledge.yaml
How Knowledge Models Are Used¶
During Steady-State Checks¶
Pre-injection (baseline):
- Framework loads knowledge model for the target operator
- Finds the component specified in the experiment
- Runs all
steadyState.checksfor that component - If any check fails, experiment is
Inconclusive(baseline not established)
Post-recovery:
- Waits for
recovery.reconcileTimeout - Runs the same steady-state checks
- If checks pass, evaluates verdict based on recovery time and reconcile cycles
- If checks fail, verdict is
Failed
During Recovery Validation¶
The framework uses managedResources to verify the operator restored everything:
- For each managed resource, query the cluster
- Compare current state to
expectedSpecfields - Count how many reconcile cycles occurred
- If all resources exist and match expectations within
reconcileTimeout, mark as recovered
During Collateral Detection¶
When multiple knowledge models are loaded:
- Framework builds a dependency graph
- During experiment evaluation, resolves which components depend on the faulted target
- Runs steady-state checks for all dependent components
- If a dependent component fails, downgrades verdict to
Degradedand reports collateral findings
Example scenario:
# odh-model-controller knowledge model
- name: odh-model-controller
dependencies:
- kserve # cross-operator dependency
# Experiment kills kserve-controller-manager pod
# After kserve recovers, framework also checks odh-model-controller
# If odh-model-controller is degraded, verdict: Degraded (collateral damage)
Best Practices¶
Do:¶
✅ Model reality: Describe what your operator actually creates, not what you wish it created
✅ Test incrementally: Start with one component, validate, add more
✅ Use specific labels: Include all identifying labels for accurate pod targeting
✅ Document webhooks: Critical for testing admission control resilience
✅ Version your models: Keep knowledge models in the operator repo alongside code
Don't:¶
❌ Over-specify: Only include expectedSpec fields that matter for correctness (not cosmetic fields like annotations)
❌ Assume reconciliation is instant: Set realistic reconcileTimeout values
❌ Ignore dependencies: Collateral damage detection only works if dependencies are modeled
❌ Skip validation: Always run validate --knowledge before using in experiments
Troubleshooting¶
"Unknown dependency: X"¶
You referenced a component or operator name that doesn't exist in any loaded knowledge model.
Fix: Load all required knowledge files:
"Steady-state check failed: condition Available not found"¶
The resource exists but doesn't have the expected condition.
Fix: Check the actual resource:
Adjust your knowledge model to match reality.
"Resource not found during pre-check"¶
The resource declared in managedResources doesn't exist on the cluster.
Fix: Either:
- Create the resource (if it should exist)
- Remove it from the knowledge model (if it's optional)
- Change the check type from
conditionTruetoresourceExists(if existence is sufficient)
"maxReconcileCycles exceeded"¶
The operator is reconciling too many times, indicating inefficient recovery.
Fix: This is a real finding—investigate why the operator needed so many cycles. Common causes:
- Finalizer loops
- Race conditions in status updates
- Missing owner references causing orphaned resources
Generating Fuzz Tests from Knowledge Models¶
Knowledge models can automatically generate fuzz test files that exercise your operator's reconciler with architecturally relevant fault combinations:
The generator reads your knowledge model and produces:
- Seed objects from
managedResources(typed Go objects with labels and expected spec) - Invariants from
steadyState.checks(verify resources survive reconciliation) - Seed corpus entries from architectural traits (webhooks produce webhook-denial seeds, finalizers produce conflict seeds, etc.)
See the Fuzz Quick Start for details on the generated test structure, and the CLI Reference for command options.
Versioned Knowledge Models¶
Versioned knowledge models enable the upgrade diff engine to detect structural changes between operator releases. They follow a directory layout convention with version metadata.
Directory Layout¶
knowledge/
├── v2.20/
│ ├── kserve.yaml
│ ├── odh-model-controller.yaml
│ └── crds/
│ ├── inferenceservice.yaml
│ └── llminferenceservice.yaml
├── v2.21/
│ ├── kserve.yaml
│ ├── odh-model-controller.yaml
│ └── crds/
│ ├── inferenceservice.yaml
│ └── llminferenceservice.yaml
└── v2.22/
└── ...
Conventions:
- Directory name matches the operator version (e.g.,
v2.20,v2.21) - Each directory contains knowledge YAML files for all operators in that release
- CRD definitions live in a
crds/subdirectory - All knowledge models in a versioned directory must include
version,platform, andolmChannelmetadata
Metadata Fields¶
Versioned knowledge models require three additional fields in the operator section:
operator:
name: kserve-operator
namespace: kserve
repository: https://github.com/kserve/kserve
version: "2.21.0" # Semantic version matching directory name
platform: "OpenShift" # Platform (OpenShift, Kubernetes, etc.)
olmChannel: "stable" # OLM channel (stable, fast, candidate)
| Field | Format | Description |
|---|---|---|
version |
Semantic version | Operator version (e.g., "2.21.0"). Must match directory name prefix |
platform |
String | Target platform (e.g., "OpenShift", "Kubernetes"). Used for platform-specific diffs |
olmChannel |
String | OLM channel (e.g., "stable", "fast"). Tracks release stream |
CRD Files¶
Place CRD manifests in the crds/ subdirectory. The diff engine uses these for deep schema analysis.
# knowledge/v2.21/crds/inferenceservice.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: inferenceservices.serving.kserve.io
spec:
versions:
- name: v1beta1
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
predictor:
type: object
# ... full CRD schema
Validation¶
Validate versioned knowledge models with:
Checks:
- All knowledge models have required metadata fields
- CRD files exist and are valid
- Directory name matches
versionfield - No duplicate operator names
Using Versioned Models with Diff Engine¶
# Compare two versions
operator-chaos diff --old knowledge/v2.20/ --new knowledge/v2.21/
# Generate upgrade test suite
operator-chaos simulate-upgrade \
--from knowledge/v2.20/ \
--to knowledge/v2.21/ \
--output experiments/upgrade-suite/
See Upgrade Testing Guide for full workflow.
Next Steps¶
- Learn about Controller Mode to run experiments as CRDs
- Generate fuzz tests with Fuzz Quick Start
- Test operator upgrades with Upgrade Testing Guide
- Explore the knowledge/ directory for real-world examples