Upgrade Testing¶
Test and validate RHOAI/ODH operator upgrades before they hit production. The upgrade testing workflow detects breaking changes between versions and simulates upgrade-like disruptions using the existing chaos injection engine.
Why Upgrade Testing Matters¶
Operator upgrades are where architectural assumptions break. RHOAI 2.x to 3.x migrations caused real production incidents:
- Component renaming:
odh-dashboardbecamerhods-dashboard, breaking label selectors and monitoring queries - Namespace moves: Components migrated from
opendatahubtoredhat-ods-applications, breaking NetworkPolicies and RBAC bindings - CRD schema changes: Required fields added to InferenceService specs, failing existing workloads
- Webhook additions: New validation webhooks rejected previously-valid resources
- Dependency ordering changes: KServe initialization timing shifted, causing race conditions in dependent operators
Traditional upgrade testing catches runtime errors (pod crash loops, API server rejections), but misses resilience degradation. The upgrade simulation workflow tests what happens when:
- Old-name pods are terminated and only new-name pods exist
- Cross-namespace communication is disrupted during namespace migrations
- Webhooks are suddenly removed or added mid-flight
- New dependencies become unavailable
These are the failure modes that surface in production after the upgrade completes, when chaos occurs naturally.
Workflow Overview¶
flowchart LR
A["Create Knowledge<br/>Snapshots"] --> B["operator-chaos<br/>diff"]
B --> C["operator-chaos<br/>diff-crds"]
C --> D["operator-chaos<br/>simulate-upgrade<br/>--dry-run"]
D --> E["Review<br/>Experiments"]
E --> F["operator-chaos<br/>simulate-upgrade"]
F --> G["operator-chaos<br/>validate-version"]
style A fill:#bbdefb,stroke:#1565c0
style B fill:#ce93d8,stroke:#6a1b9a
style C fill:#ce93d8,stroke:#6a1b9a
style D fill:#ffcc80,stroke:#e65100
style E fill:#b0bec5,stroke:#37474f
style F fill:#ef9a9a,stroke:#c62828
style G fill:#a5d6a7,stroke:#2e7d32
Step 1: Snapshot knowledge models for both versions (source and target)
Step 2: Detect architectural differences with operator-chaos diff
Step 3: Analyze CRD schema changes with operator-chaos diff-crds
Step 4: Preview upgrade experiments with --dry-run
Step 5: Review generated experiments for architectural correctness
Step 6: Execute upgrade simulation and collect results
Step 7: Validate cluster state matches target version
Versioned Knowledge Models¶
Knowledge models now support version metadata to enable diff-based upgrade testing. Versioned models live in a structured directory hierarchy:
knowledge/
├── odh/
│ └── v2.10/
│ ├── dashboard.yaml
│ ├── kserve.yaml
│ └── odh-model-controller.yaml
└── rhoai/
└── v3.3/
├── dashboard.yaml
├── kserve.yaml
└── odh-model-controller.yaml
Version Metadata Fields¶
Three new fields in the operator section identify the version:
operator:
name: dashboard
namespace: redhat-ods-applications
repository: https://github.com/opendatahub-io/odh-dashboard
version: "3.3.1" # Operator version
platform: rhoai # Platform identifier (rhoai, odh, custom)
olmChannel: stable-3.3 # OLM subscription channel
| Field | Required | Description |
|---|---|---|
version |
Yes | Semantic version string (e.g., "3.3.1", "2.10.0") |
platform |
Yes | Platform identifier used in diff output (e.g., "rhoai", "odh") |
olmChannel |
No | OLM channel for subscription-based deployments |
Creating a Version Snapshot¶
To create a versioned knowledge snapshot from a live cluster:
1. Copy current knowledge files to a versioned directory:
2. Add version metadata to each file:
operator:
name: dashboard
namespace: redhat-ods-applications
version: "3.3.1"
platform: rhoai
olmChannel: stable-3.3
3. Validate the snapshot:
Version snapshots are immutable
Once created, versioned knowledge directories should not be modified. They represent the architectural state at a specific release. To update for a new version, create a new versioned directory.
Comparing Versions¶
The operator-chaos diff command detects architectural changes between two knowledge snapshots:
Example Output¶
Comparing knowledge models:
Source: knowledge/odh/v2.10/ (platform: odh, 15 operators)
Target: knowledge/rhoai/v3.3/ (platform: rhoai, 11 operators)
Component Changes:
dashboard.odh-dashboard → dashboard.rhods-dashboard [BREAKING: Component rename]
- Deployment: odh-dashboard → rhods-dashboard (namespace: opendatahub → redhat-ods-applications)
- Service: odh-dashboard → rhods-dashboard (namespace: opendatahub → redhat-ods-applications)
- ConfigMap: odh-dashboard-config → rhods-dashboard-config (namespace: opendatahub → redhat-ods-applications)
kserve.kserve-controller-manager [BREAKING: Namespace move]
- Namespace: opendatahub → redhat-ods-applications
odh-model-controller.odh-model-controller [BREAKING: Webhook added]
+ Webhook: validating.nim.account.odh-model-controller.opendatahub.io (type: validating)
Operator Changes:
- codeflare [BREAKING: Operator removed]
- modelmesh [BREAKING: Operator removed]
- data-science-pipelines [BREAKING: Operator removed]
Summary:
15 breaking changes detected
3 warnings
8 informational changes
Filtering by Severity¶
Use --breaking to show only breaking changes:
Machine-Readable Output¶
For CI integration, use JSON format:
{
"source": {
"path": "knowledge/odh/v2.10/",
"platform": "odh",
"operatorCount": 15
},
"target": {
"path": "knowledge/rhoai/v3.3/",
"platform": "rhoai",
"operatorCount": 11
},
"changes": [
{
"type": "ComponentRename",
"severity": "Breaking",
"operator": "dashboard",
"source": "odh-dashboard",
"target": "rhods-dashboard",
"details": {
"namespaceChanged": true,
"sourceNamespace": "opendatahub",
"targetNamespace": "redhat-ods-applications"
}
}
],
"summary": {
"breaking": 15,
"warnings": 3,
"info": 8
}
}
CRD Schema Diffing¶
The operator-chaos diff-crds command analyzes Custom Resource Definition schema changes between versions:
# Extract CRDs from source cluster
oc get crd inferenceservices.serving.kserve.io -o yaml > crds-source/inferenceservice.yaml
# Extract CRDs from target cluster
oc get crd inferenceservices.serving.kserve.io -o yaml > crds-target/inferenceservice.yaml
# Diff the schemas
operator-chaos diff-crds --source crds-source/ --target crds-target/
Example Output¶
CRD Schema Changes:
inferenceservices.serving.kserve.io
API Version: v1beta1
spec.predictor.llm [BREAKING: Field removed]
spec.predictor.runtime [INFO: Field added]
spec.predictor.serviceAccountName [BREAKING: Required added]
spec.transformer.timeout [WARNING: Default changed] (60s → 120s)
API Version: v1alpha1 [BREAKING: API version removed]
Summary:
3 breaking changes
1 warning
1 informational change
Severity Levels¶
| Severity | Change Type | Impact |
|---|---|---|
| Breaking | Field removal | Existing resources with this field will fail validation |
| Breaking | Type change | Field type incompatibility (e.g., string → integer) |
| Breaking | Required field added | Resources without this field will be rejected |
| Breaking | Enum value removed | Resources using removed values will fail |
| Breaking | API version removed | Resources using removed version become inaccessible |
| Warning | Default value changed | Behavior changes for resources that don't specify the field |
| Info | Field added | New optional field available |
| Info | Enum value added | New valid option available |
Breaking changes require migration
CRD breaking changes typically require updating existing custom resources before or during the upgrade. Plan for migration scripts or admission webhooks that transform old resources to the new schema.
Simulating Upgrades¶
The operator-chaos simulate-upgrade command generates and executes chaos experiments based on detected architectural differences:
# Preview experiments without executing
operator-chaos simulate-upgrade \
--source knowledge/odh/v2.10/ \
--target knowledge/rhoai/v3.3/ \
--dry-run
# Execute upgrade simulation
operator-chaos simulate-upgrade \
--source knowledge/odh/v2.10/ \
--target knowledge/rhoai/v3.3/
Diff-to-Experiment Mapping¶
The simulation engine maps architectural changes to chaos injection types that test upgrade resilience:
| Change Type | Injection Type | What It Tests |
|---|---|---|
| Component rename | PodKill | Recovery after old-name pods are terminated |
| Namespace move | NetworkPartition | Cross-namespace communication during migration |
| Webhook removal | WebhookDisrupt | Behavior when validation is removed |
| Webhook addition | WebhookDisrupt | Impact of new validation enforcement |
| Dependency added | PodKill | Resilience when new dependency is unavailable |
| Dependency removed | NetworkPartition | Isolation from removed dependencies |
| Resource moved | PodKill + NetworkPartition | Recovery after resource relocation |
Example: For the odh-dashboard → rhods-dashboard rename, the simulation:
- Terminates all
odh-dashboardpods (simulating pod name change) - Disrupts network to old namespace
opendatahub(simulating namespace move) - Validates that dependent components recover correctly
Dry-Run Mode¶
Use --dry-run to preview generated experiments without executing them:
operator-chaos simulate-upgrade \
--source knowledge/odh/v2.10/ \
--target knowledge/rhoai/v3.3/ \
--dry-run
Output:
Upgrade Simulation Plan (DRY RUN):
Experiment 1: dashboard-rename-recovery
Type: PodKill
Target: dashboard/rhods-dashboard
Reason: Component rename detected (odh-dashboard → rhods-dashboard)
Injection: Kill all pods matching label deployment=rhods-dashboard
Validation: Verify rhods-dashboard Deployment reaches Available=True within 300s
Experiment 2: dashboard-namespace-migration
Type: NetworkPartition
Target: dashboard/rhods-dashboard
Reason: Namespace move detected (opendatahub → redhat-ods-applications)
Injection: Partition network between opendatahub and redhat-ods-applications
Validation: Verify cross-namespace communication recovers
Experiment 3: odh-model-controller-webhook-enforcement
Type: WebhookDisrupt
Target: odh-model-controller/odh-model-controller
Reason: Webhook added (validating.nim.account.odh-model-controller.opendatahub.io)
Injection: Inject webhook failures for 60s
Validation: Verify resources are validated correctly after webhook recovery
Total experiments: 3
Estimated runtime: 15m
Review the experiments for architectural correctness before executing.
Customize generated experiments
The simulate-upgrade command writes experiment YAML files to ./upgrade-experiments/ by default. Use --output-dir to specify a custom location. You can edit these files before execution to adjust timeouts, injection parameters, or validation criteria.
Validating Cluster Version¶
The operator-chaos validate-version command verifies that a cluster matches a specific knowledge snapshot:
# Validate pre-upgrade state (cluster should match source version)
operator-chaos validate-version --knowledge-dir knowledge/odh/v2.10/
# Validate post-upgrade state (cluster should match target version)
operator-chaos validate-version --knowledge-dir knowledge/rhoai/v3.3/
Example Output¶
Validating cluster against knowledge snapshot:
Path: knowledge/rhoai/v3.3/
Platform: rhoai
Expected version: 3.3.1
Operator: dashboard (version: 3.3.1) [MATCH]
Component: rhods-dashboard [OK]
✓ Deployment rhods-dashboard exists in redhat-ods-applications
✓ Deployment rhods-dashboard is Available
✓ Service rhods-dashboard exists
Operator: kserve (version: 0.13.1) [MATCH]
Component: kserve-controller-manager [OK]
✓ Deployment kserve-controller-manager exists in redhat-ods-applications
✓ Deployment kserve-controller-manager is Available
Validation: PASS
All components match target version 3.3.1
Use Cases¶
Pre-upgrade gate: Validate that the source cluster is in the expected state before starting an upgrade. If validation fails, the cluster has drifted from the source version and may have unknown state.
Post-upgrade verification: Confirm that the upgrade completed successfully and the cluster now matches the target version. Validation failures indicate incomplete upgrades or rollback scenarios.
Version drift detection
If validate-version fails on a cluster that should match the knowledge snapshot, investigate drift. Common causes: manual changes, partially-applied hotfixes, incomplete rollbacks, or OLM subscription channel mismatches.
End-to-End Example¶
Full workflow for testing an ODH 2.10 → RHOAI 3.3 upgrade:
# Step 1: Compare versions to understand changes
operator-chaos diff \
--source knowledge/odh/v2.10/ \
--target knowledge/rhoai/v3.3/
# Step 2: Analyze CRD schema changes
operator-chaos diff-crds \
--source crds-odh-2.10/ \
--target crds-rhoai-3.3/
# Step 3: Preview upgrade experiments
operator-chaos simulate-upgrade \
--source knowledge/odh/v2.10/ \
--target knowledge/rhoai/v3.3/ \
--dry-run
# Step 4: Validate pre-upgrade state
operator-chaos validate-version --knowledge-dir knowledge/odh/v2.10/
# Step 5: Run upgrade simulation
operator-chaos simulate-upgrade \
--source knowledge/odh/v2.10/ \
--target knowledge/rhoai/v3.3/
# Step 6: Validate post-upgrade state
operator-chaos validate-version --knowledge-dir knowledge/rhoai/v3.3/
Expected Results¶
Step 1 (diff): Identifies 15 breaking changes, 3 warnings. Key changes are component renames and namespace moves.
Step 2 (diff-crds): Detects 1 breaking CRD change (required field added to InferenceService). Plan to update existing InferenceService resources.
Step 3 (dry-run): Generates 12 experiments covering renames, namespace migrations, and webhook changes. Estimated runtime 45 minutes.
Step 4 (validate pre-upgrade): All components match ODH 2.10 baseline. Ready for upgrade.
Step 5 (simulate-upgrade): Executes 12 experiments. Results: - 10 experiments: Resilient (components recovered within SLO) - 2 experiments: Degraded (odh-model-controller required 15 reconcile cycles after webhook addition)
Step 6 (validate post-upgrade): All components match RHOAI 3.3 target. Upgrade successful.
Findings: The webhook addition to odh-model-controller causes inefficient reconciliation. Investigate controller logic for optimization before production rollout.
Run simulations in pre-production
Upgrade simulations inject real chaos into the cluster. Run them in dedicated test environments that mirror production topology (same namespace layout, RBAC policies, NetworkPolicies). Do not run simulations in production.
Upgrade Playbooks¶
Upgrade playbooks automate multi-step upgrade paths by orchestrating OLM channel hops, pre-migration tasks, chaos-based resilience validation, and post-upgrade verification. Instead of manually triggering each step, define a YAML playbook that executes the entire upgrade workflow with resume-on-failure support.
Playbook Format¶
An upgrade playbook is a YAML file defining the source version, target version, the OLM channel hop path, and a sequence of steps to execute:
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: UpgradePlaybook
metadata:
name: rhoai-2.10-to-3.3
description: "Upgrade RHOAI from 2.10 to 3.3 with pre-migration and post-validation"
upgrade:
source:
knowledgeDir: knowledge/rhoai/v2.10/
version: "2.10"
target:
knowledgeDir: knowledge/rhoai/v3.3/
version: "3.3"
path:
operator: rhods-operator
namespace: redhat-ods-operator
hops:
- channel: stable-2.16
maxWait: 15m
- channel: stable-3.0
maxWait: 20m
- channel: stable-3.3
maxWait: 20m
steps:
- name: validate-source-version
type: validate-version
- name: migrate-dashboard-configmaps
type: kubectl
commands:
- "oc get configmap -n opendatahub -l app=odh-dashboard --no-headers"
- name: trigger-upgrade
type: olm
- name: post-upgrade-resilience
type: chaos
experiments:
- experiments/dashboard/pod-kill.yaml
- name: validate-target-version
type: validate-version
Step Types¶
Playbooks support five step types:
| Step Type | Description | Parameters |
|---|---|---|
validate-version |
Verify cluster state matches knowledge snapshot | Uses source/target knowledgeDir from playbook metadata |
kubectl |
Run kubectl/oc commands | commands (array of shell commands to execute) |
manual |
Pause for operator intervention | description (instructions for operator), optional autoCheck (command to auto-verify completion) |
olm |
Trigger OLM channel hops | Uses path.hops from playbook metadata |
chaos |
Execute chaos experiments | experiments (array of experiment YAML file paths) |
Running a Playbook¶
Execute an upgrade playbook with operator-chaos upgrade run:
What happens:
- Creates a state file in
.upgrade-state/to track progress - Executes each step sequentially
- For
validate-versionsteps, compares cluster state to knowledge snapshot - For
kubectlsteps, runs commands and captures output - For
manualsteps, pauses and waits for operator confirmation (or runsautoCheckin CI mode) - For
olmsteps, patches Subscription channel and monitors CSV installation - For
chaossteps, runs experiments and collects verdicts - Saves state after each step for resume-on-failure
Discovering Available Channels¶
Before writing a playbook, discover available OLM channels for an operator:
Output:
Available channels for rhods-operator:
stable-2.10 (installed: 2.10.0)
stable-2.16 (latest: 2.16.3)
stable-3.0 (latest: 3.0.2)
stable-3.3 (latest: 3.3.1)
Current subscription: stable-2.10
Use --format json for machine-readable output.
Triggering Single Hops¶
For simple upgrades (single channel hop), use operator-chaos upgrade trigger instead of writing a full playbook:
operator-chaos upgrade trigger \
--operator rhods-operator \
--namespace redhat-ods-operator \
--channel stable-3.3
This patches the Subscription and monitors the upgrade until the new CSV is installed or the timeout expires.
Monitoring In-Progress Upgrades¶
Attach to an already-running upgrade (triggered externally or by another process):
operator-chaos upgrade monitor \
--operator rhods-operator \
--namespace redhat-ods-operator \
--timeout 30m
The monitor command watches CSV transitions and prints status updates until the upgrade completes or fails.
Resume on Failure¶
If an upgrade playbook fails mid-execution, resume from the failed step:
# Resume from the last failed step
operator-chaos upgrade run --playbook playbook.yaml --resume-from migrate-dashboard-configmaps
# Force resume (skip step validation and start from specified step)
operator-chaos upgrade run --playbook playbook.yaml --force-resume --resume-from trigger-upgrade
State files are stored in .upgrade-state/<playbook-name>/ with timestamps. Use --state-dir to customize the location.
CI Mode¶
For unattended execution in CI pipelines:
Flags:
| Flag | Behavior |
|---|---|
--skip-manual |
Auto-proceed through manual steps using autoCheck commands |
--allow-shell |
Allow shell command execution (kubectl, oc) without prompts |
If a manual step lacks an autoCheck command, it is skipped with a warning in --skip-manual mode.
Dry-Run Mode¶
Preview playbook execution without making changes:
Dry-run mode prints the execution plan and validates all steps, but does not patch Subscriptions, run kubectl commands, or execute chaos experiments.
Example Playbook¶
The following playbook upgrades RHOAI from 2.10 to 3.3 with pre-migration and post-validation:
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: UpgradePlaybook
metadata:
name: rhoai-2.10-to-3.3
description: "Upgrade RHOAI from 2.10 to 3.3 with pre-migration and post-validation"
upgrade:
source:
knowledgeDir: knowledge/rhoai/v2.10/
version: "2.10"
target:
knowledgeDir: knowledge/rhoai/v3.3/
version: "3.3"
path:
operator: rhods-operator
namespace: redhat-ods-operator
hops:
- channel: stable-2.16
maxWait: 15m
- channel: stable-3.0
maxWait: 20m
- channel: stable-3.3
maxWait: 20m
steps:
# Pre-upgrade: verify cluster is at expected source version
- name: validate-source-version
type: validate-version
# Pre-upgrade: migrate dashboard ConfigMaps to new namespace
- name: migrate-dashboard-configmaps
type: kubectl
commands:
- "oc get configmap -n opendatahub -l app=odh-dashboard --no-headers 2>/dev/null | wc -l | xargs -I{} echo '{} configmaps found in opendatahub namespace'"
verify:
type: resourceExists
apiVersion: v1
kind: ConfigMap
namespace: redhat-ods-applications
labelSelector: app=odh-dashboard
# Pre-upgrade: verify InferenceServices are healthy before touching kserve
- name: confirm-inferenceservices-healthy
type: manual
description: "Verify all InferenceServices are in Ready state before proceeding. Run: oc get isvc -A"
autoCheck: "oc get isvc -A -o jsonpath='{.items[*].status.conditions[?(@.type==\"Ready\")].status}' 2>/dev/null | grep -v False || echo 'no InferenceServices found (ok)'"
# Trigger the OLM upgrade through 3 channel hops
- name: trigger-upgrade
type: olm
# Post-upgrade: run chaos experiments to verify resilience
- name: post-upgrade-resilience
type: chaos
experiments:
- experiments/dashboard/pod-kill.yaml
- experiments/kserve/main-controller-kill.yaml
- experiments/odh-model-controller/pod-kill.yaml
# Post-upgrade: verify cluster matches target version
- name: validate-target-version
type: validate-version
OLM Upgrade Path Considerations¶
When planning OLM-based upgrade testing, be aware that the RHOAI operator catalog has limited upgrade graph connectivity. Real-world testing on OCP 4.20 with RHOAI revealed:
Channel Structure¶
| Channel | Available Versions | Notes |
|---|---|---|
fast-3.x |
3.0.0, 3.2.0, 3.2.1, 3.3.0, 3.3.1 | Only channel containing 3.0.0 |
stable-3.3 |
3.3.0, 3.3.1 | |
stable-3.x |
3.3.0, 3.3.1 |
Upgrade Path Gaps¶
The RHOAI catalog entries lack explicit replaces fields in the package manifest. OLM resolves upgrades using semver ordering within a channel, but this has limitations:
- Within
fast-3.x: OLM resolves 3.0.0 → 3.2.1 automatically (skips 3.2.0). The installed CSV setsreplaces: rhods-operator.3.0.0at install time, even though the catalog metadata doesn't declare it. - 3.2.1 → 3.3.x dead-end: OLM does not propose an upgrade from 3.2.1 to 3.3.x on any channel, because no 3.3.x entry declares
replaces: rhods-operator.3.2.1. - Cross-channel upgrades: Switching from
fast-3.xtostable-3.3does not trigger an upgrade if the current version (3.2.1) is not in the target channel.
Workaround: Subscription Recreation¶
To force upgrade across the gap, delete the Subscription and CSV, then create a new Subscription targeting the desired version:
# Delete current subscription and CSV
oc delete subscription rhods-operator -n redhat-ods-operator
oc delete csv rhods-operator.3.2.1 -n redhat-ods-operator
# Re-subscribe to target version
cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: rhods-operator
namespace: redhat-ods-operator
spec:
channel: stable-3.3
installPlanApproval: Manual
name: rhods-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: rhods-operator.3.3.1
EOF
# Approve the install plan
oc get installplan -n redhat-ods-operator
oc patch installplan <plan-name> -n redhat-ods-operator --type merge -p '{"spec":{"approved":true}}'
Validated Upgrade Results¶
Real cluster validation (OCP 4.20 HyperShift, 2x m5.xlarge) confirmed operator resilience across versions:
| Version | Upgrade Method | PodKill Verdict | Recovery Time |
|---|---|---|---|
| RHOAI 3.0.0 | Fresh install (fast-3.x) |
Resilient | 1.36s |
| RHOAI 3.2.1 | OLM auto-upgrade from 3.0.0 | Resilient | 1.27s |
| RHOAI 3.3.1 | Subscription recreation | Resilient | 1.79s |
Use --namespace for RHOAI clusters
Experiment YAML files default to the default namespace. When running on specific platforms, use --namespace to target the correct namespace (e.g., opendatahub for ODH, redhat-ods-applications for RHOAI). This overrides all namespace references in the experiment, including steady-state checks, blast radius validation, and reconciliation checks.
Next Steps¶
- Learn about Knowledge Models to create version snapshots
- Integrate upgrade testing into CI with CI Integration
- Explore example knowledge snapshots in the knowledge/ directory