Skip to content

Dashboard (Chaos Operator)

The chaos operator dashboard provides a web interface for monitoring chaos experiments, tracking operator resilience over time, and visualizing operator dependency graphs with chaos coverage overlays.

Overview

The dashboard runs as a single Go binary that:

  1. Watches ChaosExperiment CRs via the Kubernetes API (polling at a configurable interval)
  2. Persists experiment snapshots to SQLite (WAL mode) for historical queries
  3. Serves a REST API consumed by the React frontend
  4. Streams live experiment updates via Server-Sent Events (SSE)
  5. Embeds the compiled React build as static assets (single binary deployment)
graph TB
    subgraph Cluster["Kubernetes Cluster"]
        API["K8s API Server"]
        CRs["ChaosExperiment CRs"]
        CRs --> API
    end

    subgraph Binary["Dashboard Binary (single Go process)"]
        direction TB
        Watcher["K8s Watcher<br/>(poll interval)"]
        Store["SQLite Store<br/>(WAL mode)"]
        REST["REST API<br/>/api/v1/*"]
        SSE["SSE Stream<br/>/api/v1/experiments/live"]
        Static["Embedded Static Assets<br/>(go:embed)"]

        Watcher --> Store
        Store --> REST
        Store --> SSE
    end

    subgraph Frontend["React Frontend (TypeScript)"]
        Overview["Overview"]
        Experiments["Experiments"]
        Operators["Operators"]
        Knowledge["Knowledge"]
        Live["Live Monitor"]
        Suites["Suites"]
    end

    API -- "list/watch" --> Watcher
    REST -- "JSON" --> Frontend
    SSE -- "events" --> Live
    Static -- "serves" --> Frontend

    KnowledgeDir["knowledge/*.yaml"] -. "loaded at startup" .-> Binary

    style Cluster fill:#e8f0fe,stroke:#4285f4
    style Binary fill:#fef7e0,stroke:#f9ab00
    style Frontend fill:#e6f4ea,stroke:#34a853

Prerequisites

  • Go 1.25+
  • Node.js 18+ and npm (for building the frontend)
  • Kubernetes/OpenShift cluster with ChaosExperiment CRs
  • kubeconfig with read access to ChaosExperiment resources

Building

# 1. Build the React frontend
cd dashboard/ui
npm ci
npm run build
cd ../..

# 2. Build the Go binary (embeds the compiled UI)
go build -o bin/chaos-dashboard ./dashboard/cmd/dashboard/

The npm run build step outputs to dashboard/ui-dist/, which is embedded into the Go binary via go:embed.

Running

# Basic usage (in-cluster or with default kubeconfig)
bin/chaos-dashboard

# With all options
bin/chaos-dashboard \
  -addr :8080 \
  -db dashboard.db \
  -kubeconfig ~/.kube/config \
  -knowledge-dir knowledge/ \
  -sync-interval 30s

Then open http://localhost:8080.

Flags

Flag Description Default
-addr HTTP listen address :8080
-db SQLite database path dashboard.db
-kubeconfig Path to kubeconfig (uses in-cluster config if empty)
-knowledge-dir Directory containing operator knowledge YAML files
-sync-interval How often to poll K8s for ChaosExperiment updates 30s

Knowledge Models

The -knowledge-dir flag loads operator knowledge YAML files at startup. These power the Knowledge view's dependency graph and coverage overlays. Without this flag, the Knowledge page will show no data.

The repository ships with knowledge models for 7 operators in the knowledge/ directory:

Operator File Components
odh-model-controller odh-model-controller.yaml odh-model-controller
kserve kserve.yaml kserve-controller-manager, llmisvc-controller-manager, kserve-localmodel-controller-manager, kserve-localmodelnode-agent
opendatahub-operator opendatahub-operator.yaml opendatahub-operator
dashboard dashboard.yaml odh-dashboard
data-science-pipelines data-science-pipelines.yaml data-science-pipelines-operator, ds-pipelines-webhook
model-registry model-registry.yaml model-registry-operator
workbenches workbenches.yaml odh-notebook-controller, kf-notebook-controller

Views

Overview (/)

Cluster-wide resilience health at a glance:

  • Summary cards: Total, Resilient, Degraded, Failed, Running experiment counts
  • Trend indicators: Comparison to previous period (up/down arrows with delta)
  • Verdict timeline: 30-day sparkline of daily verdict counts
  • Recovery metrics: Average recovery time by injection type
  • Running experiments: Currently active experiments with phase and component
  • Recent experiments: Latest completed experiments table

Overview

Live (/live)

Real-time monitoring of running experiments via SSE:

  • Phase stepper: Horizontal dot-and-line visualization showing experiment progress through 7 phases (Pending, Pre-check, Injecting, Observing, Post-check, Evaluating, Complete)
  • Active phase pulse: Blue pulsing animation on the current phase
  • Aborted state: Red indicator on the phase where the experiment was aborted
  • Elapsed time: Auto-updating timer showing how long the experiment has been running
  • Reconnection banner: Warning when SSE connection drops, with automatic exponential backoff reconnection (1s, 2s, 4s, ... up to 30s max)

Live

All Experiments (/experiments)

Filterable, sortable table of all experiments:

  • Filters: Namespace, Operator, Component, Type, Verdict, Phase
  • Search: Name substring search with 300ms debounce
  • Sorting: By name, date, or recovery time (ascending/descending)
  • Pagination: Configurable page size (10, 25, 50)
  • Verdict badges: Color-coded badges (green=Resilient, yellow=Degraded, red=Failed, purple=Inconclusive)

Experiments

Experiment Detail (/experiments/:namespace/:name)

Deep dive into a single experiment across 7 tabs:

Tab Content
Summary Key-value metadata: operator, component, type, recovery time, hypothesis, blast radius
Evaluation Verdict, confidence, recovery time, reconcile cycles, deviations
Steady State Pre-check and post-check results with pass/fail per check
Injection Log Timestamped inject/revert events with target and details
Conditions Status conditions table (type, status, reason, message)
YAML Full CR YAML with copy and download buttons
Debug observedGeneration, cleanupError, raw status JSON

Experiment Detail

Suites (/suites)

Suite run history and version comparison:

  • Suite cards: Each suite run shows name, version, experiment count, and a stacked progress bar (green/yellow/red proportional to Resilient/Degraded/Failed)
  • Expandable table: Click a suite card to see its experiments with verdict and recovery time
  • Version comparison: Select two runs of the same suite to compare results side-by-side with delta indicators (improved, regressed, no change)

Suites

Operators (/operators)

Per-operator resilience insights:

  • Operator cards: Health bar showing Resilient/Degraded/Failed proportions
  • Component accordion: Expandable list of components per operator
  • Coverage matrix: 11-column grid (one per injection type) showing best verdict per type (green=all Resilient, yellow=any Degraded, red=any Failed, gray=untested)
  • Recent experiments: Latest 5 experiments per component with links to detail

Operators

Knowledge (/knowledge)

Interactive dependency graph visualization:

  • Operator/Component selectors: Dropdown toolbars to navigate the knowledge model
  • SVG dependency graph: Deterministic layout with central controller node and managed resources arranged around it
  • Coverage coloring: Nodes colored by chaos test coverage (green=Resilient, yellow=Degraded, red=Failed, gray=untested)
  • Experiment count badges: Number of experiments run against each resource
  • Detail panel: Side panel showing managed resources list, coverage tags, and chaos coverage summary
  • Zoom controls: +/- buttons for graph navigation

Knowledge

REST API Reference

All endpoints are read-only (GET), prefixed with /api/v1/. The dashboard is strictly read-only: it cannot create, modify, or delete experiments.

Experiments

GET /api/v1/experiments

List experiments with optional filters and pagination.

Parameter Type Description
namespace string Filter by namespace
operator string Filter by operator name
component string Filter by component name
type string Filter by injection type
verdict string Filter by verdict
phase string Filter by phase
search string Name substring search
since string ISO 8601 datetime or Go duration (e.g., 24h)
sort string Sort field: name, date, recovery
order string asc or desc
page int Page number (1-based)
pageSize int Items per page (default 10, max 500)

Response:

{
  "items": [...],
  "totalCount": 30
}

GET /api/v1/experiments/:namespace/:name

Get a single experiment (latest run by start time).

GET /api/v1/experiments/live

SSE stream (text/event-stream). Each event is a full experiment JSON object. Events are broadcast on every status change detected by the K8s watcher.

Overview

GET /api/v1/overview/stats

Aggregated dashboard statistics.

Parameter Type Description
since string ISO 8601 datetime or Go duration

Response:

{
  "total": 30,
  "resilient": 23,
  "degraded": 4,
  "failed": 1,
  "inconclusive": 0,
  "running": 2,
  "trends": { "total": 5, "resilient": 3, "degraded": 1, "failed": -1 },
  "verdictTimeline": [{ "date": "2026-03-01", "resilient": 3, "degraded": 1, "failed": 0 }],
  "avgRecoveryByType": { "PodKill": 12000, "ConfigDrift": 28000 },
  "runningExperiments": [{ "name": "omc-podkill", "namespace": "opendatahub", "phase": "Observing", "component": "odh-model-controller", "type": "PodKill" }]
}

Operators

GET /api/v1/operators

List operator names (distinct from experiment data).

GET /api/v1/operators/:operator/components

List component names for an operator.

Knowledge

GET /api/v1/knowledge/:operator/:component

Returns the ComponentModel from the loaded knowledge YAML files. Requires -knowledge-dir to be set at startup.

Response:

{
  "name": "odh-model-controller",
  "controller": "DataScienceCluster",
  "managedResources": [
    { "apiVersion": "apps/v1", "kind": "Deployment", "name": "odh-model-controller", "namespace": "opendatahub" }
  ],
  "webhooks": [
    { "name": "validating.odh-model-controller.opendatahub.io", "type": "validating", "path": "/validate" }
  ],
  "finalizers": ["odh.inferenceservice.finalizers"]
}

Suites

Suite runs are identified by well-known labels on ChaosExperiment CRs:

Label Description
chaos.operatorchaos.io/suite-name Suite definition name
chaos.operatorchaos.io/suite-run-id Unique run ID
chaos.operatorchaos.io/operator-version Operator version under test

GET /api/v1/suites

List suite runs with verdict counts.

GET /api/v1/suites/:runId

List experiments in a suite run.

GET /api/v1/suites/compare?suite=X&runA=Y&runB=Z

Compare two runs of the same suite. Returns { "runA": [...], "runB": [...] }.

SQLite Database

The dashboard persists experiment data in SQLite with WAL mode for concurrent read/write access. The database is automatically created and migrated on first run.

Schema

The experiments table stores one row per experiment run, keyed by {namespace}/{name}/{startTime}. Re-running an experiment with the same name creates a new row rather than overwriting history.

Indexed columns: namespace, operator, component, verdict, phase, injection_type, start_time, suite_run_id, suite_name.

Backup

The database file is a single file at the path specified by -db. To back up:

sqlite3 dashboard.db ".backup backup.db"

Deployment

In-Cluster (Kubernetes)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: chaos-dashboard
  namespace: opendatahub
spec:
  replicas: 1
  selector:
    matchLabels:
      app: chaos-dashboard
  template:
    metadata:
      labels:
        app: chaos-dashboard
    spec:
      serviceAccountName: chaos-dashboard
      containers:
        - name: dashboard
          image: quay.io/opendatahub/chaos-dashboard:latest
          args:
            - -addr=:8080
            - -db=/data/dashboard.db
            - -knowledge-dir=/knowledge
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: data
              mountPath: /data
            - name: knowledge
              mountPath: /knowledge
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: chaos-dashboard-data
        - name: knowledge
          configMap:
            name: chaos-knowledge
---
apiVersion: v1
kind: Service
metadata:
  name: chaos-dashboard
  namespace: opendatahub
spec:
  selector:
    app: chaos-dashboard
  ports:
    - port: 8080
      targetPort: 8080

The service account needs get, list, and watch permissions on chaosexperiments.chaos.operatorchaos.io resources.

Local Development

# Terminal 1: Run the backend (proxies to Vite dev server for HMR)
go run ./dashboard/cmd/dashboard/ -knowledge-dir knowledge/

# Terminal 2: Run the Vite dev server
cd dashboard/ui && npm run dev

Configure the Vite dev server to proxy /api/ requests to the Go backend (port 8080) in vite.config.ts.

Tech Stack

Layer Technology
Frontend React 18, TypeScript, Vite, Vitest + React Testing Library
Backend Go, net/http, k8s.io/client-go
Storage SQLite (via modernc.org/sqlite, pure Go, no CGO)
Build Vite (frontend), go:embed (serve static assets)