Observer Blackboard Pattern¶
The observation system uses the Blackboard architectural pattern to collect evidence from multiple independent observers. This enables holistic evaluation of experiment outcomes by combining reconciliation metrics, steady-state checks, and collateral damage detection.
What is the Blackboard Pattern?¶
The Blackboard pattern is a software architecture where multiple specialized "knowledge sources" (observers) contribute information to a shared data structure (blackboard). Each observer:
- Reads the current state from the blackboard
- Performs its specialized analysis
- Writes findings back to the blackboard
- Does NOT directly communicate with other observers
This decouples observers and makes the system extensible — new observers can be added without modifying existing ones.
Why Use It for Chaos Engineering?¶
Chaos experiments require multi-dimensional observation:
- Did the target operator recover? (Reconciliation)
- Did the system return to baseline? (Steady-state)
- Did other components break? (Collateral damage)
Traditional single-observer systems miss the full picture. The blackboard pattern allows:
- Parallel observation — Multiple observers run concurrently
- Incremental evidence — Observers contribute at different times
- Holistic evaluation — Evaluator sees all evidence, not just one observer's view
Architecture¶
graph TB
subgraph orch["Orchestrator"]
Orchestrator[Lifecycle Orchestrator]
end
subgraph bb["Blackboard"]
Board["ObservationBoard<br/>(thread-safe shared state)"]
end
subgraph p1["Phase 1 · Blocking"]
Recon["Reconciliation<br/>Contributor"]
end
subgraph p2["Phase 2 · Concurrent"]
Steady["Steady-State<br/>Contributor"]
Collateral["Collateral Damage<br/>Contributor"]
end
subgraph ev["Evaluation"]
Evaluator[Evaluator]
end
Orchestrator -->|"1. Create board"| Board
Orchestrator -->|"2. Run Phase 1"| Recon
Recon -->|"3. Write findings"| Board
Orchestrator -->|"4. Run Phase 2"| Steady
Orchestrator -->|"4. Run Phase 2"| Collateral
Steady -->|"5. Write findings"| Board
Collateral -->|"5. Write findings"| Board
Board -->|"6. Read all findings"| Evaluator
style orch fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#4a148c
style bb fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#bf360c
style p1 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
style p2 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
style ev fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#b71c1c
style Orchestrator fill:#ce93d8,stroke:#6a1b9a
style Board fill:#ffcc80,stroke:#e65100
style Recon fill:#bbdefb,stroke:#1565c0
style Steady fill:#a5d6a7,stroke:#2e7d32
style Collateral fill:#a5d6a7,stroke:#2e7d32
style Evaluator fill:#ef9a9a,stroke:#c62828
Core Components¶
ObservationBoard¶
Thread-safe data structure for storing findings:
type ObservationBoard struct {
mu sync.Mutex
findings []Finding
}
func (b *ObservationBoard) AddFinding(f Finding) {
b.mu.Lock()
defer b.mu.Unlock()
b.findings = append(b.findings, f)
}
func (b *ObservationBoard) Findings() []Finding {
b.mu.Lock()
defer b.mu.Unlock()
return append([]Finding(nil), b.findings...) // Return copy
}
Key Features:
- Thread-safe for concurrent writes
- Returns defensive copies to prevent external mutation
- Supports filtering by finding source
Finding¶
Structured evidence unit:
type Finding struct {
Source FindingSource // reconciliation, steady_state, collateral
Component string // Target component
Operator string // Target operator
Passed bool // Did observation pass?
Details string // Human-readable details
Checks *v1alpha1.CheckResult // Steady-state check results
ReconciliationResult *ReconciliationResult // Reconciliation metrics
}
Finding Sources:
const (
SourceReconciliation FindingSource = "reconciliation"
SourceSteadyState FindingSource = "steady_state"
SourceCollateral FindingSource = "collateral"
)
ObservationContributor¶
Interface for all observers:
type ObservationContributor interface {
Observe(ctx context.Context, board *ObservationBoard) error
}
Contract:
- Perform observation (watch resources, run checks, etc.)
- Write findings to board via
board.AddFinding(finding) - Return error if observation fails (does NOT cancel other contributors)
Observation Contributors¶
1. ReconciliationContributor¶
Monitors the target operator's reconciliation behavior during recovery.
What It Observes:
- Count of reconcile cycles triggered
- Duration of reconciliation window
- Whether reconciliation stabilized or thrashed
Implementation:
type ReconciliationContributor struct {
checker ReconciliationCheckerInterface
component *model.Component
namespace string
timeout time.Duration
}
func (r *ReconciliationContributor) Observe(ctx context.Context, board *ObservationBoard) error {
startTime := time.Now()
result, err := r.checker.WaitForReconciliation(ctx, r.component, r.namespace, r.timeout)
if err != nil {
return fmt.Errorf("reconciliation check failed: %w", err)
}
board.AddFinding(Finding{
Source: SourceReconciliation,
Component: r.component.Name,
Operator: r.component.Operator,
Passed: result.Cycles > 0,
ReconciliationResult: result,
})
return nil
}
Reconciliation Result:
type ReconciliationResult struct {
Cycles int // Number of reconcile events observed
Duration time.Duration // Time to stabilization
FirstEventTime time.Time // First reconcile event
LastEventTime time.Time // Last reconcile event
}
Evaluation:
- < 3 cycles →
DEGRADED(slow or failed recovery) - 3-10 cycles →
RESILIENT(normal recovery) - > 10 cycles →
DEGRADED(thrashing)
2. SteadyStateContributor¶
Runs user-defined checks to verify system baseline.
What It Observes:
- Resource existence (
CheckResourceExists) - Condition status (
CheckConditionTrue)
Implementation:
type SteadyStateContributor struct {
observer Observer
checks []v1alpha1.SteadyStateCheck
namespace string
}
func (s *SteadyStateContributor) Observe(ctx context.Context, board *ObservationBoard) error {
checkResult, err := s.observer.CheckSteadyState(ctx, s.checks, s.namespace)
if err != nil {
return fmt.Errorf("steady-state check failed: %w", err)
}
board.AddFinding(Finding{
Source: SourceSteadyState,
Passed: checkResult.Passed,
Checks: checkResult,
})
return nil
}
Check Types:
steadyState:
checks:
- type: resourceExists
kind: Deployment
name: my-operator
namespace: opendatahub
- type: conditionTrue
kind: DataScienceCluster
name: default
conditionType: Ready
CheckResult:
type CheckResult struct {
Passed bool // Overall pass/fail
ChecksRun int32 // Total checks executed
ChecksPassed int32 // Checks that passed
Details []CheckDetail // Per-check results
Timestamp metav1.Time // When check ran
}
type CheckDetail struct {
Check SteadyStateCheck
Passed bool
Value string // Actual value observed
Error string // Error message if failed
}
3. CollateralContributor¶
Checks dependent components for cascading failures.
What It Observes:
- Steady-state of components that depend on the experiment target
- Uses dependency graph to identify dependents
Implementation:
type CollateralContributor struct {
observer Observer
dependents []model.ComponentRef
}
func (c *CollateralContributor) Observe(ctx context.Context, board *ObservationBoard) error {
for _, dep := range c.dependents {
// Derive steady-state checks for dependent component
checks := deriveChecksForComponent(dep)
result, err := c.observer.CheckSteadyState(ctx, checks, dep.Namespace)
board.AddFinding(Finding{
Source: SourceCollateral,
Component: dep.Component,
Operator: dep.Operator,
Passed: result.Passed && err == nil,
Checks: result,
})
}
return nil
}
Example:
Target: model-controller
Dependents: [dashboard, notebook-controller]
CollateralContributor checks:
1. Is dashboard's Deployment Ready?
2. Is notebook-controller's StatefulSet Ready?
Evaluation:
- If any dependent fails →
DEGRADED(collateral damage detected)
Execution Phases¶
The orchestrator runs contributors in two phases:
Phase 1: Reconciliation (Blocking)¶
board := observer.NewObservationBoard()
reconContributor := observer.NewReconciliationContributor(
reconciler,
component,
namespace,
recoveryTimeout,
)
if err := reconContributor.Observe(ctx, board); err != nil {
log.Warn("reconciliation contributor error", "error", err)
}
Why Blocking?
Reconciliation observation must complete before steady-state checks run. We need to wait for the recovery window to elapse to accurately count reconcile cycles.
Phase 2: Steady-State & Collateral (Concurrent)¶
var contributors []observer.ObservationContributor
// Steady-state contributor
contributors = append(contributors, observer.NewSteadyStateContributor(
observer, checks, namespace))
// Collateral contributor (if dependents exist)
if len(dependents) > 0 {
contributors = append(contributors, observer.NewCollateralContributor(
observer, dependents))
}
// Run all contributors concurrently
errs := observer.RunContributors(ctx, board, contributors)
for _, err := range errs {
log.Warn("contributor error", "error", err)
}
Why Concurrent?
Steady-state and collateral checks are independent and can run in parallel, reducing total observation time.
RunContributors Implementation¶
func RunContributors(ctx context.Context, board *ObservationBoard, contributors []ObservationContributor) []error {
var (
wg sync.WaitGroup
mu sync.Mutex
errs []error
)
for _, c := range contributors {
wg.Add(1)
go func(contrib ObservationContributor) {
defer wg.Done()
if err := contrib.Observe(ctx, board); err != nil {
mu.Lock()
errs = append(errs, err)
mu.Unlock()
}
}(c)
}
wg.Wait()
return errs
}
Key Design Decision:
Errors from one contributor do NOT cancel other contributors. Each observer runs to completion, maximizing evidence collection.
Evaluation from Findings¶
The evaluator consumes all findings to compute a verdict:
func (e *Evaluator) EvaluateFromFindings(findings []Finding, hypothesis HypothesisSpec) *EvaluationResult {
var (
reconPassed bool
steadyPassed bool
collateralOK = true
reconCycles int
recoveryTime time.Duration
)
for _, f := range findings {
switch f.Source {
case SourceReconciliation:
reconPassed = f.Passed
if f.ReconciliationResult != nil {
reconCycles = f.ReconciliationResult.Cycles
recoveryTime = f.ReconciliationResult.Duration
}
case SourceSteadyState:
steadyPassed = f.Passed
case SourceCollateral:
if !f.Passed {
collateralOK = false
}
}
}
// Decision tree
if !steadyPassed {
return &EvaluationResult{Verdict: Failed, Confidence: "high"}
}
if !collateralOK {
return &EvaluationResult{Verdict: Degraded, Confidence: "medium", Deviations: []string{"collateral damage"}}
}
if reconCycles < 3 {
return &EvaluationResult{Verdict: Degraded, Confidence: "low", Deviations: []string{"slow recovery"}}
}
return &EvaluationResult{Verdict: Resilient, Confidence: "high"}
}
Extensibility¶
Adding a new contributor is straightforward:
1. Define the Contributor¶
type MetricsContributor struct {
promClient prometheus.Client
queries []string
}
func (m *MetricsContributor) Observe(ctx context.Context, board *ObservationBoard) error {
for _, query := range m.queries {
result, err := m.promClient.Query(ctx, query)
if err != nil {
return err
}
board.AddFinding(Finding{
Source: "metrics",
Passed: result.Value > threshold,
Details: fmt.Sprintf("metric %q: %v", query, result.Value),
})
}
return nil
}
2. Register with Orchestrator¶
contributors = append(contributors, &MetricsContributor{
promClient: promClient,
queries: []string{"rate(http_requests_total[5m])"},
})
observer.RunContributors(ctx, board, contributors)
3. Update Evaluator (Optional)¶
for _, f := range findings {
if f.Source == "metrics" && !f.Passed {
deviations = append(deviations, "performance degradation")
}
}
Benefits of the Blackboard Pattern¶
- Decoupling — Observers don't know about each other
- Incremental Evidence — Findings accumulate over time
- Parallel Execution — Contributors run concurrently
- Extensibility — New observers added without modifying existing code
- Fault Isolation — One observer's failure doesn't crash others
- Holistic View — Evaluator sees all evidence, not fragmented results
Real-World Example¶
// Experiment: Kill model-controller pod
board := observer.NewObservationBoard()
// Phase 1: Wait for reconciliation
reconContributor.Observe(ctx, board)
// Writes: {Source: reconciliation, Passed: true, Cycles: 5, Duration: 12s}
// Phase 2: Check baseline + dependents
observer.RunContributors(ctx, board, []ObservationContributor{
steadyStateContributor, // Writes: {Source: steady_state, Passed: true}
collateralContributor, // Writes: {Source: collateral, Component: "dashboard", Passed: false}
})
findings := board.Findings()
// [
// {Source: reconciliation, Passed: true, Cycles: 5},
// {Source: steady_state, Passed: true},
// {Source: collateral, Component: "dashboard", Passed: false}
// ]
result := evaluator.EvaluateFromFindings(findings, hypothesis)
// Verdict: DEGRADED (collateral damage on dashboard)
Next Steps¶
- Injection Engine — How faults are executed
- Architecture Overview — Full system design
- Contributing: Development Setup — Build from source