Skip to content

data-science-pipelines Failure Modes

Coverage

Injection Type Danger Experiment Description
FinalizerBlock low finalizer-block.yaml When a stuck finalizer prevents a DataSciencePipelinesApplication from being del...
NetworkPartition medium network-partition.yaml When the DSPO pod is network-partitioned from the API server, it should lose its...
PodKill low pod-kill.yaml When the data-science-pipelines-operator pod is killed, Kubernetes should recrea...
RBACRevoke high rbac-revoke.yaml When the DSPO ClusterRoleBinding subjects are revoked, the operator should lose ...
WebhookDisrupt high webhook-disrupt.yaml When the pipeline version validating webhook failurePolicy is weakened from Fail...

Experiment Details

data-science-pipelines-finalizer-block

  • Type: FinalizerBlock
  • Danger Level: low
  • Component: data-science-pipelines-operator

When a stuck finalizer prevents a DataSciencePipelinesApplication from being deleted, the DSPO should handle the Terminating state gracefully, report the blocked deletion in its status, and not leak associated pipeline resources. The chaos framework removes the finalizer via TTL-based cleanup after 300s.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: data-science-pipelines-finalizer-block
spec:
  tier: 3
  target:
    operator: data-science-pipelines
    component: data-science-pipelines-operator
    resource: DataSciencePipelinesApplication
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: data-science-pipelines-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: FinalizerBlock
    # IMPORTANT: "test-dspa" is a placeholder. Replace it with the name of an
    # actual DataSciencePipelinesApplication resource deployed in the target
    # namespace before running this experiment.
    parameters:
      apiVersion: datasciencepipelinesapplications.opendatahub.io/v1alpha1
      kind: DataSciencePipelinesApplication
      name: test-dspa
      finalizer: datasciencepipelinesapplications.opendatahub.io/finalizer
    ttl: "300s"
  hypothesis:
    description: >-
      When a stuck finalizer prevents a DataSciencePipelinesApplication
      from being deleted, the DSPO should handle the Terminating state
      gracefully, report the blocked deletion in its status, and not leak
      associated pipeline resources. The chaos framework removes the
      finalizer via TTL-based cleanup after 300s.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

data-science-pipelines-network-partition

  • Type: NetworkPartition
  • Danger Level: medium
  • Component: data-science-pipelines-operator

When the DSPO pod is network-partitioned from the API server, it should lose its leader lease and stop reconciling. Once the partition is removed, the operator should re-acquire the lease and resume normal operation without duplicate pipeline runs.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: data-science-pipelines-network-partition
spec:
  tier: 2
  target:
    operator: data-science-pipelines
    component: data-science-pipelines-operator
    resource: Deployment/data-science-pipelines-operator-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: data-science-pipelines-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: NetworkPartition
    parameters:
      labelSelector: app.kubernetes.io/name=data-science-pipelines-operator
    ttl: "300s"
  hypothesis:
    description: >-
      When the DSPO pod is network-partitioned from the API server, it
      should lose its leader lease and stop reconciling. Once the partition
      is removed, the operator should re-acquire the lease and resume
      normal operation without duplicate pipeline runs.
    recoveryTimeout: 180s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

data-science-pipelines-pod-kill

  • Type: PodKill
  • Danger Level: low
  • Component: data-science-pipelines-operator

When the data-science-pipelines-operator pod is killed, Kubernetes should recreate it within the recovery timeout. The operator should resume reconciling DataSciencePipelinesApplication resources without data loss or pipeline run interruption.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: data-science-pipelines-pod-kill
spec:
  tier: 1
  target:
    operator: data-science-pipelines
    component: data-science-pipelines-operator
    resource: Deployment/data-science-pipelines-operator-controller-manager
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: data-science-pipelines-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: PodKill
    parameters:
      labelSelector: app.kubernetes.io/name=data-science-pipelines-operator
    count: 1
    ttl: "300s"
  hypothesis:
    description: >-
      When the data-science-pipelines-operator pod is killed, Kubernetes
      should recreate it within the recovery timeout. The operator should
      resume reconciling DataSciencePipelinesApplication resources without
      data loss or pipeline run interruption.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowedNamespaces:
      - opendatahub

data-science-pipelines-rbac-revoke

  • Type: RBACRevoke
  • Danger Level: high
  • Component: data-science-pipelines-operator

When the DSPO ClusterRoleBinding subjects are revoked, the operator should lose its ability to reconcile DataSciencePipelinesApplication resources across namespaces and surface permission-denied errors. Once permissions are restored, reconciliation should resume.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: data-science-pipelines-rbac-revoke
spec:
  tier: 4
  target:
    operator: data-science-pipelines
    component: data-science-pipelines-operator
    resource: ClusterRoleBinding/data-science-pipelines-operator-manager-rolebinding
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: data-science-pipelines-operator-controller-manager
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: RBACRevoke
    dangerLevel: high
    parameters:
      bindingName: data-science-pipelines-operator-manager-rolebinding
      bindingType: ClusterRoleBinding
    ttl: "60s"
  hypothesis:
    description: >-
      When the DSPO ClusterRoleBinding subjects are revoked, the operator
      should lose its ability to reconcile DataSciencePipelinesApplication
      resources across namespaces and surface permission-denied errors.
      Once permissions are restored, reconciliation should resume.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true

data-science-pipelines-webhook-disrupt

  • Type: WebhookDisrupt
  • Danger Level: high
  • Component: ds-pipelines-webhook

When the pipeline version validating webhook failurePolicy is weakened from Fail to Ignore, invalid pipeline versions can bypass admission validation. The chaos framework restores the original failurePolicy via TTL-based cleanup after 60s.

Experiment YAML
apiVersion: chaos.operatorchaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: data-science-pipelines-webhook-disrupt
spec:
  tier: 4
  target:
    operator: data-science-pipelines
    component: ds-pipelines-webhook
    resource: ValidatingWebhookConfiguration/validating.pipelineversions.pipelines.kubeflow.org
  steadyState:
    checks:
      - type: conditionTrue
        apiVersion: apps/v1
        kind: Deployment
        name: ds-pipelines-webhook
        namespace: opendatahub
        conditionType: Available
    timeout: "30s"
  injection:
    type: WebhookDisrupt
    dangerLevel: high
    parameters:
      webhookName: validating.pipelineversions.pipelines.kubeflow.org
      action: setFailurePolicy
      value: Ignore
    ttl: "60s"
  hypothesis:
    description: >-
      When the pipeline version validating webhook failurePolicy is
      weakened from Fail to Ignore, invalid pipeline versions can bypass
      admission validation. The chaos framework restores the original
      failurePolicy via TTL-based cleanup after 60s.
    recoveryTimeout: 120s
  blastRadius:
    maxPodsAffected: 1
    allowDangerous: true