Skip to main content

Pipeline Status Alert

Description

This alert monitors pipeline status and triggers when a pipeline has been in a specific status (Erroring or Throttled) for a sustained minimum duration. It helps detect and alert on pipelines experiencing prolonged operational issues that require attention.

The alert evaluates pipelines at regular intervals and generates alerts for each pipeline where the specified status has been sustained for the entire configured time window.

Prerequisites

  1. Active pipelines in your Monad organization
  2. Understanding of the pipeline status values you want to monitor (Erroring or Throttled)

Setup Instructions

  1. Choose the Pipeline Status to monitor (Erroring or Throttled)
  2. Set the Time Window for the minimum sustained duration (e.g., 5m, 15m, 1h)
  3. Select the pipelines to monitor (leave empty to monitor all organization pipelines)

Configuration Options

Settings

SettingTypeRequiredDefaultDescription
statusstringYes-Pipeline status to monitor: Erroring or Throttled. The alert triggers when the pipeline sustains this status for the entire time window.
time_windowstringYes5mMinimum duration the status must be sustained (e.g., 5m, 15m, 1h). The alert only triggers if status remains constant throughout this period.

Status Options

  • Erroring: Pipeline is experiencing errors
  • Throttled: Pipeline has been throttled due to backpressure, indicating stream capacity issues.

Time Window Format

The time window follows PromQL duration format:

  • 5m - 5 minutes
  • 15m - 15 minutes
  • 1h - 1 hour
  • 6h - 6 hours

Alert JSON Format

When a pipeline sustains the specified status for the configured duration, the alert generates the following JSON structure:

{
"rule_id": "550e8400-e29b-41d4-a716-446655440000",
"name": "Pipeline Status Alert",
"organization_id": "org-123",
"severity": "critical",
"description": "Pipeline pipeline-abc-123 has been in Erroring status for at least 5m",
"metadata": {
"pipeline_id": "pipeline-abc-123",
"status": "Erroring",
"time_window": "5m"
},
"resource": {
"resource_type": "pipeline",
"resource_id": "pipeline-abc-123"
}
}

Alert Metadata Fields

  • pipeline_id: The ID of the pipeline that triggered the alert
  • status: The pipeline status that was sustained (Erroring or Throttled)
  • time_window: The time window used to determine sustained status

Use Cases

  • Error Recovery Monitoring: Alert when pipelines are erroring for extended periods, indicating a problem that needs manual intervention
  • Throttle Detection: Detect pipelines being throttled for sustained periods that may indicate capacity or resource issues
  • Operational Awareness: Get notified when critical pipelines are experiencing status issues continuously
  • Incident Response: Enable faster response to operational problems by alerting when status conditions persist
  • SLA Compliance: Ensure pipelines stay within acceptable operational status requirements

Limitations

  • Status must be one of the valid options: Erroring or Throttled
  • Time window format must follow PromQL conventions (m, h suffixes)
  • The alert triggers only when the status is sustained for the entire time window (no status changes during the window)
  • Requires pipeline status metrics to be available in Prometheus

Example Configurations

Alert on Sustained Errors (5 minutes)

{
"status": "Erroring",
"time_window": "5m"
}

Alerts when a pipeline has been in Erroring status continuously for 5 minutes.

Alert on Extended Throttling (30 minutes)

{
"status": "Throttled",
"time_window": "30m"
}

Alerts when a pipeline has been throttled for at least 30 minutes, indicating a sustained capacity issue.

Alert on Long-Running Errors (1 hour)

{
"status": "Erroring",
"time_window": "1h"
}

Alerts only for persistent errors lasting a full hour, reducing noise for transient issues.