Skip to main content

Pipeline Error Rate Alert

Description

This alert monitors the error rate of pipelines by calculating the ratio of errors to ingested records over a specified time window. It triggers when the error rate exceeds a configured percentage threshold, helping detect data quality issues and pipeline degradation early.

The error rate is calculated as: (errors / ingested_records) * 100

The alert evaluates pipelines at regular intervals and generates alerts for each pipeline that exceeds the error rate threshold. It includes a configurable minimum records threshold to prevent false positives on low-volume pipelines.

Compatible with all Monad tiers

Prerequisites

  1. Active pipelines generating metrics in your Monad organization
  2. Pipelines processing data (ingesting records)
  3. Error tracking enabled on your pipelines

Setup Instructions

  1. Set the Error Rate Threshold as a percentage (e.g., 5.0 for 5%)
  2. Specify the Time Window for metric aggregation (e.g., 5m, 1h, 30s)
  3. Configure the Minimum Records threshold to prevent alerts on low-volume pipelines (defaults to 100)
  4. Select the pipelines to monitor (leave empty to monitor all organization pipelines)

Configuration Options

Settings

SettingTypeRequiredDefaultDescription
thresholdfloatYes-Error rate percentage threshold (e.g., 5.0 for 5%). Alert triggers when error rate exceeds this value. Must be between 0 and 100.
time_windowstringYes-Time window for metric aggregation using PromQL format (e.g., 5m, 1h, 30s)
min_recordsintegerNo100Minimum number of ingested records required to evaluate the error rate. Pipelines below this threshold are skipped to prevent false positives on low-volume data.

Time Window Format

The time window follows PromQL duration format:

  • 30s - 30 seconds
  • 5m - 5 minutes
  • 1h - 1 hour

Alert JSON Format

When the error rate exceeds the threshold, the alert generates the following JSON structure:

{
"rule_id": "550e8400-e29b-41d4-a716-446655440000",
"name": "High Error Rate Alert",
"organization_id": "org-123",
"severity": "critical",
"description": "Pipeline pipeline-abc-123 error rate 7.50% exceeds threshold of 5.00%",
"metadata": {
"pipeline_id": "pipeline-abc-123",
"error_rate": 7.5,
"error_count": 75,
"ingested_count": 1000,
"threshold": 5.0,
"time_window": "5m",
"min_records": 100
},
"resource": {
"resource_type": "pipeline",
"resource_id": "pipeline-abc-123"
}
}

Alert Metadata Fields

  • pipeline_id: The ID of the pipeline that triggered the alert
  • error_rate: The calculated error rate as a percentage
  • error_count: The total number of errors in the time window
  • ingested_count: The total number of ingested records in the time window
  • threshold: The configured error rate threshold percentage
  • time_window: The time window used for metric aggregation
  • min_records: The minimum records threshold configured

Use Cases

  • Data Quality Monitoring: Detect degradation in data quality when error rates spike
  • Pipeline Health: Monitor pipeline reliability and catch processing issues early
  • SLA Compliance: Ensure error rates stay within acceptable service level agreements
  • Anomaly Detection: Identify unusual patterns that might indicate upstream data issues or configuration problems
  • Production Monitoring: Alert on-call teams when pipelines are experiencing elevated error rates
  • Low-Volume Protection: Use min_records to avoid noisy alerts on pipelines with sporadic or minimal traffic

Limitations

  • Threshold must be between 0 and 100 (percentage)
  • Time window format must follow PromQL conventions (s, m, h suffixes)
  • Requires both error and ingestion metrics to be available
  • Pipelines with ingested record count below min_records are skipped (no alert generated)
  • Error rate calculation requires at least some ingested records (ingested_count > 0)

Example Configurations

High Sensitivity for Critical Pipelines

{
"threshold": 1.0,
"time_window": "5m",
"min_records": 50
}

Alerts when error rate exceeds 1% over 5 minutes, with a low minimum records threshold for quick detection.

Standard Production Monitoring

{
"threshold": 5.0,
"time_window": "15m",
"min_records": 100
}

Alerts when error rate exceeds 5% over 15 minutes, filtering out low-volume pipelines.

Low-Volume Pipeline Monitoring

{
"threshold": 10.0,
"time_window": "1h",
"min_records": 10
}

Alerts when error rate exceeds 10% over 1 hour, suitable for pipelines with lower traffic volumes.