Pipeline Error Rate Alert
Description
This alert monitors the error rate of pipelines by calculating the ratio of errors to ingested records over a specified time window. It triggers when the error rate exceeds a configured percentage threshold, helping detect data quality issues and pipeline degradation early.
The error rate is calculated as: (errors / ingested_records) * 100
The alert evaluates pipelines at regular intervals and generates alerts for each pipeline that exceeds the error rate threshold. It includes a configurable minimum records threshold to prevent false positives on low-volume pipelines.
Compatible with all Monad tiers
Prerequisites
- Active pipelines generating metrics in your Monad organization
- Pipelines processing data (ingesting records)
- Error tracking enabled on your pipelines
Setup Instructions
- Set the Error Rate Threshold as a percentage (e.g., 5.0 for 5%)
- Specify the Time Window for metric aggregation (e.g., 5m, 1h, 30s)
- Configure the Minimum Records threshold to prevent alerts on low-volume pipelines (defaults to 100)
- Select the pipelines to monitor (leave empty to monitor all organization pipelines)
Configuration Options
Settings
| Setting | Type | Required | Default | Description |
|---|---|---|---|---|
| threshold | float | Yes | - | Error rate percentage threshold (e.g., 5.0 for 5%). Alert triggers when error rate exceeds this value. Must be between 0 and 100. |
| time_window | string | Yes | - | Time window for metric aggregation using PromQL format (e.g., 5m, 1h, 30s) |
| min_records | integer | No | 100 | Minimum number of ingested records required to evaluate the error rate. Pipelines below this threshold are skipped to prevent false positives on low-volume data. |
Time Window Format
The time window follows PromQL duration format:
30s- 30 seconds5m- 5 minutes1h- 1 hour
Alert JSON Format
When the error rate exceeds the threshold, the alert generates the following JSON structure:
{
"rule_id": "550e8400-e29b-41d4-a716-446655440000",
"name": "High Error Rate Alert",
"organization_id": "org-123",
"severity": "critical",
"description": "Pipeline pipeline-abc-123 error rate 7.50% exceeds threshold of 5.00%",
"metadata": {
"pipeline_id": "pipeline-abc-123",
"error_rate": 7.5,
"error_count": 75,
"ingested_count": 1000,
"threshold": 5.0,
"time_window": "5m",
"min_records": 100
},
"resource": {
"resource_type": "pipeline",
"resource_id": "pipeline-abc-123"
}
}
Alert Metadata Fields
- pipeline_id: The ID of the pipeline that triggered the alert
- error_rate: The calculated error rate as a percentage
- error_count: The total number of errors in the time window
- ingested_count: The total number of ingested records in the time window
- threshold: The configured error rate threshold percentage
- time_window: The time window used for metric aggregation
- min_records: The minimum records threshold configured
Use Cases
- Data Quality Monitoring: Detect degradation in data quality when error rates spike
- Pipeline Health: Monitor pipeline reliability and catch processing issues early
- SLA Compliance: Ensure error rates stay within acceptable service level agreements
- Anomaly Detection: Identify unusual patterns that might indicate upstream data issues or configuration problems
- Production Monitoring: Alert on-call teams when pipelines are experiencing elevated error rates
- Low-Volume Protection: Use min_records to avoid noisy alerts on pipelines with sporadic or minimal traffic
Limitations
- Threshold must be between 0 and 100 (percentage)
- Time window format must follow PromQL conventions (s, m, h suffixes)
- Requires both error and ingestion metrics to be available
- Pipelines with ingested record count below min_records are skipped (no alert generated)
- Error rate calculation requires at least some ingested records (ingested_count > 0)
Example Configurations
High Sensitivity for Critical Pipelines
{
"threshold": 1.0,
"time_window": "5m",
"min_records": 50
}
Alerts when error rate exceeds 1% over 5 minutes, with a low minimum records threshold for quick detection.
Standard Production Monitoring
{
"threshold": 5.0,
"time_window": "15m",
"min_records": 100
}
Alerts when error rate exceeds 5% over 15 minutes, filtering out low-volume pipelines.
Low-Volume Pipeline Monitoring
{
"threshold": 10.0,
"time_window": "1h",
"min_records": 10
}
Alerts when error rate exceeds 10% over 1 hour, suitable for pipelines with lower traffic volumes.