Skip to main content

Sample

Passes through only a percentage of records, either randomly or deterministically by key.

Overview

The sample condition selects a percentage of records to pass through. It supports two modes:

  • Random sampling: When no key is specified (or * is used), each record has an independent random chance of being selected based on the configured percentage.
  • Hash-based sampling: When a key is specified, the condition hashes the value of that key to deterministically decide whether the record passes. Records with the same key value always produce the same result, providing consistent sampling across evaluations.

Use Cases

  • Cost Reduction: Reduce data volume sent to expensive destinations by sampling a fraction of records
  • Load Management: Limit the volume of data flowing to downstream systems
  • Development and Testing: Sample a small percentage of production data for debugging or testing pipelines
  • Statistical Analysis: Collect a representative subset of records for analysis
  • Consistent Routing: Use hash-based sampling to ensure records with the same key are always included or excluded together

Configuration

SettingTypeRequiredDescription
percentnumberYesPercentage of records to pass through. Must be greater than 0 and at most 100.
keystringNoField path to hash for deterministic sampling. If empty or *, uses random sampling. Supports dot notation for nested fields.

Examples

Random 10% Sample

Pass through approximately 10% of all records at random:

{
"type_id": "sample",
"config": {
"percent": 10
}
}

Each record has an independent 10% chance of being selected. Over a large number of records the output will converge to 10% of the input.

Half of All Records

Pass through approximately 50% of records:

{
"type_id": "sample",
"config": {
"percent": 50
}
}

Hash-Based Sampling by User

Consistently sample 25% of users by hashing the user_id field:

{
"type_id": "sample",
"config": {
"percent": 25,
"key": "user_id"
}
}

All records with the same user_id value will either pass or be excluded together. This is useful when you need a consistent subset of users rather than a random scattering of individual records.

Hash-Based Sampling by Nested Field

Sample by a nested field using dot notation:

{
"type_id": "sample",
"config": {
"percent": 20,
"key": "event.source"
}
}

Combined with Filters

Sample 5% of production debug logs:

{
"operator": "and",
"conditions": [
{"type_id": "equals", "config": {"key": "environment", "value": "production"}},
{"type_id": "equals", "config": {"key": "log_level", "value": "debug"}},
{"type_id": "sample", "config": {"percent": 5}}
]
}

Common Patterns

Tail Sampling

Keep all errors while sampling a small percentage of normal traffic. This ensures you never miss an error but avoids overloading your log destination with routine records:

{
"operator": "or",
"conditions": [
{"type_id": "equals", "config": {"key": "log_level", "value": "error"}},
{"type_id": "sample", "config": {"percent": 10}}
]
}

Every error record passes through because the or short-circuits on the first match. The remaining non-error records have a 10% chance of being included.

Cost-Effective Monitoring

Send a fraction of records to an expensive analytics destination while routing all records to cheaper storage:

Analytics edge (sampled):

{
"operator": "and",
"conditions": [
{"type_id": "sample", "config": {"percent": 10}}
]
}

Storage edge (all records):

{
"operator": "always",
"conditions": []
}

Consistent User Sampling

Sample a stable cohort of users for A/B analysis:

{
"operator": "and",
"conditions": [
{"type_id": "key_exists", "config": {"key": "user_id"}},
{"type_id": "sample", "config": {"percent": 10, "key": "user_id"}}
]
}

Using key_exists ensures that records without a user_id are excluded rather than randomly sampled.

Tiered Sampling by Severity

Apply different sample rates depending on severity:

Critical events (keep all):

{
"operator": "and",
"conditions": [
{"type_id": "equals", "config": {"key": "severity", "value": "critical"}}
]
}

Info events (keep 1%):

{
"operator": "and",
"conditions": [
{"type_id": "equals", "config": {"key": "severity", "value": "info"}},
{"type_id": "sample", "config": {"percent": 1}}
]
}

Best Practices

  1. Use hash-based sampling for consistency: When you need the same records to be selected across multiple evaluations, specify a key. This ensures deterministic behavior.

  2. Start with higher percentages: Begin with a larger sample and reduce as you understand your data volume and downstream capacity.

  3. Use random sampling for statistical representativeness: When you need an unbiased sample, omit the key to get independent random selection per record.

  4. Account for upstream filtering: If other conditions in your pipeline already reduce volume, adjust your sample percentage accordingly.

Behavior Details

Random Sampling (no key)

  • Each record is independently evaluated with a random number
  • Over large volumes, the output will converge to the configured percentage
  • Small batches may show variance from the target percentage
  • No state is maintained between evaluations

Hash-Based Sampling (with key)

  • Uses a fast, stable hash (xxhash) on the string representation of the key's value
  • The same key value always produces the same pass/fail result
  • Different key values are distributed uniformly across the hash space
  • If the specified key does not exist on a record, the condition returns false
  • Numeric, boolean, and other types are hashed by their string representation

Limitations

  • Cannot guarantee an exact count of records — only a statistical percentage
  • Hash-based sampling depends on the distribution of key values; highly skewed keys may produce uneven results
  • Does not support the not parameter
  • The percent value must be greater than 0 and at most 100
  • Random sampling has no memory — it does not guarantee uniform spacing between selected records

Troubleshooting

Output volume is higher or lower than expected:

  • Random sampling is probabilistic; small volumes will show more variance
  • Verify the percent value is set correctly (e.g., 10 for 10%, not 0.1)
  • Check whether other conditions in the pipeline are also filtering records

Hash-based sampling excludes all records:

  • Verify the key exists in your records
  • Check the key path is correct (use dot notation for nested fields)
  • If the key is missing, the condition returns false

Different records for the same key are getting different results:

  • This can happen if the key value varies (e.g., trailing whitespace, different casing)
  • Hash-based sampling hashes the exact string representation of the value