Amazon S3
Writes data to Amazon S3 buckets in various file formats. Supports configurable partitioning, compression, and batching options for optimal data organization and performance.
Requirements
To configure S3 as an output destination for Monad, complete the following steps:
- Create an IAM Role: Create an IAM role that allows Monad to assume it and access your S3 bucket. You need to set up a trust relationship that permits Monad to assume the role. For more details, refer to the AWS IAM role creation guide.
Copy and paste the trust relationship below:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AssumeRoleWithExternalId",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::339712996529:role/monad-app"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "{your-org-id}"
}
}
},
{
"Sid": "TagSession",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::339712996529:role/monad-app"
},
"Action": "sts:TagSession"
}
]
}
- Set IAM Role Permissions:
The role must have permissions to perform these S3 actions:
s3:PutObject,s3:GetObject, ands3:ListBucket. For a full list of S3 permissions, refer to AWS documentation.
You can either set the permissions via UI or copy the permission policy below:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::{bucket_name}",
"arn:aws:s3:::{bucket_name}/*"
]
}
]
}
Note: The resource needs both the bucket itself (for ListBucket) and the bucket with (for PutObject/GetObject).
Functionality
The output continuously sends data to your specified S3 path, formatted as prefix/partition/filename.format.compression, where:
- The partition structure depends on your chosen partition format (simple date or Hive-compliant)
- Files are created based on batching configuration (record count, data size, or time elapsed)
- Data is compressed using your selected compression method before storage
Batching Behavior
Monad batches records before sending to S3 based on three configurable limits:
- Record Count: Maximum number of records per file (default: 100,000, range: 500-1,000,000)
- Data Size: Maximum uncompressed size per file (default: 10 MB, range: 1-25 MB)
- Time Interval: Maximum time before flushing a batch (default: 45 seconds, range: 1-60 seconds)
Whichever limit is reached first triggers the batch to be written to S3. This ensures timely delivery while optimizing file sizes for downstream processing.
Output Formats
The output format depends on your configuration:
- JSON Array Format: Records are stored as a standard JSON array
- JSON Nested Format: Records are wrapped under your specified key (e.g.,
{"records": [...]}) - JSON Line Format: Each record is on its own line (JSONL format)
- Delimited Format: Records in CSV or other delimited formats
- Parquet Format: Columnar storage format for efficient analytics
Configuration
Settings
| Setting | Type | Required | Default | Description |
|---|---|---|---|---|
| AWS IAM Role ARN | string | Yes | - | The Amazon Resource Name (ARN) of the IAM role to assume which grants access to the S3 bucket |
| S3 Bucket Name | string | Yes | - | The name of the S3 bucket where data will be stored |
| AWS Region | string | Yes | - | The AWS region where the S3 bucket is located (e.g., us-east-1, eu-west-1) |
| S3 Object Prefix | string | No | - | An optional prefix for S3 object keys to organize data within the bucket |
| Format Configuration | object | Yes | - | The format configuration for output data - see Format Options below |
| Compression Method | string | Yes | - | The compression method to be applied to the data before storing (e.g., gzip, snappy, none) |
| Partition Format | string | Yes | simple_date | The format for organizing data into partitions within the S3 bucket |
| Batch Configuration | object | No | See defaults | Controls when batches are written to S3 |
Format Options
The output format determines how your data is structured in the storage files. You must configure exactly one format type you can see documentation on formats here: Formats.
Partition Format Options
- Simple Date Format (
simple_date):
- Structure:
{prefix}/{YYYY}/{MM}/{DD}/{filename} - Example:
my-data/2024/01/15/20240115T123045Z-uuid.json.gz - Use case: Straightforward date-based organization
- Hive-Compliant Format (
hive_compliant):
- Structure:
{prefix}/year={YYYY}/month={MM}/day={DD}/{filename} - Example:
my-data/year=2024/month=01/day=15/20240115T123045Z-uuid.parquet - Use case: Compatible with Athena, Hive, and other query engines that expect this partitioning scheme
Both partition formats use UTC time for consistency across different time zones.
Batch Configuration
| Setting | Type | Default | Min | Max | Description |
|---|---|---|---|---|---|
| record_count | integer | 100,000 | 500 | 1,000,000 | Maximum number of records per file |
| data_size | integer | 10,485,760 (10 MB) | 1,048,576 (1 MB) | 26,214,400 (25 MB) | Maximum uncompressed data size per file in bytes |
| publish_rate | integer | 45 | 1 | 60 | Maximum seconds before flushing a batch |
Secrets
None required - authentication is handled through the IAM role assumption.
Examples
JSON Array Format
{
"iam_role_arn": "arn:aws:iam::123456789012:role/monad-s3-access",
"bucket": "my-data-bucket",
"region": "us-east-1",
"prefix": "events/raw",
"format_config": {
"format": "json",
"json": {
"type": "array"
}
},
"compression": "gzip",
"partition_format": "simple_date"
}
CSV Format with Custom Headers
{
"iam_role_arn": "arn:aws:iam::123456789012:role/monad-s3-access",
"bucket": "analytics-bucket",
"region": "eu-west-1",
"prefix": "processed/csv",
"format_config": {
"format": "delimited",
"delimited": {
"delimiter": ",",
"headers": ["timestamp", "user_id", "event_type", "value"]
}
},
"compression": "snappy",
"partition_format": "hive_compliant",
"batch_config": {
"record_count": 50000,
"data_size": 5242880,
"publish_rate": 30
}
}
Parquet Format for Analytics
{
"iam_role_arn": "arn:aws:iam::123456789012:role/monad-s3-access",
"bucket": "data-warehouse",
"region": "us-west-2",
"prefix": "fact_tables/events",
"format_config": {
"format": "parquet",
"parquet": {
"schema": "{\"Tag\": \"name=events\", \"Fields\": [{\"Tag\": \"name=timestamp, type=INT64, repetitiontype=REQUIRED\"}, {\"Tag\": \"name=event_type, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED\"}, {\"Tag\": \"name=user_id, type=INT64, repetitiontype=REQUIRED\"}, {\"Tag\": \"name=properties, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL\"}]}"
}
},
"compression": "snappy",
"partition_format": "hive_compliant"
}
JSONL Format for Streaming
{
"iam_role_arn": "arn:aws:iam::123456789012:role/monad-s3-access",
"bucket": "streaming-data",
"region": "ap-southeast-1",
"prefix": "realtime",
"format_config": {
"format": "json",
"json": {
"type": "line"
}
},
"compression": "none",
"partition_format": "simple_date",
"batch_config": {
"record_count": 1000,
"data_size": 1048576,
"publish_rate": 5
}
}
Best Practices
-
File Size Optimization: Balance between file size and query performance. Larger files (closer to the 10 MB default) are generally better for analytics workloads.
-
Compression Selection:
- gzip: Best compression ratio, slower write speed
- snappy: Balanced compression and speed, good for Parquet files
- none: Fastest writes, largest file sizes
- Partition Strategy:
- Use
hive_compliantwhen querying with Athena, Redshift Spectrum, or similar services - Use
simple_datefor simpler directory structures or custom processing pipelines
- Format Selection:
- Parquet: Best for analytics, columnar queries, and data warehousing
- JSON: Best for flexibility and human readability
- CSV: Best for compatibility with traditional tools and spreadsheets
- IAM Best Practices:
- Use least-privilege principles - only grant access to specific buckets
- Consider using bucket policies in addition to IAM roles for defense in depth
- Regularly audit and rotate credentials