Skip to main content

S3

Enables seamless streaming of data posted to an S3 bucket into the Monad solution.

Sync Type: Incremental

Requirements

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3BucketLevelListPermissions",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::{bucket_name}"
},
{
"Sid": "S3ObjectLevelPermissions",
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject"],
"Resource": "arn:aws:s3:::{bucket_name}/*"
}
]
}

Details

When the input is run for the first time, it performs a full sync of all files in the specified bucket-prefix.

On subsequent runs, the processor performs an incremental sync starting from last checkpoint timestamp it found a blob at. On restart in case of any form of failure, we resume from the day prefix of the last checkpointed timestamp. A checkpoint occurs at every page within a prefix. So while processing a prefix, if a failure occurs, the processor will restart from the last completed page's checkpoint. You will not lose out on any records, however you may re-process some data in the S3 objects on the page where the failure occured in case of any catastrophic failures.

  • To avoid this, we recommend publishing S3 data to an SQS queue to avoid such failures.

  • Please also note we rescan and drop all data based on our deduplication logic on every single sync which occurs in a day prefix. This means that for larger buckets, this may lead to hitting rate limits since we will be scanning the same data a large number of times in a day. To avoid this, we recommend publishing S3 data to an SQS queue to avoid such scenarios.

  • Prefixes must be hive compliant/simple date always. Anything other than this can cause unexpected behavior in the input.

  • Each log's last updated time should be on the same date as the logical prefix itself. so any object that lands in the 2025/08/10 prefix should have a last updated time of 2025/08/10 (in its ISO8601 format). Not doing so can cause unexpected behavior in the input.

  • To avoid such tight boundaries, we recommend publishing S3 data to an SQS queue to avoid such failures.

The processor checks for new data continuously every 10 seconds, processing objects as they appear in the bucket.

Configuration

The following configuration defines the input parameters. Each field's specifications, such as type, requirements, and descriptions, are detailed below.

Settings

SettingTypeRequiredDescription
RegionstringNoThe region of the S3 bucket. If left blank, the region will be auto-detected.
BucketstringYesThe name of the S3 bucket.
PrefixstringNoPrefix of the S3 object keys to read.
CompressionstringYesCompression format of the S3 objects.
FormatstringYesFile format of the S3 objects.
Partition FormatstringYesThe existing partition format used in your S3 bucket.
RoleARNstringYesRole ARN to assume when reading from S3.
Record LocationstringNoLocation of the record in the JSON object. Applies only if the format is JSON. Leave empty if you want the entire record.
Backfill Start TimestringNoThe date to start fetching data from. If not specified, no past records will be fetched.

Partition Format

The Partition Format setting specifies the existing organization of data within your S3 bucket. This is crucial for the system to correctly navigate and read your data. Select the option that matches your current S3 bucket structure:

  1. Simple Date Format ('simple date'):

    • Structure: YYYY/MM/DD
    • Example: 2024/01/01
    • Use case: For buckets using basic chronological organization of data.
  2. Hive-compatible Format ('hive compliant'):

    • Structure: year=YYYY/month=MM/day=DD
    • Example: year=2024/month=01/day=01
    • Use case: For buckets set up in a Hive-compatible format, common in data lake configurations.

Selecting the correct Partition Format ensures that the system can efficiently locate and process your existing data by matching your S3 bucket's current structure. This setting does not change your bucket's organization; it tells the system how to navigate it.

Secrets (Static Credentials Only)

SettingTypeRequiredDescription
Access KeystringConditionalAWS Access Key ID
Secret KeystringConditionalAWS Secret Access Key

⚠️ Authentication: Choose either Role ARN (recommended) or static credentials. See AWS Authentication Guide for setup instructions.

Custom Schema Handling

If the source data doesn't align with any of the OpenSecurityControlFramework (OSCF) schemas, you can create a custom transformation using our JQ transform pipeline. For example:

{
metadata: {
schema_version: "1.0.0",
custom_framework: "my_framework"
},
controls: .[]
}

For more information on JQ and how to write your own JQ transformations see the JQ docs here..

If you believe this data source should be included in the standard OSCF schema set, please reach out to our team at support@monad.com. We're always looking to expand our coverage of security control frameworks based on community needs.