Skip to main content

Object Storage

Enables real-time ingestion of objects from any S3-compatible object storage service for continuous data processing.

Sync Type: Incremental

Overview

The Object Storage input connector allows you to stream data from any S3-compatible storage service into Monad. This includes services like:

  • MinIO
  • Wasabi
  • DigitalOcean Spaces
  • Backblaze B2
  • Google Cloud Storage (S3-compatible mode)
  • Any other S3-compatible storage service

Requirements

  • Access credentials (Access Key and Secret Key) for your object storage service
  • Read permissions on the bucket and objects you want to ingest
  • Objects should be organized using date-based partitioning for incremental sync functionality

Details

When the input is run for the first time, it performs a full sync of all files in the specified bucket-prefix. After each successful page of objects is processed, the processor checkpoints its state by saving:

  • The highest LastModified timestamp encountered
  • The lexicographically greatest Blob key at that timestamp

On subsequent runs, the processor performs an incremental sync starting from last checkpoint timestamp. On restart in case of any form of failure, we resume from the day prefix of the last checkpointed timestamp. A checkpoint occurs at every page within a prefix. So while processing a prefix, if a failure occurs, the processor will restart from the last completed page's checkpoint. This means that while you may not lose out on records, you may re-process some records within the last page in case of any catastrophic failures.

  • To avoid this, we recommend publishing Blob data to a queue that can be consumed from to avoid such failures.

  • Please also note we rescan and drop all data based on our deduplication logic on every single sync which occurs in a day prefix. This means that for larger containers, this may lead to hitting rate limits since we will be scanning the same data a large number of times in a day. To avoid this, we recommend publishing Blob data to a queue that can be consumed from to avoid such scenarios.

  • Prefixes must be hive compliant/simple date always. Anything other than this can cause unexpected behavior in the input.

  • Each log's last updated time should be on the same date as the logical prefix itself. so any object that lands in the 2025/08/10 prefix should have a last updated time of 2025/08/10 (in its ISO8601 format). Not doing so can cause unexpected behavior in the input.

  • To avoid such tight boundaries, we recommend publishing Blob data to a queue that can be consumed from to avoid such failures.

Configuration

The following configuration defines the input parameters. Each field's specifications, such as type, requirements, and descriptions, are detailed below.

Settings

SettingTypeRequiredDefaultDescription
EndpointstringYes-Endpoint URL for the object storage service (e.g., https://minio.example.com, https://s3.amazonaws.com)
Skip SSL VerificationbooleanNofalseSkip SSL verification for self-signed certificates. Only use this for development/testing environments.
Use Path StylebooleanNotrueWhether to use path-style URLs (endpoint.com/bucket/object) vs virtual-hosted-style (bucket.endpoint.com/object). Most S3-compatible services require this to be true.
BucketstringYes-Name of the storage bucket
PrefixstringNo-Prefix that leads to the start of the expected partition. For example, if your objects are at /logs/year=2024/month=01/day=01/, the prefix would be logs.
RegionstringNous-east-1Optional region for the object storage service. This is often required for services like AWS S3.
CompressionstringYes-Compression format of the objects. Options include: none, gzip, zstd, snappy, lz4
FormatstringYesjsonFile format of the objects. Options include: json, csv, parquet, avro
Partition FormatstringYesSimple DateSpecifies the partition format of your bucket. See Partition Format section below.
Record LocationstringYes*@thisLocation of the record in the JSON object. Required only for JSON format. Use @this if records are at the root level or in an array. Use JSONPath notation for nested records (e.g., $.data.records).

*Required only when Format is set to json

Secrets

SecretTypeRequiredDescription
Access KeystringYesAccess key for object storage authentication
Secret KeystringYesSecret key for object storage authentication

Partition Format

The Partition Format setting specifies the existing organization of data within your object storage bucket. This is crucial for the system to correctly navigate and read your data. Select the option that matches your current bucket structure:

  1. Simple Date Format (simple date):
  • Structure: YYYY/MM/DD
  • Example: 2024/01/01
  • Use case: For buckets using basic chronological organization of data
  1. Hive-compatible Format (hive compliant):
  • Structure: year=YYYY/month=MM/day=DD
  • Example: year=2024/month=01/day=01
  • Use case: For buckets set up in a Hive-compatible format, common in data lake configurations

Selecting the correct Partition Format ensures that the system can efficiently locate and process your existing data by matching your bucket's current structure. This setting does not change your bucket's organization; it tells the system how to navigate it. NOTE: Your data MUST be partitioned in one of the above formats or subsequent syncs (after the initial sync) will not be able to find your data.

Configuration Examples

MinIO (Self-hosted)

{
"settings": {
"endpoint": "https://minio.company.internal:9000",
"skip_ssl_verification": true,
"use_path_style": true,
"bucket": "security-logs",
"prefix": "cloudtrail",
"compression": "none",
"format": "json",
"partition_format": "hive compliant",
"record_location": "$.Records"
},
"secrets": {
"access_key": "minioadmin",
"secret_key": "minioadmin123"
}
}

Wasabi

{
"settings": {
"endpoint": "https://s3.wasabisys.com",
"skip_ssl_verification": false,
"use_path_style": true,
"bucket": "backup-logs",
"prefix": "security/events",
"region": "us-east-1",
"compression": "zstd",
"format": "parquet",
"partition_format": "simple date"
},
"secrets": {
"access_key": "WASABI_ACCESS_KEY",
"secret_key": "WASABI_SECRET_KEY"
}
}

Supported Formats and Compression

File Formats

  • JSON: Supports nested JSON with configurable record location
  • CSV: Comma-separated values
  • Parquet: Columnar storage format
  • Avro: Row-based storage format with schema

Compression Types

  • none: Uncompressed files
  • gzip: GNU zip compression

Troubleshooting

Common Issues

  1. Connection Errors
  • Verify the endpoint URL is correct and includes the protocol (https:// or http://)
  • Check if SSL verification needs to be disabled for self-signed certificates
  • Ensure the service is accessible from Monad's infrastructure
  1. Authentication Failures
  • Verify access key and secret key are correct
  • Check if the credentials have the necessary permissions
  1. Path Style Issues
  • If getting bucket not found errors, try toggling the "Use Path Style" setting
  1. Missing Data
  • Verify the partition format matches your bucket structure
  • Your data MUST be partitioned properly by date. see Partition format above for references.
  • Check if the prefix is correctly specified. Leading and trailing slashes are stripped automatically.