Backblaze B2 Cloud Storage
Write data to Backblaze B2 Cloud Storage buckets.
Requirements
To configure Backblaze B2 as an output destination for Monad, complete the following steps:
Step 1: Create Application Keys in Backblaze B2
-
Log in to your Backblaze B2 Console
Go to the Backblaze B2 Console -
Navigate to App Keys
In the sidebar, click on "App Keys" -
Create a New Application Key
Click "Add a New Application Key" -
Configure the Key:
- Name: Give your key a descriptive name (e.g., "Monad Integration")
- Capabilities: Select the following permissions:
listBuckets- to list available bucketslistFiles- to list files within the bucketwriteFiles- to write data to the bucketreadFiles- to read file contents (for verification)
- Bucket Access: Choose either:
- "All" for access to all buckets, or
- "Specific bucket" and select your target bucket
-
Create and Store the Key
Click "Create New Key" and immediately copy both thekeyIDandapplicationKey- you won't be able to see theapplicationKeyagain
Step 2: Create or Configure Your B2 Bucket
If you don't already have a bucket:
- In the B2 Console, click "Create a Bucket"
- Choose a unique bucket name
- Select your preferred region (note this for configuration)
- Configure bucket settings as needed
Functionality
The output continuously sends data to your specified B2 path, formatted as prefix/partition/filename.format.compression, where:
- The partition structure depends on your chosen partition format (simple date or Hive-compliant)
- Files are created based on batching configuration (record count, data size, or time elapsed)
- Data is compressed using your selected compression method before storage
Batching Behavior
Monad batches records before sending to B2 based on three configurable limits:
- Record Count: Maximum number of records per file (default: 100,000, range: 500-1,000,000)
- Data Size: Maximum uncompressed size per file (default: 10 MB, range: 1-25 MB)
- Time Interval: Maximum time before flushing a batch (default: 45 seconds, range: 1-60 seconds)
Whichever limit is reached first triggers the batch to be written to B2. This ensures timely delivery while optimizing file sizes for downstream processing.
Output Formats
The output format depends on your configuration:
- JSON Array Format: Records are stored as a standard JSON array
- JSON Nested Format: Records are wrapped under your specified key (e.g.,
{"records": [...]}) - JSON Line Format: Each record is on its own line (JSONL format)
- Delimited Format: Records in CSV or other delimited formats
- Parquet Format: Columnar storage format for efficient analytics
Configuration
Settings
| Setting | Type | Required | Default | Description |
|---|---|---|---|---|
| B2 Bucket Name | string | Yes | - | The name of the B2 bucket where data will be stored |
| B2 Region | string | Yes | us-west-001 | The B2 region endpoint (e.g., us-west-001, us-west-002, eu-central-003) |
| B2 Object Prefix | string | No | - | An optional prefix for B2 object keys to organize data within the bucket |
| Format Configuration | object | Yes | - | The format configuration for output data - see Format Options below |
| Compression Method | string | Yes | - | The compression method to be applied to the data before storing (e.g., gzip, snappy, none) |
| Partition Format | string | Yes | simple_date | The format for organizing data into partitions within the B2 bucket |
| Batch Configuration | object | No | See defaults below | Controls when batches are written to B2 |
Format Options
The output format determines how your data is structured in the storage files. You must configure exactly one format type you can see documentation on formats here: Formats.
Partition Format Options
- Simple Date Format (
simple_date):
- Structure:
{prefix}/{YYYY}/{MM}/{DD}/{filename} - Example:
my-data/2024/01/15/20240115T123045Z-uuid.json.gz - Use case: Straightforward date-based organization
- Hive-Compliant Format (
hive_compliant):
- Structure:
{prefix}/year={YYYY}/month={MM}/day={DD}/{filename} - Example:
my-data/year=2024/month=01/day=15/20240115T123045Z-uuid.parquet - Use case: Compatible with Athena, Hive, and other query engines that expect this partitioning scheme
Both partition formats use UTC time for consistency across different time zones.
Batch Configuration
| Setting | Type | Default | Min | Max | Description |
|---|---|---|---|---|---|
| record_count | integer | 100,000 | 500 | 1,000,000 | Maximum number of records per file |
| data_size | integer | 10,485,760 (10 MB) | 1,048,576 (1 MB) | 26,214,400 (25 MB) | Maximum uncompressed data size per file in bytes |
| publish_rate | integer | 45 | 1 | 60 | Maximum seconds before flushing a batch |
Secrets
| Secret | Type | Required | Description |
|---|---|---|---|
| Application Key ID | string | Yes | Backblaze B2 Application Key ID for authentication |
| Application Key | string | Yes | Backblaze B2 Application Key for authentication |
Best Practices
-
File Size Optimization: Balance between file size and query performance. Larger files are generally better for analytics workloads and reduce per-request costs.
-
Compression Selection:
- gzip: Best compression ratio, slower write speed, excellent for long-term storage
- none: Fastest writes, largest file sizes, use when compression happens downstream
-
Partition Strategy:
- Use
hive_compliantwhen querying with Athena, Presto, or similar services - Use
simple_datefor simpler directory structures or custom processing pipelines
- Use
-
Format Selection:
- Parquet: Best for analytics, columnar queries, and data warehousing
- JSON: Best for flexibility and human readability
- CSV: Best for compatibility with traditional tools and spreadsheets
-
Cost Optimization:
- Larger batch sizes reduce the number of write calls
- Use appropriate compression to minimize storage costs
- Consider B2's lifecycle rules for automatic data archiving
-
Security Best Practices:
- Use application keys with minimal required permissions
- Regularly rotate application keys
- Consider bucket-specific keys for different environments
Troubleshooting
Common Issues
-
Authentication Failed:
- Verify your Application Key ID and Application Key are correct
- Ensure the key has the required capabilities (
listBuckets,listFiles,writeFiles,readFiles) - Check that the key has access to the specified bucket
-
Bucket Not Found:
- Verify the bucket name is correct and exists
- Ensure the application key has access to the bucket
- Check that the bucket is in the specified region
-
Connection Timeout:
- Verify the region setting matches your bucket's actual region
- Check network connectivity to Backblaze B2 endpoints
- Ensure no firewall rules are blocking the connection
-
Permission Denied:
- Verify the application key has
writeFilescapability - Check that bucket permissions allow the operation
- Ensure the key is not expired
- Verify the application key has
-
Parse Errors:
- Ensure the file format setting matches your data structure
- Verify the record format for JSON files
- Check that compression setting is supported
-
Performance Issues:
- Consider increasing batch size to reduce API calls
- Use appropriate compression for your use case
- Verify region selection for optimal latency