Skip to main content

Cloud Storage

Collects and ingests data from a Google Cloud Storage bucket.

Sync Type: Incremental

Details

The Google Cloud Storage input allows you to collect and ingest data from a Google Cloud Storage bucket. You can specify which bucket to monitor and configure how data should be processed based on its format and organization.

Requirements

Before setting up the Google Cloud Storage input, you need to:

  1. Have a Google Cloud Platform account with access to the desired project.
  2. Create a service account with the necessary permissions.
  3. Generate a JSON key for the service account.

Setup Instructions

You can set up the Google Cloud Storage input using either the Google Cloud Console UI or command-line interface.

Option 1: Using Google Cloud Console

  1. Navigate to the Google Cloud Console
  2. Select your project
  3. Open "IAM & Admin" > "Service Accounts"
  4. Create a new service account:
    • Click "Create Service Account"
    • Provide a name for the service account
    • Click "Create"
  5. Assign the required role:
    • Add the "Storage Object Viewer" role
    • Click "Continue"
    • Click "Done"
  6. Generate credentials:
    • Select the newly created service account
    • Go to the "Keys" tab
    • Click "Add Key" > "Create new key"
    • Select JSON format
    • Click "Create" to download the key file
    • Store this file securely - you'll need its contents later

Option 2: Using Command Line

  1. Set your project ID:
export PROJECT_ID="your-project-id"
gcloud config set project $PROJECT_ID
  1. Create a service account:
gcloud iam service-accounts create monad-gcs-input-connector \
--display-name="Monad GCS Input Connector" \
--description="Service account for GCS input connector"
  1. Assign the required roles:
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:monad-gcs-input-connector@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"

gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:monad-gcs-input-connector@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/serviceusage.serviceUsageConsumer"
  1. Generate and download the service account key:
gcloud iam service-accounts keys create monad-gcs-key.json \
--iam-account=monad-gcs-input-connector@$PROJECT_ID.iam.gserviceaccount.com

This creates a monad-gcs-key.json file in your current directory. Use the contents of this file as the value for the credentials_json secret in your input configuration.

Important: Store this credentials file securely and never commit it to version control.

Bucket Structure

Your bucket should follow one of these partition formats:

  1. Simple Date format (YYYY/MM/DD):
bucket/
2026/
06/
02/
data.json
  1. Hive format (year=YYYY/month=MM/day=DD):
bucket/
year=2026/
month=06/
day=02/
data.json

You can optionally include a prefix for better organization:

bucket/
data/
device-logs/
2026/
06/
02/
data.json

Details

When the input is run for the first time, it performs a full sync of all files in the specified bucket-prefix. State is checkpointed only after an entire date prefix has been successfully processed, saving:

  • The highest LastModified timestamp encountered
  • The lexicographically greatest Blob key at that timestamp

On subsequent runs, the processor performs an incremental sync starting from the last checkpointed timestamp. In the event of a failure, the processor resumes from the start of the last checkpointed date prefix. This means that if a failure occurs mid-prefix, the entire date prefix will be reprocessed from the beginning, which can represent a large number of objects or blobs.

  • To avoid large-scale reprocessing on failure, we recommend publishing blob data to a queue that can be consumed from instead.

  • Please also note we rescan and drop all data based on our deduplication logic on every single sync which occurs in a day prefix. This means that for larger buckets, this may lead to hitting rate limits since we will be scanning the same data a large number of times in a day. To avoid this, we recommend publishing blob data to a queue that can be consumed from to avoid such scenarios.

  • Prefixes must be hive compliant/simple date always. Anything other than this can cause unexpected behavior in the input.

  • Each log's last updated time should be on the same date as the logical prefix itself. So any object that lands in the 2025/08/10 prefix should have a last updated time of 2025/08/10 (in its ISO8601 format). Not doing so can cause unexpected behavior in the input.

Configuration

Settings

SettingTypeRequiredDefaultDescription
project_idstringYes-The Google Cloud project ID to use
bucket_namestringYes-The name of the Google Cloud Storage bucket to use
compressionstringYes-Compression format of the objects (e.g., "gzip", "none")
partition_formatstringYes"Simple Date"Specifies how data is organized in the bucket. Options include Hive-compatible format and simple date format.
formatstringYes"json"The format of the files in the bucket (e.g., "json", "csv")
prefixstringNo-The prefix to filter objects within the bucket
record_locationstringNo""Location of the record in the JSON object. Applies only if the format is JSON. Leave empty if you want the entire record.

Secrets

SettingTypeRequiredDescription
credentials_jsonstringYesService account JSON key file contents as a string

Setting up the Input

  1. In the Monad UI, go to the "Inputs" section.
  2. Click "Add Input" and select "Google Cloud Storage".
  3. Configure the input settings:
    • Project ID: Your Google Cloud project ID
    • Bucket Name: The name of the bucket you want to monitor
    • Prefix (optional): Filter objects in the bucket by prefix (e.g., "data/2023/")
    • Compression: Select the compression format of your files
    • Partition Format: Choose how your data is organized (e.g., "Simple Date" or "Hive")
    • Format: Select the format of your files (e.g., "json", "csv")
    • Record Location: Specify where to find records in JSON files (default: "")
  4. In the "Secrets" section, provide the contents of your service account JSON key file.

Working with Prefix and Partition Format

The combination of prefix and partition_format determines how the input navigates your bucket's folder structure to find files.

Simple Date Format

The Simple Date format uses a date-based folder structure in the format YYYY/MM/DD.

  • Without Prefix: Files are fetched directly from date-formatted folders

    bucket/
    2026/
    06/
    02/
    data.json
  • With Prefix: Files are fetched from date-formatted folders under the specified prefix (e.g. data/device-logs)

    bucket/
    data/
    device-logs/
    2026/
    06/
    02/
    data.json

Hive Format

The Hive format uses a more explicit folder structure in the format year=YYYY/month=MM/day=DD.

  • Without Prefix: Files are fetched from Hive-formatted folders

    bucket/
    year=2026/
    month=06/
    day=02/
    data.json
  • With Prefix: Files are fetched from Hive-formatted folders under the specified prefix

    bucket/
    data/
    device-logs/
    year=2026/
    month=06/
    day=02/
    data.json

Working with Record Location

The record_location setting helps you specify where to find the array of records within a JSON object. This is particularly useful when your data is nested within the JSON structure.

Example Usage

If your JSON files have the following structure:

{
"metadata": {
"timestamp": "2026-06-02T10:00:00Z",
"version": "1.0"
},
"data": {
"events": [
{ "id": 1, "type": "login" },
{ "id": 2, "type": "logout" }
]
}
}

To process the events array, set:

record_location = "data.events"

Nested Objects

You can access deeply nested arrays using dot notation:

{
"store": {
"location": {
"transactions": {
"daily": [
{ "id": 1, "amount": 100 },
{ "id": 2, "amount": 200 }
]
}
}
}
}

To process the daily transactions, set:

record_location = "store.location.transactions.daily"

If no record_location is specified (empty string), the input will treat the entire JSON object as a single record or expect an array at the root level.

Troubleshooting

Common Issues

  1. Access Denied: Ensure your service account has the correct IAM roles assigned.
  2. No Files Found: Verify that the bucket name, prefix, and partition format match your bucket structure.
  3. Invalid Credentials: Make sure the credentials_json is correct.
  4. Parse Errors: Ensure the file format and record location settings match your data structure.