Cloud Storage
Collects and ingests data from a Google Cloud Storage bucket.
Sync Type: Incremental
Details
The Google Cloud Storage input allows you to collect and ingest data from a Google Cloud Storage bucket. You can specify which bucket to monitor and configure how data should be processed based on its format and organization.
Requirements
Before setting up the Google Cloud Storage input, you need to:
- Have a Google Cloud Platform account with access to the desired project.
- Create a service account with the necessary permissions.
- Generate a JSON key for the service account.
Setup Instructions
You can set up the Google Cloud Storage input using either the Google Cloud Console UI or command-line interface.
Option 1: Using Google Cloud Console
- Navigate to the Google Cloud Console
- Select your project
- Open "IAM & Admin" > "Service Accounts"
- Create a new service account:
- Click "Create Service Account"
- Provide a name for the service account
- Click "Create"
- Assign the required role:
- Add the "Storage Object Viewer" role
- Click "Continue"
- Click "Done"
- Generate credentials:
- Select the newly created service account
- Go to the "Keys" tab
- Click "Add Key" > "Create new key"
- Select JSON format
- Click "Create" to download the key file
- Store this file securely - you'll need its contents later
Option 2: Using Command Line
- Set your project ID:
export PROJECT_ID="your-project-id"
gcloud config set project $PROJECT_ID
- Create a service account:
gcloud iam service-accounts create monad-gcs-input-connector \
--display-name="Monad GCS Input Connector" \
--description="Service account for GCS input connector"
- Assign the required roles:
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:monad-gcs-input-connector@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:monad-gcs-input-connector@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/serviceusage.serviceUsageConsumer"
- Generate and download the service account key:
gcloud iam service-accounts keys create monad-gcs-key.json \
--iam-account=monad-gcs-input-connector@$PROJECT_ID.iam.gserviceaccount.com
This creates a monad-gcs-key.json file in your current directory. Use the contents of this file as the value for the credentials_json secret in your input configuration.
Important: Store this credentials file securely and never commit it to version control.
Bucket Structure
Your bucket should follow one of these partition formats:
- Simple Date format (
YYYY/MM/DD):
bucket/
2026/
06/
02/
data.json
- Hive format (
year=YYYY/month=MM/day=DD):
bucket/
year=2026/
month=06/
day=02/
data.json
You can optionally include a prefix for better organization:
bucket/
data/
device-logs/
2026/
06/
02/
data.json
Details
When the input is run for the first time, it performs a full sync of all files in the specified bucket-prefix. State is checkpointed only after an entire date prefix has been successfully processed, saving:
- The highest LastModified timestamp encountered
- The lexicographically greatest Blob key at that timestamp
On subsequent runs, the processor performs an incremental sync starting from the last checkpointed timestamp. In the event of a failure, the processor resumes from the start of the last checkpointed date prefix. This means that if a failure occurs mid-prefix, the entire date prefix will be reprocessed from the beginning, which can represent a large number of objects or blobs.
-
To avoid large-scale reprocessing on failure, we recommend publishing blob data to a queue that can be consumed from instead.
-
Please also note we rescan and drop all data based on our deduplication logic on every single sync which occurs in a day prefix. This means that for larger buckets, this may lead to hitting rate limits since we will be scanning the same data a large number of times in a day. To avoid this, we recommend publishing blob data to a queue that can be consumed from to avoid such scenarios.
-
Prefixes must be hive compliant/simple date always. Anything other than this can cause unexpected behavior in the input.
-
Each log's last updated time should be on the same date as the logical prefix itself. So any object that lands in the 2025/08/10 prefix should have a last updated time of 2025/08/10 (in its ISO8601 format). Not doing so can cause unexpected behavior in the input.
Configuration
Settings
| Setting | Type | Required | Default | Description |
|---|---|---|---|---|
| project_id | string | Yes | - | The Google Cloud project ID to use |
| bucket_name | string | Yes | - | The name of the Google Cloud Storage bucket to use |
| compression | string | Yes | - | Compression format of the objects (e.g., "gzip", "none") |
| partition_format | string | Yes | "Simple Date" | Specifies how data is organized in the bucket. Options include Hive-compatible format and simple date format. |
| format | string | Yes | "json" | The format of the files in the bucket (e.g., "json", "csv") |
| prefix | string | No | - | The prefix to filter objects within the bucket |
| record_location | string | No | "" | Location of the record in the JSON object. Applies only if the format is JSON. Leave empty if you want the entire record. |
Secrets
| Setting | Type | Required | Description |
|---|---|---|---|
| credentials_json | string | Yes | Service account JSON key file contents as a string |
Setting up the Input
- In the Monad UI, go to the "Inputs" section.
- Click "Add Input" and select "Google Cloud Storage".
- Configure the input settings:
- Project ID: Your Google Cloud project ID
- Bucket Name: The name of the bucket you want to monitor
- Prefix (optional): Filter objects in the bucket by prefix (e.g.,
"data/2023/") - Compression: Select the compression format of your files
- Partition Format: Choose how your data is organized (e.g., "Simple Date" or "Hive")
- Format: Select the format of your files (e.g., "json", "csv")
- Record Location: Specify where to find records in JSON files (default: "")
- In the "Secrets" section, provide the contents of your service account JSON key file.
Working with Prefix and Partition Format
The combination of prefix and partition_format determines how the input navigates your bucket's folder structure to find files.
Simple Date Format
The Simple Date format uses a date-based folder structure in the format YYYY/MM/DD.
-
Without Prefix: Files are fetched directly from date-formatted folders
bucket/
2026/
06/
02/
data.json -
With Prefix: Files are fetched from date-formatted folders under the specified prefix (e.g.
data/device-logs)bucket/
data/
device-logs/
2026/
06/
02/
data.json
Hive Format
The Hive format uses a more explicit folder structure in the format year=YYYY/month=MM/day=DD.
-
Without Prefix: Files are fetched from Hive-formatted folders
bucket/
year=2026/
month=06/
day=02/
data.json -
With Prefix: Files are fetched from Hive-formatted folders under the specified prefix
bucket/
data/
device-logs/
year=2026/
month=06/
day=02/
data.json
Working with Record Location
The record_location setting helps you specify where to find the array of records within a JSON object. This is particularly useful when your data is nested within the JSON structure.
Example Usage
If your JSON files have the following structure:
{
"metadata": {
"timestamp": "2026-06-02T10:00:00Z",
"version": "1.0"
},
"data": {
"events": [
{ "id": 1, "type": "login" },
{ "id": 2, "type": "logout" }
]
}
}
To process the events array, set:
record_location = "data.events"
Nested Objects
You can access deeply nested arrays using dot notation:
{
"store": {
"location": {
"transactions": {
"daily": [
{ "id": 1, "amount": 100 },
{ "id": 2, "amount": 200 }
]
}
}
}
}
To process the daily transactions, set:
record_location = "store.location.transactions.daily"
If no record_location is specified (empty string), the input will treat the entire JSON object as a single record or expect an array at the root level.
Troubleshooting
Common Issues
- Access Denied: Ensure your service account has the correct IAM roles assigned.
- No Files Found: Verify that the bucket name, prefix, and partition format match your bucket structure.
- Invalid Credentials: Make sure the credentials_json is correct.
- Parse Errors: Ensure the file format and record location settings match your data structure.