Google Cloud Storage
Easily write Monad pipeline outputs to Google Cloud Storage for archiving or downstream processing.
Details
The Google Cloud Storage Output allows you to write data from Monad pipelines directly to a Google Cloud Storage (GCS) bucket. This is useful for archiving, sharing, or further processing your pipeline outputs in the cloud.
Requirements
- A Google Cloud Platform (GCP) account
- Access to a GCS bucket (create one if needed)
- Service account credentials with
Storage Object Creatorpermissions
Setup Instructions
Option 1: Using Google Cloud Console
-
Create a GCS Bucket
Go to the Cloud Storage Console and click "Create bucket". Follow the prompts to set up your bucket. -
Create a Service Account
Navigate to IAM & Admin > Service Accounts. Click "Create Service Account", assign a name, and grant the "Storage Object Creator" role. -
Download Service Account Key
After creating the service account, go to "Keys" and add a new key (JSON). Download and securely store this file.
Option 2: Using Command Line
-
Create a GCS Bucket
gsutil mb gs://YOUR_BUCKET_NAME/ -
Create a Service Account
gcloud iam service-accounts create monad-gcs-writer \
--display-name="Monad GCS Writer" -
Grant Permissions
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
--member="serviceAccount:monad-gcs-writer@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectCreator" -
Create and Download Key
gcloud iam service-accounts keys create monad-gcs-key.json \
--iam-account=monad-gcs-writer@YOUR_PROJECT_ID.iam.gserviceaccount.com
This creates a monad-gcs-key.json file in your current directory. Use the contents of this file as the value for the credentials_json secret in your input configuration.
Important: Store this credentials file securely and never commit it to version control.
Configuration
Add the following to your Monad pipeline configuration to enable the GCS output.
Settings
| Setting | Type | Required | Default | Description |
|---|---|---|---|---|
| Bucket Name | string | Yes | — | Name of your GCS bucket |
| Prefix | string | No | — | Path prefix for objects in the bucket |
| Compression | string | Yes | none | Compression type for output files (e.g. gzip) |
| Partition Format | string | Yes | simple date | Date/time partitioning format for object storage |
| Format Config | object | Yes | — | Output format options (e.g. Delimited, JSON) |
| Batch Config | object | Yes | — | Batch write options (e.g. size, rate, count) |
Secrets
| Secret | Type | Required | Default | Description |
|---|---|---|---|---|
| Credentials Json | string | Yes | — | Contents of your service account JSON key |
Setting up the Output
To configure Google Cloud Storage as an output in your Monad pipeline:
- Complete the setup steps above to create a GCS bucket and service account with the required permissions.
- Add the GCS output configuration to your pipeline's configuration file, specifying the appropriate settings and secrets as described above.
- Ensure the
gcp_credentialssecret contains the full JSON key (not just a file path). - Adjust settings such as
prefix,compression,partition format,format config, andbatch configto match your data and workflow requirements. - Deploy or run your pipeline. Output files will be written to your specified GCS bucket according to your configuration.
If you want to use data stored in GCS as an input for your pipeline, refer to the Google Cloud Storage Input documentation.
Partition Format
The Partition Format setting controls how output files are organized within your GCS bucket based on date or time. Supported formats are:
simple date(default): Organizes files byYYYY/MM/DD.hive compliant: Uses Hive-style partitioning, e.g.,year=YYYY/month=MM/date=DD/.
Choose a format that matches your data retention and access patterns.
Format Options
The output format determines how your data is structured in the storage files. You must configure exactly one format type you can see documentation on formats here: Formats.
Batch Config
The Batch Config object controls how data is grouped and written to GCS. Typical options include:
size: Maximum size of each batch (e.g., 10MB)count: Maximum number of records per batchrate: Throttling or write rate limits
Tuning batch settings can help optimize performance and cost for your workload.
Troubleshooting
Common Issues
-
Permission Denied:
Ensure your service account has the correct permissions (Storage Object Creator) and the credentials are valid. -
Bucket Not Found:
Double-check the bucket name and ensure it exists in your GCP project. -
Invalid Credentials Format:
Make sure thegcp_credentialssecret contains the full JSON key, not just a path.