Skip to main content

Google Cloud Storage

Easily write Monad pipeline outputs to Google Cloud Storage for archiving or downstream processing.

Details

The Google Cloud Storage Output allows you to write data from Monad pipelines directly to a Google Cloud Storage (GCS) bucket. This is useful for archiving, sharing, or further processing your pipeline outputs in the cloud.

Requirements

  • A Google Cloud Platform (GCP) account
  • Access to a GCS bucket (create one if needed)
  • Service account credentials with Storage Object Creator permissions

Setup Instructions

Option 1: Using Google Cloud Console

  1. Create a GCS Bucket
    Go to the Cloud Storage Console and click "Create bucket". Follow the prompts to set up your bucket.

  2. Create a Service Account
    Navigate to IAM & Admin > Service Accounts. Click "Create Service Account", assign a name, and grant the "Storage Object Creator" role.

  3. Download Service Account Key
    After creating the service account, go to "Keys" and add a new key (JSON). Download and securely store this file.

Option 2: Using Command Line

  1. Create a GCS Bucket

    gsutil mb gs://YOUR_BUCKET_NAME/
  2. Create a Service Account

    gcloud iam service-accounts create monad-gcs-writer \
    --display-name="Monad GCS Writer"
  3. Grant Permissions

    gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:monad-gcs-writer@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/storage.objectCreator"
  4. Create and Download Key

    gcloud iam service-accounts keys create monad-gcs-key.json \
    --iam-account=monad-gcs-writer@YOUR_PROJECT_ID.iam.gserviceaccount.com

This creates a monad-gcs-key.json file in your current directory. Use the contents of this file as the value for the credentials_json secret in your input configuration.

Important: Store this credentials file securely and never commit it to version control.

Configuration

Add the following to your Monad pipeline configuration to enable the GCS output.

Settings

SettingTypeRequiredDefaultDescription
Bucket NamestringYesName of your GCS bucket
PrefixstringNoPath prefix for objects in the bucket
CompressionstringYesnoneCompression type for output files (e.g. gzip)
Partition FormatstringYessimple dateDate/time partitioning format for object storage
Format ConfigobjectYesOutput format options (e.g. Delimited, JSON)
Batch ConfigobjectYesBatch write options (e.g. size, rate, count)

Secrets

SecretTypeRequiredDefaultDescription
Credentials JsonstringYesContents of your service account JSON key

Setting up the Output

To configure Google Cloud Storage as an output in your Monad pipeline:

  1. Complete the setup steps above to create a GCS bucket and service account with the required permissions.
  2. Add the GCS output configuration to your pipeline's configuration file, specifying the appropriate settings and secrets as described above.
  3. Ensure the gcp_credentials secret contains the full JSON key (not just a file path).
  4. Adjust settings such as prefix, compression, partition format, format config, and batch config to match your data and workflow requirements.
  5. Deploy or run your pipeline. Output files will be written to your specified GCS bucket according to your configuration.

If you want to use data stored in GCS as an input for your pipeline, refer to the Google Cloud Storage Input documentation.

Partition Format

The Partition Format setting controls how output files are organized within your GCS bucket based on date or time. Supported formats are:

  • simple date (default): Organizes files by YYYY/MM/DD.
  • hive compliant: Uses Hive-style partitioning, e.g., year=YYYY/month=MM/date=DD/.

Choose a format that matches your data retention and access patterns.

Format Options

The output format determines how your data is structured in the storage files. You must configure exactly one format type you can see documentation on formats here: Formats.

Batch Config

The Batch Config object controls how data is grouped and written to GCS. Typical options include:

  • size: Maximum size of each batch (e.g., 10MB)
  • count: Maximum number of records per batch
  • rate: Throttling or write rate limits

Tuning batch settings can help optimize performance and cost for your workload.

Troubleshooting

Common Issues

  • Permission Denied:
    Ensure your service account has the correct permissions (Storage Object Creator) and the credentials are valid.

  • Bucket Not Found:
    Double-check the bucket name and ensure it exists in your GCP project.

  • Invalid Credentials Format:
    Make sure the gcp_credentials secret contains the full JSON key, not just a path.