Skip to main content

BigQuery

The BigQuery Output allows you to load data from your pipeline into Google BigQuery tables.

This output supports efficient batch loading of data and automatically handles schema evolution.

Requirements

The BigQuery Output automatically handles resource creation:

  • The specified dataset will be created if it doesn't exist
  • The specified table will be created if it doesn't exist
  • Table schema is automatically detected from your data

Setting up a Service Account

  1. Create a new service account:
# Set your project ID
export PROJECT_ID="your-project-id"

# Create the service account
gcloud iam service-accounts create bigquery-loader \
--description="Service account for BigQuery data loading" \
--display-name="BigQuery Loader"
  1. Grant the required permissions:
# Grant BigQuery Data Owner role (allows creating datasets and tables)
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:bigquery-loader@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/bigquery.dataOwner"
  1. Create and save the service account key (credentials):
# Create and save the key file to your current directory
gcloud iam service-accounts keys create bigquery-credentials.json \
--iam-account=bigquery-loader@$PROJECT_ID.iam.gserviceaccount.com

# Verify the file was created
ls -l bigquery-credentials.json

This command creates a file named bigquery-credentials.json in your current working directory. The contents of this file should be used as the value for the credentials secret in your output configuration.

Note: Keep this credentials file secure and never commit it to version control. After you've configured your output, you can safely delete this file.

Configuration

The BigQuery Output is configured using the following settings:

Settings

SettingTypeRequiredDefaultDescription
project_idstringYes-The Google Cloud Project ID where the BigQuery instance is located
datasetstringYes-The name of the BigQuery dataset where the table resides
tablestringYes-The name of the table where the data will be written
kms_key_namestringNo-The optional KMS key name to use to encrypt your data. The service account provided must have access to the KMS key

Secrets

SettingTypeRequiredDescription
credentialstextYesThe Google Cloud service account credentials in JSON format

Data Loading and Compute Resources

BigQuery offers two types of compute resources for data loading:

  1. Default Pipeline (Free)

    • Automatically used when no reservation is assigned to your project
    • No charge for data loading
    • Shares resources with other users
    • May have variable performance based on overall system load
    • Best for small to medium workloads where consistent performance isn't critical
  2. Reserved Compute (Paid)

    • Provides dedicated compute resources
    • Guarantees consistent performance
    • Configured at the project level in BigQuery Admin
    • Best for high-volume workloads requiring predictable performance

Using Reserved Compute

To use reserved compute resources:

  1. Purchase a BigQuery reservation through Google Cloud Console
  2. Assign your project to the reservation using BigQuery Admin
  3. No code changes needed - jobs will automatically use the reservation

Your BigQuery Output will automatically use whatever compute resources are assigned to your project. If your data loading is experiencing performance issues with the default pipeline, consider purchasing a reservation through the BigQuery Admin Console.

For more information about batch loading and compute resources, see:

Best Practices

  1. Batch Loading

    • The output automatically batches records for efficient loading
    • Default batch settings are optimized for most use cases
  2. Schema Evolution

    • Schema updates are automatically handled
    • New fields can be added without configuration changes
    • Existing fields can be relaxed (e.g., INT64 to FLOAT64)
  3. Performance

    • Start with the free default pipeline
    • Monitor load job latency and throughput
    • If performance is insufficient, consider a reservation

Limitations

  • Individual records must be valid JSON
  • Empty objects () are automatically converted to NULL
  • Maximum batch size is determined by BigQuery's load job limits
  • Schema changes that would lose data are not allowed
  • Default pipeline may have variable performance under load

This configuration loads data into a BigQuery table in the specified project, dataset, and table. The data is encrypted using the specified KMS key, and authentication is handled through the provided service account credentials.

Remember to replace the project ID, dataset, table name, and credentials with your actual values. The KMS key name is optional and can be omitted if not needed.