BigQuery
The BigQuery Output allows you to load data from your pipeline into Google BigQuery tables.
This output supports efficient batch loading of data and automatically handles schema evolution.
Requirements
The BigQuery Output automatically handles resource creation:
- The specified dataset will be created if it doesn't exist
- The specified table will be created if it doesn't exist
- Table schema is automatically detected from your data
Setting up a Service Account
- Create a new service account:
# Set your project ID
export PROJECT_ID="your-project-id"
# Create the service account
gcloud iam service-accounts create bigquery-loader \
--description="Service account for BigQuery data loading" \
--display-name="BigQuery Loader"
- Grant the required permissions:
# Grant BigQuery Data Owner role (allows creating datasets and tables)
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:bigquery-loader@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/bigquery.dataOwner"
- Create and save the service account key (credentials):
# Create and save the key file to your current directory
gcloud iam service-accounts keys create bigquery-credentials.json \
--iam-account=bigquery-loader@$PROJECT_ID.iam.gserviceaccount.com
# Verify the file was created
ls -l bigquery-credentials.json
This command creates a file named bigquery-credentials.json in your current working directory. The contents of this file should be used as the value for the credentials secret in your output configuration.
Note: Keep this credentials file secure and never commit it to version control. After you've configured your output, you can safely delete this file.
Configuration
The BigQuery Output is configured using the following settings:
Settings
| Setting | Type | Required | Default | Description |
|---|---|---|---|---|
| project_id | string | Yes | - | The Google Cloud Project ID where the BigQuery instance is located |
| dataset | string | Yes | - | The name of the BigQuery dataset where the table resides |
| table | string | Yes | - | The name of the table where the data will be written |
| kms_key_name | string | No | - | The optional KMS key name to use to encrypt your data. The service account provided must have access to the KMS key |
Secrets
| Setting | Type | Required | Description |
|---|---|---|---|
| credentials | text | Yes | The Google Cloud service account credentials in JSON format |
Data Loading and Compute Resources
BigQuery offers two types of compute resources for data loading:
-
Default Pipeline (Free)
- Automatically used when no reservation is assigned to your project
- No charge for data loading
- Shares resources with other users
- May have variable performance based on overall system load
- Best for small to medium workloads where consistent performance isn't critical
-
Reserved Compute (Paid)
- Provides dedicated compute resources
- Guarantees consistent performance
- Configured at the project level in BigQuery Admin
- Best for high-volume workloads requiring predictable performance
Using Reserved Compute
To use reserved compute resources:
- Purchase a BigQuery reservation through Google Cloud Console
- Assign your project to the reservation using BigQuery Admin
- No code changes needed - jobs will automatically use the reservation
Your BigQuery Output will automatically use whatever compute resources are assigned to your project. If your data loading is experiencing performance issues with the default pipeline, consider purchasing a reservation through the BigQuery Admin Console.
For more information about batch loading and compute resources, see:
Best Practices
-
Batch Loading
- The output automatically batches records for efficient loading
- Default batch settings are optimized for most use cases
-
Schema Evolution
- Schema updates are automatically handled
- New fields can be added without configuration changes
- Existing fields can be relaxed (e.g., INT64 to FLOAT64)
-
Performance
- Start with the free default pipeline
- Monitor load job latency and throughput
- If performance is insufficient, consider a reservation
Limitations
- Individual records must be valid JSON
- Empty objects () are automatically converted to NULL
- Maximum batch size is determined by BigQuery's load job limits
- Schema changes that would lose data are not allowed
- Default pipeline may have variable performance under load
This configuration loads data into a BigQuery table in the specified project, dataset, and table. The data is encrypted using the specified KMS key, and authentication is handled through the provided service account credentials.
Remember to replace the project ID, dataset, table name, and credentials with your actual values. The KMS key name is optional and can be omitted if not needed.