Skip to main content

Databricks

Stream data from your Monad pipeline into Databricks Delta Lake tables via Unity Catalog, with automatic table creation, schema inference, and gzip-compressed staging.

Overview

The Databricks output enables you to:

  • Load data into Delta Lake tables with automatic schema inference
  • Automatically create tables on first data load if they don't exist
  • Stage compressed JSONL files via Unity Catalog Volumes for efficient bulk loading
  • Authenticate using OAuth M2M (service principal)
  • Handle schema evolution with mergeSchema support
  • Validate permissions during connection testing before any data is written

Requirements

  1. Databricks Workspace with Unity Catalog enabled
  2. SQL Warehouse running and accessible
  3. Catalog and Schema must already exist in your workspace
  4. Volume for staging files - Monad will create it if it doesn't exist
  5. Authentication credentials (see Authentication Methods)

Setting Up Permissions

The Databricks output requires the following permissions on the target schema. Grant them to your user or service principal:

-- Replace placeholders with your actual values
GRANT USE CATALOG ON CATALOG <your_catalog> TO `<principal>`;
GRANT USE SCHEMA ON SCHEMA <your_catalog>.<your_schema> TO `<principal>`;
GRANT CREATE TABLE ON SCHEMA <your_catalog>.<your_schema> TO `<principal>`;
GRANT SELECT ON SCHEMA <your_catalog>.<your_schema> TO `<principal>`;
GRANT MODIFY ON SCHEMA <your_catalog>.<your_schema> TO `<principal>`;
GRANT READ VOLUME, WRITE VOLUME ON VOLUME <your_catalog>.<your_schema>.<your_volume> TO `<principal>`;
GRANT CREATE VOLUME ON SCHEMA <your_catalog>.<your_schema> TO `<principal>`;

Where <principal> is: Your service principal application ID

Monad verifies all of these permissions during Test Connection and will report any missing grants.

Configuration

Settings

SettingTypeRequiredDefaultDescription
Server HostnamestringYes-The Databricks workspace hostname (e.g. adb-1234567890.azuredatabricks.net)
HTTP PathstringYes-The SQL warehouse HTTP path from connection details (e.g. /sql/1.0/warehouses/abc123)
CatalogstringYes-The Unity Catalog name
SchemastringYes-The target schema within the catalog
TablestringYes-The target Delta table name. Created automatically if it doesn't exist
VolumestringYes-The Unity Catalog Volume used for staging JSONL files before loading
Batch ConfigobjectNoSee belowBatching configuration

Batch Configuration

Defaults are tuned for bulk COPY INTO throughput -- larger batches mean fewer load operations.

SettingDefaultMinMaxDescription
record_count50,000500100,000Maximum records per batch
data_size10 MB1 MB50 MBMaximum batch size
publish_rate30s5s120sMaximum time before sending a batch

Secrets

SettingTypeRequiredDescription
Client IDstringYes*OAuth M2M client ID for service principal authentication
Client SecretstringYes*OAuth M2M client secret for service principal authentication

Generate Client ID and Client Secret (OAuth Machine-to-Machine - Service Principal)

Recommended for production. Uses a service principal with client credentials:

  1. In the Databricks Account Console, go to User management > Service principals
  2. Click Add service principal and create one
  3. Select the service principal, go to Secrets > Generate secret
  4. Copy the Client ID and Client Secret
  5. Add the service principal to your workspace and grant it the required permissions

Use the client ID and client secret as the client_id and client_secret secrets.

Where to Find Connection Details

  1. In your Databricks workspace, go to SQL Warehouses
  2. Select your warehouse and open the Connection details tab
  3. Copy the Server hostname and HTTP path

Troubleshooting

Connection Issues

  • Server hostname: Ensure the hostname is correct and accessible (e.g. adb-1234567890.azuredatabricks.net)
  • HTTP path: Verify the SQL warehouse HTTP path from the connection details tab
  • SQL warehouse: Make sure your warehouse is running -- Monad cannot start a stopped warehouse

Authentication Errors

  • 401 Unauthorized: Check that your OAuth credentials are valid and not expired
  • OAuth M2M: Ensure the service principal is added to the workspace and has the correct grants

Permission Errors

  • USE SCHEMA denied: Grant USE SCHEMA on the target schema to your principal
  • Missing permissions: Run Test Connection to see which specific permissions are missing
  • Volume access denied: Grant READ VOLUME and WRITE VOLUME on the volume

Data Loading Issues

  • COPY INTO failures: Check that the volume exists and is accessible
  • Schema mismatch: mergeSchema is enabled, so new fields are added automatically. However, incompatible type changes may cause errors
  • Large batch failures: If uploads fail with 413 errors, reduce the data_size in batch configuration

Limitations

  • Catalog and schema must exist before configuring the output
  • Volume is created automatically if it doesn't exist
  • Table schema is inferred from the data -- explicit schema definition is not supported, but you can pre-create the table with your desired schema

Best Practices

  1. Use default batch settings -- they are optimized for COPY INTO throughput
  2. Share volumes across connectors -- multiple tables can safely stage files in the same volume
  3. Pre-create catalog and schema -- Monad expects these to exist
  4. Use dedicated service principals with only the required permissions
  5. Monitor warehouse usage -- each COPY INTO consumes SQL warehouse compute