Databricks
Stream data from your Monad pipeline into Databricks Delta Lake tables via Unity Catalog, with automatic table creation, schema inference, and gzip-compressed staging.
Overview
The Databricks output supports two write modes:
- Copy Into -- Stages compressed JSONL files to a Unity Catalog Volume and uses
COPY INTOto load them into a Delta table. Monad handles table creation, schema inference, and file cleanup automatically. - Autoloader -- Stages compressed JSONL files to a Unity Catalog Volume for Databricks Autoloader (
cloudFiles) to ingest. You configure the Autoloader job in Databricks to pick up files from the volume.
Both modes support OAuth M2M (service principal) authentication and validate permissions during connection testing.
Requirements
- Databricks Workspace with Unity Catalog enabled
- SQL Warehouse running and accessible
- Catalog and Schema must already exist in your workspace
- Volume for staging files - Monad will create it if it doesn't exist
- Authentication credentials (see Authentication Methods)
Setting Up Permissions
The required permissions depend on which write mode you use.
Copy Into mode
Code
Autoloader mode
Autoloader only needs volume access -- table permissions are managed by your Autoloader job:
Code
Where <principal> is: Your service principal application ID
Monad verifies all of these permissions during Test Connection and will report any missing grants.
Configuration
Settings
| Setting | Type | Required | Default | Description |
|---|---|---|---|---|
| Server Hostname | string | Yes | - | The Databricks workspace hostname (e.g. adb-1234567890.azuredatabricks.net) |
| HTTP Path | string | Yes | - | The SQL warehouse HTTP path from connection details (e.g. /sql/1.0/warehouses/abc123) |
| Write Mode | object | Yes | - | How data is loaded (see Write Modes) |
| Catalog | string | Yes | - | The Unity Catalog name |
| Schema | string | Yes | - | The target schema within the catalog |
| Volume | string | Yes | - | The Unity Catalog Volume used for staging JSONL files |
| Batch Config | object | No | See below | Batching configuration |
Write Modes
| Mode | Description |
|---|---|
copy_into | Stages files to a Volume and uses COPY INTO to load data into a Delta table |
autoloader | Stages files to a Volume for Databricks Autoloader (cloudFiles) to ingest |
Copy Into requires an additional Table Name setting -- the target Delta table name. If the table doesn't exist, Monad will create it automatically.
Autoloader has no additional settings. Files are uploaded to the volume and left for your Autoloader job to pick up.
Batch Configuration
Defaults are tuned for bulk loading throughput -- larger batches mean fewer load operations.
| Setting | Default | Min | Max | Description |
|---|---|---|---|---|
record_count | 50,000 | 10,000 | 100,000 | Maximum records per batch |
data_size | 20 MB | 10 MB | 50 MB | Maximum batch size |
publish_rate | 300s | 300s | 600s | Maximum time before sending a batch |
Secrets
| Setting | Type | Required | Description |
|---|---|---|---|
| Client ID | string | Yes | OAuth M2M client ID for service principal authentication |
| Client Secret | string | Yes | OAuth M2M client secret for service principal authentication |
Generate Client ID and Client Secret (OAuth Machine-to-Machine - Service Principal)
Recommended for production. Uses a service principal with client credentials:
- In the Databricks Account Console, go to User management > Service principals
- Click Add service principal and create one
- Select the service principal, go to Secrets > Generate secret
- Copy the Client ID and Client Secret
- Add the service principal to your workspace and grant it the required permissions
Use the client ID and client secret as the client_id and client_secret secrets.
Where to Find Connection Details
- In your Databricks workspace, go to SQL Warehouses
- Select your warehouse and open the Connection details tab
- Copy the Server hostname and HTTP path
Troubleshooting
Connection Issues
- Server hostname: Ensure the hostname is correct and accessible (e.g.
adb-1234567890.azuredatabricks.net) - HTTP path: Verify the SQL warehouse HTTP path from the connection details tab
- SQL warehouse: Make sure your warehouse is running -- Monad cannot start a stopped warehouse
Authentication Errors
- 401 Unauthorized: Check that your OAuth credentials are valid and not expired
- OAuth M2M: Ensure the service principal is added to the workspace and has the correct grants
Permission Errors
- USE SCHEMA denied: Grant
USE SCHEMAon the target schema to your principal - Missing permissions: Run Test Connection to see which specific permissions are missing
- Volume access denied: Grant
READ VOLUMEandWRITE VOLUMEon the volume
Data Loading Issues
- COPY INTO failures (copy_into mode): Check that the volume exists and is accessible
- Schema mismatch (copy_into mode):
mergeSchemais enabled, so new fields are added automatically. However, incompatible type changes may cause errors - Large batch failures: If uploads fail with 413 errors, reduce the
data_sizein batch configuration - Autoloader not picking up files: Verify your Autoloader job is configured to read from the correct volume path (
/Volumes/<catalog>/<schema>/<volume>/)
Limitations
- Catalog and schema must exist before configuring the output
- Volume is created automatically if it doesn't exist
- In copy_into mode, table schema is inferred from the data -- explicit schema definition is not supported, but you can pre-create the table with your desired schema
- In autoloader mode, Monad only stages files -- you are responsible for configuring the Autoloader job in Databricks
Best Practices
- Use default batch settings -- they are optimized for bulk loading throughput
- Share volumes across connectors -- multiple tables can safely stage files in the same volume
- Pre-create catalog and schema -- Monad expects these to exist
- Use dedicated service principals with only the required permissions
- Monitor warehouse usage (copy_into mode) -- each
COPY INTOconsumes SQL warehouse compute - Use autoloader mode when you want Databricks to control the ingestion schedule and schema evolution