Databricks
Stream data from your Monad pipeline into Databricks Delta Lake tables via Unity Catalog, with automatic table creation, schema inference, and gzip-compressed staging.
Overview
The Databricks output supports two write modes:
- Copy Into -- Stages compressed JSONL files to a Unity Catalog Volume and uses
COPY INTOto load them into a Delta table. Monad handles table creation, schema inference, and file cleanup automatically. - Autoloader -- Stages compressed JSONL files to a Unity Catalog Volume for Databricks Autoloader (
cloudFiles) to ingest. You configure the Autoloader job in Databricks to pick up files from the volume.
Both modes support OAuth M2M (service principal) authentication and validate permissions during connection testing.
Requirements
- Databricks Workspace with Unity Catalog enabled
- SQL Warehouse running and accessible (only required for
copy_intomode) - Catalog and Schema must already exist in your workspace
- Volume for staging files - Monad will create it if it doesn't exist
- Authentication credentials (see Authentication Methods)
Setting Up Permissions
The required permissions depend on which write mode you use.
Copy Into mode
Code
Autoloader mode
Autoloader only needs volume access -- table permissions are managed by your Autoloader job:
Code
Where <principal> is: Your service principal application ID
Monad verifies all of these permissions during Test Connection and will report any missing grants.
Configuration
Settings
| Setting | Type | Required | Default | Description |
|---|---|---|---|---|
| Server Hostname | string | Yes | - | The Databricks workspace hostname (e.g. adb-1234567890.azuredatabricks.net) |
| Write Mode | object | Yes | - | How data is loaded (see Write Modes) |
| Catalog | string | Yes | - | The Unity Catalog name |
| Schema | string | Yes | - | The target schema within the catalog |
| Volume | string | Yes | - | The Unity Catalog Volume used for staging JSONL files |
| Batch Config | object | No | See below | Batching configuration |
Write Modes
| Mode | Description |
|---|---|
copy_into | Stages files to a Volume and uses COPY INTO to load data into a Delta table |
autoloader | Stages files to a Volume for Databricks Autoloader (cloudFiles) to ingest |
Copy Into requires two additional settings:
- Table Name -- the target Delta table name. If the table doesn't exist, Monad will create it automatically.
- HTTP Path -- the SQL warehouse HTTP path from connection details (e.g.
/sql/1.0/warehouses/abc123).
Autoloader has no additional settings. Files are uploaded to the volume and left for your Autoloader job to pick up -- no SQL warehouse is required.
Batch Configuration
Defaults are tuned for bulk loading throughput -- larger batches mean fewer load operations.
| Setting | Default | Min | Max | Description |
|---|---|---|---|---|
record_count | 50,000 | 10,000 | 100,000 | Maximum records per batch |
data_size | 20 MB | 10 MB | 50 MB | Maximum batch size |
publish_rate | 300s | 300s | 600s | Maximum time before sending a batch |
Secrets
| Setting | Type | Required | Description |
|---|---|---|---|
| Client ID | string | Yes | OAuth M2M client ID for service principal authentication |
| Client Secret | string | Yes | OAuth M2M client secret for service principal authentication |
Generate Client ID and Client Secret (OAuth Machine-to-Machine - Service Principal)
Recommended for production. Uses a service principal with client credentials:
- In the Databricks Account Console, go to User management > Service principals
- Click Add service principal and create one
- Select the service principal, go to Secrets > Generate secret
- Copy the Client ID and Client Secret
- Add the service principal to your workspace and grant it the required permissions
Use the client ID and client secret as the client_id and client_secret secrets.
Where to Find Connection Details
- In your Databricks workspace, go to SQL Warehouses
- Select your warehouse and open the Connection details tab
- Copy the Server hostname and HTTP path
Troubleshooting
Connection Issues
- Server hostname: Ensure the hostname is correct and accessible (e.g.
adb-1234567890.azuredatabricks.net) - HTTP path: Verify the SQL warehouse HTTP path from the connection details tab
- SQL warehouse: Make sure your warehouse is running -- Monad cannot start a stopped warehouse
Authentication Errors
- 401 Unauthorized: Check that your OAuth credentials are valid and not expired
- OAuth M2M: Ensure the service principal is added to the workspace and has the correct grants
Permission Errors
- USE SCHEMA denied: Grant
USE SCHEMAon the target schema to your principal - Missing permissions: Run Test Connection to see which specific permissions are missing
- Volume access denied: Grant
READ VOLUMEandWRITE VOLUMEon the volume
Data Loading Issues
- COPY INTO failures (copy_into mode): Check that the volume exists and is accessible
- Schema mismatch (copy_into mode):
mergeSchemais enabled, so new fields are added automatically. However, incompatible type changes may cause errors - Large batch failures: If uploads fail with 413 errors, reduce the
data_sizein batch configuration - Autoloader not picking up files: Verify your Autoloader job is configured to read from the correct volume path (
/Volumes/<catalog>/<schema>/<volume>/)
Limitations
- Catalog and schema must exist before configuring the output
- Volume is created automatically if it doesn't exist
- In copy_into mode, table schema is inferred from the data -- explicit schema definition is not supported, but you can pre-create the table with your desired schema
- In autoloader mode, Monad only stages files -- you are responsible for configuring the Autoloader job in Databricks
Best Practices
- Use default batch settings -- they are optimized for bulk loading throughput
- Share volumes across connectors -- multiple tables can safely stage files in the same volume
- Pre-create catalog and schema -- Monad expects these to exist
- Use dedicated service principals with only the required permissions
- Monitor warehouse usage (copy_into mode) -- each
COPY INTOconsumes SQL warehouse compute - Use autoloader mode when you want Databricks to control the ingestion schedule and schema evolution