Skip to content

Commit 7294430

Browse files
committed
docs: databricks
1 parent d4c535e commit 7294430

File tree

1 file changed

+125
-0
lines changed
  • site/docs/reference/Connectors/materialization-connectors

1 file changed

+125
-0
lines changed
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Databricks
2+
3+
This connector materializes Flow collections into tables in a Databricks SQL Warehouse.
4+
It allows both standard and [delta updates](#delta-updates).
5+
6+
The connector first uploads data changes to a [Databricks Unity Catalog Volume](https://docs.databricks.com/en/sql/language-manual/sql-ref-volumes.html).
7+
From there, it transactionally applies the changes to the Databricks tables.
8+
9+
[`ghcr.io/estuary/materialize-databricks:dev`](https://ghcr.io/estuary/materialize-databricks:dev) provides the latest connector image. You can also follow the link in your browser to see past image versions.
10+
11+
## Prerequisites
12+
13+
To use this connector, you'll need:
14+
15+
* A Databricks account that includes:
16+
* A unity catalog
17+
* A SQL Warehouse
18+
* A [schema](https://docs.databricks.com/api/workspace/schemas) — a logical grouping of tables in a catalog
19+
* A user with a role assigned that grants the appropriate access levels to these resources.
20+
* At least one Flow collection
21+
22+
:::tip
23+
If you haven't yet captured your data from its external source, start at the beginning of the [guide to create a dataflow](../../../guides/create-dataflow.md). You'll be referred back to this connector-specific documentation at the appropriate steps.
24+
:::
25+
26+
### Setup
27+
28+
You need to first create a SQL Warehouse if you don't already have one in your account. See [Databricks documentation](https://docs.databricks.com/en/sql/admin/create-sql-warehouse.html) on configuring a Databricks SQL Warehouse. After creating a SQL Warehouse, you can find the details necessary for connecting to it under the **Connection Details** tab.
29+
30+
You also need an access token for your user to be used by our connector, see the respective [documentation](https://docs.databricks.com/en/administration-guide/access-control/tokens.html) from Databricks on how to create an access token.
31+
32+
## Configuration
33+
34+
To use this connector, begin with data in one or more Flow collections.
35+
Use the below properties to configure a Databricks materialization, which will direct one or more of your Flow collections to new Databricks tables.
36+
37+
### Properties
38+
39+
#### Endpoint
40+
41+
| Property | Title | Description | Type | Required/Default |
42+
|------------------------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------|--------|--------------------------|
43+
| **`/address`** | Address | Host and port of the SQL warehouse (in the form of host[:port]). Port 443 is used as the default if no specific port is provided. | string | Required |
44+
| **`/http_path`** | HTTP Path | HTTP path of your SQL warehouse | string | Required |
45+
| **`/catalog_name`** | Catalog Name | Name of your Unity Catalog | string | Required |
46+
| **`/schema_name`** | Schema Name | Default schema to materialize to | string | `default` schema is used |
47+
| **`/credentials`** | Credentials | Authentication credentials | object | |
48+
| **`/credentials/auth_type`** | Role | Authentication type, set to `PAT` for personal access token | string | Required |
49+
| **`/credentials/personal_access_token`** | Role | Personal Access Token | string | Required |
50+
51+
#### Bindings
52+
53+
| Property | Title | Description | Type | Required/Default |
54+
|------------------|--------------------|------------------------------------------------------------|---------|------------------|
55+
| **`/table`** | Table | Table name | string | Required |
56+
| `/schema` | Alternative Schema | Alternative schema for this table | string | Required |
57+
| `/delta_updates` | Delta updates | Whether to use standard or [delta updates](#delta-updates) | boolean | `false` |
58+
59+
### Sample
60+
61+
```yaml
62+
63+
materializations:
64+
${PREFIX}/${mat_name}:
65+
endpoint:
66+
connector:
67+
config:
68+
address: dbc-abcdefgh-a12b.cloud.databricks.com
69+
catalog_name: main
70+
http_path: /sql/1.0/warehouses/abcd123efgh4567
71+
schema_name: default
72+
credentials:
73+
auth_type: PAT
74+
personal_access_token: secret
75+
image: ghcr.io/estuary/materialize-databricks:dev
76+
# If you have multiple collections you need to materialize, add a binding for each one
77+
# to ensure complete data flow-through
78+
bindings:
79+
- resource:
80+
table: ${table_name}
81+
schema: default
82+
source: ${PREFIX}/${source_collection}
83+
```
84+
85+
## Delta updates
86+
87+
This connector supports both standard (merge) and [delta updates](../../../concepts/materialization.md#delta-updates).
88+
The default is to use standard updates.
89+
90+
Enabling delta updates will prevent Flow from querying for documents in your Databricks table, which can reduce latency and costs for large datasets.
91+
If you're certain that all events will have unique keys, enabling delta updates is a simple way to improve
92+
performance with no effect on the output.
93+
However, enabling delta updates is not suitable for all workflows, as the resulting table in Databricks won't be fully reduced.
94+
95+
You can enable delta updates on a per-binding basis:
96+
97+
```yaml
98+
bindings:
99+
- resource:
100+
table: ${table_name}
101+
schema: default
102+
delta_updates: true
103+
source: ${PREFIX}/${source_collection}
104+
```
105+
## Reserved words
106+
107+
Databricks has a list of reserved words that must be quoted in order to be used as an identifier. Flow automatically quotes fields that are in the reserved words list. You can find this list in Databricks's documentation [here](https://docs.databricks.com/en/sql/language-manual/sql-ref-reserved-words.html) and in the table below.
108+
109+
:::caution
110+
In Databricks, objects created with quoted identifiers must always be referenced exactly as created, including the quotes. Otherwise, SQL statements and queries can result in errors. See the [Databricks docs](https://docs.databricks.com/en/sql-reference/identifiers-syntax.html#double-quoted-identifiers).
111+
:::
112+
113+
| Reserved words | |
114+
|----------------|---------------|
115+
| ANTI | |
116+
| EXCEPT | FULL |
117+
| INNER | INTERSECT |
118+
| JOIN | LATERAL |
119+
| LEFT | MINUS |
120+
| NATURAL | ON |
121+
| RIGHT | SEMI |
122+
| SEMI | USING |
123+
| NULL | DEFAULT |
124+
| TRUE | FALSE |
125+
| CROSS | |

0 commit comments

Comments
 (0)