Skip to content

Commit 540dae2

Browse files
tobymaovchan
andauthored
first draft at comparisons (#558)
* first draft at comparisons * Apply suggestions from code review Co-authored-by: Vincent Chan <vchan@users.noreply.github.com> --------- Co-authored-by: Vincent Chan <vchan@users.noreply.github.com>
1 parent 97985f9 commit 540dae2

2 files changed

Lines changed: 120 additions & 1 deletion

File tree

docs/comparisons.md

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# Comparisons
2+
3+
There are many tools and frameworks in the data ecosystem. This page tries to make sense of it all.
4+
5+
## dbt
6+
[dbt](https://www.getdbt.com/) is a tool for data transformations. It is a pioneer in this space and has shown how valuable transformation frameworks can be. Although dbt is a fanstastic tool, it has trouble scaling with data and organizational size.
7+
8+
SQLMesh aims to be dbt format compatible. Importing existing dbt projects with minor changes is currently supported in alpha status.
9+
10+
### Feature Comparisons
11+
| Feature | dbt | SQLMesh
12+
| ------- | --- | -------
13+
| `SQL models` | ✅ | ✅
14+
| `Python models` | ✅ | ✅
15+
| `Seed models` | ✅ | ✅
16+
| `Jinja support` | ✅ | ✅
17+
| `Views / Embedded Models` | ✅ | ✅
18+
| `Incremental Models` | ✅ | ✅
19+
| `Seed Models` | ✅ | ✅
20+
| `Snapshot Models` | ✅ | ❌
21+
| `Documentation generation` | ✅ | ❌
22+
| `Package Manager` | ✅ | ❌
23+
| `Semantic validation` | ❌ | ✅
24+
| `Transpilation` | ❌ | ✅
25+
| `Unit tests` | ❌ | ✅
26+
| `Column level lineage` | ❌ | ✅
27+
| `Accessible incremental models` | ❌ | ✅
28+
| `Downstream impact planner` | ❌ | ✅
29+
| `Change categorization` | ❌ | ✅
30+
| `Native Airflow integration` | ❌ | ✅
31+
| `Date leakage protection` | ❌ | ✅
32+
| `Data gap detection/repair` | ❌ | ✅
33+
| `Batched backfills` | ❌ | ✅
34+
| `Table reuse across environments` | ❌ | ✅
35+
| `Local Python execution` | ❌ | ✅
36+
| `Open-source CI/CD Bot` | ❌ | ✅
37+
| `Open-source IDE (UI)` | ❌ | ✅
38+
39+
40+
### Incremental Models
41+
Implementing an incremental model is difficult and error-prone in dbt because it does not keep track of state. Since there is no state in dbt, the user must write subqueries to find missing date boundaries.
42+
43+
#### Complexity
44+
```sql
45+
-- dbt incremental
46+
SELECT *
47+
FROM raw.events e
48+
JOIN raw.event_dims d
49+
ON e.id = d.id
50+
-- must specify the is_incremental flag because this predicate will fail if the model has never run before
51+
{% if is_incremental() %}
52+
-- this filter dynamically scans the current model to find the date boundary
53+
AND d.ds >= (SELECT MAX(ds) FROM {{ this }})
54+
{% endif %}
55+
{% if is_incremental() %}
56+
WHERE e.ds >= (SELECT MAX(ds) FROM {{ this }})
57+
{% endif %}
58+
```
59+
60+
Having to manually specify macros to find date boundaries is repetitive and error-prone. As incremental models become more complex, the cognitive burden of having two run times, "first time full refresh" vs "subsequent incremental", increases.
61+
62+
SQLMesh keeps track of which date ranges exist so the query can be simplified as follows.
63+
64+
```sql
65+
-- sqlmesh incremental
66+
SELECT *
67+
FROM raw.events
68+
JOIN raw.event_dims d
69+
-- date ranges are handled automatically by sqlmesh
70+
ON e.id = d.id AND d.ds BETWEEN @start_ds AND @end_ds
71+
WHERE d.ds BETWEEN @start_ds AND @end_ds
72+
```
73+
74+
#### Data leakage
75+
dbt does not enforce that the data inserted into the incremental table should be there. This can lead to problems or consistency issues such as late arriving data overriding past partitions. SQLMesh wraps all queries under the hood in a subquery with a time filter to enforce that the data inserted for a particular batch is as expected.
76+
77+
dbt also only supports the 'insert/overwrite' incremental load pattern for systems that natively support it. SQLMesh enables 'insert/overwrite' on any system because it is the most robust way to do incremental pipelines. 'Append' pipelines are extremely dangerous due data leakage / duplicates.
78+
79+
80+
```sql
81+
-- original query
82+
SELECT *
83+
FROM raw.events
84+
JOIN raw.event_dims d
85+
ON e.id = d.id AND d.ds BETWEEN @start_ds AND @end_ds
86+
WHERE d.ds BETWEEN @start_ds AND @end_ds
87+
88+
-- with data leakage guard
89+
SELECT *
90+
FROM (
91+
SELECT *
92+
FROM raw.events
93+
JOIN raw.event_dims d
94+
ON e.id = d.id AND d.ds BETWEEN @start_ds AND @end_ds
95+
WHERE d.ds BETWEEN @start_ds AND @end_ds
96+
)
97+
WHERE ds BETWEEN @start_ds AND @end_ds
98+
```
99+
100+
#### Data gaps
101+
The main pattern used in incremental models checks for MAX(ds). This pattern does not catch missing data from the past or data gaps.
102+
103+
```
104+
Expected dates: 2022-01-01, 2022-01-02, 2022-01-03
105+
Missing past data: ?, 2022-01-02, 2022-01-03
106+
Data gap: 2022-01-01, ?, 2022-01-03
107+
```
108+
109+
SQLMesh stores each date interval a model has been run with so that it knows exactly what dates are missing.
110+
111+
#### Performance
112+
The subqueries that look for MAX(date) could have a performance impact on the query. SQLMesh is able to avoid these extra subqueries.
113+
114+
Additionally, dbt expects an incremental model to be able to fully refresh the first time it runs. For some large scale data sets, this is cost prohibitive or infeasible. SQLMesh is able to [batch](../concepts/models/overview#batch_size) up backfills into more manageable chunks.
115+
116+
### SQL unaware
117+
dbt does not parse or understand SQL. It relies heavily on Jinja which is basically just string manipulation. Syntax errors are difficult to debug and only discovered at runtime.

mkdocs.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,9 @@ nav:
4545
- Best practices:
4646
- best_practices/recommended_workflow.md
4747
- Resources:
48-
- release_notes.md
48+
- comparisons.md
4949
- development.md
50+
- release_notes.md
5051
theme:
5152
name: material
5253
logo: sqlmesh.png
@@ -74,6 +75,7 @@ plugins:
7475
- include-markdown
7576
- search
7677
markdown_extensions:
78+
- tables
7779
- pymdownx.highlight:
7880
anchor_linenums: true
7981
- pymdownx.inlinehilite

0 commit comments

Comments
 (0)