Skip to content
This repository was archived by the owner on Dec 4, 2024. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
add back old version under disabled v1 folder
  • Loading branch information
dataders committed Sep 16, 2022
commit c96365475d886f6ced08af392d607539874ad4eb
21 changes: 18 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,21 @@ It's without question that us dbters [stan](https://www.urbandictionary.com/defi

This dbt project shows a trivial example fuzzy string matching in Snowflake using dbt-snowflake Python models in Snowpark. [thefuzz](https://github.com/seatgeek/thefuzz) is the defacto package. While Snowflake SQL has the `EDITDISTANCE()` ([docs](https://docs.snowflake.com/en/sql-reference/functions/editdistance.html)) function, what we're after is "give me the best match for this string, as long as it's 'close enough'"

This is easily accomplished with `process.extractOne()` ([source](https://github.com/seatgeek/thefuzz/blob/791c0bd18c77b4d9911f234c70808dbf24f74152/thefuzz/process.py#L200-L225))
This is easily accomplished with `thefuzz.process.extractOne()` ([source](https://github.com/seatgeek/thefuzz/blob/791c0bd18c77b4d9911f234c70808dbf24f74152/thefuzz/process.py#L200-L225))


## Imaginiary Scenario

### Video Walkthough

If you'd prefer to here a rambling overview. Check out the [video walkthrough]()

### Shut up and show me the code!

- [fuzzer.ipynb](fuzzer.ipynb): A notebook that shows you the code on your local machine
- [/models/fruit_join.py](/models/fruit_join.py): A notebook that shows you the code on your local machine
- [/models/v1/fruit_join.py](/models/v1/fruit_join.py): A Python model that does effectively the majority of the transformation
- [models/stage/stg_fruit_user_input.py](models/stage/stg_fruit_user_input.py) a Python


### Background

Expand Down Expand Up @@ -110,4 +116,13 @@ df_final = (df_input

return df_final
```
3. to run this DAG, simply call `dbt build`!
3. to run this DAG, simply call `dbt build`!


#### Making the code more dbtonic

All we're really doing is adding a new column to a raw dataset. This falls which is also know as a staging model. So for v2, [models/stage/stg_fruit_user_input.py](models/stage/stg_fruit_user_input.py), the new column calculation is the only thing that's done to the staging model and it is done in Python. Everything else happens in SQL in downstream models as per usual.


From [dbt's best practices](https://docs.getdbt.com/guides/legacy/best-practices)
> Source-centric transformations to transform data from different sources into a consistent structure, for example, re-aliasing and recasting columns, or unioning, joining or deduplicating source data to ensure your model has the correct grain.
3 changes: 3 additions & 0 deletions dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ models:
stage:
stg_fruit_user_input:
+materialized: table
v1:
fruit_join:
+enabled: false


seeds:
Expand Down
40 changes: 40 additions & 0 deletions models/v1/fruit_join.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
from fuzzywuzzy import process


def model(dbt, session):
dbt.config(
materialized="table",
packages=["fuzzywuzzy"]
)

df_input = dbt.ref("fruit_user_input").to_pandas()

df_price = dbt.ref("fruit_prices_fact").to_pandas()

def custom_scorer(string):
'''
for a given string
return the best match out of the `fruit_name` column in the df_to table
'''

x = process.extractOne(string, df_price["fruit_name"], score_cutoff=60)

if x is not None:
return x[0]
else:
return None

df_final = (df_input
# make new col, `fruit_name`, with best match against actual table
.assign(fruit_name=lambda df: df["fruit_user_input"].apply(custom_scorer))
# join the actual fruit price table
.merge(df_price, on="fruit_name")
# calculate subtotal
.assign(total=lambda df: df.quantity * df.cost)
# find total for each user
.groupby("user_name")["total"].sum()
.reset_index()
.sort_values("total", ascending=False)
)

return df_final