add back old version under disabled v1 folder

dbt-labs · dataders · Sep 16, 2022 · Sep 16, 2022 · Sep 16, 2022 · Sep 16, 2022
commit c96365475d886f6ced08af392d607539874ad4eb
diff --git a/README.md b/README.md
@@ -4,15 +4,21 @@ It's without question that us dbters [stan](https://www.urbandictionary.com/defi
 
 This dbt project shows a trivial example fuzzy string matching in Snowflake using dbt-snowflake Python models in Snowpark. [thefuzz](https://github.com/seatgeek/thefuzz) is the defacto package. While Snowflake SQL has the `EDITDISTANCE()` ([docs](https://docs.snowflake.com/en/sql-reference/functions/editdistance.html)) function, what we're after is "give me the best match for this string, as long as it's 'close enough'"
 
-This is easily accomplished with `process.extractOne()` ([source](https://github.com/seatgeek/thefuzz/blob/791c0bd18c77b4d9911f234c70808dbf24f74152/thefuzz/process.py#L200-L225))
+This is easily accomplished with `thefuzz.process.extractOne()` ([source](https://github.com/seatgeek/thefuzz/blob/791c0bd18c77b4d9911f234c70808dbf24f74152/thefuzz/process.py#L200-L225))
 
 
 ## Imaginiary Scenario
 
+### Video Walkthough
+
+If you'd prefer to here a rambling overview. Check out the [video walkthrough]()
+
 ### Shut up and show me the code!
 
 - [fuzzer.ipynb](fuzzer.ipynb): A notebook that shows you the code on your local machine 
-- [/models/fruit_join.py](/models/fruit_join.py): A notebook that shows you the code on your local machine 
+- [/models/v1/fruit_join.py](/models/v1/fruit_join.py): A Python model that does effectively the majority of the transformation
+- [models/stage/stg_fruit_user_input.py](models/stage/stg_fruit_user_input.py) a Python
+
 
 ### Background
 
@@ -110,4 +116,13 @@ df_final = (df_input
 
         return df_final
     ```
-3. to run this DAG, simply call `dbt build`!
+3. to run this DAG, simply call `dbt build`!
+
+
+#### Making the code more dbtonic
+
+All we're really doing is adding a new column to a raw dataset. This falls which is also know as a staging model. So for v2, [models/stage/stg_fruit_user_input.py](models/stage/stg_fruit_user_input.py), the new column calculation is the only thing that's done to the staging model and it is done in Python. Everything else happens in SQL in downstream models as per usual.
+
+
+From [dbt's best practices](https://docs.getdbt.com/guides/legacy/best-practices)
+> Source-centric transformations to transform data from different sources into a consistent structure, for example, re-aliasing and recasting columns, or unioning, joining or deduplicating source data to ensure your model has the correct grain.
diff --git a/dbt_project.yml b/dbt_project.yml
@@ -37,6 +37,9 @@ models:
     stage:
       stg_fruit_user_input:
         +materialized: table
+    v1:
+      fruit_join:
+        +enabled: false
 
 
 seeds:

diff --git a/models/v1/fruit_join.py b/models/v1/fruit_join.py
@@ -0,0 +1,40 @@
+from fuzzywuzzy import process
+
+
+def model(dbt, session):
+    dbt.config(
+        materialized="table",
+        packages=["fuzzywuzzy"]
+    )
+
+    df_input = dbt.ref("fruit_user_input").to_pandas()
+
+    df_price = dbt.ref("fruit_prices_fact").to_pandas()
+
+    def custom_scorer(string):
+        '''
+        for a given string
+        return the best match out of the `fruit_name` column in the df_to table
+        '''
+
+        x = process.extractOne(string, df_price["fruit_name"], score_cutoff=60)
+
+        if x is not None:
+            return x[0]
+        else:
+            return None
+
+    df_final = (df_input
+                # make new col, `fruit_name`, with best match against actual table
+                .assign(fruit_name=lambda df: df["fruit_user_input"].apply(custom_scorer))
+                # join the actual fruit price table
+                .merge(df_price, on="fruit_name")
+                # calculate subtotal
+                .assign(total=lambda df: df.quantity * df.cost)
+                # find total for each user
+                .groupby("user_name")["total"].sum()
+                .reset_index()
+                .sort_values("total", ascending=False)
+                )
+
+    return df_final