Additional slides fixes on slides-with-dedupe-2.ipynb after rehearsal

fjsj · fjsj · commit ddb736db9c44 · 2020-04-19T20:30:17.000-03:00
diff --git a/rise.css b/rise.css
@@ -2,8 +2,7 @@
     background-color: #fff;
 }
 
-
-.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {
+.rise-enabled .rendered_html table, .rise-enabled .rendered_html th, .rise-enabled .rendered_html tr, .rise-enabled .rendered_html td {
     font-size: 100%;
 }
 
diff --git a/slides-with-dedupe-2.ipynb b/slides-with-dedupe-2.ipynb
@@ -373,7 +373,7 @@
     }
    },
    "source": [
-    "Great, so we maybe we can consider restaurant pairs with **high name similarity** as **matches**. But it's useful to use the other restaurant fields as well."
+    "Great, so we maybe we can consider restaurant pairs with **high name similarity** as **matches**. But it's useful to use the **fields** other than name as well."
    ]
   },
   {
@@ -897,12 +897,12 @@
     }
    },
    "source": [
-    "Why? Becuase we'll create **scoring functions** to compare the fields of each record pair:"
+    "We do that becuase we'll create **scoring functions** to compare the fields of each record pair:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 60,
    "metadata": {
     "slideshow": {
      "slide_type": "fragment"
@@ -917,7 +917,7 @@
     "\n",
     "\n",
     "def _compare_latlng(x, y):\n",
-    "    return haversine.haversine(x, y, unit=haversine.Unit.MILES)\n",
+    "    return haversine.haversine(x, y, unit=haversine.Unit.KILOMETERS)\n",
     "\n",
     "\n",
     "def compare_pair(record_x, record_y):\n",
@@ -1396,7 +1396,7 @@
    "cell_type": "markdown",
    "metadata": {
     "slideshow": {
-     "slide_type": "fragment"
+     "slide_type": "slide"
     }
    },
    "source": [
@@ -1632,7 +1632,7 @@
    "cell_type": "markdown",
    "metadata": {
     "slideshow": {
-     "slide_type": "slide"
+     "slide_type": "fragment"
     }
    },
    "source": [
@@ -1896,7 +1896,7 @@
     }
    },
    "source": [
-    "That can work well on simple and small datasets, but for complex and larger ones, a **Machine Learning Classifier** can help us to define which pairs are matches or not:"
+    "That can work well on simple and small datasets, but for complex and larger ones, a **Machine Learning Classifier** will probably work better:"
    ]
   },
   {
@@ -1931,7 +1931,8 @@
     }
    },
    "source": [
-    "The problem is: how to train that classifier?  \n",
+    "The problem is: how to train that classifier?\n",
+    "\n",
     "It can be challanging to **manually find matching pairs** in a gigantic dataset, becuase the number of matching pairs tends to be much smaller than the non-matching pairs."
    ]
   },
@@ -2129,12 +2130,23 @@
     "df_with_truth.head(9)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "slideshow": {
+     "slide_type": "slide"
+    }
+   },
+   "source": [
+    "The dataset comes with the **true matches** indicated by the `cluster` column. We use that to compute the `golden_pairs_set`:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 30,
    "metadata": {
     "slideshow": {
-     "slide_type": "slide"
+     "slide_type": "fragment"
     }
    },
    "outputs": [
@@ -2158,17 +2170,6 @@
     "len(golden_pairs_set)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "The dataset comes with the **true matches** indicated by the `cluster` column."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {
@@ -2177,7 +2178,7 @@
     }
    },
    "source": [
-    "We'll remove the `phone` and `type` to makes things more **difficult**:"
+    "We'll remove the `phone` and `type` fields to makes things more **difficult**:"
    ]
   },
   {
@@ -2555,7 +2556,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 36,
+   "execution_count": 61,
    "metadata": {
     "slideshow": {
      "slide_type": "fragment"
@@ -2588,7 +2589,6 @@
     "        'type': 'LatLong'\n",
     "    },\n",
     "]\n",
-    "\n",
     "deduper = RFDedupe(fields, num_cores=os.cpu_count())"
    ]
   },
@@ -2600,7 +2600,7 @@
     }
    },
    "source": [
-    "Our `RFDedupe` is a bit different than the original `Dedupe`, because we changed it to use a [Random Forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from scikit-learn. By default Dedupe uses a simpler logistic regression model.\n",
+    "Our `RFDedupe` is a bit different than the original `Dedupe`, because we changed it to use a [**Random Forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from scikit-learn. By default Dedupe uses a simpler logistic regression model.\n",
     "\n",
     "Use our code as base and try different classifiers!"
    ]
@@ -2647,7 +2647,7 @@
    "execution_count": 38,
    "metadata": {
     "slideshow": {
-     "slide_type": "slide"
+     "slide_type": "fragment"
     }
    },
    "outputs": [],
@@ -2675,6 +2675,7 @@
    "cell_type": "code",
    "execution_count": 39,
    "metadata": {
+    "scrolled": true,
     "slideshow": {
      "slide_type": "fragment"
     }
@@ -3082,7 +3083,9 @@
     }
    },
    "source": [
-    "After training, we can see which **blocking predicates** (indexing rules) the deduper learned from our training input. It's good to do that to check if we trained enough:"
+    "After training, we can see which **blocking fingerprints** the deduper learned from our training input.\n",
+    "\n",
+    "It's good to do that to check if we trained enough:"
    ]
   },
   {
@@ -3120,7 +3123,7 @@
     }
    },
    "source": [
-    "The deduper selected those predicates from this extense list of **possible predicates**:"
+    "The deduper selected those fingerprints from this extense list of **possible fingerprints**:"
    ]
   },
   {
@@ -3311,7 +3314,7 @@
     }
    },
    "source": [
-    "Then we **score** those blocked pairs. Dedupe calls for us the similarity functions based on the types we've passed on constructor and then passes the similarities/distances to the classifier.\n",
+    "Then we **score** those blocked pairs. Dedupe calls for us the similarity functions based on the types we've set on the constructor and then passes the similarities/distances to the classifier.\n",
     "\n",
     "Internally, it looks like this:\n",
     "\n",
@@ -3336,7 +3339,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 63,
+   "execution_count": 45,
    "metadata": {
     "slideshow": {
      "slide_type": "fragment"
@@ -3349,7 +3352,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 64,
+   "execution_count": 46,
    "metadata": {
     "slideshow": {
      "slide_type": "fragment"
@@ -3371,7 +3374,7 @@
        " ([ 9, 10], 1.)]"
       ]
      },
-     "execution_count": 64,
+     "execution_count": 46,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -3392,14 +3395,12 @@
    "source": [
     "Note there are records with very low similarity in our `scored_pairs` result. Like `([6, 7], 0.04)`.\n",
     "\n",
-    "We need to use a `threshold` to filter out low similarity pairs.\n",
-    "\n",
-    "Understand the `threshold` allows us to **trade-off between [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)**, i.e., if you want to be more or less sensitive on matching records, at the risk of introducing false positives (if more sensitive) or false negatives (if less sensitive)."
+    "We need to use a `threshold` to filter out low similarity pairs."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 47,
+   "execution_count": 65,
    "metadata": {
     "slideshow": {
      "slide_type": "fragment"
@@ -3421,7 +3422,7 @@
        " ([ 9, 12], 0.95)]"
       ]
      },
-     "execution_count": 47,
+     "execution_count": 65,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -3432,6 +3433,17 @@
     "list(threshold_pairs)[:10]"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "slideshow": {
+     "slide_type": "fragment"
+    }
+   },
+   "source": [
+    "Understand the `threshold` allows us to **trade-off between [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)**, i.e., if you want to be more or less sensitive on matching records, at the risk of introducing false positives (if more sensitive) or false negatives (if less sensitive)."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
@@ -3535,7 +3547,7 @@
    "execution_count": 52,
    "metadata": {
     "slideshow": {
-     "slide_type": "slide"
+     "slide_type": "skip"
     }
    },
    "outputs": [
@@ -4200,7 +4212,7 @@
     "- By deduplicating, we find:\n",
     "- `(A, B)` match\n",
     "- `(B, C)` match\n",
-    "- `(A, C)` nonmatch\n",
+    "- `(A, C)` non match\n",
     "- And that doesn't make sense!\n",
     "\n",
     "The solution for that ambiguity is computing the **Transitive Closure** through clustering."

Original file line number	Diff line number	Diff line change
`@@ -2,8 +2,7 @@`
`2`	`2`	`background-color: #fff;`
`3`	`3`	`}`
`4`	`4`
`5`		`-`
`6`		`-.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {`
	`5`	`+.rise-enabled .rendered_html table, .rise-enabled .rendered_html th, .rise-enabled .rendered_html tr, .rise-enabled .rendered_html td {`
`7`	`6`	`font-size: 100%;`
`8`	`7`	`}`
`9`	`8`