|
373 | 373 | } |
374 | 374 | }, |
375 | 375 | "source": [ |
376 | | - "Great, so we maybe we can consider restaurant pairs with **high name similarity** as **matches**. But it's useful to use the other restaurant fields as well." |
| 376 | + "Great, so we maybe we can consider restaurant pairs with **high name similarity** as **matches**. But it's useful to use the **fields** other than name as well." |
377 | 377 | ] |
378 | 378 | }, |
379 | 379 | { |
|
897 | 897 | } |
898 | 898 | }, |
899 | 899 | "source": [ |
900 | | - "Why? Becuase we'll create **scoring functions** to compare the fields of each record pair:" |
| 900 | + "We do that becuase we'll create **scoring functions** to compare the fields of each record pair:" |
901 | 901 | ] |
902 | 902 | }, |
903 | 903 | { |
904 | 904 | "cell_type": "code", |
905 | | - "execution_count": 17, |
| 905 | + "execution_count": 60, |
906 | 906 | "metadata": { |
907 | 907 | "slideshow": { |
908 | 908 | "slide_type": "fragment" |
|
917 | 917 | "\n", |
918 | 918 | "\n", |
919 | 919 | "def _compare_latlng(x, y):\n", |
920 | | - " return haversine.haversine(x, y, unit=haversine.Unit.MILES)\n", |
| 920 | + " return haversine.haversine(x, y, unit=haversine.Unit.KILOMETERS)\n", |
921 | 921 | "\n", |
922 | 922 | "\n", |
923 | 923 | "def compare_pair(record_x, record_y):\n", |
|
1396 | 1396 | "cell_type": "markdown", |
1397 | 1397 | "metadata": { |
1398 | 1398 | "slideshow": { |
1399 | | - "slide_type": "fragment" |
| 1399 | + "slide_type": "slide" |
1400 | 1400 | } |
1401 | 1401 | }, |
1402 | 1402 | "source": [ |
|
1632 | 1632 | "cell_type": "markdown", |
1633 | 1633 | "metadata": { |
1634 | 1634 | "slideshow": { |
1635 | | - "slide_type": "slide" |
| 1635 | + "slide_type": "fragment" |
1636 | 1636 | } |
1637 | 1637 | }, |
1638 | 1638 | "source": [ |
|
1896 | 1896 | } |
1897 | 1897 | }, |
1898 | 1898 | "source": [ |
1899 | | - "That can work well on simple and small datasets, but for complex and larger ones, a **Machine Learning Classifier** can help us to define which pairs are matches or not:" |
| 1899 | + "That can work well on simple and small datasets, but for complex and larger ones, a **Machine Learning Classifier** will probably work better:" |
1900 | 1900 | ] |
1901 | 1901 | }, |
1902 | 1902 | { |
|
1931 | 1931 | } |
1932 | 1932 | }, |
1933 | 1933 | "source": [ |
1934 | | - "The problem is: how to train that classifier? \n", |
| 1934 | + "The problem is: how to train that classifier?\n", |
| 1935 | + "\n", |
1935 | 1936 | "It can be challanging to **manually find matching pairs** in a gigantic dataset, becuase the number of matching pairs tends to be much smaller than the non-matching pairs." |
1936 | 1937 | ] |
1937 | 1938 | }, |
|
2129 | 2130 | "df_with_truth.head(9)" |
2130 | 2131 | ] |
2131 | 2132 | }, |
| 2133 | + { |
| 2134 | + "cell_type": "markdown", |
| 2135 | + "metadata": { |
| 2136 | + "slideshow": { |
| 2137 | + "slide_type": "slide" |
| 2138 | + } |
| 2139 | + }, |
| 2140 | + "source": [ |
| 2141 | + "The dataset comes with the **true matches** indicated by the `cluster` column. We use that to compute the `golden_pairs_set`:" |
| 2142 | + ] |
| 2143 | + }, |
2132 | 2144 | { |
2133 | 2145 | "cell_type": "code", |
2134 | 2146 | "execution_count": 30, |
2135 | 2147 | "metadata": { |
2136 | 2148 | "slideshow": { |
2137 | | - "slide_type": "slide" |
| 2149 | + "slide_type": "fragment" |
2138 | 2150 | } |
2139 | 2151 | }, |
2140 | 2152 | "outputs": [ |
|
2158 | 2170 | "len(golden_pairs_set)" |
2159 | 2171 | ] |
2160 | 2172 | }, |
2161 | | - { |
2162 | | - "cell_type": "markdown", |
2163 | | - "metadata": { |
2164 | | - "slideshow": { |
2165 | | - "slide_type": "fragment" |
2166 | | - } |
2167 | | - }, |
2168 | | - "source": [ |
2169 | | - "The dataset comes with the **true matches** indicated by the `cluster` column." |
2170 | | - ] |
2171 | | - }, |
2172 | 2173 | { |
2173 | 2174 | "cell_type": "markdown", |
2174 | 2175 | "metadata": { |
|
2177 | 2178 | } |
2178 | 2179 | }, |
2179 | 2180 | "source": [ |
2180 | | - "We'll remove the `phone` and `type` to makes things more **difficult**:" |
| 2181 | + "We'll remove the `phone` and `type` fields to makes things more **difficult**:" |
2181 | 2182 | ] |
2182 | 2183 | }, |
2183 | 2184 | { |
|
2555 | 2556 | }, |
2556 | 2557 | { |
2557 | 2558 | "cell_type": "code", |
2558 | | - "execution_count": 36, |
| 2559 | + "execution_count": 61, |
2559 | 2560 | "metadata": { |
2560 | 2561 | "slideshow": { |
2561 | 2562 | "slide_type": "fragment" |
|
2588 | 2589 | " 'type': 'LatLong'\n", |
2589 | 2590 | " },\n", |
2590 | 2591 | "]\n", |
2591 | | - "\n", |
2592 | 2592 | "deduper = RFDedupe(fields, num_cores=os.cpu_count())" |
2593 | 2593 | ] |
2594 | 2594 | }, |
|
2600 | 2600 | } |
2601 | 2601 | }, |
2602 | 2602 | "source": [ |
2603 | | - "Our `RFDedupe` is a bit different than the original `Dedupe`, because we changed it to use a [Random Forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from scikit-learn. By default Dedupe uses a simpler logistic regression model.\n", |
| 2603 | + "Our `RFDedupe` is a bit different than the original `Dedupe`, because we changed it to use a [**Random Forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from scikit-learn. By default Dedupe uses a simpler logistic regression model.\n", |
2604 | 2604 | "\n", |
2605 | 2605 | "Use our code as base and try different classifiers!" |
2606 | 2606 | ] |
|
2647 | 2647 | "execution_count": 38, |
2648 | 2648 | "metadata": { |
2649 | 2649 | "slideshow": { |
2650 | | - "slide_type": "slide" |
| 2650 | + "slide_type": "fragment" |
2651 | 2651 | } |
2652 | 2652 | }, |
2653 | 2653 | "outputs": [], |
|
2675 | 2675 | "cell_type": "code", |
2676 | 2676 | "execution_count": 39, |
2677 | 2677 | "metadata": { |
| 2678 | + "scrolled": true, |
2678 | 2679 | "slideshow": { |
2679 | 2680 | "slide_type": "fragment" |
2680 | 2681 | } |
|
3082 | 3083 | } |
3083 | 3084 | }, |
3084 | 3085 | "source": [ |
3085 | | - "After training, we can see which **blocking predicates** (indexing rules) the deduper learned from our training input. It's good to do that to check if we trained enough:" |
| 3086 | + "After training, we can see which **blocking fingerprints** the deduper learned from our training input.\n", |
| 3087 | + "\n", |
| 3088 | + "It's good to do that to check if we trained enough:" |
3086 | 3089 | ] |
3087 | 3090 | }, |
3088 | 3091 | { |
|
3120 | 3123 | } |
3121 | 3124 | }, |
3122 | 3125 | "source": [ |
3123 | | - "The deduper selected those predicates from this extense list of **possible predicates**:" |
| 3126 | + "The deduper selected those fingerprints from this extense list of **possible fingerprints**:" |
3124 | 3127 | ] |
3125 | 3128 | }, |
3126 | 3129 | { |
|
3311 | 3314 | } |
3312 | 3315 | }, |
3313 | 3316 | "source": [ |
3314 | | - "Then we **score** those blocked pairs. Dedupe calls for us the similarity functions based on the types we've passed on constructor and then passes the similarities/distances to the classifier.\n", |
| 3317 | + "Then we **score** those blocked pairs. Dedupe calls for us the similarity functions based on the types we've set on the constructor and then passes the similarities/distances to the classifier.\n", |
3315 | 3318 | "\n", |
3316 | 3319 | "Internally, it looks like this:\n", |
3317 | 3320 | "\n", |
|
3336 | 3339 | }, |
3337 | 3340 | { |
3338 | 3341 | "cell_type": "code", |
3339 | | - "execution_count": 63, |
| 3342 | + "execution_count": 45, |
3340 | 3343 | "metadata": { |
3341 | 3344 | "slideshow": { |
3342 | 3345 | "slide_type": "fragment" |
|
3349 | 3352 | }, |
3350 | 3353 | { |
3351 | 3354 | "cell_type": "code", |
3352 | | - "execution_count": 64, |
| 3355 | + "execution_count": 46, |
3353 | 3356 | "metadata": { |
3354 | 3357 | "slideshow": { |
3355 | 3358 | "slide_type": "fragment" |
|
3371 | 3374 | " ([ 9, 10], 1.)]" |
3372 | 3375 | ] |
3373 | 3376 | }, |
3374 | | - "execution_count": 64, |
| 3377 | + "execution_count": 46, |
3375 | 3378 | "metadata": {}, |
3376 | 3379 | "output_type": "execute_result" |
3377 | 3380 | } |
|
3392 | 3395 | "source": [ |
3393 | 3396 | "Note there are records with very low similarity in our `scored_pairs` result. Like `([6, 7], 0.04)`.\n", |
3394 | 3397 | "\n", |
3395 | | - "We need to use a `threshold` to filter out low similarity pairs.\n", |
3396 | | - "\n", |
3397 | | - "Understand the `threshold` allows us to **trade-off between [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)**, i.e., if you want to be more or less sensitive on matching records, at the risk of introducing false positives (if more sensitive) or false negatives (if less sensitive)." |
| 3398 | + "We need to use a `threshold` to filter out low similarity pairs." |
3398 | 3399 | ] |
3399 | 3400 | }, |
3400 | 3401 | { |
3401 | 3402 | "cell_type": "code", |
3402 | | - "execution_count": 47, |
| 3403 | + "execution_count": 65, |
3403 | 3404 | "metadata": { |
3404 | 3405 | "slideshow": { |
3405 | 3406 | "slide_type": "fragment" |
|
3421 | 3422 | " ([ 9, 12], 0.95)]" |
3422 | 3423 | ] |
3423 | 3424 | }, |
3424 | | - "execution_count": 47, |
| 3425 | + "execution_count": 65, |
3425 | 3426 | "metadata": {}, |
3426 | 3427 | "output_type": "execute_result" |
3427 | 3428 | } |
|
3432 | 3433 | "list(threshold_pairs)[:10]" |
3433 | 3434 | ] |
3434 | 3435 | }, |
| 3436 | + { |
| 3437 | + "cell_type": "markdown", |
| 3438 | + "metadata": { |
| 3439 | + "slideshow": { |
| 3440 | + "slide_type": "fragment" |
| 3441 | + } |
| 3442 | + }, |
| 3443 | + "source": [ |
| 3444 | + "Understand the `threshold` allows us to **trade-off between [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)**, i.e., if you want to be more or less sensitive on matching records, at the risk of introducing false positives (if more sensitive) or false negatives (if less sensitive)." |
| 3445 | + ] |
| 3446 | + }, |
3435 | 3447 | { |
3436 | 3448 | "cell_type": "markdown", |
3437 | 3449 | "metadata": { |
|
3535 | 3547 | "execution_count": 52, |
3536 | 3548 | "metadata": { |
3537 | 3549 | "slideshow": { |
3538 | | - "slide_type": "slide" |
| 3550 | + "slide_type": "skip" |
3539 | 3551 | } |
3540 | 3552 | }, |
3541 | 3553 | "outputs": [ |
|
4200 | 4212 | "- By deduplicating, we find:\n", |
4201 | 4213 | "- `(A, B)` match\n", |
4202 | 4214 | "- `(B, C)` match\n", |
4203 | | - "- `(A, C)` nonmatch\n", |
| 4215 | + "- `(A, C)` non match\n", |
4204 | 4216 | "- And that doesn't make sense!\n", |
4205 | 4217 | "\n", |
4206 | 4218 | "The solution for that ambiguity is computing the **Transitive Closure** through clustering." |
|
0 commit comments