Skip to content

Commit ddb736d

Browse files
committed
Additional slides fixes on slides-with-dedupe-2.ipynb after rehearsal
1 parent ad1ef6d commit ddb736d

File tree

2 files changed

+51
-40
lines changed

2 files changed

+51
-40
lines changed

rise.css

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,7 @@
22
background-color: #fff;
33
}
44

5-
6-
.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {
5+
.rise-enabled .rendered_html table, .rise-enabled .rendered_html th, .rise-enabled .rendered_html tr, .rise-enabled .rendered_html td {
76
font-size: 100%;
87
}
98

slides-with-dedupe-2.ipynb

Lines changed: 50 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -373,7 +373,7 @@
373373
}
374374
},
375375
"source": [
376-
"Great, so we maybe we can consider restaurant pairs with **high name similarity** as **matches**. But it's useful to use the other restaurant fields as well."
376+
"Great, so we maybe we can consider restaurant pairs with **high name similarity** as **matches**. But it's useful to use the **fields** other than name as well."
377377
]
378378
},
379379
{
@@ -897,12 +897,12 @@
897897
}
898898
},
899899
"source": [
900-
"Why? Becuase we'll create **scoring functions** to compare the fields of each record pair:"
900+
"We do that becuase we'll create **scoring functions** to compare the fields of each record pair:"
901901
]
902902
},
903903
{
904904
"cell_type": "code",
905-
"execution_count": 17,
905+
"execution_count": 60,
906906
"metadata": {
907907
"slideshow": {
908908
"slide_type": "fragment"
@@ -917,7 +917,7 @@
917917
"\n",
918918
"\n",
919919
"def _compare_latlng(x, y):\n",
920-
" return haversine.haversine(x, y, unit=haversine.Unit.MILES)\n",
920+
" return haversine.haversine(x, y, unit=haversine.Unit.KILOMETERS)\n",
921921
"\n",
922922
"\n",
923923
"def compare_pair(record_x, record_y):\n",
@@ -1396,7 +1396,7 @@
13961396
"cell_type": "markdown",
13971397
"metadata": {
13981398
"slideshow": {
1399-
"slide_type": "fragment"
1399+
"slide_type": "slide"
14001400
}
14011401
},
14021402
"source": [
@@ -1632,7 +1632,7 @@
16321632
"cell_type": "markdown",
16331633
"metadata": {
16341634
"slideshow": {
1635-
"slide_type": "slide"
1635+
"slide_type": "fragment"
16361636
}
16371637
},
16381638
"source": [
@@ -1896,7 +1896,7 @@
18961896
}
18971897
},
18981898
"source": [
1899-
"That can work well on simple and small datasets, but for complex and larger ones, a **Machine Learning Classifier** can help us to define which pairs are matches or not:"
1899+
"That can work well on simple and small datasets, but for complex and larger ones, a **Machine Learning Classifier** will probably work better:"
19001900
]
19011901
},
19021902
{
@@ -1931,7 +1931,8 @@
19311931
}
19321932
},
19331933
"source": [
1934-
"The problem is: how to train that classifier? \n",
1934+
"The problem is: how to train that classifier?\n",
1935+
"\n",
19351936
"It can be challanging to **manually find matching pairs** in a gigantic dataset, becuase the number of matching pairs tends to be much smaller than the non-matching pairs."
19361937
]
19371938
},
@@ -2129,12 +2130,23 @@
21292130
"df_with_truth.head(9)"
21302131
]
21312132
},
2133+
{
2134+
"cell_type": "markdown",
2135+
"metadata": {
2136+
"slideshow": {
2137+
"slide_type": "slide"
2138+
}
2139+
},
2140+
"source": [
2141+
"The dataset comes with the **true matches** indicated by the `cluster` column. We use that to compute the `golden_pairs_set`:"
2142+
]
2143+
},
21322144
{
21332145
"cell_type": "code",
21342146
"execution_count": 30,
21352147
"metadata": {
21362148
"slideshow": {
2137-
"slide_type": "slide"
2149+
"slide_type": "fragment"
21382150
}
21392151
},
21402152
"outputs": [
@@ -2158,17 +2170,6 @@
21582170
"len(golden_pairs_set)"
21592171
]
21602172
},
2161-
{
2162-
"cell_type": "markdown",
2163-
"metadata": {
2164-
"slideshow": {
2165-
"slide_type": "fragment"
2166-
}
2167-
},
2168-
"source": [
2169-
"The dataset comes with the **true matches** indicated by the `cluster` column."
2170-
]
2171-
},
21722173
{
21732174
"cell_type": "markdown",
21742175
"metadata": {
@@ -2177,7 +2178,7 @@
21772178
}
21782179
},
21792180
"source": [
2180-
"We'll remove the `phone` and `type` to makes things more **difficult**:"
2181+
"We'll remove the `phone` and `type` fields to makes things more **difficult**:"
21812182
]
21822183
},
21832184
{
@@ -2555,7 +2556,7 @@
25552556
},
25562557
{
25572558
"cell_type": "code",
2558-
"execution_count": 36,
2559+
"execution_count": 61,
25592560
"metadata": {
25602561
"slideshow": {
25612562
"slide_type": "fragment"
@@ -2588,7 +2589,6 @@
25882589
" 'type': 'LatLong'\n",
25892590
" },\n",
25902591
"]\n",
2591-
"\n",
25922592
"deduper = RFDedupe(fields, num_cores=os.cpu_count())"
25932593
]
25942594
},
@@ -2600,7 +2600,7 @@
26002600
}
26012601
},
26022602
"source": [
2603-
"Our `RFDedupe` is a bit different than the original `Dedupe`, because we changed it to use a [Random Forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from scikit-learn. By default Dedupe uses a simpler logistic regression model.\n",
2603+
"Our `RFDedupe` is a bit different than the original `Dedupe`, because we changed it to use a [**Random Forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from scikit-learn. By default Dedupe uses a simpler logistic regression model.\n",
26042604
"\n",
26052605
"Use our code as base and try different classifiers!"
26062606
]
@@ -2647,7 +2647,7 @@
26472647
"execution_count": 38,
26482648
"metadata": {
26492649
"slideshow": {
2650-
"slide_type": "slide"
2650+
"slide_type": "fragment"
26512651
}
26522652
},
26532653
"outputs": [],
@@ -2675,6 +2675,7 @@
26752675
"cell_type": "code",
26762676
"execution_count": 39,
26772677
"metadata": {
2678+
"scrolled": true,
26782679
"slideshow": {
26792680
"slide_type": "fragment"
26802681
}
@@ -3082,7 +3083,9 @@
30823083
}
30833084
},
30843085
"source": [
3085-
"After training, we can see which **blocking predicates** (indexing rules) the deduper learned from our training input. It's good to do that to check if we trained enough:"
3086+
"After training, we can see which **blocking fingerprints** the deduper learned from our training input.\n",
3087+
"\n",
3088+
"It's good to do that to check if we trained enough:"
30863089
]
30873090
},
30883091
{
@@ -3120,7 +3123,7 @@
31203123
}
31213124
},
31223125
"source": [
3123-
"The deduper selected those predicates from this extense list of **possible predicates**:"
3126+
"The deduper selected those fingerprints from this extense list of **possible fingerprints**:"
31243127
]
31253128
},
31263129
{
@@ -3311,7 +3314,7 @@
33113314
}
33123315
},
33133316
"source": [
3314-
"Then we **score** those blocked pairs. Dedupe calls for us the similarity functions based on the types we've passed on constructor and then passes the similarities/distances to the classifier.\n",
3317+
"Then we **score** those blocked pairs. Dedupe calls for us the similarity functions based on the types we've set on the constructor and then passes the similarities/distances to the classifier.\n",
33153318
"\n",
33163319
"Internally, it looks like this:\n",
33173320
"\n",
@@ -3336,7 +3339,7 @@
33363339
},
33373340
{
33383341
"cell_type": "code",
3339-
"execution_count": 63,
3342+
"execution_count": 45,
33403343
"metadata": {
33413344
"slideshow": {
33423345
"slide_type": "fragment"
@@ -3349,7 +3352,7 @@
33493352
},
33503353
{
33513354
"cell_type": "code",
3352-
"execution_count": 64,
3355+
"execution_count": 46,
33533356
"metadata": {
33543357
"slideshow": {
33553358
"slide_type": "fragment"
@@ -3371,7 +3374,7 @@
33713374
" ([ 9, 10], 1.)]"
33723375
]
33733376
},
3374-
"execution_count": 64,
3377+
"execution_count": 46,
33753378
"metadata": {},
33763379
"output_type": "execute_result"
33773380
}
@@ -3392,14 +3395,12 @@
33923395
"source": [
33933396
"Note there are records with very low similarity in our `scored_pairs` result. Like `([6, 7], 0.04)`.\n",
33943397
"\n",
3395-
"We need to use a `threshold` to filter out low similarity pairs.\n",
3396-
"\n",
3397-
"Understand the `threshold` allows us to **trade-off between [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)**, i.e., if you want to be more or less sensitive on matching records, at the risk of introducing false positives (if more sensitive) or false negatives (if less sensitive)."
3398+
"We need to use a `threshold` to filter out low similarity pairs."
33983399
]
33993400
},
34003401
{
34013402
"cell_type": "code",
3402-
"execution_count": 47,
3403+
"execution_count": 65,
34033404
"metadata": {
34043405
"slideshow": {
34053406
"slide_type": "fragment"
@@ -3421,7 +3422,7 @@
34213422
" ([ 9, 12], 0.95)]"
34223423
]
34233424
},
3424-
"execution_count": 47,
3425+
"execution_count": 65,
34253426
"metadata": {},
34263427
"output_type": "execute_result"
34273428
}
@@ -3432,6 +3433,17 @@
34323433
"list(threshold_pairs)[:10]"
34333434
]
34343435
},
3436+
{
3437+
"cell_type": "markdown",
3438+
"metadata": {
3439+
"slideshow": {
3440+
"slide_type": "fragment"
3441+
}
3442+
},
3443+
"source": [
3444+
"Understand the `threshold` allows us to **trade-off between [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)**, i.e., if you want to be more or less sensitive on matching records, at the risk of introducing false positives (if more sensitive) or false negatives (if less sensitive)."
3445+
]
3446+
},
34353447
{
34363448
"cell_type": "markdown",
34373449
"metadata": {
@@ -3535,7 +3547,7 @@
35353547
"execution_count": 52,
35363548
"metadata": {
35373549
"slideshow": {
3538-
"slide_type": "slide"
3550+
"slide_type": "skip"
35393551
}
35403552
},
35413553
"outputs": [
@@ -4200,7 +4212,7 @@
42004212
"- By deduplicating, we find:\n",
42014213
"- `(A, B)` match\n",
42024214
"- `(B, C)` match\n",
4203-
"- `(A, C)` nonmatch\n",
4215+
"- `(A, C)` non match\n",
42044216
"- And that doesn't make sense!\n",
42054217
"\n",
42064218
"The solution for that ambiguity is computing the **Transitive Closure** through clustering."

0 commit comments

Comments
 (0)