diff --git a/docs/tutorial/Permutations.ipynb b/docs/tutorial/Permutations.ipynb
index e30adabc..c815debd 100644
--- a/docs/tutorial/Permutations.ipynb
+++ b/docs/tutorial/Permutations.ipynb
@@ -23,15 +23,15 @@
"### Steps\n",
"These steps are usually run by different companies - but for illustration all is carried out in this one file. The participants providing data are _Alice_ and *Bob*, and the *Analyst* acting the integration authority.\n",
"\n",
- "* [Check connection to Entity Service](#check_con)\n",
- "* [Data preparation](#data_prep)\n",
+ "* [Check connection to Entity Service](#Check-Connection)\n",
+ "* [Data preparation](#Data-preparation)\n",
" * Write CSV files with PII\n",
- " * [Create a Linkage Schema](#schema_prep)\n",
- "* [Create Linkage Project](#create_pro)\n",
- "* [Generate CLKs from PII](#hash_n_up)\n",
- "* [Upload the PII](#hash_n_up)\n",
- "* [Create a run](#create_run)\n",
- "* [Retrieve and analyse results](#results)"
+ " * [Create a Linkage Schema](#Schema-Preparation)\n",
+ "* [Create Linkage Project](#Create-Linkage-Project)\n",
+ "* [Generate CLKs from PII](#Hash-and-Upload)\n",
+ "* [Upload the PII](#Hash-and-Upload)\n",
+ "* [Create a run](#Create-a-run)\n",
+ "* [Retrieve and analyse results](#Results)"
]
},
{
@@ -40,7 +40,6 @@
"pycharm": {}
},
"source": [
- "\n",
"## Check Connection\n",
"\n",
"> If you're connecting to a custom entity service, change the address here."
@@ -82,7 +81,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "{\"project_count\": 6534, \"rate\": 2504556, \"status\": \"ok\"}\r\n"
+ "{\"project_count\": 7050, \"rate\": 2824020, \"status\": \"ok\"}\r\n"
]
}
],
@@ -96,7 +95,6 @@
"pycharm": {}
},
"source": [
- "\n",
"## Data preparation\n",
"\n",
"Following the [clkhash tutorial](http://clkhash.readthedocs.io/en/latest/tutorial_cli.html) we will use a dataset from the `recordlinkage` library. We will just write both datasets out to temporary CSV files.\n"
@@ -173,7 +171,7 @@
" \n",
"
\n",
"
\n",
- "
rec-1070-org
\n",
+ "
rec-1070-org
\n",
"
michaela
\n",
"
neumann
\n",
"
8
\n",
@@ -186,7 +184,7 @@
"
5304218
\n",
"
\n",
"
\n",
- "
rec-1016-org
\n",
+ "
rec-1016-org
\n",
"
courtney
\n",
"
painter
\n",
"
12
\n",
@@ -199,7 +197,7 @@
"
4066625
\n",
"
\n",
"
\n",
- "
rec-4405-org
\n",
+ "
rec-4405-org
\n",
"
charles
\n",
"
green
\n",
"
38
\n",
@@ -262,9 +260,7 @@
"pycharm": {}
},
"source": [
- "\n",
"## Schema Preparation\n",
- "\n",
"The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the [api docs](http://clkhash.readthedocs.io/en/latest/schema.html). We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation."
]
},
@@ -294,7 +290,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Overwriting /tmp/tmptm0w938k\n"
+ "Overwriting /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmp3jpcxxrs\n"
]
}
],
@@ -518,7 +514,6 @@
"pycharm": {}
},
"source": [
- "\n",
"## Create Linkage Project\n",
"\n",
"The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.\n"
@@ -537,17 +532,17 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Credentials will be saved in /tmp/tmptneh9xy1\n",
+ "Credentials will be saved in /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmp_tz_feve\n",
"\u001b[31mProject created\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
- "{'project_id': '12256e29a8ad92c9016ba3e7650888f13d3bfb3bd23cc98a',\n",
- " 'result_token': '1a588d384f651e9430ac1bb42196f9fe393ff10e8ec65f48',\n",
- " 'update_tokens': ['6111c582a0d6a649480c719adcd258b811da17887849ee00',\n",
- " '4239370ce8868a9eb3dc85a85eca243bf593a0cc637a5be8']}"
+ "{'project_id': '7c942add9259b0c61fc06ce24afc6ee9c99355cc5a5eae7a',\n",
+ " 'result_token': '4552074bebabf66a19e707ef64aa35638fc1eb2cd3b9a768',\n",
+ " 'update_tokens': ['1045c9dda873d3cccf37181bcff7c61a5e82c6051d0da2c0',\n",
+ " 'fc27160c4e4736c1dbbecbedd6bc5e4117a3626c1f2eda9c']}"
]
},
"execution_count": 7,
@@ -559,7 +554,12 @@
"creds = NamedTemporaryFile('wt')\n",
"print(\"Credentials will be saved in\", creds.name)\n",
"\n",
- "!clkutil create-project --schema \"{schema.name}\" --output \"{creds.name}\" --type \"permutations\" --server \"{url}\"\n",
+ "!clkutil create-project \\\n",
+ " --schema \"{schema.name}\" \\\n",
+ " --output \"{creds.name}\" \\\n",
+ " --type \"permutations\" \\\n",
+ " --server \"{url}\"\n",
+ "\n",
"creds.seek(0)\n",
"\n",
"import json\n",
@@ -578,7 +578,6 @@
"source": [
"**Note:** the analyst will need to pass on the `project_id` (the id of the linkage project) and one of the two `update_tokens` to each data provider.\n",
"\n",
- "\n",
"## Hash and Upload\n",
"\n",
"At the moment both data providers have *raw* personally identiy information. We first have to generate CLKs from the raw entity information. We need:\n",
@@ -602,8 +601,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "\u001b[31mCLK data written to /tmp/tmp9vdauwh4.json\u001b[0m\n",
- "\u001b[31mCLK data written to /tmp/tmpgspffags.json\u001b[0m\n"
+ "\u001b[31mCLK data written to /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmppybfm62c.json\u001b[0m\n",
+ "\u001b[31mCLK data written to /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmpu4jx4mjv.json\u001b[0m\n"
]
}
],
@@ -743,7 +742,6 @@
"pycharm": {}
},
"source": [
- "\n",
"## Create a run\n",
"\n",
"Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:"
@@ -776,7 +774,6 @@
"pycharm": {}
},
"source": [
- "\n",
"## Results\n",
"\n",
"Now after some delay (depending on the size) we can fetch the mask.\n",
@@ -964,7 +961,7 @@
{
"data": {
"text/plain": [
- "[2418, 3590, 2340, 1226, 1323, 251, 4696, 2598, 4019, 301]"
+ "[3645, 1068, 4371, 465, 1533, 987, 343, 53, 3298, 2515]"
]
},
"execution_count": 20,
@@ -998,7 +995,7 @@
{
"data": {
"text/plain": [
- "[3183, 4293, 3406, 2808, 4528, 2446, 4606, 1601, 1641, 2062]"
+ "[3857, 4827, 3267, 4934, 1958, 3682, 4576, 4895, 4867, 1188]"
]
},
"execution_count": 21,
@@ -1072,16 +1069,16 @@
{
"data": {
"text/plain": [
- "['rec-3933-org,joshua,rigley,19,east place,kergunyah,kingaroy,3665,vic,19670613,4096438\\n',\n",
- " 'rec-1057-org,samara,pringle,7,allan street,bonnie doon,campbelltown,5073,nsw,19560429,3493586\\n',\n",
- " 'rec-4035-org,chloe,worm,6,brentnall place,donna valley,karloo,3128,nsw,19000814,9383057\\n',\n",
- " 'rec-3793-org,lucy,mccarthy,29,charlton street,warrah lea,bundaberg,4061,qld,19940917,6596660\\n',\n",
- " 'rec-27-org,angelina,campbell,161,jackie howe crescent,bugoren,woorim,6052,nsw,19531108,8948230\\n',\n",
- " 'rec-2303-org,tahlia,hage,3,maclaurin crescent,,ormond,4740,tas,19190517,6174860\\n',\n",
- " 'rec-658-org,david,hobson,14,vagabond crescent,dugout 65,patterson lakes,4880,wa,19010305,7666240\\n',\n",
- " 'rec-4484-org,alexandra,clarke,15,parnell road,rsdb 284,nedlands,4014,sa,19890608,7235143\\n',\n",
- " 'rec-702-org,barnaby,fleet,4,martley circuit,peak view,ascot vale,3930,sa,19360907,9383837\\n',\n",
- " 'rec-3252-org,,campbell,4,dunbar street,delicate nobby street,cloverdale,2528,vic,19480406,8607518\\n']"
+ "['rec-3302-org,blaize,koopman,17,allison place,aldersyde estate,balwyn north,4650,nsw,19110608,7823755\\n',\n",
+ " 'rec-1385-org,joel,bishop,10,french street,cedarview,orange,3223,nt,,1324854\\n',\n",
+ " 'rec-190-org,,alias,24,elkington street,pangani,isle of capri,2145,sa,19650429,8261472\\n',\n",
+ " 'rec-4781-org,jacob,waller,89,dalley crescent,the willows,mosman,2480,qld,19580408,6317326\\n',\n",
+ " 'rec-4881-org,alexandra,nguyen,44,colebatch place,langley flats,freshwater,3242,nsw,19511004,6416159\\n',\n",
+ " 'rec-4770-org,tegan,rosendale,1,sherbrooke street,nazareth village,innaloo,2250,wa,19801011,9351309\\n',\n",
+ " 'rec-3385-org,shanaye,carbone,41,haystack crescent,st vincents hospital,matong,3690,nsw,19300519,1632237\\n',\n",
+ " 'rec-3738-org,imogen,carlington,45,mcinnes street,parish talowahl,girilambone,2154,nsw,19781117,7912921\\n',\n",
+ " 'rec-831-org,laura,flannery,54,sid barnes crescent,weemilah,winston hills,5073,qld,19581023,9712180\\n',\n",
+ " 'rec-815-org,holly,campbell,21,casey crescent,nestor,westmead,4573,qld,19911007,4424335\\n']"
]
},
"execution_count": 24,
@@ -1105,16 +1102,16 @@
{
"data": {
"text/plain": [
- "['rec-3933-dup-0,joshua,rigly,19,east place,kergunyah,kingaroy,3665,vic,19670613,4096438\\n',\n",
- " 'rec-1057-dup-0,pringle,samara,7,allan street,bonnie doon,campbelltown,5073,nsw,19560429,3493586\\n',\n",
- " 'rec-4035-dup-0,chooe,worm,6,brentnal place,donna valley,karloo,3128,nsw,19000814,9383057\\n',\n",
- " 'rec-3793-dup-0,mccarthy,lucy,29,charltonstreet,warrahlea,bundaverg,4061,qld,19940917,6596660\\n',\n",
- " 'rec-27-dup-0,angelina,campbell,190,jackie howe crescent,bugoren,woorim,6352,nsw,19531108,8948230\\n',\n",
- " 'rec-2303-dup-0,peter,ha ge,3,maclaurin crescent,,ormond,4704,tas,19190517,6174860\\n',\n",
- " 'rec-658-dup-0,david,hobsson,14,vagabond cfescent,dugout 65,patterson lakes,4880,wa,19010305,7666240\\n',\n",
- " 'rec-4484-dup-0,alexandra,clarke,15,rsd b 284,parnell roa,,4014,sa,19890608,7235143\\n',\n",
- " 'rec-702-dup-0,barnay,fleet,4,martley circuit,peak view,ascot vale,3930,sa,19360907,9383837\\n',\n",
- " 'rec-3252-dup-0,,campbell,4,dunbar svtreet,delicate nobby street,cloverdale,2528,vic,19480406,8607518\\n']"
+ "['rec-3302-dup-0,blaize,koopman,17,allison place,aldersydeestate,balwyn north,4650,nsw,19110608,7823755\\n',\n",
+ " 'rec-1385-dup-0,elton,bishop,10,french street,,orange,3223,nt,,1324854\\n',\n",
+ " 'rec-190-dup-0,,alias,24,elkington street,panganu,isle of capri,2145,sa,19650429,8261472\\n',\n",
+ " 'rec-4781-dup-0,jacob,waliler,89,dalley crescent,the ui llows,mosman,2487,qld,19580408,6317326\\n',\n",
+ " 'rec-4881-dup-0,nguyen,alexandra,44,colebatch place,langley flats,freshwater,3242,nsw,19511004,6416159\\n',\n",
+ " 'rec-4770-dup-0,tegan,rosendale,1,sherbrooke street,nazareth village,innaloo,2550,nsw,19801011,9351309\\n',\n",
+ " 'rec-3385-dup-0,shanaye,lonto,41,haystack crescent,,leetob,3680,nsw,19300519,1632237\\n',\n",
+ " 'rec-3738-dup-0,imogen,carlington,45,mcinnes treet,parish talowahl,girilabmone,2154,nsw,19781117,7912921\\n',\n",
+ " 'rec-831-dup-0,laura,flannery,54,sid barnes crescent,,winstonhills,5073,qld,19581023,9712180\\n',\n",
+ " 'rec-815-dup-0,holyl,campbell,21,casey crescent,,westmead,4573,qld,19911007,4424335\\n']"
]
},
"execution_count": 25,
@@ -1152,16 +1149,16 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Joshua Rigley (rec-3933-org) =? Joshua Rigly (rec-3933-dup-0)\n",
- "Samara Pringle (rec-1057-org) =? Pringle Samara (rec-1057-dup-0)\n",
- "Chloe Worm (rec-4035-org) =? Chooe Worm (rec-4035-dup-0)\n",
- "Lucy Mccarthy (rec-3793-org) =? Mccarthy Lucy (rec-3793-dup-0)\n",
- "Angelina Campbell (rec-27-org) =? Angelina Campbell (rec-27-dup-0)\n",
- "Tahlia Hage (rec-2303-org) =? Peter Ha Ge (rec-2303-dup-0)\n",
- "David Hobson (rec-658-org) =? David Hobsson (rec-658-dup-0)\n",
- "Alexandra Clarke (rec-4484-org) =? Alexandra Clarke (rec-4484-dup-0)\n",
- "Barnaby Fleet (rec-702-org) =? Barnay Fleet (rec-702-dup-0)\n",
- " Campbell (rec-3252-org) =? Campbell (rec-3252-dup-0)\n"
+ "Blaize Koopman (rec-3302-org) =? Blaize Koopman (rec-3302-dup-0)\n",
+ "Joel Bishop (rec-1385-org) =? Elton Bishop (rec-1385-dup-0)\n",
+ " Alias (rec-190-org) =? Alias (rec-190-dup-0)\n",
+ "Jacob Waller (rec-4781-org) =? Jacob Waliler (rec-4781-dup-0)\n",
+ "Alexandra Nguyen (rec-4881-org) =? Nguyen Alexandra (rec-4881-dup-0)\n",
+ "Tegan Rosendale (rec-4770-org) =? Tegan Rosendale (rec-4770-dup-0)\n",
+ "Shanaye Carbone (rec-3385-org) =? Shanaye Lonto (rec-3385-dup-0)\n",
+ "Imogen Carlington (rec-3738-org) =? Imogen Carlington (rec-3738-dup-0)\n",
+ "Laura Flannery (rec-831-org) =? Laura Flannery (rec-831-dup-0)\n",
+ "Holly Campbell (rec-815-org) =? Holyl Campbell (rec-815-dup-0)\n"
]
}
],
@@ -1230,6 +1227,27 @@
"print(\"Precision: {:.1f}%\".format(100*precision))\n",
"print(\"Recall: {:.1f}%\".format(100*recall))"
]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\u001b[31mProject deleted\u001b[0m\r\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Deleting the project\n",
+ "!clkutil delete-project \\\n",
+ " --project=\"{credentials['project_id']}\" \\\n",
+ " --apikey=\"{credentials['result_token']}\" \\\n",
+ " --server=\"{url}\""
+ ]
}
],
"metadata": {
@@ -1248,18 +1266,18 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.7.3"
+ "version": "3.7.4"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
+ "source": [],
"metadata": {
"collapsed": false
- },
- "source": []
+ }
}
}
},
"nbformat": 4,
"nbformat_minor": 2
-}
+}
\ No newline at end of file
diff --git a/docs/tutorial/Record Linkage API.ipynb b/docs/tutorial/Record Linkage API.ipynb
index b5e074f1..71e87353 100644
--- a/docs/tutorial/Record Linkage API.ipynb
+++ b/docs/tutorial/Record Linkage API.ipynb
@@ -694,7 +694,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.7.3"
+ "version": "3.7.4"
},
"pycharm": {
"stem_cell": {
diff --git a/docs/tutorial/Similarity Scores.ipynb b/docs/tutorial/Similarity Scores.ipynb
index 583718df..5a2c4c9f 100644
--- a/docs/tutorial/Similarity Scores.ipynb
+++ b/docs/tutorial/Similarity Scores.ipynb
@@ -47,6 +47,7 @@
"import json\n",
"import os\n",
"import time\n",
+ "import pandas as pd\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import requests\n",
@@ -100,7 +101,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "{\"project_count\": 1689, \"rate\": 2267284, \"status\": \"ok\"}\r\n"
+ "{\"project_count\": 4, \"rate\": 32036360, \"status\": \"ok\"}\r\n"
]
}
],
@@ -312,7 +313,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Overwriting /tmp/tmpw_n8wu8g\n"
+ "Overwriting /tmp/tmp23q54lqu\n"
]
}
],
@@ -344,7 +345,7 @@
" },\n",
" \"hashing\": {\n",
" \"strategy\": {\n",
- " \"bitsPerToken\": 30\n",
+ " \"bitsPerFeature\": 200\n",
" },\n",
" \"hash\": {\n",
" \"type\": \"doubleHash\"\n",
@@ -364,7 +365,7 @@
" },\n",
" \"hashing\": {\n",
" \"strategy\": {\n",
- " \"bitsPerToken\": 30\n",
+ " \"bitsPerFeature\": 200\n",
" },\n",
" \"hash\": {\n",
" \"type\": \"doubleHash\"\n",
@@ -386,7 +387,7 @@
" \"sentinel\": \"\"\n",
" },\n",
" \"strategy\": {\n",
- " \"bitsPerToken\": 30\n",
+ " \"bitsPerFeature\": 100\n",
" },\n",
" \"hash\": {\n",
" \"type\": \"doubleHash\"\n",
@@ -406,7 +407,7 @@
" },\n",
" \"hashing\": {\n",
" \"strategy\": {\n",
- " \"bitsPerToken\": 30\n",
+ " \"bitsPerFeature\": 100\n",
" },\n",
" \"hash\": {\n",
" \"type\": \"doubleHash\"\n",
@@ -426,7 +427,7 @@
" },\n",
" \"hashing\": {\n",
" \"strategy\": {\n",
- " \"bitsPerToken\": 30\n",
+ " \"bitsPerFeature\": 100\n",
" },\n",
" \"hash\": {\n",
" \"type\": \"doubleHash\"\n",
@@ -446,7 +447,7 @@
" },\n",
" \"hashing\": {\n",
" \"strategy\": {\n",
- " \"bitsPerToken\": 30\n",
+ " \"bitsPerFeature\": 100\n",
" },\n",
" \"hash\": {\n",
" \"type\": \"doubleHash\"\n",
@@ -467,7 +468,7 @@
" },\n",
" \"hashing\": {\n",
" \"strategy\": {\n",
- " \"bitsPerToken\": 30\n",
+ " \"bitsPerFeature\": 100\n",
" },\n",
" \"hash\": {\n",
" \"type\": \"doubleHash\"\n",
@@ -488,7 +489,7 @@
" },\n",
" \"hashing\": {\n",
" \"strategy\": {\n",
- " \"bitsPerToken\": 30\n",
+ " \"bitsPerFeature\": 100\n",
" },\n",
" \"hash\": {\n",
" \"type\": \"doubleHash\"\n",
@@ -510,7 +511,7 @@
" \"sentinel\": \"\"\n",
" },\n",
" \"strategy\": {\n",
- " \"bitsPerToken\": 30\n",
+ " \"bitsPerFeature\": 200\n",
" },\n",
" \"hash\": {\n",
" \"type\": \"doubleHash\"\n",
@@ -554,17 +555,17 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "Credentials will be saved in /tmp/tmp2eppf_dc\n",
+ "Credentials will be saved in /tmp/tmp6c2zwr2b\n",
"\u001b[31mProject created\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
- "{'project_id': 'fc8f8216e33a7b8ffd4b967c27f8ce8e5d7371cf8f52bcdb',\n",
- " 'result_token': '6423ccee1e634a390a12e3de1a57e7bd322621111c119351',\n",
- " 'update_tokens': ['ef0404a7c23ea25c9f922f4c254f80dd6fa644d7d906efa9',\n",
- " '46a71922c19a75eae2dd75ec59db0eac453842123514c22a']}"
+ "{'project_id': '4d499f0fd3fb41c7dca684ee923ee056daff1d1d0dea0e69',\n",
+ " 'result_token': '644530073d94cec15ee0b6955192e6ec66e4d5b6a7c59ec4',\n",
+ " 'update_tokens': ['aaec135b6729e8234b2d974e99b47df48a5d2b83b1e0e5fb',\n",
+ " 'f154012b6f6d48700490f964525633ff8efaa18f200ec7c5']}"
]
},
"execution_count": 8,
@@ -576,7 +577,12 @@
"creds = NamedTemporaryFile('wt')\n",
"print(\"Credentials will be saved in\", creds.name)\n",
"\n",
- "!clkutil create-project --schema \"{schema.name}\" --output \"{creds.name}\" --type \"similarity_scores\" --server \"{url}\"\n",
+ "!clkutil create-project \\\n",
+ " --schema \"{schema.name}\" \\\n",
+ " --output \"{creds.name}\" \\\n",
+ " --type \"similarity_scores\" \\\n",
+ " --server \"{url}\"\n",
+ "\n",
"creds.seek(0)\n",
"\n",
"with open(creds.name, 'r') as f:\n",
@@ -612,8 +618,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "\u001b[31mCLK data written to /tmp/tmpjlx4bxil.json\u001b[0m\n",
- "\u001b[31mCLK data written to /tmp/tmpz2ykuhep.json\u001b[0m\n"
+ "\u001b[31mCLK data written to /tmp/tmp75ho6ywb.json\u001b[0m\n",
+ "\u001b[31mCLK data written to /tmp/tmp1rw5bksd.json\u001b[0m\n"
]
}
],
@@ -726,7 +732,7 @@
" --project=\"{project_id}\" \\\n",
" --apikey=\"{credentials['result_token']}\" \\\n",
" --server \"{url}\" \\\n",
- " --threshold 0.9 \\\n",
+ " --threshold 0.75 \\\n",
" --output \"{f.name}\"\n",
" \n",
" run_id = json.load(open(f.name))['run_id']"
@@ -740,7 +746,7 @@
"source": [
"## Results\n",
"\n",
- "Now after some delay (depending on the size) we can fetch the mask.\n",
+ "Now after some delay (depending on the size) we can fetch the result.\n",
"This can be done with clkutil:\n",
"\n",
" !clkutil results --server \"{url}\" \\\n",
@@ -851,7 +857,7 @@
{
"data": {
"text/plain": [
- "1150393"
+ "280116"
]
},
"execution_count": 16,
@@ -883,7 +889,7 @@
"outputs": [
{
"data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAPGElEQVR4nO3df6xfd13H8eeLjmEUyDpbm9lt3KklsfzhnHUMFZkQtm6LFtDwI1HKXKyEkUgif1T5Y2aEpGrAsIALVSobEciMII0rjlohqGGwImPsh9DL6LLWshYL6LJEAd/+8f0UvnT3trf3+6t3n+cj+eZ7vp/zOef7efd7+zrnnnO+56aqkCT14WmzHoAkaXoMfUnqiKEvSR0x9CWpI4a+JHXknFkP4FTWrFlTc3Nzsx6GJK0on/vc575eVWsXmndWh/7c3Bz79++f9TAkaUVJ8shi8zy8I0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTmrv5E7qrntdy7YfnDHdVMeiSSdHdzTl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SerIaUM/yUVJPpHkwSQPJPm91n5+kr1JDrTn1a09SW5JMp/kviSXDa1ra+t/IMnWyZUlSVrIUvb0vwP8flVtBK4AbkyyEdgO7KuqDcC+9hrgGmBDe2wDboXBRgK4CXg+cDlw04kNhSRpOk4b+lV1pKr+rU3/N/AQsB7YAtzWut0GvKxNbwFur4G7gfOSXABcDeytquNV9Q1gL7B5rNVIkk7pjI7pJ5kDfhb4DLCuqo60WV8D1rXp9cCjQ4sdam2LtZ/8HtuS7E+y/9ixY2cyPEnSaSw59JM8E/hb4E1V9V/D86qqgBrHgKpqZ1VtqqpNa9euHccqJUnNkkI/ydMZBP5fV9WHW/Nj7bAN7floaz8MXDS0+IWtbbF2SdKULOXqnQDvBR6qqncMzdoNnLgCZyvw0aH217areK4AvtUOA90FXJVkdTuBe1VrkyRNyTlL6POLwG8BX0xyb2v7Q2AHcEeSG4BHgFe2eXuAa4F54AngeoCqOp7krcA9rd/NVXV8LFVIkpbktKFfVf8CZJHZL1mgfwE3LrKuXcCuMxmgJGl8/EauJHXE0Jekjhj6ktQRQ1+SOrKUq3eecua237lg+8Ed1015JJI0Xe7pS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI6cM+sBnE3mtt+5YPvBHddNeSSSNBmn3dNPsivJ0ST3D7X9UZLDSe5tj2uH5v1BkvkkX0py9VD75tY2n2T7+EuRJJ3OUg7vvA/YvED7n1XVpe2xByDJRuDVwPPaMn+eZFWSVcC7gWuAjcBrWl9J0hSd9vBOVX0qydwS17cF+FBV/Q/w1STzwOVt3nxVPQyQ5EOt74NnPGJJ0rKNciL3jUnua4d/Vre29cCjQ30OtbbF2p8kybYk+5PsP3bs2AjDkySdbLmhfyvwk8ClwBHg7eMaUFXtrKpNVbVp7dq141qtJIllXr1TVY+dmE7yF8Dft5eHgYuGul7Y2jhFuyRpSpa1p5/kgqGXLwdOXNmzG3h1kmckuQTYAHwWuAfYkOSSJOcyONm7e/nDliQtx2n39JN8ELgSWJPkEHATcGWSS4ECDgK/C1BVDyS5g8EJ2u8AN1bVd9t63gjcBawCdlXVA2OvRpJ0Sku5euc1CzS/9xT93wa8bYH2PcCeMxqdJGmsvA2DJHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHVnWn0vszdz2OxdsP7jjuimPRJJG456+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SO+OcSR+CfUZS00rinL0kdMfQlqSOnDf0ku5IcTXL/UNv5SfYmOdCeV7f2JLklyXyS+5JcNrTM1tb/QJKtkylHknQqS9nTfx+w+aS27cC+qtoA7GuvAa4BNrTHNuBWGGwkgJuA5wOXAzed2FBIkqbntKFfVZ8Cjp/UvAW4rU3fBrxsqP32GrgbOC/JBcDVwN6qOl5V3wD28uQNiSRpwpZ7TH9dVR1p018D1rXp9cCjQ/0OtbbF2p8kybYk+5PsP3bs2DKHJ0layMgncquqgBrDWE6sb2dVbaqqTWvXrh3XaiVJLD/0H2uHbWjPR1v7YeCioX4XtrbF2iVJU7Tc0N8NnLgCZyvw0aH217areK4AvtUOA90FXJVkdTuBe1VrkyRN0Wm/kZvkg8CVwJokhxhchbMDuCPJDcAjwCtb9z3AtcA88ARwPUBVHU/yVuCe1u/mqjr55LAkacJOG/pV9ZpFZr1kgb4F3LjIenYBu85odJKksfIbuZLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkdOe+8dnbm57Xcu2H5wx3VTHokk/SD39CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjriXTanyLtvSpo19/QlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdWSk0E9yMMkXk9ybZH9rOz/J3iQH2vPq1p4ktySZT3JfksvGUYAkaenGsaf/K1V1aVVtaq+3A/uqagOwr70GuAbY0B7bgFvH8N6SpDMwicM7W4Db2vRtwMuG2m+vgbuB85JcMIH3lyQtYtS7bBbw8SQFvKeqdgLrqupIm/81YF2bXg88OrTsodZ2ZKiNJNsY/CbAxRdfPOLwVgbvvilpWkYN/V+qqsNJfgzYm+Tfh2dWVbUNwpK1DcdOgE2bNp3RspKkUxvp8E5VHW7PR4GPAJcDj504bNOej7buh4GLhha/sLVJkqZk2aGf5EeSPOvENHAVcD+wG9jaum0FPtqmdwOvbVfxXAF8a+gwkCRpCkY5vLMO+EiSE+v5QFX9Q5J7gDuS3AA8Aryy9d8DXAvMA08A14/w3pKkZVh26FfVw8DPLND+n8BLFmgv4Mblvp8kaXR+I1eSOmLoS1JHDH1J6oihL0kdMfQlqSOjfiNXE+TtGSSNm3v6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xOv0V6DFrt8Hr+GXdGru6UtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOeMnmU4y3Y5Z0Ku7pS1JHDH1J6oihL0kdMfQlqSOeyO2EJ3glgXv6ktQVQ1+SOmLoS1JHDH1J6ogncjvnCV6pL+7pS1JH3NPXgvwNQHpqck9fkjpi6EtSRzy8ozPiYR9pZTP0NRZuDKSVYeqhn2Qz8E5gFfCXVbVj2mPQ9LgxkM4uUw39JKuAdwMvBQ4B9yTZXVUPTnMcmj03BtJsTHtP/3JgvqoeBkjyIWALYOgLWHxjME5uWNSzaYf+euDRodeHgOcPd0iyDdjWXj6e5EsjvN8a4OsjLL8S9VbzGdebP57QSKant88YrPlMPWexGWfdidyq2gnsHMe6kuyvqk3jWNdK0VvNvdUL1tyLSdU87ev0DwMXDb2+sLVJkqZg2qF/D7AhySVJzgVeDeye8hgkqVtTPbxTVd9J8kbgLgaXbO6qqgcm+JZjOUy0wvRWc2/1gjX3YiI1p6omsV5J0lnIe+9IUkcMfUnqyIoM/SSbk3wpyXyS7QvMf06SfUnuS/LJJBcOzdua5EB7bJ3uyJdvuTUnuTTJp5M80Oa9avqjX55RPuc2/9lJDiV51/RGPZoRf7YvTvLxJA8leTDJ3DTHvlwj1vwn7Wf7oSS3JMl0R3/mkuxKcjTJ/YvMT6tlvtV82dC80fOrqlbUg8EJ4K8APwGcC3wB2HhSn78BtrbpFwPvb9PnAw+359VtevWsa5pwzc8FNrTpHweOAOfNuqZJ1jw0/53AB4B3zbqeadQMfBJ4aZt+JvDDs65pkjUDvwD8a1vHKuDTwJWzrmkJNf8ycBlw/yLzrwU+BgS4AvhMax9Lfq3EPf3v3cqhqv4XOHErh2EbgX9q058Ymn81sLeqjlfVN4C9wOYpjHlUy665qr5cVQfa9H8AR4G1Uxn1aEb5nEnyc8A64ONTGOu4LLvmJBuBc6pqL0BVPV5VT0xn2CMZ5XMu4IcYbCyeATwdeGziIx5RVX0KOH6KLluA22vgbuC8JBcwpvxaiaG/0K0c1p/U5wvAK9r0y4FnJfnRJS57Nhql5u9JcjmD/yBfmdA4x2nZNSd5GvB24M0TH+V4jfI5Pxf4ZpIPJ/l8kj9tNzg82y275qr6NIONwJH2uKuqHprweKdhsX+TseTXSgz9pXgz8KIknwdexOBbv9+d7ZAm7pQ1tz2F9wPXV9X/zWaIY7dYzW8A9lTVoVkObkIWq/kc4IVt/s8zOFzyuhmNcdwWrDnJTwE/zeCb/euBFyd54eyGuTKcdffeWYLT3sqhHcZ4BUCSZwK/XlXfTHIYuPKkZT85ycGOybJrbq+fDdwJvKX9urgSjPI5vwB4YZI3MDi2fW6Sx6vqSScJzzKj1HwIuLe+fwfbv2NwPPi90xj4CEap+XeAu6vq8TbvY8ALgH+exsAnaLF/k/Hk16xPaizjJMg5DE5gXML3T/w876Q+a4Cntem3ATcPnQj5KoOTIKvb9PmzrmnCNZ8L7APeNOs6plXzSX1ex8o5kTvK57yq9V/bXv8VcOOsa5pwza8C/rGt4+nt5/xXZ13TEuueY/ETudfxgydyP9vax5JfMy9+mf9g1wJfZnBs+i2t7Wbg19r0bwAHWp+/BJ4xtOxvA/Ptcf2sa5l0zcBvAt8G7h16XDrreib9OQ+tY8WE/qg1M/jjRPcBXwTeB5w763omWTODDd17gIcY/E2Od8y6liXW+0EG5yC+zeC4/A3A64HXt/lh8MemvtI+y01Dy46cX96GQZI68lQ9kStJWoChL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjry/11JeywGTfuPAAAAAElFTkSuQmCC\n",
+ "image/png": "\n",
"text/plain": [
"
"
]
@@ -895,7 +901,10 @@
}
],
"source": [
- "plt.hist([score for _, _, score in data[::100]], bins=50);"
+ "plt.style.use('seaborn-deep')\n",
+ "plt.hist([score for _, _, score in data], bins=50)\n",
+ "plt.xlabel('similarity score')\n",
+ "plt.show()"
]
},
{
@@ -904,7 +913,7 @@
"pycharm": {}
},
"source": [
- "The vast majority of these similarity scores are for non matches. Let's zoom into the right side of the distribution."
+ "The vast majority of these similarity scores are for non matches. We expect the matches to have a high similarity score. So let's zoom into the right side of the distribution."
]
},
{
@@ -918,7 +927,7 @@
"outputs": [
{
"data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAARWklEQVR4nO3df4zkdX3H8eeL41cVW0C25DzApYppz7ai3VKstfIjVoS2h61FbKKHkp5GTGqiSdH+obUlwbZiaWxJz0IFIyitWkjFCqLE2gp4KPKz6oFHuOsJp4BKjVbw3T/me2FYdm9md3Zmdz88H8lkv/P5fGfm/dnZe+1nP9/vfC9VhSSpLXstdwGSpKVnuEtSgwx3SWqQ4S5JDTLcJalBey93AQCHHHJITU9PL3cZkrSq3HTTTd+uqqm5+lZEuE9PT7Nly5blLkOSVpUk98zX57KMJDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1aEV8QnUU02d/ct6+beeeMsFKJGnlcOYuSQ0y3CWpQQPDPcn+SW5M8tUktyf5s679yCQ3JNma5KNJ9u3a9+vub+36p8c7BEnSbMPM3H8EnFBVzwOOBk5KcizwHuB9VfVs4EHgzG7/M4EHu/b3dftJkiZoYLhXz8Pd3X26WwEnAP/StV8MnNptb+ju0/WfmCRLVrEkaaCh1tyTrElyM3A/cA1wF/BQVT3S7bIdWNdtrwPuBej6vws8fY7n3JRkS5Itu3btGm0UkqTHGSrcq+rRqjoaOAw4Bvj5UV+4qjZX1UxVzUxNzfkfiUiSFmlBZ8tU1UPA54AXAgcm2X2e/GHAjm57B3A4QNf/M8B3lqRaSdJQhjlbZirJgd32TwEvBe6kF/Kv7HbbCFzRbV/Z3afr/2xV1VIWLUnas2E+oboWuDjJGnq/DC6vqn9LcgfwkSR/AXwFuLDb/0LgQ0m2Ag8Ap4+hbknSHgwM96q6BXj+HO1301t/n93+Q+APlqQ6SdKi+AlVSWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQcNcOGzVmj77k3O2bzv3lAlXIkmT5cxdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQQPDPcnhST6X5I4ktyf54679XUl2JLm5u53c95i3J9ma5GtJXjbOAUiSnmiYS/4+Ary1qr6c5GnATUmu6freV1V/3b9zkvXA6cBzgWcAn0nynKp6dCkLlyTNb+DMvap2VtWXu+3vA3cC6/bwkA3AR6rqR1X1TWArcMxSFCtJGs6C1tyTTAPPB27omt6c5JYkFyU5qGtbB9zb97DtzPHLIMmmJFuSbNm1a9eCC5ckzW/ocE9yAPAx4C1V9T3gAuBZwNHATuC9C3nhqtpcVTNVNTM1NbWQh0qSBhgq3JPsQy/YP1xVHweoqvuq6tGq+gnwAR5betkBHN738MO6NknShAxztkyAC4E7q+q8vva1fbu9Arit274SOD3JfkmOBI4Cbly6kiVJgwxztsyLgNcAtya5uWt7B/DqJEcDBWwD3gBQVbcnuRy4g96ZNmd5powkTdbAcK+qLwCZo+uqPTzmHOCcEeqSJI3AT6hKUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIatPegHZIcDlwCHAoUsLmqzk9yMPBRYBrYBpxWVQ8mCXA+cDLwA+CMqvryeMpfnOmzPzln+7ZzT5lwJZI0HsPM3B8B3lpV64FjgbOSrAfOBq6tqqOAa7v7AC8Hjupum4ALlrxqSdIeDQz3qtq5e+ZdVd8H7gTWARuAi7vdLgZO7bY3AJdUz/XAgUnWLnnlkqR5LWjNPck08HzgBuDQqtrZdX2L3rIN9IL/3r6Hbe/aZj/XpiRbkmzZtWvXAsuWJO3J0OGe5ADgY8Bbqup7/X1VVfTW44dWVZuraqaqZqamphbyUEnSAEOFe5J96AX7h6vq413zfbuXW7qv93ftO4DD+x5+WNcmSZqQgeHenf1yIXBnVZ3X13UlsLHb3ghc0df+2vQcC3y3b/lGkjQBA0+FBF4EvAa4NcnNXds7gHOBy5OcCdwDnNb1XUXvNMit9E6FfN2SVixJGmhguFfVF4DM033iHPsXcNaIdUmSRuAnVCWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUoGGuCilJK5L/2f38nLlLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIs2X6eORdUiucuUtSg5y5S2rOfH+Fz6fFv86duUtSg5y5S1rxFjoTl+EuaQUxxJfOwGWZJBcluT/JbX1t70qyI8nN3e3kvr63J9ma5GtJXjauwiVJ8xtmzf2DwElztL+vqo7ublcBJFkPnA48t3vM3ydZs1TFSpKGMzDcq+rzwANDPt8G4CNV9aOq+iawFThmhPokSYswytkyb05yS7dsc1DXtg64t2+f7V2bJGmCFntA9QLgz4Hqvr4XeP1CniDJJmATwBFHHLHIMiStRh44Hb9Fzdyr6r6qerSqfgJ8gMeWXnYAh/fteljXNtdzbK6qmaqamZqaWkwZkqR5LGrmnmRtVe3s7r4C2H0mzZXApUnOA54BHAXcOHKVkjRGLV5XamC4J7kMOA44JMl24J3AcUmOprcssw14A0BV3Z7kcuAO4BHgrKp6dDylS5LmMzDcq+rVczRfuIf9zwHOGaUoSdJovLaMJDXIcJekBhnuktQgw12SGmS4S1KDvOSvpLHxk6jLx5m7JDXIcJekBhnuktQgw12SGmS4S1KDPFtmCC1eMU5S25y5S1KDnLlL0jz2dJ7+Sv/L3XCX9DirOdD0GMNd0sj8JOrK45q7JDXIcJekBhnuktQg19wlDc219dXDcJeepAzqtrksI0kNMtwlqUEuy4zAa85IWqmcuUtSgwbO3JNcBPw2cH9V/WLXdjDwUWAa2AacVlUPJglwPnAy8APgjKr68nhKlzQMD5w+OQ0zc/8gcNKstrOBa6vqKODa7j7Ay4Gjutsm4IKlKVOStBADZ+5V9fkk07OaNwDHddsXA9cBf9K1X1JVBVyf5MAka6tq51IVLD3ZeaxHw1jsmvuhfYH9LeDQbnsdcG/fftu7tidIsinJliRbdu3atcgyJElzGfmAajdLr0U8bnNVzVTVzNTU1KhlSJL6LDbc70uyFqD7en/XvgM4vG+/w7o2SdIELTbcrwQ2dtsbgSv62l+bnmOB77reLkmTN8ypkJfRO3h6SJLtwDuBc4HLk5wJ3AOc1u1+Fb3TILfSOxXydWOoWZI0wDBny7x6nq4T59i3gLNGLUqSNBo/oSpJDTLcJalBXjhMWiQ/TKSVzHCXBlgt12ZZLXVqMlyWkaQGOXMfA/9cl7TcDHdpifnLXSuB4S4tM9fKNQ6uuUtSg5y5a6zGvUThEog0N8N9BTCgJC01w12aENfWNUmuuUtSg5y5a0VxiUpaGoa7nlT2tDTiLxC1xHBfwZzFTpZr4mqJ4S5JEzDpyZrhPkHODB8z7u+F32s92Xm2jCQ1yJm75rTS1vudiUsL48xdkhrkzP1JwNP/pKW30v66nc2ZuyQ1yJn7k9xSrWW7Ji6tLCOFe5JtwPeBR4FHqmomycHAR4FpYBtwWlU9OFqZWikMcWl1WIplmeOr6uiqmununw1cW1VHAdd29yVJEzSONfcNwMXd9sXAqWN4DUnSHowa7gVcneSmJJu6tkOrame3/S3g0LkemGRTki1JtuzatWvEMiRJ/UY9oPobVbUjyc8C1yT57/7OqqokNdcDq2ozsBlgZmZmzn20MK6HS9ptpHCvqh3d1/uTfAI4Brgvydqq2plkLXD/EtSpPoa4pEEWvSyT5KlJnrZ7G/gt4DbgSmBjt9tG4IpRi5QkLcwoM/dDgU8k2f08l1bVvyf5EnB5kjOBe4DTRi9TkrQQiw73qrobeN4c7d8BThylKEnSaLz8gCQ1yHCXpAYZ7pLUIC8cJklLaKWcquzMXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBYwv3JCcl+VqSrUnOHtfrSJKeaCzhnmQN8HfAy4H1wKuTrB/Ha0mSnmhcM/djgK1VdXdV/R/wEWDDmF5LkjTL3mN63nXAvX33twO/1r9Dkk3Apu7uw0m+tsjXOgT49iIfu9I4lpWplbG0Mg5oaCx5z0hjeeZ8HeMK94GqajOwedTnSbKlqmaWoKRl51hWplbG0so4wLEMY1zLMjuAw/vuH9a1SZImYFzh/iXgqCRHJtkXOB24ckyvJUmaZSzLMlX1SJI3A58G1gAXVdXt43gtlmBpZwVxLCtTK2NpZRzgWAZKVY3jeSVJy8hPqEpSgwx3SWrQig73QZcwSPLMJNcmuSXJdUkOm9X/00m2J3n/5Kqe2yhjSfJokpu727IemB5xHEckuTrJnUnuSDI9ydpnW+xYkhzf937cnOSHSU6d/AgeV+so78tfJrm9e1/+NkkmW/0Tah1lLO9Jclt3e9VkK39CnRcluT/JbfP0p/t+b+3G8oK+vo1JvtHdNi6qgKpakTd6B2LvAn4O2Bf4KrB+1j7/DGzstk8APjSr/3zgUuD9q3kswMPL/X4s0TiuA17abR8APGW1jqVvn4OBB1brWIBfB/6ze441wBeB41bpWE4BrqF3oshT6Z2199PLOJbfBF4A3DZP/8nAp4AAxwI39P1M3d19PajbPmihr7+SZ+7DXMJgPfDZbvtz/f1JfgU4FLh6ArUOMtJYVpBFj6O7ttDeVXUNQFU9XFU/mEzZc1qq9+SVwKdW8VgK2J9ekO4H7APcN/aK5zfKWNYDn6+qR6rqf4FbgJMmUPOcqurz9H7xz2cDcEn1XA8cmGQt8DLgmqp6oKoepPcLa8HjWMnhPtclDNbN2uerwO91268Anpbk6Un2At4LvG3sVQ5n0WPp7u+fZEuS65f5z/9RxvEc4KEkH0/ylSR/1V1gbrmM+p7sdjpw2VgqHN6ix1JVX6QXkDu726er6s4x17sno7wvXwVOSvKUJIcAx/P4D1OuNPONdZjvwUArOdyH8TbgJUm+AryE3qdgHwXeBFxVVduXs7gFmm8sAM+s3seT/xD4myTPWqYahzHfOPYGXtz1/yq9P7vPWKYah7Wn94RulvVL9D7PsdLNOZYkzwZ+gd6nyNcBJyR58fKVOZQ5x1JVVwNXAf9F7xfuF+l7v55slu3aMkMYeAmDqvofut/gSQ4Afr+qHkryQuDFSd5Eb2133yQPV9VyXVd+0WPp+nZ0X+9Och3wfHrrkpM2ynuyHbi5qu7u+v6V3jrjhZMofA4jvSed04BPVNWPx1zrIKO8L38EXF9VD3d9nwJeCPzHJAqfw6j/Vs4Bzun6LgW+PoGaF2u+se4AjpvVft2Cn325DjYMcTBib3oHEo7ksQMrz521zyHAXt32OcC753ieM1j+A6qLHgu9Ayr79e3zDWYdYFol41jT7T/V3f8n4KzV+J709V8PHL+cP1tL8L68CvhM9xz7ANcCv7NKx7IGeHq3/cvAbfSO8yznezPN/AdUT+HxB1Rv7NoPBr7Z/ds/qNs+eMGvvdw/mAO+MSfT+817F/CnXdu7gd/ttl/Zhd3XgX/cHYKznuMMljncRxkLvbMZbu1+yG8FzlyN4+j6XkrvINetwAeBfVfxWKbpzbD2Wu6frRF/vtYA/wDcCdwBnLeKx7J/N4Y76P3iPXqZx3EZveMYP6a3bn4m8EbgjV1/6P2nRnd1/yZm+h77emBrd3vdYl7fyw9IUoNW+wFVSdIcDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUoP8H1xfysAfPXP0AAAAASUVORK5CYII=\n",
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEGCAYAAACevtWaAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAT/ElEQVR4nO3df7SlVX3f8fcnINBllAGZzqIzjEN0GjNtlsiahSTGhEpUwOCQVKgm1ZFOOjGLuNpKk2LtMrbLP+IyRjFpTWaBcTD+AG0so6VaMog2tmAG+S1RRoQwI8jIr4TiL/TbP86echzunXvuveeee2ff92utu+7z7Gc/5+yzuXzOnn2eZ59UFZKkvvzYYjdAkjR+hrskdchwl6QOGe6S1CHDXZI6dPhiNwDguOOOq3Xr1i12MyTpkHLDDTd8q6pWTnVsSYT7unXr2LVr12I3Q5IOKUnume6Y0zKS1CHDXZI6ZLhLUocMd0nqkOEuSR0aKdyTrEjy8SR/neSOJD+T5NgkVye5s/0+ptVNkvcm2Z3kliQnL+xLkCQdaNSR+8XAp6vqecDzgTuAi4CdVbUe2Nn2Ac4E1refrcD7xtpiSdKMZgz3JEcDPw9cClBV36uqR4BNwPZWbTtwTtveBFxWA9cBK5IcP/aWS5KmNcrI/URgH/CnSW5MckmSpwOrquq+Vud+YFXbXg3cO3T+nlb2I5JsTbIrya59+/bN/RVIkp5ilDtUDwdOBt5YVdcnuZgnp2AAqKpKMqtv/aiqbcA2gI0bN875G0POvvDKKcs/+a5Nc31ISTrkjTJy3wPsqarr2/7HGYT9N/dPt7TfD7Tje4EThs5f08okSRMyY7hX1f3AvUl+shWdDnwZ2AFsbmWbgf1D6B3A69pVM6cCjw5N30iSJmDUhcPeCHwoyRHAXcD5DN4YrkiyBbgHOK/VvQo4C9gNPN7qSpImaKRwr6qbgI1THDp9iroFXDDPdkmS5sE7VCWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQyOFe5K7k9ya5KYku1rZsUmuTnJn+31MK0+S9ybZneSWJCcv5AuQJD3VbEbu/6SqTqqqjW3/ImBnVa0HdrZ9gDOB9e1nK/C+cTVWkjSa+UzLbAK2t+3twDlD5ZfVwHXAiiTHz+N5JEmzNGq4F/A/k9yQZGsrW1VV97Xt+4FVbXs1cO/QuXtamSRpQg4fsd7PVdXeJH8fuDrJXw8frKpKUrN54vYmsRVg7dq1szlVkjSDkUbuVbW3/X4A+ARwCvDN/dMt7fcDrfpe4ISh09e0sgMfc1tVbayqjStXrpz7K5AkPcWM4Z7k6UmesX8beBlwG7AD2NyqbQaubNs7gNe1q2ZOBR4dmr6RJE3AKNMyq4BPJNlf/8NV9ekkfwVckWQLcA9wXqt/FXAWsBt4HDh/7K2WJB3UjOFeVXcBz5+i/EHg9CnKC7hgLK2TJM2Jd6hKUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nq0MjhnuSwJDcm+VTbPzHJ9Ul2J7k8yRGt/Mi2v7sdX7cwTZckTWc2I/d/BdwxtP8O4N1V9VzgYWBLK98CPNzK393qSZImaKRwT7IGeAVwSdsP8BLg463KduCctr2p7dOOn97qS5ImZNSR+3uA3wF+2PafBTxSVU+0/T3A6ra9GrgXoB1/tNX/EUm2JtmVZNe+ffvm2HxJ0lRmDPckvwQ8UFU3jPOJq2pbVW2sqo0rV64c50NL0rJ3+Ah1XgS8MslZwFHAM4GLgRVJDm+j8zXA3lZ/L3ACsCfJ4cDRwINjb7kkaVozjtyr6s1Vtaaq1gGvBq6pql8DPgu8qlXbDFzZtne0fdrxa6qqxtpqSdJBzec6938HvCnJbgZz6pe28kuBZ7XyNwEXza+JkqTZGmVa5v+rqmuBa9v2XcApU9T5DnDuGNomSZoj71CVpA4Z7pLUIcNdkjo0qzn3Q8nZF145Zfkn37Vpwi2RpMlz5C5JHTLcJalDhrskdchwl6QOGe6S1CHDXZI6ZLhLUocMd0nqkOEuSR0y3CWpQ90uPyBJkzbdsicw+aVPHLlLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDhnuktQhw12SOmS4S1KHDHdJ6pDhLkkdmjHckxyV5ItJbk5ye5L/2MpPTHJ9kt1JLk9yRCs/su3vbsfXLexLkCQdaJSR+3eBl1TV84GTgDOSnAq8A3h3VT0XeBjY0upvAR5u5e9u9SRJEzRjuNfAY233ae2ngJcAH2/l24Fz2vamtk87fnqSjK3FkqQZjTTnnuSwJDcBDwBXA18DHqmqJ1qVPcDqtr0auBegHX8UeNYUj7k1ya4ku/bt2ze/VyFJ+hEjhXtV/aCqTgLWAKcAz5vvE1fVtqraWFUbV65cOd+HkyQNmdXVMlX1CPBZ4GeAFUn2f9nHGmBv294LnADQjh8NPDiW1kqSRjLK1TIrk6xo238PeClwB4OQf1WrthnY/xUkO9o+7fg1VVXjbLQk6eBG+Zq944HtSQ5j8GZwRVV9KsmXgY8meTtwI3Bpq38p8MEku4GHgFcvQLslSQcxY7hX1S3AC6Yov4vB/PuB5d8Bzh1L6yRJc+IdqpLUIcNdkjpkuEtShwx3SeqQ4S5JHTLcJalDo1znLkmHlLMvvHLK8k++a9OEW7J4HLlLUocMd0nqkOEuSR0y3CWpQ4a7JHXIcJekDnkppCTN0nSXWi4ljtwlqUOGuyR1yHCXpA4Z7pLUIcNdkjpkuEtShwx3SeqQ17lLWjaW01LAhrukZa/H0HdaRpI6ZLhLUodmnJZJcgJwGbAKKGBbVV2c5FjgcmAdcDdwXlU9nCTAxcBZwOPA66vqSwvTfEnL2aGwxstiGWXk/gRwYVVtAE4FLkiyAbgI2FlV64GdbR/gTGB9+9kKvG/srZYkHdSM4V5V9+0feVfV3wF3AKuBTcD2Vm07cE7b3gRcVgPXASuSHD/2lkuSpjWrOfck64AXANcDq6rqvnbofgbTNjAI/nuHTtvTyiRJEzLypZBJfhz4r8C/rqq/HUytD1RVJanZPHGSrQymbVi7du1sTpXUKefQx2ekcE/yNAbB/qGq+vNW/M0kx1fVfW3a5YFWvhc4Yej0Na3sR1TVNmAbwMaNG2f1xiBJk3Aov9nMOC3Trn65FLijqv5g6NAOYHPb3gxcOVT+ugycCjw6NH0jSZqAUUbuLwJeC9ya5KZW9u+B3wOuSLIFuAc4rx27isFlkLsZXAp5/lhbLEma0YzhXlV/CWSaw6dPUb+AC+bZLknSPLi2jKSJO5Tnsg8VLj8gSR1y5C5pwThCXzyO3CWpQ4a7JHXIcJekDhnuktShZfeB6sE+4DmUv1JLkoY5cpekDhnuktShZTctI2n8vJ596XHkLkkdcuQuSRMw3b9uFupCDsNdWqYmHTaaLKdlJKlDhrskdchwl6QOGe6S1CE/UJU0Eq9lP7Q4cpekDhnuktQhw12SOuScu7REjWuOe7Y3JTm33gfDXVpkhqkWguEudc43j+XJOXdJ6tCMI/ck7wd+CXigqv5xKzsWuBxYB9wNnFdVDycJcDFwFvA48Pqq+tLCNF1amlyQS0vBKCP3DwBnHFB2EbCzqtYDO9s+wJnA+vazFXjfeJopSZqNGcO9qj4PPHRA8SZge9veDpwzVH5ZDVwHrEhy/LgaK0kazVw/UF1VVfe17fuBVW17NXDvUL09rew+DpBkK4PRPWvXrp1jM8bLf05L6sW8P1CtqgJqDudtq6qNVbVx5cqV822GJGnIXMP9m/unW9rvB1r5XuCEoXprWpkkaYLmOi2zA9gM/F77feVQ+W8l+SjwQuDRoekbaVnzenNN0iiXQn4EOA04Lske4HcZhPoVSbYA9wDntepXMbgMcjeDSyHPX4A2S0uCYa2lbMZwr6rXTHPo9CnqFnDBfBslSZoflx+Q8Eop9cflBySpQ47cpYNwXl2HKkfuktQhw12SOmS4S1KHDHdJ6pDhLkkdMtwlqUNeCqlZ8WYf6dBguI/AQDv0+N9My53TMpLUIUfu8+DocPHN9g5S7zjVcmG4d2ScbzaGoHRoM9y1oMb1huObjTQ7hvsysBSD0SktaWEZ7gtgoYPrUAprSYvDq2UkqUOO3JcApyie5L8ApPEw3Jcwg07SXDktI0kdMtwlqUOGuyR1yDn3CXIOXdKkOHKXpA4Z7pLUoQUJ9yRnJPlKkt1JLlqI55AkTW/s4Z7kMOA/A2cCG4DXJNkw7ueRJE1vIUbupwC7q+quqvoe8FFg+d1qKUmLaCGullkN3Du0vwd44YGVkmwFtrbdx5J8ZQHaMpPjgG8twvMeKuyfmdlHB2f/zCB/MK8+evZ0BxbtUsiq2gZsW6znB0iyq6o2LmYbljL7Z2b20cHZPzNbqD5aiGmZvcAJQ/trWpkkaUIWItz/Clif5MQkRwCvBnYswPNIkqYx9mmZqnoiyW8BnwEOA95fVbeP+3nGZFGnhQ4B9s/M7KODs39mtiB9lKpaiMeVJC0i71CVpA4Z7pLUoS7DfablD5KsTfLZJDcmuSXJWUPH3tzO+0qSl0+25ZMz1z5Ksi7Jt5Pc1H7+ePKtX3gj9M+zk+xsfXNtkjVDxzYnubP9bJ5syydnnn30g6G/oS4vuEjy/iQPJLltmuNJ8t7Wf7ckOXno2Pz/hqqqqx8GH+J+DfgJ4AjgZmDDAXW2Ab/ZtjcAdw9t3wwcCZzYHuewxX5NS6yP1gG3LfZrWAL98zFgc9t+CfDBtn0scFf7fUzbPmaxX9NS6qO2/9hiv4YJ9NHPAydP9/8LcBbwP4AApwLXj/NvqMeR+yjLHxTwzLZ9NPCNtr0J+GhVfbeqvg7sbo/Xm/n00XIwSv9sAK5p258dOv5y4OqqeqiqHgauBs6YQJsnbT59tCxU1eeBhw5SZRNwWQ1cB6xIcjxj+hvqMdynWv5g9QF13gb88yR7gKuAN87i3B7Mp48ATmzTNZ9L8uIFbeniGKV/bgZ+pW3/MvCMJM8a8dwezKePAI5KsivJdUnOWdimLlnT9eFY/oZ6DPdRvAb4QFWtYfBPow8mWa59MZ3p+ug+YG1VvQB4E/DhJM88yOP06t8Cv5DkRuAXGNyF/YPFbdKSc7A+enYNbrn/VeA9SZ6zSG3sVo+BNsryB1uAKwCq6v8ARzFY4Gi5LJ0w5z5qU1YPtvIbGMy7/sMFb/Fkzdg/VfWNqvqV9ib3llb2yCjndmI+fURV7W2/7wKuBV4wgTYvNdP14Vj+hnoM91GWP/gb4HSAJD/FILj2tXqvTnJkkhOB9cAXJ9byyZlzHyVZ2dbsJ8lPMOijuybW8smYsX+SHDf0r703A+9v258BXpbkmCTHAC9rZb2Zcx+1vjlyfx3gRcCXJ9bypWMH8Lp21cypwKNVdR/j+hta7E+UF+hT6rOArzIYVb6llf0n4JVtewPwBQZzgjcBLxs69y3tvK8AZy72a1lqfQT8U+D2VvYl4OzFfi2L1D+vAu5sdS4Bjhw6918w+DB+N3D+Yr+WpdZHwM8Ct7a/rVuBLYv9Whaofz7CYBrz+wzmzbcAbwDe0I6HwRcbfa31w8Zx/g25/IAkdajHaRlJWvYMd0nqkOEuSR0y3CWpQ4a7JHXIcNeSluSSJBtmUX9jkve27dcn+aNZPt/w+acl+dnZtVhaGsb+NXvSOFXVr8+y/i5g11yeK8nhB5x/GvAY8L/n8njjkOSwqnJZA82aI3ctCUmenuS/J7k5yW1J/lkrvzbJxrb9WJJ3Jrk9yV8kOaUdvyvJK1ud05J8aorHPzvJ9W3Bs79IsqqVvy3JB5N8gcH6Oacl+VSSdQxuOPk3bc3xFyf5epKntfOeObw/9DzntvbfnOTzreywJL/fym9J8sZWfnprz60ZrP29/67Nu5O8I8mXgHOTPCfJp5PckOR/JXneQvw3UF8cuWupOAP4RlW9AiDJ0VPUeTpwTVX9dpJPAG8HXsrgbtrtPHUJhWF/CZxaVZXk14HfAS5sxzYAP1dV305yGkBV3Z3BF5E8VlW/39p0LfAK4L8xuN3+z6vq+wc8z1uBl1fV3iQrWtlWBuvgn1SDL5A/NslRwAeA06vqq0kuA34TeE8758GqOrk9704GdzXemeSFwH9hsD66NC1H7loqbgVe2kasL66qR6eo8z3g00P1P9fC9VYG4Xkwa4DPJLkV+G3gHw0d21FV3x6hjZcA57ft84E/naLOF4APJPmXDL7QAuAXgT+pqicAquoh4CeBr1fVV1ud7Qy+3GG/ywGS/DiD2/U/luQm4E+A40doq5Y5w11LQgu5kxkE9duTvHWKat+vJ9fL+CHw3XbuD5n5X6F/CPxRVf008BsMFkLb7/+O2MYvAOva6P6wqnrK16dV1RuA/8BgVb8b8uT65bO1v00/BjxSVScN/fzUHB9Ty4jhriUhyT8AHq+qPwPeySDox+lonlw2ddTvpPw74BkHlF0GfJipR+0keU5VXV9Vb2Ww0ugJDL5J5zeSHN7qHMtgYbp1SZ7bTn0t8LkDH6+q/hb4epJz27lJ8vwR269lzHDXUvHTwBfb1MPvMphPH6e3MZjauAH41ojnfBL45f0fqLayDzH4XsuPTHPOO9sHpLcxuMrmZgbTOX8D3JLkZuBXq+o7DKZ2Ptamin4ITPdl478GbGnn3s4y+7o6zY2rQkqzkORVwKaqeu1it0U6GK+WkUaU5A+BMxmsYy4taY7cJalDzrlLUocMd0nqkOEuSR0y3CWpQ4a7JHXo/wGLDuOcFHH2SgAAAABJRU5ErkJggg==\n",
"text/plain": [
"
"
]
@@ -930,104 +939,76 @@
}
],
"source": [
- "plt.hist([score for _, _, score in data[::1] if score > 0.94], bins=50);"
+ "plt.hist([score for _, _, score in data if score >= 0.79], bins=50);\n",
+ "plt.xlabel('similarity score')\n",
+ "plt.show()"
]
},
{
"cell_type": "markdown",
- "metadata": {
- "pycharm": {}
- },
+ "metadata": {},
"source": [
- "Now it looks like a good threshold should be above `0.95`. Let's have a look at some of the candidate matches around there."
+ "Indeed, there is a cluster of scores between 0.9 and 1.0. To better visualize that these are indeed the scores for the matches, we will now extract the true_matches from the datasets and group the similarity scores into those for the matches and the non-matches (We can do this because we know the ground truth of the dataset)."
]
},
{
"cell_type": "code",
"execution_count": 19,
- "metadata": {
- "pycharm": {
- "is_executing": false
- }
- },
+ "metadata": {},
"outputs": [],
"source": [
- "def sample(data, threshold, num_samples, epsilon=0.01):\n",
- " samples = []\n",
- " for row in data:\n",
- " if abs(row[2] - threshold) <= epsilon:\n",
- " samples.append(row)\n",
- " if len(samples) >= num_samples:\n",
- " break\n",
- " return samples\n",
- "\n",
- "def lookup_originals(candidate_pair):\n",
- " a, b, score = candidate_pair\n",
- " a_index, b_index = [x[1] for x in sorted([a, b])]\n",
- " a = dfA.iloc[a_index]\n",
- " b = dfB.iloc[b_index]\n",
- " return a, b"
+ "# rec_id in dfA has the form 'rec-1070-org'. We only want the number. Additionally, as we are\n",
+ "# interested in the position of the records, we create a new index which contains the row numbers.\n",
+ "dfA_ = dfA.rename(lambda x: x[4:-4], axis='index').reset_index()\n",
+ "dfB_ = dfB.rename(lambda x: x[4:-6], axis='index').reset_index()\n",
+ "# now we can merge dfA_ and dfB_ on the record_id.\n",
+ "a = pd.DataFrame({'ida': dfA_.index, 'rec_id': dfA_['rec_id']})\n",
+ "b = pd.DataFrame({'idb': dfB_.index, 'rec_id': dfB_['rec_id']})\n",
+ "dfj = a.merge(b, on='rec_id', how='inner').drop(columns=['rec_id'])\n",
+ "# and build a set of the corresponding row numbers.\n",
+ "true_matches = set((row[0], row[1]) for row in dfj.itertuples(index=False))"
]
},
{
"cell_type": "code",
- "execution_count": 20,
- "metadata": {
- "pycharm": {
- "is_executing": false
- }
- },
+ "execution_count": 21,
+ "metadata": {},
"outputs": [],
"source": [
- "def look_at_per_field_accuracy(threshold = 0.999, num_samples = 100):\n",
- " results = []\n",
- " for i, candidate in enumerate(sample(data, threshold, num_samples, 0.01), start=1):\n",
- " record_a, record_b = lookup_originals(candidate)\n",
- " results.append(record_a == record_b)\n",
- "\n",
- " print(\"Proportion of exact matches for each field using threshold: {}\".format(threshold))\n",
- " print(sum(results)/num_samples)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "pycharm": {}
- },
- "source": [
- "So we should expect a very high proportion of matches across all fields for high thresholds:"
+ "scores_matches = []\n",
+ "scores_non_matches = []\n",
+ "for (_, a), (_, b), score in data:\n",
+ " if score < 0.79:\n",
+ " continue\n",
+ " if (a, b) in true_matches:\n",
+ " scores_matches.append(score)\n",
+ " else:\n",
+ " scores_non_matches.append(score)"
]
},
{
"cell_type": "code",
- "execution_count": 21,
- "metadata": {
- "pycharm": {
- "is_executing": false
- }
- },
+ "execution_count": 22,
+ "metadata": {},
"outputs": [
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Proportion of exact matches for each field using threshold: 0.999\n",
- "given_name 0.95\n",
- "surname 0.94\n",
- "street_number 0.85\n",
- "address_1 0.93\n",
- "address_2 0.75\n",
- "suburb 0.95\n",
- "postcode 0.97\n",
- "state 1.00\n",
- "date_of_birth 0.98\n",
- "soc_sec_id 0.38\n",
- "dtype: float64\n"
- ]
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
}
],
"source": [
- "look_at_per_field_accuracy(threshold = 0.999, num_samples = 100)"
+ "plt.hist([scores_matches, scores_non_matches], bins=50, label=['matches', 'non-matches'])\n",
+ "plt.legend(loc='upper right')\n",
+ "plt.xlabel('similarity score')\n",
+ "plt.show()"
]
},
{
@@ -1036,39 +1017,29 @@
"pycharm": {}
},
"source": [
- "But if we look at a threshold which is closer to the boundary between real matches we should see a lot more errors:"
+ "We can see that the similarity scores for the matches and the ones for the non-matches form two different distributions. With a suitable linkage schema, these two distributions hardly overlap. \n",
+ "\n",
+ "When choosing a similarity threshold for solving, the valley between these two distributions is a good starting point. In this example, it is around 0.82. We can see that almost all similarity scores above 0.82 are from matches, thus the solver will produce a linkage result with high precision. However, recall will not be optimal, as there are still some scores from matches below 0.82. By moving the threshold to either side, you can favour either precision or recall."
]
},
{
"cell_type": "code",
- "execution_count": 22,
- "metadata": {
- "pycharm": {
- "is_executing": false
- }
- },
+ "execution_count": 23,
+ "metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "Proportion of exact matches for each field using threshold: 0.95\n",
- "given_name 0.58\n",
- "surname 0.59\n",
- "street_number 0.73\n",
- "address_1 0.67\n",
- "address_2 0.53\n",
- "suburb 0.71\n",
- "postcode 0.89\n",
- "state 0.95\n",
- "date_of_birth 0.75\n",
- "soc_sec_id 0.92\n",
- "dtype: float64\n"
+ "\u001b[31mProject deleted\u001b[0m\r\n"
]
}
],
"source": [
- "look_at_per_field_accuracy(threshold = 0.95, num_samples = 100)"
+ "# Deleting the project\n",
+ "!clkutil delete-project --project=\"{credentials['project_id']}\" \\\n",
+ " --apikey=\"{credentials['result_token']}\" \\\n",
+ " --server=\"{url}\""
]
}
],
@@ -1089,6 +1060,15 @@
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
+ },
+ "pycharm": {
+ "stem_cell": {
+ "cell_type": "raw",
+ "metadata": {
+ "collapsed": false
+ },
+ "source": []
+ }
}
},
"nbformat": 4,
diff --git a/docs/tutorial/index.rst b/docs/tutorial/index.rst
index 2998dbe3..d5dd159b 100644
--- a/docs/tutorial/index.rst
+++ b/docs/tutorial/index.rst
@@ -15,12 +15,14 @@ Tutorials
Usage
-----
+You can download the tutorials from `github `_.
+The dependencies are listed in ``tutorial-requirements.txt``.
The code is often evolving and may include some breaking changes not yet deployed in our testing deployment (at the
-URL https://testing.es.data61.xyz ). So to run the tutorials, you can either:
+URL ``_). So to run the tutorials, you can either:
- - use the tutorials from the `master` branch of this repository which will work with the currently deployed testing service,
+ - use the tutorials from the ``master`` branch of this repository which will work with the currently deployed testing service,
- or build and deploy the service from the same branch as the tutorials you would like to run, providing its URL to
- the tutorials via the environment variable `SERVER` (e.g. `SERVER=http://0.0.0.0:8851` if deployed locally).
+ the tutorials via the environment variable ``SERVER`` (e.g. ``SERVER=http://0.0.0.0:8851`` if deployed locally).
Other use-cases are not supported and may fail for non-obvious reasons.
@@ -28,5 +30,5 @@ External Tutorials
------------------
The ``clkhash`` library includes a tutorial of carrying out record linkage on perturbed data.
-
+``_
diff --git a/docs/tutorial/multiparty-linkage-in-entity-service.ipynb b/docs/tutorial/multiparty-linkage-in-entity-service.ipynb
index a5a5e5f6..b4ea0f79 100644
--- a/docs/tutorial/multiparty-linkage-in-entity-service.ipynb
+++ b/docs/tutorial/multiparty-linkage-in-entity-service.ipynb
@@ -13,6 +13,7 @@
"import csv\n",
"import itertools\n",
"import os\n",
+ "import pandas as pd\n",
"\n",
"import requests"
]
@@ -26,7 +27,262 @@
"# Entity Service: Multiparty linkage demo\n",
"This notebook is a demonstration of the multiparty linkage capability that has been implemented in the Entity Service.\n",
"\n",
- "We show how five parties may upload their hashed data to the Entity Service to obtain a multiparty linkage result. This result identifies each entity across all datasets in which they are included."
+ "We show how five parties may upload their hashed data to the Entity Service to obtain a multiparty linkage result. This result identifies each entity across all datasets in which they are included.\n",
+ "\n",
+ "Each party has a dataset of the following form:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
givenname
\n",
+ "
surname
\n",
+ "
dob
\n",
+ "
gender
\n",
+ "
city
\n",
+ "
income
\n",
+ "
phone number
\n",
+ "
\n",
+ "
\n",
+ "
id
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
tara
\n",
+ "
hilton
\n",
+ "
27-08-1941
\n",
+ "
male
\n",
+ "
canberra
\n",
+ "
84052.973
\n",
+ "
08 2210 0298
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
saJi
\n",
+ "
vernre
\n",
+ "
22-12-2972
\n",
+ "
mals
\n",
+ "
perth
\n",
+ "
50104.118
\n",
+ "
02 1090 1906
\n",
+ "
\n",
+ "
\n",
+ "
7
\n",
+ "
sliver
\n",
+ "
paciorek
\n",
+ "
NaN
\n",
+ "
mals
\n",
+ "
sydney
\n",
+ "
31750.893
\n",
+ "
NaN
\n",
+ "
\n",
+ "
\n",
+ "
9
\n",
+ "
ruby
\n",
+ "
george
\n",
+ "
09-05-1939
\n",
+ "
male
\n",
+ "
sydney
\n",
+ "
135099.875
\n",
+ "
07 4698 6255
\n",
+ "
\n",
+ "
\n",
+ "
10
\n",
+ "
eyrinm
\n",
+ "
campbell
\n",
+ "
29-1q-1983
\n",
+ "
male
\n",
+ "
perth
\n",
+ "
NaN
\n",
+ "
08 299y 1535
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " givenname surname dob gender city income phone number\n",
+ "id \n",
+ "0 tara hilton 27-08-1941 male canberra 84052.973 08 2210 0298\n",
+ "3 saJi vernre 22-12-2972 mals perth 50104.118 02 1090 1906\n",
+ "7 sliver paciorek NaN mals sydney 31750.893 NaN\n",
+ "9 ruby george 09-05-1939 male sydney 135099.875 07 4698 6255\n",
+ "10 eyrinm campbell 29-1q-1983 male perth NaN 08 299y 1535"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pd.read_csv('data/dataset-1.csv', index_col='id').head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Comparing the beginning of the first dataset to the second, we can see that the quality of the data is not very good. There are a lot of spelling mistakes and missing information. Let's see how well the entity service does with linking those entities."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "