Skip to content
148 changes: 83 additions & 65 deletions docs/tutorial/Permutations.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,15 @@
"### Steps\n",
"These steps are usually run by different companies - but for illustration all is carried out in this one file. The participants providing data are _Alice_ and *Bob*, and the *Analyst* acting the integration authority.\n",
"\n",
"* [Check connection to Entity Service](#check_con)\n",
"* [Data preparation](#data_prep)\n",
"* [Check connection to Entity Service](#Check-Connection)\n",
"* [Data preparation](#Data-preparation)\n",
" * Write CSV files with PII\n",
" * [Create a Linkage Schema](#schema_prep)\n",
"* [Create Linkage Project](#create_pro)\n",
"* [Generate CLKs from PII](#hash_n_up)\n",
"* [Upload the PII](#hash_n_up)\n",
"* [Create a run](#create_run)\n",
"* [Retrieve and analyse results](#results)"
" * [Create a Linkage Schema](#Schema-Preparation)\n",
"* [Create Linkage Project](#Create-Linkage-Project)\n",
"* [Generate CLKs from PII](#Hash-and-Upload)\n",
"* [Upload the PII](#Hash-and-Upload)\n",
"* [Create a run](#Create-a-run)\n",
"* [Retrieve and analyse results](#Results)"
]
},
{
Expand All @@ -40,7 +40,6 @@
"pycharm": {}
},
"source": [
"<a id=\"check_con\"></a>\n",
"## Check Connection\n",
"\n",
"> If you're connecting to a custom entity service, change the address here."
Expand Down Expand Up @@ -82,7 +81,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"{\"project_count\": 6534, \"rate\": 2504556, \"status\": \"ok\"}\r\n"
"{\"project_count\": 7050, \"rate\": 2824020, \"status\": \"ok\"}\r\n"
]
}
],
Expand All @@ -96,7 +95,6 @@
"pycharm": {}
},
"source": [
"<a id=\"data_prep\"></a>\n",
"## Data preparation\n",
"\n",
"Following the [clkhash tutorial](http://clkhash.readthedocs.io/en/latest/tutorial_cli.html) we will use a dataset from the `recordlinkage` library. We will just write both datasets out to temporary CSV files.\n"
Expand Down Expand Up @@ -173,7 +171,7 @@
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>rec-1070-org</th>\n",
" <td>rec-1070-org</td>\n",
" <td>michaela</td>\n",
" <td>neumann</td>\n",
" <td>8</td>\n",
Expand All @@ -186,7 +184,7 @@
" <td>5304218</td>\n",
" </tr>\n",
" <tr>\n",
" <th>rec-1016-org</th>\n",
" <td>rec-1016-org</td>\n",
" <td>courtney</td>\n",
" <td>painter</td>\n",
" <td>12</td>\n",
Expand All @@ -199,7 +197,7 @@
" <td>4066625</td>\n",
" </tr>\n",
" <tr>\n",
" <th>rec-4405-org</th>\n",
" <td>rec-4405-org</td>\n",
" <td>charles</td>\n",
" <td>green</td>\n",
" <td>38</td>\n",
Expand Down Expand Up @@ -262,9 +260,7 @@
"pycharm": {}
},
"source": [
"<a id=\"schema_prep\"></a>\n",
"## Schema Preparation\n",
"\n",
"The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the [api docs](http://clkhash.readthedocs.io/en/latest/schema.html). We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation."
]
},
Expand Down Expand Up @@ -294,7 +290,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Overwriting /tmp/tmptm0w938k\n"
"Overwriting /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmp3jpcxxrs\n"
]
}
],
Expand Down Expand Up @@ -518,7 +514,6 @@
"pycharm": {}
},
"source": [
"<a id=\"create_pro\"></a>\n",
"## Create Linkage Project\n",
"\n",
"The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.\n"
Expand All @@ -537,17 +532,17 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Credentials will be saved in /tmp/tmptneh9xy1\n",
"Credentials will be saved in /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmp_tz_feve\n",
"\u001b[31mProject created\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
"{'project_id': '12256e29a8ad92c9016ba3e7650888f13d3bfb3bd23cc98a',\n",
" 'result_token': '1a588d384f651e9430ac1bb42196f9fe393ff10e8ec65f48',\n",
" 'update_tokens': ['6111c582a0d6a649480c719adcd258b811da17887849ee00',\n",
" '4239370ce8868a9eb3dc85a85eca243bf593a0cc637a5be8']}"
"{'project_id': '7c942add9259b0c61fc06ce24afc6ee9c99355cc5a5eae7a',\n",
" 'result_token': '4552074bebabf66a19e707ef64aa35638fc1eb2cd3b9a768',\n",
" 'update_tokens': ['1045c9dda873d3cccf37181bcff7c61a5e82c6051d0da2c0',\n",
" 'fc27160c4e4736c1dbbecbedd6bc5e4117a3626c1f2eda9c']}"
]
},
"execution_count": 7,
Expand All @@ -559,7 +554,12 @@
"creds = NamedTemporaryFile('wt')\n",
"print(\"Credentials will be saved in\", creds.name)\n",
"\n",
"!clkutil create-project --schema \"{schema.name}\" --output \"{creds.name}\" --type \"permutations\" --server \"{url}\"\n",
"!clkutil create-project \\\n",
" --schema \"{schema.name}\" \\\n",
" --output \"{creds.name}\" \\\n",
" --type \"permutations\" \\\n",
" --server \"{url}\"\n",
"\n",
"creds.seek(0)\n",
"\n",
"import json\n",
Expand All @@ -578,7 +578,6 @@
"source": [
"**Note:** the analyst will need to pass on the `project_id` (the id of the linkage project) and one of the two `update_tokens` to each data provider.\n",
"\n",
"<a id=\"hash_n_up\"></a>\n",
"## Hash and Upload\n",
"\n",
"At the moment both data providers have *raw* personally identiy information. We first have to generate CLKs from the raw entity information. We need:\n",
Expand All @@ -602,8 +601,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[31mCLK data written to /tmp/tmp9vdauwh4.json\u001b[0m\n",
"\u001b[31mCLK data written to /tmp/tmpgspffags.json\u001b[0m\n"
"\u001b[31mCLK data written to /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmppybfm62c.json\u001b[0m\n",
"\u001b[31mCLK data written to /var/folders/mw/21b9jb5d1c9_3_z0dq7hpx1m00j_0b/T/tmpu4jx4mjv.json\u001b[0m\n"
]
}
],
Expand Down Expand Up @@ -743,7 +742,6 @@
"pycharm": {}
},
"source": [
"<a id=\"create_run\"></a>\n",
"## Create a run\n",
"\n",
"Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:"
Expand Down Expand Up @@ -776,7 +774,6 @@
"pycharm": {}
},
"source": [
"<a id=\"results\"></a>\n",
"## Results\n",
"\n",
"Now after some delay (depending on the size) we can fetch the mask.\n",
Expand Down Expand Up @@ -964,7 +961,7 @@
{
"data": {
"text/plain": [
"[2418, 3590, 2340, 1226, 1323, 251, 4696, 2598, 4019, 301]"
"[3645, 1068, 4371, 465, 1533, 987, 343, 53, 3298, 2515]"
]
},
"execution_count": 20,
Expand Down Expand Up @@ -998,7 +995,7 @@
{
"data": {
"text/plain": [
"[3183, 4293, 3406, 2808, 4528, 2446, 4606, 1601, 1641, 2062]"
"[3857, 4827, 3267, 4934, 1958, 3682, 4576, 4895, 4867, 1188]"
]
},
"execution_count": 21,
Expand Down Expand Up @@ -1072,16 +1069,16 @@
{
"data": {
"text/plain": [
"['rec-3933-org,joshua,rigley,19,east place,kergunyah,kingaroy,3665,vic,19670613,4096438\\n',\n",
" 'rec-1057-org,samara,pringle,7,allan street,bonnie doon,campbelltown,5073,nsw,19560429,3493586\\n',\n",
" 'rec-4035-org,chloe,worm,6,brentnall place,donna valley,karloo,3128,nsw,19000814,9383057\\n',\n",
" 'rec-3793-org,lucy,mccarthy,29,charlton street,warrah lea,bundaberg,4061,qld,19940917,6596660\\n',\n",
" 'rec-27-org,angelina,campbell,161,jackie howe crescent,bugoren,woorim,6052,nsw,19531108,8948230\\n',\n",
" 'rec-2303-org,tahlia,hage,3,maclaurin crescent,,ormond,4740,tas,19190517,6174860\\n',\n",
" 'rec-658-org,david,hobson,14,vagabond crescent,dugout 65,patterson lakes,4880,wa,19010305,7666240\\n',\n",
" 'rec-4484-org,alexandra,clarke,15,parnell road,rsdb 284,nedlands,4014,sa,19890608,7235143\\n',\n",
" 'rec-702-org,barnaby,fleet,4,martley circuit,peak view,ascot vale,3930,sa,19360907,9383837\\n',\n",
" 'rec-3252-org,,campbell,4,dunbar street,delicate nobby street,cloverdale,2528,vic,19480406,8607518\\n']"
"['rec-3302-org,blaize,koopman,17,allison place,aldersyde estate,balwyn north,4650,nsw,19110608,7823755\\n',\n",
" 'rec-1385-org,joel,bishop,10,french street,cedarview,orange,3223,nt,,1324854\\n',\n",
" 'rec-190-org,,alias,24,elkington street,pangani,isle of capri,2145,sa,19650429,8261472\\n',\n",
" 'rec-4781-org,jacob,waller,89,dalley crescent,the willows,mosman,2480,qld,19580408,6317326\\n',\n",
" 'rec-4881-org,alexandra,nguyen,44,colebatch place,langley flats,freshwater,3242,nsw,19511004,6416159\\n',\n",
" 'rec-4770-org,tegan,rosendale,1,sherbrooke street,nazareth village,innaloo,2250,wa,19801011,9351309\\n',\n",
" 'rec-3385-org,shanaye,carbone,41,haystack crescent,st vincents hospital,matong,3690,nsw,19300519,1632237\\n',\n",
" 'rec-3738-org,imogen,carlington,45,mcinnes street,parish talowahl,girilambone,2154,nsw,19781117,7912921\\n',\n",
" 'rec-831-org,laura,flannery,54,sid barnes crescent,weemilah,winston hills,5073,qld,19581023,9712180\\n',\n",
" 'rec-815-org,holly,campbell,21,casey crescent,nestor,westmead,4573,qld,19911007,4424335\\n']"
]
},
"execution_count": 24,
Expand All @@ -1105,16 +1102,16 @@
{
"data": {
"text/plain": [
"['rec-3933-dup-0,joshua,rigly,19,east place,kergunyah,kingaroy,3665,vic,19670613,4096438\\n',\n",
" 'rec-1057-dup-0,pringle,samara,7,allan street,bonnie doon,campbelltown,5073,nsw,19560429,3493586\\n',\n",
" 'rec-4035-dup-0,chooe,worm,6,brentnal place,donna valley,karloo,3128,nsw,19000814,9383057\\n',\n",
" 'rec-3793-dup-0,mccarthy,lucy,29,charltonstreet,warrahlea,bundaverg,4061,qld,19940917,6596660\\n',\n",
" 'rec-27-dup-0,angelina,campbell,190,jackie howe crescent,bugoren,woorim,6352,nsw,19531108,8948230\\n',\n",
" 'rec-2303-dup-0,peter,ha ge,3,maclaurin crescent,,ormond,4704,tas,19190517,6174860\\n',\n",
" 'rec-658-dup-0,david,hobsson,14,vagabond cfescent,dugout 65,patterson lakes,4880,wa,19010305,7666240\\n',\n",
" 'rec-4484-dup-0,alexandra,clarke,15,rsd b 284,parnell roa,,4014,sa,19890608,7235143\\n',\n",
" 'rec-702-dup-0,barnay,fleet,4,martley circuit,peak view,ascot vale,3930,sa,19360907,9383837\\n',\n",
" 'rec-3252-dup-0,,campbell,4,dunbar svtreet,delicate nobby street,cloverdale,2528,vic,19480406,8607518\\n']"
"['rec-3302-dup-0,blaize,koopman,17,allison place,aldersydeestate,balwyn north,4650,nsw,19110608,7823755\\n',\n",
" 'rec-1385-dup-0,elton,bishop,10,french street,,orange,3223,nt,,1324854\\n',\n",
" 'rec-190-dup-0,,alias,24,elkington street,panganu,isle of capri,2145,sa,19650429,8261472\\n',\n",
" 'rec-4781-dup-0,jacob,waliler,89,dalley crescent,the ui llows,mosman,2487,qld,19580408,6317326\\n',\n",
" 'rec-4881-dup-0,nguyen,alexandra,44,colebatch place,langley flats,freshwater,3242,nsw,19511004,6416159\\n',\n",
" 'rec-4770-dup-0,tegan,rosendale,1,sherbrooke street,nazareth village,innaloo,2550,nsw,19801011,9351309\\n',\n",
" 'rec-3385-dup-0,shanaye,lonto,41,haystack crescent,,leetob,3680,nsw,19300519,1632237\\n',\n",
" 'rec-3738-dup-0,imogen,carlington,45,mcinnes treet,parish talowahl,girilabmone,2154,nsw,19781117,7912921\\n',\n",
" 'rec-831-dup-0,laura,flannery,54,sid barnes crescent,,winstonhills,5073,qld,19581023,9712180\\n',\n",
" 'rec-815-dup-0,holyl,campbell,21,casey crescent,,westmead,4573,qld,19911007,4424335\\n']"
]
},
"execution_count": 25,
Expand Down Expand Up @@ -1152,16 +1149,16 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Joshua Rigley (rec-3933-org) =? Joshua Rigly (rec-3933-dup-0)\n",
"Samara Pringle (rec-1057-org) =? Pringle Samara (rec-1057-dup-0)\n",
"Chloe Worm (rec-4035-org) =? Chooe Worm (rec-4035-dup-0)\n",
"Lucy Mccarthy (rec-3793-org) =? Mccarthy Lucy (rec-3793-dup-0)\n",
"Angelina Campbell (rec-27-org) =? Angelina Campbell (rec-27-dup-0)\n",
"Tahlia Hage (rec-2303-org) =? Peter Ha Ge (rec-2303-dup-0)\n",
"David Hobson (rec-658-org) =? David Hobsson (rec-658-dup-0)\n",
"Alexandra Clarke (rec-4484-org) =? Alexandra Clarke (rec-4484-dup-0)\n",
"Barnaby Fleet (rec-702-org) =? Barnay Fleet (rec-702-dup-0)\n",
" Campbell (rec-3252-org) =? Campbell (rec-3252-dup-0)\n"
"Blaize Koopman (rec-3302-org) =? Blaize Koopman (rec-3302-dup-0)\n",
"Joel Bishop (rec-1385-org) =? Elton Bishop (rec-1385-dup-0)\n",
" Alias (rec-190-org) =? Alias (rec-190-dup-0)\n",
"Jacob Waller (rec-4781-org) =? Jacob Waliler (rec-4781-dup-0)\n",
"Alexandra Nguyen (rec-4881-org) =? Nguyen Alexandra (rec-4881-dup-0)\n",
"Tegan Rosendale (rec-4770-org) =? Tegan Rosendale (rec-4770-dup-0)\n",
"Shanaye Carbone (rec-3385-org) =? Shanaye Lonto (rec-3385-dup-0)\n",
"Imogen Carlington (rec-3738-org) =? Imogen Carlington (rec-3738-dup-0)\n",
"Laura Flannery (rec-831-org) =? Laura Flannery (rec-831-dup-0)\n",
"Holly Campbell (rec-815-org) =? Holyl Campbell (rec-815-dup-0)\n"
]
}
],
Expand Down Expand Up @@ -1230,6 +1227,27 @@
"print(\"Precision: {:.1f}%\".format(100*precision))\n",
"print(\"Recall: {:.1f}%\".format(100*recall))"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[31mProject deleted\u001b[0m\r\n"
]
}
],
"source": [
"# Deleting the project\n",
"!clkutil delete-project \\\n",
" --project=\"{credentials['project_id']}\" \\\n",
" --apikey=\"{credentials['result_token']}\" \\\n",
" --server=\"{url}\""
]
}
],
"metadata": {
Expand All @@ -1248,18 +1266,18 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.7.4"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
},
"source": []
}
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
}
2 changes: 1 addition & 1 deletion docs/tutorial/Record Linkage API.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -694,7 +694,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.7.4"
},
"pycharm": {
"stem_cell": {
Expand Down
Loading