-
Notifications
You must be signed in to change notification settings - Fork 8
Feature pull client data from object store #549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3dfa25d
80d99fa
db9464f
9d18ec4
8ccd27b
a6d3715
3dbb49b
b54423f
1d27bbb
f9cc8c3
54eb646
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,3 +2,4 @@ | |
| .git/ | ||
| data | ||
| .env | ||
| *.log | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -321,16 +321,21 @@ paths: | |
| '/projects/{project_id}/clks': | ||
| post: | ||
| operationId: entityservice.views.project.project_clks_post | ||
| summary: Upload encoded PII data to a linkage project. | ||
| summary: Upload encoded data to a linkage project. | ||
| tags: | ||
| - Project | ||
| description: | | ||
| Called by each of the data providers with their calculated `CLK` vectors. | ||
| The project must have been created, and the caller must have both the | ||
| `project_id` and a valid `upload_token` in order to contribute data. | ||
| Called by each data provider with their encodings and optional blocking | ||
| information. | ||
|
|
||
| The data uploaded must be of one of the following formats. | ||
| - CLKs only upload: An array of base64 encoded [CLKs](./concepts.html#cryptographic-longterm-keys), one per | ||
| The caller must have both the `project_id` and a valid `upload_token` in order to contribute data, | ||
| both of these are generated when a project is created. | ||
| This endpoint can directly accept uploads up to several hundred MiB, and can pull encoding data from | ||
| an object store for larger uploads. | ||
|
|
||
| The data uploaded must be of one of the following formats: | ||
|
|
||
| - Encodings only: An array of base64 encoded [CLKs](./concepts.html#cryptographic-longterm-keys), one per | ||
| entity. | ||
| - CLKs with blocking information upload: An array of base64 encoded CLKs with corresponding blocking | ||
| information. One element in this array is an array with the first element being a base64 encoded CLK followed | ||
|
|
@@ -342,7 +347,7 @@ paths: | |
| The uploaded encodings must all have the same length in bytes. If the project's linkage schema | ||
| specifes an encoding size it will be checked and enforced before any runs are computed. Note a | ||
| minimum and maximum encoding size can be set at the server level at deployment time. | ||
| Currently anonlink requires this _encoding size_ to be a multiple of 8. An example value is 128 Bytes. | ||
| Currently anonlink requires this _encoding size_ to be a multiple of 8. A common value is `128 Bytes`. | ||
|
|
||
| Note in the default deployment the maximum request size is set to `~10 GB`, which __should__ | ||
| translate to just over 20 million entities. | ||
|
|
@@ -361,12 +366,13 @@ paths: | |
| - $ref: '#/components/parameters/project_id' | ||
| - $ref: '#/components/parameters/token' | ||
| requestBody: | ||
| description: the encoded PII | ||
| description: Data to upload | ||
| required: true | ||
| content: | ||
| application/json: | ||
| schema: | ||
| oneOf: | ||
| - $ref: '#/components/schemas/EncodingUpload' | ||
| - $ref: '#/components/schemas/CLKUpload' | ||
| - $ref: '#/components/schemas/CLKnBlockUpload' | ||
| responses: | ||
|
|
@@ -1081,17 +1087,91 @@ components: | |
| required: | ||
| - number | ||
|
|
||
| EncodingUpload: | ||
| description: Object that contains one data provider's encodings | ||
| type: object | ||
| required: [encodings] | ||
| properties: | ||
| encodings: | ||
| oneOf: | ||
| - $ref: '#/components/schemas/EncodingArray' | ||
| - $ref: '#/components/schemas/ExternalData' | ||
| blocks: | ||
| oneOf: | ||
| - $ref: '#/components/schemas/BlockMap' | ||
| ## TODO may be useful to handle external blocking data too | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. definitely. With very small blocks, the blocking info is in the same order of size as the encodings.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will be part of a follow up PR |
||
| #- $ref: '#/ExternalData' | ||
|
|
||
| EncodingArray: | ||
| description: Array of encodings, base64 encoded. | ||
| type: array | ||
| items: | ||
| - type: string | ||
| format: byte | ||
| description: Base64 encoded CLK data | ||
|
|
||
| BlockMap: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No it isn't required, see above where it is referenced - only the
{
"1": ["block1", "block2"],
"2": ["block2"]
} |
||
| description: Blocking information for encodings. A mapping from encoding id (a string) to a list of block ids | ||
| type: object | ||
| additionalProperties: | ||
| type: array | ||
| items: | ||
| - type: string | ||
| description: Block ID | ||
| example: | ||
| "1": ["block1", "block2"] | ||
| "2": [] | ||
| "3": ["block1"] | ||
| ExternalData: | ||
| description: A file in an object store. | ||
| type: object | ||
| required: [file] | ||
| properties: | ||
| credentials: | ||
| type: object | ||
| description: | | ||
| Optional credentials to pull the file from the object store. | ||
|
|
||
| Not required if using the Anonlink Entity Service's own object store. | ||
| properties: | ||
| AccessKeyId: | ||
| type: string | ||
| SecretAccessKey: | ||
| type: string | ||
| SessionToken: | ||
| type: string | ||
| file: | ||
| type: object | ||
| required: [bucket, path] | ||
| properties: | ||
| bucket: | ||
| type: string | ||
| example: anonlink-uploads | ||
| path: | ||
| type: string | ||
| description: The object name in the bucket. | ||
| example: project-foo/encodings.bin | ||
| endpoint: | ||
| type: string | ||
| description: | | ||
| Object store endpoint - usually a public endpoint for a MinIO as part of an Anonlink deployment e.g. | ||
| `minio.anonlink.easd.data61.xyz`, or a public (region specific) endpoint for AWS S3: | ||
| `s3.ap-southeast-2.amazonaws.com`. | ||
|
|
||
| If not given the Anonlink Entity Service's own object store will be assumed. | ||
| example: s3.ap-southeast-2.amazonaws.com | ||
| secure: | ||
| type: boolean | ||
| default: true | ||
| description: If this object store should be connected to only over a secure connection. | ||
|
|
||
| CLKUpload: | ||
| description: Object that contains this party's Bloom Filters | ||
| type: object | ||
| required: [clks] | ||
| properties: | ||
| clks: | ||
| type: array | ||
| items: | ||
| type: string | ||
| format: byte | ||
| description: Base64 encoded CLK data | ||
| $ref: '#/components/schemas/EncodingArray' | ||
|
|
||
| CLKnBlockUpload: | ||
| description: Object that contains this party's Bloom Filters including blocking information | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,10 +3,19 @@ | |
| from typing import Iterator, List, Tuple | ||
|
|
||
| import ijson | ||
| import opentracing | ||
| from flask import g | ||
| from structlog import get_logger | ||
|
|
||
| from entityservice import database as db | ||
| from entityservice.database import insert_encodings_into_blocks, get_encodingblock_ids, \ | ||
| get_chunk_of_encodings | ||
| get_chunk_of_encodings, DBConn | ||
| from entityservice.serialization import deserialize_bytes, binary_format, binary_unpack_filters | ||
| from entityservice.utils import fmt_bytes | ||
|
|
||
| logger = get_logger() | ||
|
|
||
| DEFAULT_BLOCK_ID = '1' | ||
|
|
||
|
|
||
| def stream_json_clksnblocks(f): | ||
|
|
@@ -110,3 +119,47 @@ def get_encoding_chunk(conn, chunk_info, encoding_size=128): | |
| chunk_data = binary_unpack_filters(encoding_iter, encoding_size=encoding_size) | ||
| return chunk_data, len(chunk_data) | ||
|
|
||
|
|
||
| def upload_clk_data_binary(project_id, dp_id, encoding_iter, receipt_token, count, size=128): | ||
| """ | ||
| Save the user provided binary-packed CLK data. | ||
|
|
||
| """ | ||
| filename = None | ||
| # Set the state to 'pending' in the uploads table | ||
| with DBConn() as conn: | ||
| db.insert_encoding_metadata(conn, filename, dp_id, receipt_token, encoding_count=count, block_count=1) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what's with the
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is just a way of keeping the column in the |
||
| db.update_encoding_metadata_set_encoding_size(conn, dp_id, size) | ||
| num_bytes = binary_format(size).size * count | ||
|
|
||
| logger.debug("Directly storing binary file with index, base64 encoded CLK, popcount") | ||
|
|
||
| # Upload to database | ||
| logger.info(f"Uploading {count} binary encodings to database. Total size: {fmt_bytes(num_bytes)}") | ||
| parent_span = g.flask_tracer.get_span() | ||
|
|
||
| with DBConn() as conn: | ||
| with opentracing.tracer.start_span('create-default-block-in-db', child_of=parent_span): | ||
| db.insert_blocking_metadata(conn, dp_id, {DEFAULT_BLOCK_ID: count}) | ||
|
|
||
| with opentracing.tracer.start_span('upload-encodings-to-db', child_of=parent_span): | ||
| store_encodings_in_db(conn, dp_id, encoding_iter, size) | ||
|
|
||
| with opentracing.tracer.start_span('update-encoding-metadata', child_of=parent_span): | ||
| db.update_encoding_metadata(conn, filename, dp_id, 'ready') | ||
|
|
||
|
|
||
| def include_encoding_id_in_binary_stream(stream, size, count): | ||
| """ | ||
| Inject an encoding_id and default block into a binary stream of encodings. | ||
| """ | ||
|
|
||
| binary_formatter = binary_format(size) | ||
|
|
||
| def encoding_iterator(filter_stream): | ||
| # Assumes encoding id and block info not provided (yet) | ||
| for entity_id in range(count): | ||
| yield str(entity_id), binary_formatter.pack(entity_id, filter_stream.read(size)), [DEFAULT_BLOCK_ID] | ||
|
|
||
| return encoding_iterator(stream) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 💯