-
Notifications
You must be signed in to change notification settings - Fork 8
Store uploaded encodings in database #516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8b2e696 to
84e787a
Compare
| pipeline = convert_encodings_from_base64_to_binary(stream_json_clksnblocks(raw_data)) | ||
| with DBConn() as db: | ||
| for entity_id, encoding_data, blocks in pipeline: | ||
| # write this encoding to files or database | ||
| insert_encoding_into_blocks(db, dp_id, blocks, entity_id, encoding_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be fair to say this should be improved. Doing insertions one at a time (in fact 2 SQL queries per encoding) is sub-optimal.
wilko77
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. Have a look at my comments though.
| if not rows: | ||
| break | ||
| for row in rows: | ||
| yield row[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait a minute, does that mean we will hold on to the db connection until all the yielding is done?
Won't you call this like:
with DBConn() as db:
for id in get_encodingblock_ids(...):
...
That doesn't seem like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you think it would be good to keep the db connection while streaming through the blocks? Establishing a db connection is not free. The point of this sneaky Python cache is because we might not have the memory to store all (e.g. millions) of blocks if we used fetchall(), but the network overhead of fetchone if we fetched and yielded each row one at a time would be a killer.
| 4 B encoding id, and assuming `k` multiple unique blocks of 64 B will be a transaction | ||
| of approximately k*64 + 132 * n. For k = 10 and n = 100_000 this gives a transaction | ||
| size under 100MiB. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if those assumptions don't hold? Let's say we run a comparison with wildly bigger encoding sizes. Will it all explode? Should we make n dependent on the encoding size?
| count INT NOT NULL, | ||
|
|
||
| -- Number of blocks uploaded | ||
| block_count INT NOT NULL DEFAULT 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both count and block_count are redundant info in the database. These values are also present int the blocks and encodings table.
We should be clearer on why we also need those values. What's their meaning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are a bit redundant - they can be computed by the database easily enough.
I think the count of uploaded encodings is only used in scheduling/chunking the comparison tasks - the block_count won't be used anywhere yet (well it is notionally checked in the encoding_upload task).
5f23688 to
fa17c91
Compare
fa17c91 to
5ae5c57
Compare
29fe3d2 to
b03bf1f
Compare
This PR modifies the postgres database to store the raw encoding and blocking information.
I've modified the upload data task to insert the encoding data into the database, but have kept the existing object store file based approach in place. We now rewrite any user supplied JSON in the
clksformat into theclksnblocksformat so our background task only has to deal with one format.The comparison tasks have not been touched - they will have to be modified to pull data from the db instead of the object store.
The database now has these tables to store the encodings and blocks: