Store uploaded encodings in database #516

hardbyte · 2020-02-24T21:44:54Z

This PR modifies the postgres database to store the raw encoding and blocking information.

I've modified the upload data task to insert the encoding data into the database, but have kept the existing object store file based approach in place. We now rewrite any user supplied JSON in the clks format into the clksnblocks format so our background task only has to deal with one format.

The comparison tasks have not been touched - they will have to be modified to pull data from the db instead of the object store.

The database now has these tables to store the encodings and blocks:

hardbyte · 2020-02-25T02:13:50Z

backend/entityservice/tasks/encoding_uploading.py

+        pipeline = convert_encodings_from_base64_to_binary(stream_json_clksnblocks(raw_data))
+        with DBConn() as db:
+            for entity_id, encoding_data, blocks in pipeline:
+                # write this encoding to files or database
+                insert_encoding_into_blocks(db, dp_id, blocks, entity_id, encoding_data)


It would be fair to say this should be improved. Doing insertions one at a time (in fact 2 SQL queries per encoding) is sub-optimal.

wilko77

looks pretty good. Have a look at my comments though.

backend/entityservice/database/insertions.py

wilko77 · 2020-03-01T23:50:34Z

backend/entityservice/database/selections.py

+        if not rows:
+            break
+        for row in rows:
+            yield row[0]


wait a minute, does that mean we will hold on to the db connection until all the yielding is done?
Won't you call this like:

with DBConn() as db: for id in get_encodingblock_ids(...): ...

That doesn't seem like a good idea.

Why don't you think it would be good to keep the db connection while streaming through the blocks? Establishing a db connection is not free. The point of this sneaky Python cache is because we might not have the memory to store all (e.g. millions) of blocks if we used fetchall(), but the network overhead of fetchone if we fetched and yielded each row one at a time would be a killer.

wilko77 · 2020-03-02T00:19:35Z

backend/entityservice/encoding_storage.py

+    4 B encoding id, and assuming `k` multiple unique blocks of 64 B will be a transaction
+    of approximately k*64 + 132 * n. For k = 10 and n = 100_000 this gives a transaction
+    size under 100MiB.
+    """


What happens if those assumptions don't hold? Let's say we run a comparison with wildly bigger encoding sizes. Will it all explode? Should we make n dependent on the encoding size?

wilko77 · 2020-03-02T00:35:55Z

backend/entityservice/init-db-schema.sql

+  count  INT         NOT NULL,
+
+  -- Number of blocks uploaded
+  block_count INT   NOT NULL DEFAULT 1


both count and block_count are redundant info in the database. These values are also present int the blocks and encodings table.
We should be clearer on why we also need those values. What's their meaning?

They are a bit redundant - they can be computed by the database easily enough.

I think the count of uploaded encodings is only used in scheduling/chunking the comparison tasks - the block_count won't be used anywhere yet (well it is notionally checked in the encoding_upload task).

hardbyte force-pushed the feature-store-encodings-in-db branch 2 times, most recently from 8b2e696 to 84e787a Compare February 24, 2020 22:27

hardbyte requested a review from wilko77 February 24, 2020 23:56

hardbyte commented Feb 25, 2020

View reviewed changes

wilko77 approved these changes Mar 2, 2020

View reviewed changes

hardbyte added 4 commits March 2, 2020 16:18

Store uploaded encodings in database

febc433

Bulk upload encodings to database

a1a22eb

Clean up comments

3187fd4

More efficient fetching from encodingblocks table

09a3486

hardbyte force-pushed the feature-store-encodings-in-db branch from 5f23688 to fa17c91 Compare March 2, 2020 03:18

Transaction size for encoding upload will depend on encoding_size.

5ae5c57

hardbyte force-pushed the feature-store-encodings-in-db branch from fa17c91 to 5ae5c57 Compare March 2, 2020 03:36

Peek at actual encoding size

b03bf1f

hardbyte force-pushed the feature-store-encodings-in-db branch from 29fe3d2 to b03bf1f Compare March 2, 2020 04:24

hardbyte merged commit 79a7e89 into develop Mar 2, 2020

hardbyte deleted the feature-store-encodings-in-db branch March 2, 2020 04:52

This was referenced Apr 30, 2020

Release v1.13.0 beta.2 #553

Merged

Merge develop into master for v1.13.0-beta2 #554

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Store uploaded encodings in database #516

Store uploaded encodings in database #516

Uh oh!

hardbyte commented Feb 24, 2020 •

edited

Loading

Uh oh!

hardbyte Feb 25, 2020

Uh oh!

wilko77 left a comment

Uh oh!

Uh oh!

Uh oh!

wilko77 Mar 1, 2020

Uh oh!

hardbyte Mar 2, 2020

Uh oh!

wilko77 Mar 2, 2020

Uh oh!

wilko77 Mar 2, 2020

Uh oh!

hardbyte Mar 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Store uploaded encodings in database #516

Store uploaded encodings in database #516

Uh oh!

Conversation

hardbyte commented Feb 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hardbyte Feb 25, 2020

Choose a reason for hiding this comment

Uh oh!

wilko77 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wilko77 Mar 1, 2020

Choose a reason for hiding this comment

Uh oh!

hardbyte Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

wilko77 Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

wilko77 Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

hardbyte Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hardbyte commented Feb 24, 2020 •

edited

Loading