Skip to content

Conversation

@hardbyte
Copy link
Collaborator

@hardbyte hardbyte commented Feb 24, 2020

This PR modifies the postgres database to store the raw encoding and blocking information.

I've modified the upload data task to insert the encoding data into the database, but have kept the existing object store file based approach in place. We now rewrite any user supplied JSON in the clks format into the clksnblocks format so our background task only has to deal with one format.

The comparison tasks have not been touched - they will have to be modified to pull data from the db instead of the object store.

The database now has these tables to store the encodings and blocks:

image

@hardbyte hardbyte force-pushed the feature-store-encodings-in-db branch 2 times, most recently from 8b2e696 to 84e787a Compare February 24, 2020 22:27
@hardbyte hardbyte requested a review from wilko77 February 24, 2020 23:56
Comment on lines 44 to 48
pipeline = convert_encodings_from_base64_to_binary(stream_json_clksnblocks(raw_data))
with DBConn() as db:
for entity_id, encoding_data, blocks in pipeline:
# write this encoding to files or database
insert_encoding_into_blocks(db, dp_id, blocks, entity_id, encoding_data)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be fair to say this should be improved. Doing insertions one at a time (in fact 2 SQL queries per encoding) is sub-optimal.

Copy link
Collaborator

@wilko77 wilko77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. Have a look at my comments though.

if not rows:
break
for row in rows:
yield row[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait a minute, does that mean we will hold on to the db connection until all the yielding is done?
Won't you call this like:

with DBConn() as db:
   for id in get_encodingblock_ids(...):
      ...

That doesn't seem like a good idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you think it would be good to keep the db connection while streaming through the blocks? Establishing a db connection is not free. The point of this sneaky Python cache is because we might not have the memory to store all (e.g. millions) of blocks if we used fetchall(), but the network overhead of fetchone if we fetched and yielded each row one at a time would be a killer.

4 B encoding id, and assuming `k` multiple unique blocks of 64 B will be a transaction
of approximately k*64 + 132 * n. For k = 10 and n = 100_000 this gives a transaction
size under 100MiB.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if those assumptions don't hold? Let's say we run a comparison with wildly bigger encoding sizes. Will it all explode? Should we make n dependent on the encoding size?

Comment on lines +142 to +145
count INT NOT NULL,

-- Number of blocks uploaded
block_count INT NOT NULL DEFAULT 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both count and block_count are redundant info in the database. These values are also present int the blocks and encodings table.
We should be clearer on why we also need those values. What's their meaning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are a bit redundant - they can be computed by the database easily enough.

I think the count of uploaded encodings is only used in scheduling/chunking the comparison tasks - the block_count won't be used anywhere yet (well it is notionally checked in the encoding_upload task).

@hardbyte hardbyte force-pushed the feature-store-encodings-in-db branch from 5f23688 to fa17c91 Compare March 2, 2020 03:18
@hardbyte hardbyte force-pushed the feature-store-encodings-in-db branch from fa17c91 to 5ae5c57 Compare March 2, 2020 03:36
@hardbyte hardbyte force-pushed the feature-store-encodings-in-db branch from 29fe3d2 to b03bf1f Compare March 2, 2020 04:24
@hardbyte hardbyte merged commit 79a7e89 into develop Mar 2, 2020
@hardbyte hardbyte deleted the feature-store-encodings-in-db branch March 2, 2020 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants