-
Notifications
You must be signed in to change notification settings - Fork 8
Comparison tasks now based on blocks #527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
27 commits
Select commit
Hold shift + click to select a range
065b11a
Add function to fetch block ids and sizes from db
hardbyte 6f7c75c
Retrieve blocking info in create_comparison_jobs task
hardbyte ebcb248
WIP - identify blocks that need to be broken up further
hardbyte a066ccc
Query for getting encodings in a block
hardbyte fb550de
Split tasks into chunks using blocking information
hardbyte 610b3bb
Refactor create comparison jobs function
hardbyte d838fe4
More refactoring of chunk creation
hardbyte ec36e8d
Add a few unit tests for chunking
hardbyte ddcbcc3
Add database index on encodings table
hardbyte 4ab16e6
clknblocks not clksnblocks and other minor cleanup
hardbyte d66bf58
cleanup
hardbyte 1e5151f
Add blocking concept to docs
hardbyte aec9b5c
Deduplicate candidate pairs before solving
hardbyte f30c819
Catch the empty candidate pair case
hardbyte 9dc59e1
Simplify solver task by using anonlink's _merge_similarities function
hardbyte 6219e44
Update celery
hardbyte 0b6a4c2
Address code review feedback
hardbyte 5467362
Bump version to beta2
hardbyte f342d5a
Celery concurrency defaults
hardbyte 2add5ef
Add another layer of tracing into the comparison task
hardbyte e2ebe99
Update task names in celery routing
hardbyte 38b624f
Faster encoding retrieval by using COPY.
hardbyte 7ec7fef
Pass on stored size when retrieving encodings from DB
hardbyte 24caa79
Increase time on test
hardbyte dc1983b
Refactor binary copy into own function for easier reuse and testing
hardbyte 8bae410
Add more detailed tracing around binary encoding insertions.
hardbyte 88e968d
Add tests for binary copy function
hardbyte File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1 @@ | ||
| v1.13.0-beta | ||
| v1.13.0-beta2 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| import psycopg2 | ||
|
|
||
| from entityservice.settings import Config as config | ||
|
|
||
|
|
||
| def _get_conn_and_cursor(): | ||
| db = config.DATABASE | ||
| host = config.DATABASE_SERVER | ||
| user = config.DATABASE_USER | ||
| password = config.DATABASE_PASSWORD | ||
| conn = psycopg2.connect(host=host, dbname=db, user=user, password=password) | ||
| cursor = conn.cursor() | ||
| return conn, cursor |
57 changes: 57 additions & 0 deletions
57
backend/entityservice/integrationtests/dbtests/conftest.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| import pytest | ||
| import psycopg2 | ||
|
|
||
| from entityservice.settings import Config as config | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def conn(): | ||
| db = config.DATABASE | ||
| host = config.DATABASE_SERVER | ||
| user = config.DATABASE_USER | ||
| password = config.DATABASE_PASSWORD | ||
| conn = psycopg2.connect(host=host, dbname=db, user=user, password=password) | ||
| yield conn | ||
| conn.close() | ||
|
|
||
| @pytest.fixture | ||
| def cur(conn): | ||
| return conn.cursor() | ||
|
|
||
|
|
||
|
|
||
|
|
||
| @pytest.fixture() | ||
| def prepopulated_binary_test_data(conn, cur, num_bytes=4, num_rows=100): | ||
| creation_sql = """ | ||
| DROP TABLE IF EXISTS binary_test; | ||
| CREATE TABLE binary_test | ||
| ( | ||
| id integer not null, | ||
| encoding bytea not null | ||
| );""" | ||
| cur.execute(creation_sql) | ||
| conn.commit() | ||
|
|
||
| # Add data using execute_values | ||
| data = [(i, bytes([i % 128] * num_bytes)) for i in range(num_rows)] | ||
| psycopg2.extras.execute_values(cur, """ | ||
| INSERT INTO binary_test (id, encoding) VALUES %s | ||
| """, data) | ||
|
|
||
| conn.commit() | ||
|
|
||
| # quick check data is there | ||
| cur.execute("select count(*) from binary_test") | ||
| res = cur.fetchone()[0] | ||
| assert res == num_rows | ||
|
|
||
| cur.execute("select encoding from binary_test where id = 1") | ||
| assert bytes(cur.fetchone()[0]) == data[1][1] | ||
|
|
||
| yield data | ||
|
|
||
| # delete test table | ||
| deletion_sql = "drop table if exists binary_test cascade;" | ||
| cur.execute(deletion_sql) | ||
| conn.commit() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 'one' argument is a bit confusing. From the name alone it is not obvious what it does.
Do we really need that?
Wouldn't it be cleaner to always yield the full results? That's what the function name says.
We could just call it like this:
for block_name, _ in iterate_cursor_results(cur):