Skip to content

Conversation

@wilko77
Copy link
Collaborator

@wilko77 wilko77 commented Nov 1, 2019

It all started with this strange failure in a clkhash test:

State: queued
Stage (1/3): waiting for CLKs
Progress: 100.00%
State: queued
Stage (1/3): waiting for CLKs
Progress: 100.00%
State: queued
Stage (2/3): compute similarity scores
State: running
Stage (3/3): compute output
State: running
Stage (4/3): there is no description for this stage

It turns out, that the entity service is naughty.

The problem is around the detection of the race condition to start a run twice.
The idea is that a run's state is set to created after creation, and only once all data provider have uploaded the necessary clks, the run state gets promoted to queued and, consequently, the run execution gets started.

However, some naughty code in views/run/list.py:post did some premature clk checking and run state changing. After all that, it would still call the check_for_executable_runs task, which will then do exactly the same again.
Now, as we previously "fixed" the get_created_runs_and_queue by allowing to return runs that are either created OR queued, together with the fact that check_for_executable_runs can be called twice almost in parallel, led to double queuing of the run and double stage progressing.

To stop all this nonsense, I propose to not include run logic in the views of ngnix. If we always call the same task as entrypoint (check_for_executable_runs), it will, most likely, keep as saner for longer. Hopefully.

wilko added 2 commits November 1, 2019 17:02
did half of the work of `check_for_executable_runs`. And as a side effect,
run stages were increased too many times, broke the race condition disabler
code.
@wilko77 wilko77 requested a review from gusmith November 1, 2019 06:28
Copy link
Contributor

@gusmith gusmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely spotted.
Only things to close this PR is to update the changelog.

state = 'queued'
WHERE
state IN ('created', 'queued') AND project = %s
state = 'created' AND project = %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

PyYAML==5.1
redis==3.2.1
requests==2.21.0
requests==2.22.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add this dependency update in the changelog.

@wilko77 wilko77 merged commit 7f2dad0 into develop Nov 4, 2019
@wilko77 wilko77 deleted the fix_starting_runs branch November 4, 2019 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants