-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24697][SS] Fix the reported start offsets in streaming query progress #21673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #92484 has finished for PR 21673 at commit
|
|
@arunmahadevan We'd be better to respect style guide on pull request: please change title to include let JIRA issue number being guided with http://spark.apache.org/contributing.html
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arunmahadevan
The code change looks great, but the patch would be better if we modify test to verify the change. (fail on current master but succeed on proposed patch)
It would be one liner change: HeartSaVioR@020d93b
No credit needed, feel free to apply it to your PR.
|
@HeartSaVioR , thanks for the inputs. Please check again. |
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Test build #92652 has finished for PR 21673 at commit
|
|
Thanks @arunmahadevan for making this PR. However, I dont like the solution of adding another field as a workaround thus making the control flow harder to reason about. I think the fundamental problem is that the original design of the ProgressReport that sees all the internal details of StreamExecution (e.g. availableOffsets and committedOffsets) and its very reason what informatio is read when. I want to refactor this a little bit towards improving this underlying problem. I am working on a PR myself for that. I will post it shortly. |
|
@tdas , thanks for your comments. Yes theres problem with the current abstraction, and I didn't consider refactoring it since there have been multiple changes to this class without changing the underlying structure and the fields of the ExecutionStats are accessed from multiple places within StreamExecution already. I did not think adding an extra field would increase the code complexity, however if you plan to do major refactoring to simplify the logic and address the issues, I am happy to discard this PR and help review your changes. |
|
@arunmahadevan I made this PR as an attempt to incrementally improve the control flow in ProgressReporter while fixing the bug here. |
What changes were proposed in this pull request?
Streaming query reports progress during each trigger (e.g. after runBatch in MicrobatchExcecution). However the reported progress has wrong offsets since the offsets are first committed and committedOffsets is updated to the availableOffsets before the progress is reported.
This leads to weird progress where startOffset and endOffsets are always the same.
Remember the last committed offset before running the batch and updating the committed offsets and report the last committed offsets in the Streaming query progress.
How was this patch tested?
Existing Unit tests and running sample programs.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.