Skip to content

Conversation

@arunmahadevan
Copy link
Contributor

What changes were proposed in this pull request?

Streaming query reports progress during each trigger (e.g. after runBatch in MicrobatchExcecution). However the reported progress has wrong offsets since the offsets are first committed and committedOffsets is updated to the availableOffsets before the progress is reported.

This leads to weird progress where startOffset and endOffsets are always the same.

{
 "id" : "76bf5515-55be-46af-bc79-9fc92cc6d856",
 "runId" : "b526f0f4-24bf-4ddc-b6e8-7b0cc83bdbe8",
...
"sources" : [ {
 "description" : "KafkaV2[Subscribe[topic2]]",
 "startOffset" : {
 "topic2" : {
 "0" : 44
 }
 },
 "endOffset" : {
 "topic2" : {
 "0" : 44
 }
 },
 "numInputRows" : 11,
 "inputRowsPerSecond" : 1.099670098970309,
 "processedRowsPerSecond" : 1.8829168093118795
 } ],
...
}

Remember the last committed offset before running the batch and updating the committed offsets and report the last committed offsets in the Streaming query progress.

How was this patch tested?

Existing Unit tests and running sample programs.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

@arunmahadevan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Jun 30, 2018

Test build #92484 has finished for PR 21673 at commit 24d75ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

@arunmahadevan We'd be better to respect style guide on pull request: please change title to include let JIRA issue number being guided with [] and also add [SS].

http://spark.apache.org/contributing.html

The PR title should be of the form [SPARK-xxxx][COMPONENT] Title, where SPARK-xxxx is the relevant JIRA number, COMPONENT is one of the PR categories shown at spark-prs.appspot.com and Title may be the JIRA’s title or a more specific title describing the PR itself.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arunmahadevan
The code change looks great, but the patch would be better if we modify test to verify the change. (fail on current master but succeed on proposed patch)

It would be one liner change: HeartSaVioR@020d93b

No credit needed, feel free to apply it to your PR.

@arunmahadevan arunmahadevan changed the title SPARK-24697: Fix the reported start offsets in streaming query progress [SPARK-24697][SS] Fix the reported start offsets in streaming query progress Jul 5, 2018
@arunmahadevan
Copy link
Contributor Author

@HeartSaVioR , thanks for the inputs. Please check again.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jul 5, 2018

Test build #92652 has finished for PR 21673 at commit aeadd5a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Jul 5, 2018

Thanks @arunmahadevan for making this PR. However, I dont like the solution of adding another field as a workaround thus making the control flow harder to reason about. I think the fundamental problem is that the original design of the ProgressReport that sees all the internal details of StreamExecution (e.g. availableOffsets and committedOffsets) and its very reason what informatio is read when. I want to refactor this a little bit towards improving this underlying problem. I am working on a PR myself for that. I will post it shortly.

@arunmahadevan
Copy link
Contributor Author

@tdas , thanks for your comments. Yes theres problem with the current abstraction, and I didn't consider refactoring it since there have been multiple changes to this class without changing the underlying structure and the fields of the ExecutionStats are accessed from multiple places within StreamExecution already.

I did not think adding an extra field would increase the code complexity, however if you plan to do major refactoring to simplify the logic and address the issues, I am happy to discard this PR and help review your changes.

@tdas
Copy link
Contributor

tdas commented Jul 11, 2018

@arunmahadevan I made this PR as an attempt to incrementally improve the control flow in ProgressReporter while fixing the bug here.

#21744

@arunmahadevan
Copy link
Contributor Author

@tdas Closing this in favor of #21744 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants