Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

Introduction: this PR is a part of SPARK-10816 (EventTime based sessionization (session window)). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.)

What changes were proposed in this pull request?

This PR introduces MergingSortWithSessionWindowStateIterator, which does "merge sort" between input rows and sessions in state based on group key and session's start time.

Note that the iterator does merge sort among input rows and sessions grouped by grouping key. The iterator doesn't provide sessions in state which keys don't exist in input rows. For input rows, the iterator will provide all rows regardless of the existence of matching sessions in state.

MergingSortWithSessionWindowStateIterator works on the precondition that given iterator is sorted by "group keys + start time of session window", and the iterator still retains the characteristic of the sort.

Why are the changes needed?

This part is a one of required on implementing SPARK-10816.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT added.

@HeartSaVioR
Copy link
Contributor Author

Same here; I marked this as draft as other PRs has to be reviewed and merged earlier. I'll rebase this PR once all other PRs are merged.

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44837/

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44837/

@HeartSaVioR HeartSaVioR force-pushed the SPARK-34892-SPARK-10816-PR-31570-part-4 branch from eb433d7 to 83cc78a Compare June 25, 2021 07:43
@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44843/

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44843/

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Test build #140306 has finished for PR 33077 at commit eb433d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MergingSortWithSessionWindowStateIterator(

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Test build #140311 has finished for PR 33077 at commit 83cc78a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MergingSortWithSessionWindowStateIterator(

@HeartSaVioR HeartSaVioR force-pushed the SPARK-34892-SPARK-10816-PR-31570-part-4 branch from 83cc78a to abab1e8 Compare July 13, 2021 19:54
@HeartSaVioR HeartSaVioR marked this pull request as ready for review July 13, 2021 19:54
@HeartSaVioR
Copy link
Contributor Author

cc. @viirya @xuanyuanking Please take a look. Thanks!

@viirya
Copy link
Member

viirya commented Jul 13, 2021

Thanks @HeartSaVioR. I will review this today.

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Test build #140981 has finished for PR 33077 at commit 60f6114.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45495/

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45501/

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45501/

… sorting input rows and rows in state efficiently
@HeartSaVioR HeartSaVioR force-pushed the SPARK-34892-SPARK-10816-PR-31570-part-4 branch from b540632 to e4a74a3 Compare July 14, 2021 01:22
@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45503/

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Test build #140987 has finished for PR 33077 at commit b540632.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Test build #140989 has finished for PR 33077 at commit e4a74a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MergingSortWithSessionWindowStateIterator(

Comment on lines 68 to 71
private var currentRow: SessionRowInformation = _
private var currentStateRow: SessionRowInformation = _
private var currentStateIter: Iterator[InternalRow] = _
private var currentStateFetchedKey: UnsafeRow = _
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions:

currentRow -> currentRowFromInput
currentStateRow -> currentRowFromState
currentStateIter -> sessionIterFromState
currentStateFetchedKey -> currentSessionKey

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe add a few comments explaining these variables for readability.

} else {
// compare
if (currentRow.keys != currentStateRow.keys) {
// state row cannot advance to row in input, so state row should be lower
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this case mean, the input iterator advances to new keys other than current sessions from the state? So we should output from current sessions until it ends and retrieves new sessions from the state again?

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. We retrieve state rows for the specific key only when there's a new key from input side, so the case is not possible state side advances compared to input side. If the keys differ, there're rows to process in state side. The opposite case is not possible.

rowAttributes)

val actual = iter.map(_.copy()).toList
assert(actual.isEmpty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm? Why it is empty? If input is empty, doesn't it output sorted sessions in the state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nvm. I see.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I have a few suggestions about variable names.

@viirya
Copy link
Member

viirya commented Jul 14, 2021

I will merge this tomorrow. Thanks @HeartSaVioR

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45520/

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45520/

@HeartSaVioR
Copy link
Contributor Author

Please bear with me on merging this one now to unblock the last PR #33081. We can do post-review for any part of session window changes even it's merged during QA period (and even RC period if someone catches defect.)

HeartSaVioR added a commit that referenced this pull request Jul 14, 2021
… sorting input rows and rows in state efficiently

Introduction: this PR is a part of SPARK-10816 (EventTime based sessionization (session window)). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.)

### What changes were proposed in this pull request?

This PR introduces MergingSortWithSessionWindowStateIterator, which does "merge sort" between input rows and sessions in state based on group key and session's start time.

Note that the iterator does merge sort among input rows and sessions grouped by grouping key. The iterator doesn't provide sessions in state which keys don't exist in input rows. For input rows, the iterator will provide all rows regardless of the existence of matching sessions in state.

MergingSortWithSessionWindowStateIterator works on the precondition that given iterator is sorted by "group keys + start time of session window", and the iterator still retains the characteristic of the sort.

### Why are the changes needed?

This part is a one of required on implementing SPARK-10816.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New UT added.

Closes #33077 from HeartSaVioR/SPARK-34892-SPARK-10816-PR-31570-part-4.

Authored-by: Jungtaek Lim <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
(cherry picked from commit 12a576f)
Signed-off-by: Jungtaek Lim <[email protected]>
@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Jul 14, 2021

Thanks @viirya for reviewing! I merged this to master/3.2. I'll rebase #33081 to reflect the latest master branch.

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Test build #141006 has finished for PR 33077 at commit c674046.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
… sorting input rows and rows in state efficiently

Introduction: this PR is a part of SPARK-10816 (EventTime based sessionization (session window)). Please refer apache#31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.)

### What changes were proposed in this pull request?

This PR introduces MergingSortWithSessionWindowStateIterator, which does "merge sort" between input rows and sessions in state based on group key and session's start time.

Note that the iterator does merge sort among input rows and sessions grouped by grouping key. The iterator doesn't provide sessions in state which keys don't exist in input rows. For input rows, the iterator will provide all rows regardless of the existence of matching sessions in state.

MergingSortWithSessionWindowStateIterator works on the precondition that given iterator is sorted by "group keys + start time of session window", and the iterator still retains the characteristic of the sort.

### Why are the changes needed?

This part is a one of required on implementing SPARK-10816.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New UT added.

Closes apache#33077 from HeartSaVioR/SPARK-34892-SPARK-10816-PR-31570-part-4.

Authored-by: Jungtaek Lim <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants