Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented Mar 29, 2018

What changes were proposed in this pull request?

Currently, the requiredChildDistribution does not specify the partitions. This can cause the weird corner cases where the child's distribution is SinglePartition which satisfies the required distribution of ClusterDistribution(no-num-partition-requirement), thus eliminating the shuffle needed to repartition input data into the required number of partitions (i.e. same as state stores). That can lead to "file not found" errors on the state store delta files as the micro-batch-with-no-shuffle will not run certain tasks and therefore not generate the expected state store delta files.

This PR adds the required constraint on the number of partitions.

How was this patch tested?

Modified test harness to always check that ANY stateful operator should have a constraint on the number of partitions. As part of that, the existing opt-in checks on child output partitioning were removed, as they are redundant.

@tdas tdas changed the title Spark 23827 [SPARK-23827] [SS] StreamingJoinExec should ensure that input data is partitioned into specific number of partitions Mar 29, 2018
@tdas
Copy link
Contributor Author

tdas commented Mar 29, 2018

@brkyvz @zsxwing can one of you take a look?

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is great. Left one comment

assert(d.requiredNumPartitions.isDefined)
assert(d.requiredNumPartitions.get >= 1)
if (d != AllTuples) {
assert(d.requiredNumPartitions.get == s.stateInfo.get.numPartitions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also verify that this is equal to the number of partitions in the metadata?

assert(s.stateInfo.isDefined)
assert(s.stateInfo.get.numPartitions >= 1)
assert(
s.stateInfo.map(_.numPartitions).contains(currentStream.lastExecution.numStateStores))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #88721 has finished for PR 20941 at commit c162f8d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DeduplicateSuite extends StateStoreMetricsTest with BeforeAndAfterAll

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #88727 has finished for PR 20941 at commit 555eeb3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

override def requiredChildDistribution: Seq[Distribution] =
ClusteredDistribution(leftKeys) :: ClusteredDistribution(rightKeys) :: Nil
ClusteredDistribution(leftKeys, stateInfo.map(_.numPartitions)) ::
ClusteredDistribution(rightKeys, stateInfo.map(_.numPartitions)) :: Nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@tdas
Copy link
Contributor Author

tdas commented Mar 30, 2018

Started more tests to test for flakiness.

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #88761 has finished for PR 20941 at commit aeec843.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #4146 has finished for PR 20941 at commit aeec843.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #4145 has finished for PR 20941 at commit aeec843.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #4147 has finished for PR 20941 at commit aeec843.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #4148 has finished for PR 20941 at commit aeec843.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Mar 30, 2018
…partitioned into specific number of partitions

## What changes were proposed in this pull request?

Currently, the requiredChildDistribution does not specify the partitions. This can cause the weird corner cases where the child's distribution is `SinglePartition` which satisfies the required distribution of `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the shuffle needed to repartition input data into the required number of partitions (i.e. same as state stores). That can lead to "file not found" errors on the state store delta files as the micro-batch-with-no-shuffle will not run certain tasks and therefore not generate the expected state store delta files.

This PR adds the required constraint on the number of partitions.

## How was this patch tested?
Modified test harness to always check that ANY stateful operator should have a constraint on the number of partitions. As part of that, the existing opt-in checks on child output partitioning were removed, as they are redundant.

Author: Tathagata Das <[email protected]>

Closes #20941 from tdas/SPARK-23827.

(cherry picked from commit 15298b9)
Signed-off-by: Tathagata Das <[email protected]>
@asfgit asfgit closed this in 15298b9 Mar 30, 2018
mshtelma pushed a commit to mshtelma/spark that referenced this pull request Apr 5, 2018
…partitioned into specific number of partitions

## What changes were proposed in this pull request?

Currently, the requiredChildDistribution does not specify the partitions. This can cause the weird corner cases where the child's distribution is `SinglePartition` which satisfies the required distribution of `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the shuffle needed to repartition input data into the required number of partitions (i.e. same as state stores). That can lead to "file not found" errors on the state store delta files as the micro-batch-with-no-shuffle will not run certain tasks and therefore not generate the expected state store delta files.

This PR adds the required constraint on the number of partitions.

## How was this patch tested?
Modified test harness to always check that ANY stateful operator should have a constraint on the number of partitions. As part of that, the existing opt-in checks on child output partitioning were removed, as they are redundant.

Author: Tathagata Das <[email protected]>

Closes apache#20941 from tdas/SPARK-23827.
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
…partitioned into specific number of partitions

Currently, the requiredChildDistribution does not specify the partitions. This can cause the weird corner cases where the child's distribution is `SinglePartition` which satisfies the required distribution of `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the shuffle needed to repartition input data into the required number of partitions (i.e. same as state stores). That can lead to "file not found" errors on the state store delta files as the micro-batch-with-no-shuffle will not run certain tasks and therefore not generate the expected state store delta files.

This PR adds the required constraint on the number of partitions.

Modified test harness to always check that ANY stateful operator should have a constraint on the number of partitions. As part of that, the existing opt-in checks on child output partitioning were removed, as they are redundant.

Author: Tathagata Das <[email protected]>

Closes apache#20941 from tdas/SPARK-23827.

(cherry picked from commit 15298b9)
Signed-off-by: Tathagata Das <[email protected]>

Change-Id: I9dc225a765afb198e3e8719bdb3dfffd2cff95b9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants