[SPARK-23880][SQL] Do not trigger any jobs for caching data #21018

maropu · 2018-04-10T02:55:19Z

What changes were proposed in this pull request?

This pr fixed code so that cache could prevent any jobs from being triggered.
For example, in the current master, an operation below triggers a actual job;

val df = spark.range(10000000000L)
  .filter('id > 1000)
  .orderBy('id.desc)
  .cache()

This triggers a job while the cache should be lazy. The problem is that, when creating InMemoryRelation, we build the RDD, which calls SparkPlan.execute and may trigger jobs, like sampling job for range partitioner, or broadcast job.

This pr removed the code to build a cached RDD in the constructor of InMemoryRelation and added CachedRDDBuilder to lazily build the RDD in InMemoryRelation. Then, the first call of CachedRDDBuilder.cachedColumnBuffers triggers a job to materialize the cache in InMemoryTableScanExec .

How was this patch tested?

Added tests in CachedTableSuite.

gatorsmile · 2018-04-10T04:51:16Z

cc @mengxr @cloud-fan

SparkQA · 2018-04-10T04:58:06Z

Test build #89083 has finished for PR 21018 at commit 01d75d7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-04-10T05:28:10Z

I'll fix the failure tonight (jst)

SparkQA · 2018-04-11T03:24:54Z

Test build #89162 has finished for PR 21018 at commit 313f44b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-11T11:18:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

why do we do this?

The current approach of this pr is; cache() just registers an entry without building a RDD in CacheManager and then InMemoryTableScanExec re-registers an entry to build (materialize) a RDD in CacheManager. So, I added this function for InMemoryTableScanExec to re-register these entries in CacheManager. But, I don't think this is the best, so I'd like to have any suggestion.

Why do we need to involve CacheManager? Shall we just make the creation of RDD lazy in InMemoryRelation and trigger the materialization in InMemoryTableScanExec?

I thought, since InMemoryRelation was copied in a tree sometimes, the lazy update of _cachedColumnBuffers always didn't lead to the materialization of the corresponding cache entry in CacheManager (maybe...). If so, following queries might have unnecessary matiralization repeatedly? Therefore, I though we needed to directly update the entry in CacheManager.

something like

class InMemoryRelation(private var _cachedColumnBuffers: RDD[CachedBatch] = null) { def cachedColumnBuffers = { if (_cachedColumnBuffers == null) { synchronized { if (_cachedColumnBuffers == null) { _cachedColumnBuffers = buildBuffer() } } } _cachedColumnBuffers } }

once it's materialized, it's still materialized after copy

I'll update today

I checked the suggested, but some queries didn't work well;
the query in the description was ok, but a query below wrongly cached two different RDDs (I checked the Strorage tab in the web UI);

scala> sql("SET spark.sql.crossJoin.enabled=true") scala> val df = spark.range(100000000L).cache() scala> df.join(df).show

This is because Analyzer copies an InMemoryRelation (_cachedColumnBuffers = null) node via newInstance then they build RDDs, respectively. Thought?

class CachedRDDBuilder(private var _cachedColumnBuffers: RDD[CachedBatch] = null) { def cachedColumnBuffers = { if (_cachedColumnBuffers == null) { synchronized { if (_cachedColumnBuffers == null) { _cachedColumnBuffers = buildBuffer() } } } _cachedColumnBuffers } } class InMemoryRelation(cacheBuilder: CachedRDDBuilder = new CachedRDDBuilder()) { // newInstance should keep the existing CachedRDDBuilder def newInstance()... }

then in the physical plan and cache manager, just call relation.cacheBuilder.cachedColumnBuffers

aha, I'll fix that way. Thanks!

SparkQA · 2018-04-12T17:05:33Z

Test build #89265 has finished for PR 21018 at commit fb77f39.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-19T05:21:52Z

Test build #89537 has finished for PR 21018 at commit 50dc700.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-04-19T08:59:32Z

retest this please

SparkQA · 2018-04-19T14:03:00Z

Test build #89563 has finished for PR 21018 at commit 50dc700.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-04-19T22:45:37Z

retest this please

SparkQA · 2018-04-20T02:27:02Z

Test build #89598 has finished for PR 21018 at commit 50dc700.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit 2b58189.

SparkQA · 2018-04-20T14:36:51Z

Test build #89632 has finished for PR 21018 at commit 6ace545.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CachedRDDBuilder(
case class InMemoryRelation(

SparkQA · 2018-04-21T02:50:48Z

Test build #89666 has finished for PR 21018 at commit 80f3b34.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CachedRDDBuilder(
case class InMemoryRelation(

maropu · 2018-04-21T03:57:42Z

@cloud-fan @viirya could you check this? Thanks!

cloud-fan · 2018-04-23T03:12:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

+    if (_cachedColumnBuffers != null) {
+      synchronized {
+        if (_cachedColumnBuffers != null) {
+          _cachedColumnBuffers.unpersist(blocking)


shall we also do _cachedColumnBuffers = null so that unpersist won't be called twice?

cloud-fan · 2018-04-23T03:17:17Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+    nodes.forall(_.relation.cacheBuilder._cachedColumnBuffers != null)
+  }
+
+  test("SPARK-23880 table cache should be lazy and don't trigger any jobs") {


how does this test prove we don't trigger jobs?

I feel it's more clear to create a listener and explicitly show we don't trigger any jobs after calling Dataset.cache

viirya · 2018-04-23T09:02:53Z

Don't forget to update PR description too. :)

viirya · 2018-04-23T09:08:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

+
+  @transient val partitionStatistics = new PartitionStatistics(output)
+
+  val child: SparkPlan = cacheBuilder.child


Do we need to expose child: SparkPlan? As it is a logical.LeafNode, it's a bit weird to have it.

Since InMemoryTableScanExec and the other places reference this variable, I kept this public. But, ya, I feel the name is a little weird. So, I renamed child to cachedPlan.

viirya · 2018-04-23T09:18:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

-      Statistics(sizeInBytes = sizeInBytesStats.value.longValue)
+  def cachedColumnBuffers: RDD[CachedBatch] = {
+    if (_cachedColumnBuffers == null) {
+      synchronized {


_cachedColumnBuffers is private[sql], so I'm not sure if this synchronized can be very effective.

I feel thread contention is low here, so I like simpler code. But, I welcome suggestions for more efficient&simpler code.

We should not care about thread-safety at all or do it right. Please prove CachedRDDBuilder will never be accessed by multiple threads and remove these synchronized, or making _cachedColumnBuffers private.

ok, I'll recheck and update.

In this pr w/o synchronized, I found multi-thread queries wrongly built four RDDs for a single cache;

val cachedDf = spark.range(1000000).selectExpr("id AS k", "id AS v").cache for (i <- 0 to 3) { val thread = new Thread { override def run { // Start a job in each thread val df = cachedDf.filter('k > 5).groupBy().sum("v") df.collect } } thread.start }

Either way, I think we should make _cachedColumnBuffers private, so I fixed.

SparkQA · 2018-04-23T12:25:53Z

Test build #89710 has finished for PR 21018 at commit 9c9f9c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-23T16:44:21Z

Test build #89718 has finished for PR 21018 at commit f5f8fbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-24T07:05:01Z

Test build #89762 has finished for PR 21018 at commit c17c5fb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-04-24T07:19:05Z

retest this please.

cloud-fan · 2018-04-24T08:02:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

+
+  @transient val partitionStatistics = new PartitionStatistics(output)
+
+  val cachedPlan: SparkPlan = cacheBuilder.cachedPlan


this should be def, or it will be serialized.

cloud-fan · 2018-04-24T08:04:27Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+    sparkContext.addSparkListener(jobListener)
+    try {
+      val df = f
+      assert(numJobTrigered === 0)


before this assert, we should make sure the event queue is empty, via sparkContext.listenerBus.waitUntilEmpty

cloud-fan · 2018-04-24T08:04:56Z

LGTM

SparkQA · 2018-04-24T11:16:41Z

Test build #89768 has finished for PR 21018 at commit c17c5fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-04-24T22:52:55Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+    }
+  }
+
+  test("SPARK-23880 table cache should be lazy and don't trigger any jobs") {


Without the changes in this PR, this test still can pass. : )

oh, I'll recheck. Thanks!

SparkQA · 2018-04-25T03:07:22Z

Test build #89812 has finished for PR 21018 at commit a3cce89.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-04-25T03:17:17Z

retest this please

SparkQA · 2018-04-25T05:24:58Z

Test build #89814 has finished for PR 21018 at commit a3cce89.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-25T06:06:45Z

retest this please

SparkQA · 2018-04-25T07:05:01Z

Test build #89823 has finished for PR 21018 at commit a3cce89.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-25T07:07:35Z

retest this please

SparkQA · 2018-04-25T10:53:19Z

Test build #89828 has finished for PR 21018 at commit a3cce89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-25T10:58:24Z

Test build #89829 has finished for PR 21018 at commit a3cce89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-25T11:07:14Z

thanks, merging to master!

cloud-fan reviewed Apr 11, 2018

View reviewed changes

maropu force-pushed the SPARK-23880 branch from 313f44b to fb77f39 Compare April 12, 2018 13:10

maropu force-pushed the SPARK-23880 branch 2 times, most recently from 7ae9a5c to 50dc700 Compare April 19, 2018 01:45

maropu added 2 commits April 20, 2018 15:48

Fix

2b58189

Revert "Fix"

10f83f7

This reverts commit 2b58189.

maropu force-pushed the SPARK-23880 branch from 50dc700 to 6ace545 Compare April 20, 2018 14:24

Fix

80f3b34

maropu force-pushed the SPARK-23880 branch from 6ace545 to 80f3b34 Compare April 20, 2018 22:54

cloud-fan reviewed Apr 23, 2018

View reviewed changes

Fix

9c9f9c2

viirya reviewed Apr 23, 2018

View reviewed changes

Fix

f5f8fbf

Fix

c17c5fb

cloud-fan reviewed Apr 24, 2018

View reviewed changes

gatorsmile reviewed Apr 24, 2018

View reviewed changes

Fix

a3cce89

asfgit closed this in 20ca208 Apr 25, 2018

onursatici mentioned this pull request Jul 18, 2018

[SPARK-24850][SQL] fix str representation of CachedRDDBuilder #21805

Closed

eejbyfeldt mentioned this pull request Dec 13, 2024

[SPARK-50572][SQL] Fix race condition in CachedRDDBuilder.cachedColumnBuffers #49179

Closed


		@transient val partitionStatistics = new PartitionStatistics(output)

		val child: SparkPlan = cacheBuilder.child


		@transient val partitionStatistics = new PartitionStatistics(output)

		val cachedPlan: SparkPlan = cacheBuilder.cachedPlan

[SPARK-23880][SQL] Do not trigger any jobs for caching data #21018

[SPARK-23880][SQL] Do not trigger any jobs for caching data #21018

Uh oh!

Conversation

maropu commented Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Apr 10, 2018

Uh oh!

SparkQA commented Apr 10, 2018

Uh oh!

maropu commented Apr 10, 2018

Uh oh!

SparkQA commented Apr 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Apr 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 19, 2018

Uh oh!

maropu commented Apr 19, 2018

Uh oh!

SparkQA commented Apr 19, 2018

Uh oh!

maropu commented Apr 19, 2018

Uh oh!

SparkQA commented Apr 20, 2018

Uh oh!

SparkQA commented Apr 20, 2018

Uh oh!

SparkQA commented Apr 21, 2018

Uh oh!

maropu commented Apr 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Apr 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

maropu commented Apr 10, 2018 •

edited

Loading

maropu Apr 19, 2018 •

edited

Loading

cloud-fan Apr 23, 2018 •

edited

Loading

maropu Apr 23, 2018 •

edited

Loading