Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Apr 10, 2018

What changes were proposed in this pull request?

This pr fixed code so that cache could prevent any jobs from being triggered.
For example, in the current master, an operation below triggers a actual job;

val df = spark.range(10000000000L)
  .filter('id > 1000)
  .orderBy('id.desc)
  .cache()

This triggers a job while the cache should be lazy. The problem is that, when creating InMemoryRelation, we build the RDD, which calls SparkPlan.execute and may trigger jobs, like sampling job for range partitioner, or broadcast job.

This pr removed the code to build a cached RDD in the constructor of InMemoryRelation and added CachedRDDBuilder to lazily build the RDD in InMemoryRelation. Then, the first call of CachedRDDBuilder.cachedColumnBuffers triggers a job to materialize the cache in InMemoryTableScanExec .

How was this patch tested?

Added tests in CachedTableSuite.

@gatorsmile
Copy link
Member

cc @mengxr @cloud-fan

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89083 has finished for PR 21018 at commit 01d75d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 10, 2018

I'll fix the failure tonight (jst)

@SparkQA
Copy link

SparkQA commented Apr 11, 2018

Test build #89162 has finished for PR 21018 at commit 313f44b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach of this pr is; cache() just registers an entry without building a RDD in CacheManager and then InMemoryTableScanExec re-registers an entry to build (materialize) a RDD in CacheManager. So, I added this function for InMemoryTableScanExec to re-register these entries in CacheManager. But, I don't think this is the best, so I'd like to have any suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to involve CacheManager? Shall we just make the creation of RDD lazy in InMemoryRelation and trigger the materialization in InMemoryTableScanExec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought, since InMemoryRelation was copied in a tree sometimes, the lazy update of _cachedColumnBuffers always didn't lead to the materialization of the corresponding cache entry in CacheManager (maybe...). If so, following queries might have unnecessary matiralization repeatedly? Therefore, I though we needed to directly update the entry in CacheManager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like

class InMemoryRelation(private var _cachedColumnBuffers: RDD[CachedBatch] = null) {
  def cachedColumnBuffers = {
    if (_cachedColumnBuffers == null) {
      synchronized {
        if (_cachedColumnBuffers == null) {
          _cachedColumnBuffers = buildBuffer()
        }
      }
    }
    _cachedColumnBuffers
  }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once it's materialized, it's still materialized after copy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update today

Copy link
Member Author

@maropu maropu Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the suggested, but some queries didn't work well;
the query in the description was ok, but a query below wrongly cached two different RDDs (I checked the Strorage tab in the web UI);

scala> sql("SET spark.sql.crossJoin.enabled=true")
scala> val df = spark.range(100000000L).cache()
scala> df.join(df).show

This is because Analyzer copies an InMemoryRelation (_cachedColumnBuffers = null) node via newInstance then they build RDDs, respectively. Thought?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class CachedRDDBuilder(private var _cachedColumnBuffers: RDD[CachedBatch] = null) {
  def cachedColumnBuffers = {
    if (_cachedColumnBuffers == null) {
      synchronized {
        if (_cachedColumnBuffers == null) {
          _cachedColumnBuffers = buildBuffer()
        }
      }
    }
    _cachedColumnBuffers
  } 
}

class InMemoryRelation(cacheBuilder: CachedRDDBuilder = new CachedRDDBuilder()) {
  // newInstance should keep the existing CachedRDDBuilder
  def newInstance()...
}

then in the physical plan and cache manager, just call relation.cacheBuilder.cachedColumnBuffers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, I'll fix that way. Thanks!

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89265 has finished for PR 21018 at commit fb77f39.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu force-pushed the SPARK-23880 branch 2 times, most recently from 7ae9a5c to 50dc700 Compare April 19, 2018 01:45
@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89537 has finished for PR 21018 at commit 50dc700.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 19, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89563 has finished for PR 21018 at commit 50dc700.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 19, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89598 has finished for PR 21018 at commit 50dc700.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

maropu added 2 commits April 20, 2018 15:48
This reverts commit 2b58189.
@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89632 has finished for PR 21018 at commit 6ace545.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CachedRDDBuilder(
  • case class InMemoryRelation(

@SparkQA
Copy link

SparkQA commented Apr 21, 2018

Test build #89666 has finished for PR 21018 at commit 80f3b34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CachedRDDBuilder(
  • case class InMemoryRelation(

@maropu
Copy link
Member Author

maropu commented Apr 21, 2018

@cloud-fan @viirya could you check this? Thanks!

if (_cachedColumnBuffers != null) {
synchronized {
if (_cachedColumnBuffers != null) {
_cachedColumnBuffers.unpersist(blocking)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also do _cachedColumnBuffers = null so that unpersist won't be called twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

nodes.forall(_.relation.cacheBuilder._cachedColumnBuffers != null)
}

test("SPARK-23880 table cache should be lazy and don't trigger any jobs") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this test prove we don't trigger jobs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's more clear to create a listener and explicitly show we don't trigger any jobs after calling Dataset.cache

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@viirya
Copy link
Member

viirya commented Apr 23, 2018

Don't forget to update PR description too. :)


@transient val partitionStatistics = new PartitionStatistics(output)

val child: SparkPlan = cacheBuilder.child
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to expose child: SparkPlan? As it is a logical.LeafNode, it's a bit weird to have it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since InMemoryTableScanExec and the other places reference this variable, I kept this public. But, ya, I feel the name is a little weird. So, I renamed child to cachedPlan.

Statistics(sizeInBytes = sizeInBytesStats.value.longValue)
def cachedColumnBuffers: RDD[CachedBatch] = {
if (_cachedColumnBuffers == null) {
synchronized {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_cachedColumnBuffers is private[sql], so I'm not sure if this synchronized can be very effective.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel thread contention is low here, so I like simpler code. But, I welcome suggestions for more efficient&simpler code.

Copy link
Contributor

@cloud-fan cloud-fan Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not care about thread-safety at all or do it right. Please prove CachedRDDBuilder will never be accessed by multiple threads and remove these synchronized, or making _cachedColumnBuffers private.

Copy link
Member Author

@maropu maropu Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll recheck and update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this pr w/o synchronized, I found multi-thread queries wrongly built four RDDs for a single cache;

val cachedDf = spark.range(1000000).selectExpr("id AS k", "id AS v").cache
for (i <- 0 to 3) {
  val thread = new Thread {
    override def run {
      // Start a job in each thread
      val df = cachedDf.filter('k > 5).groupBy().sum("v")
      df.collect
    }
  }
  thread.start
}

Either way, I think we should make _cachedColumnBuffers private, so I fixed.

@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89710 has finished for PR 21018 at commit 9c9f9c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 23, 2018

Test build #89718 has finished for PR 21018 at commit f5f8fbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89762 has finished for PR 21018 at commit c17c5fb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Apr 24, 2018

retest this please.


@transient val partitionStatistics = new PartitionStatistics(output)

val cachedPlan: SparkPlan = cacheBuilder.cachedPlan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be def, or it will be serialized.

sparkContext.addSparkListener(jobListener)
try {
val df = f
assert(numJobTrigered === 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this assert, we should make sure the event queue is empty, via sparkContext.listenerBus.waitUntilEmpty

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89768 has finished for PR 21018 at commit c17c5fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

test("SPARK-23880 table cache should be lazy and don't trigger any jobs") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the changes in this PR, this test still can pass. : )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I'll recheck. Thanks!

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89812 has finished for PR 21018 at commit a3cce89.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 25, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89814 has finished for PR 21018 at commit a3cce89.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89823 has finished for PR 21018 at commit a3cce89.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89828 has finished for PR 21018 at commit a3cce89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89829 has finished for PR 21018 at commit a3cce89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants