[SPARK-6844][SQL] Clean up accumulators used in InMemoryRelation when it is uncached #5475

viirya · 2015-04-11T19:31:36Z

JIRA: https://issues.apache.org/jira/browse/SPARK-6844

SparkQA · 2015-04-11T21:01:26Z

Test build #30085 has finished for PR 5475 at commit 1c3b06e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

marmbrus · 2015-04-11T21:41:58Z

Thanks for working on this and for adding a test! I have two issues with the way this is being implemented:

We are still leaking accumulators on a per table scan basis, and so you are still going to have a problem if you never explicitly uncache the table
It seems like we should be able to do this without adding a global hash map. In particular, when you are uncaching you have a handle to the InMemoryRelation. Perhaps that class should define an uncache method on this class that takes care of both unpersisting the data and unregistering the accumulators.

The ones inside of the table scan are clearly harder to deal with, but perhaps we can just make those lazy vals that are only initialized during testing due to some conf being set?

viirya · 2015-04-12T06:55:33Z

Thanks for commenting.

For the first issue, I have registered the accumulators of table scan InMemoryColumnarTableScan. If users explicitly uncache the table, these accumulators can be released as I tested. If the table is not explicitly uncached, these accumulators will be leaked as you said. Using a configuration to initialize them sounds good. I will try it.

For the second one, I have tried doing this without a global hash map. However, because the cache manager will call InMemoryRelation.withOutput to create a new InMemoryRelation each time replacing the logical plan with cached version. The accumulators will be created in these InMemoryRelations. But the cache manager only keeps the original InMemoryRelation and we uncache based on it. So I add this global hash map to deal with this problem.

SparkQA · 2015-04-12T09:01:52Z

Test build #30108 has finished for PR 5475 at commit 26c9bb6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

marmbrus · 2015-04-13T22:04:32Z

I see, thanks for clarifying. It seems bad that we are creating accumulators per copy anyway then. Can we just pass in the accumulator with the buffers and statistics instead so that there is only ever one copy?

Honestly, the best solution would probably be to make the InMemoryRelation a BaseRelation so that it does not have to handle newInstance calls at all. Unfortunately this API did not exist when that class was created. However, I'm not sure its worth refactoring at this point. This code was ported from Shark and I'd like to get rid of it completely in the next release or two.

For the second part, it would be great if we could just get rid of the accumulators completely. It feels bad to have all this extra logic only for a test. Can you think of any other way that we could test this?

viirya · 2015-04-14T07:09:27Z

i just found that we can only allocate the accumulator for first one InMemoryRelation instead of for all following InMemoryRelation. Because the following relations only need to use the pre-computed statistics of the first relation.

Because the statistics are obtained lazily, the above doesn't work. I put the accumulator in the following array that is propagated to following relations. The following relations will not allocate new accumulators if the array is propagated.

The accumulators in InMemoryColumnarTableScan are harder to completely removed if we want to collect these statistics, because these statistics are computed during the RDD operations. Accumulator seems the only method we can pass the information out. Currently I still apply a configuration to control if we want to allocate accumulators. But now I put these accumulators in an array in InMemoryRelation, instead of a global map before. Of course this array needs to be propagated to following relations because InMemoryColumnarTableScan is depending on later relations not the original one.

Will update codes later.

marmbrus · 2015-04-14T22:36:59Z

For InMemoryColumnarTableScan if they are only every allocated during testing, I think it is okay if they leak.

viirya · 2015-04-15T01:58:32Z

ok. updated.

SparkQA · 2015-04-15T02:04:26Z

Test build #30290 has finished for PR 5475 at commit dc1d5d5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-15T03:36:36Z

Test build #30291 has finished for PR 5475 at commit 0b41235.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

marmbrus · 2015-04-15T20:16:26Z

Thanks! Merged to master.

Added in apache#5475. Pointed as broken in apache#5639. /cc marmbrus Author: Liang-Chi Hsieh <[email protected]> Closes apache#5640 from viirya/fix_cached_test and squashes the following commits: c0cf69a [Liang-Chi Hsieh] Fix broken cached test.

gatorsmile · 2018-01-12T15:40:59Z

sql/core/src/test/scala/org/apache/spark/sql/columnar/PartitionBatchPruningSuite.scala

    // Enable in-memory partition pruning
    setConf(SQLConf.IN_MEMORY_PARTITION_PRUNING, "true")
+    // Enable in-memory table scan accumulators
+    setConf("spark.sql.inMemoryTableScanStatistics.enable", "true")


Why not setting spark.sql.inMemoryTableScanStatistics.enable back false in afterAll?

gatorsmile · 2018-01-12T15:41:58Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+    val accsSize = Accumulators.originals.size
+
+    sql("SELECT key FROM testData LIMIT 10").registerTempTable("t1")
+    sql("SELECT key FROM testData LIMIT 5").registerTempTable("t2")


We also need to drop these two temp views after tests

Clean up accumulators used in InMemoryRelation when it is uncached.

1c3b06e

Add configuration to enable in-memory table scan accumulators.

26c9bb6

Merge remote-tracking branch 'upstream/master' into cache_memory_leak

78af229

For comments.

dc1d5d5

fix style.

0b41235

asfgit closed this in cf38fe0 Apr 15, 2015

marmbrus mentioned this pull request Apr 22, 2015

[HOTFIX][SQL] Ignore flaky CachedTableSuite test #5639

Closed

viirya mentioned this pull request Apr 23, 2015

[HOTFIX][SQL] Fix broken cached test #5640

Closed

gatorsmile reviewed Jan 12, 2018

View reviewed changes

viirya deleted the cache_memory_leak branch December 27, 2023 18:31

[SPARK-6844][SQL] Clean up accumulators used in InMemoryRelation when it is uncached #5475

[SPARK-6844][SQL] Clean up accumulators used in InMemoryRelation when it is uncached #5475

Uh oh!

Conversation

viirya commented Apr 11, 2015

Uh oh!

SparkQA commented Apr 11, 2015

Uh oh!

marmbrus commented Apr 11, 2015

Uh oh!

viirya commented Apr 12, 2015

Uh oh!

SparkQA commented Apr 12, 2015

Uh oh!

marmbrus commented Apr 13, 2015

Uh oh!

viirya commented Apr 14, 2015

Uh oh!

marmbrus commented Apr 14, 2015

Uh oh!

viirya commented Apr 15, 2015

Uh oh!

SparkQA commented Apr 15, 2015

Uh oh!

SparkQA commented Apr 15, 2015

Uh oh!

marmbrus commented Apr 15, 2015

Uh oh!

gatorsmile Jan 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jan 12, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gatorsmile Jan 12, 2018 •

edited

Loading