Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Apr 11, 2015

@SparkQA
Copy link

SparkQA commented Apr 11, 2015

Test build #30085 has finished for PR 5475 at commit 1c3b06e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@marmbrus
Copy link
Contributor

Thanks for working on this and for adding a test! I have two issues with the way this is being implemented:

  • We are still leaking accumulators on a per table scan basis, and so you are still going to have a problem if you never explicitly uncache the table
  • It seems like we should be able to do this without adding a global hash map. In particular, when you are uncaching you have a handle to the InMemoryRelation. Perhaps that class should define an uncache method on this class that takes care of both unpersisting the data and unregistering the accumulators.

The ones inside of the table scan are clearly harder to deal with, but perhaps we can just make those lazy vals that are only initialized during testing due to some conf being set?

@viirya
Copy link
Member Author

viirya commented Apr 12, 2015

Thanks for commenting.

For the first issue, I have registered the accumulators of table scan InMemoryColumnarTableScan. If users explicitly uncache the table, these accumulators can be released as I tested. If the table is not explicitly uncached, these accumulators will be leaked as you said. Using a configuration to initialize them sounds good. I will try it.

For the second one, I have tried doing this without a global hash map. However, because the cache manager will call InMemoryRelation.withOutput to create a new InMemoryRelation each time replacing the logical plan with cached version. The accumulators will be created in these InMemoryRelations. But the cache manager only keeps the original InMemoryRelation and we uncache based on it. So I add this global hash map to deal with this problem.

@SparkQA
Copy link

SparkQA commented Apr 12, 2015

Test build #30108 has finished for PR 5475 at commit 26c9bb6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@marmbrus
Copy link
Contributor

I see, thanks for clarifying. It seems bad that we are creating accumulators per copy anyway then. Can we just pass in the accumulator with the buffers and statistics instead so that there is only ever one copy?

Honestly, the best solution would probably be to make the InMemoryRelation a BaseRelation so that it does not have to handle newInstance calls at all. Unfortunately this API did not exist when that class was created. However, I'm not sure its worth refactoring at this point. This code was ported from Shark and I'd like to get rid of it completely in the next release or two.

For the second part, it would be great if we could just get rid of the accumulators completely. It feels bad to have all this extra logic only for a test. Can you think of any other way that we could test this?

@viirya
Copy link
Member Author

viirya commented Apr 14, 2015

i just found that we can only allocate the accumulator for first one InMemoryRelation instead of for all following InMemoryRelation. Because the following relations only need to use the pre-computed statistics of the first relation.

Because the statistics are obtained lazily, the above doesn't work. I put the accumulator in the following array that is propagated to following relations. The following relations will not allocate new accumulators if the array is propagated.

The accumulators in InMemoryColumnarTableScan are harder to completely removed if we want to collect these statistics, because these statistics are computed during the RDD operations. Accumulator seems the only method we can pass the information out. Currently I still apply a configuration to control if we want to allocate accumulators. But now I put these accumulators in an array in InMemoryRelation, instead of a global map before. Of course this array needs to be propagated to following relations because InMemoryColumnarTableScan is depending on later relations not the original one.

Will update codes later.

@marmbrus
Copy link
Contributor

For InMemoryColumnarTableScan if they are only every allocated during testing, I think it is okay if they leak.

@viirya
Copy link
Member Author

viirya commented Apr 15, 2015

ok. updated.

@SparkQA
Copy link

SparkQA commented Apr 15, 2015

Test build #30290 has finished for PR 5475 at commit dc1d5d5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 15, 2015

Test build #30291 has finished for PR 5475 at commit 0b41235.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@marmbrus
Copy link
Contributor

Thanks! Merged to master.

@asfgit asfgit closed this in cf38fe0 Apr 15, 2015
mengxr pushed a commit to mengxr/spark that referenced this pull request Apr 23, 2015
Added in apache#5475. Pointed as broken in apache#5639.
/cc marmbrus

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#5640 from viirya/fix_cached_test and squashes the following commits:

c0cf69a [Liang-Chi Hsieh] Fix broken cached test.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 14, 2015
Added in apache#5475. Pointed as broken in apache#5639.
/cc marmbrus

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#5640 from viirya/fix_cached_test and squashes the following commits:

c0cf69a [Liang-Chi Hsieh] Fix broken cached test.
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Added in apache#5475. Pointed as broken in apache#5639.
/cc marmbrus

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#5640 from viirya/fix_cached_test and squashes the following commits:

c0cf69a [Liang-Chi Hsieh] Fix broken cached test.
// Enable in-memory partition pruning
setConf(SQLConf.IN_MEMORY_PARTITION_PRUNING, "true")
// Enable in-memory table scan accumulators
setConf("spark.sql.inMemoryTableScanStatistics.enable", "true")
Copy link
Member

@gatorsmile gatorsmile Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not setting spark.sql.inMemoryTableScanStatistics.enable back false in afterAll?

val accsSize = Accumulators.originals.size

sql("SELECT key FROM testData LIMIT 10").registerTempTable("t1")
sql("SELECT key FROM testData LIMIT 5").registerTempTable("t2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to drop these two temp views after tests

@viirya viirya deleted the cache_memory_leak branch December 27, 2023 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants