Skip to content

Conversation

@ianlcsd
Copy link

@ianlcsd ianlcsd commented Jul 21, 2017

The problem originated from the fact that spark can not read a cache block from disk store with size over 2G. When data is distributed highly skewed over partitions, we see problems recorded in SPY-1394, with any RDD cached on disk.

The ultimate solution of being to adapt the partitions to skew is a long shot,
but imposing a CSD policy on cache block size is feasible.

This PR is proposing the following policy:

  1. We cache block in memory if it is less than 2G and there is available storage memory.
  2. We don't cache any block over 2G size
  3. We drop the block to disk store when it is smaller than 2G and there is no sufficient storage memory.
  4. Introduced a spark configuration, spark.storage.MemoryStore.csdCacheBlockSizeLimit.
    A negative value should disable the block policy.
    The default is set to Integer.MAX_VALUE, but we could choose to make it smaller.

By applying the above policy, we ensure all the cache block is less 2G in size either in memory or disk. This braces us against skewed data and other type of abnormal caching pattern.

@davidnavas @markhamstra

@ianlcsd ianlcsd changed the title SPY-1394: Csd policy on caching block size SPY-1394: CSD caching policy at block level Jul 21, 2017
@ianlcsd ianlcsd closed this Jul 21, 2017
@ianlcsd ianlcsd reopened this Jul 21, 2017
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This api is challenging as we need to keep all the seen values until reaching 2G size limit. it is a potential OOM thread and hard to test.

@ianlcsd
Copy link
Author

ianlcsd commented Jul 24, 2017

jttp

@alteryx alteryx deleted a comment from ianlcsd Jul 25, 2017
if (!shouldCache) {
logBlockSizeLimitMessage(blockId, vector.estimateSize())
} else {
// We ran out of space while unrolling the values for this block

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand this comment for this case. Wouldn't this be the case where it's still cacheable?
oic, this is the original comment. Maybe need to put a line before this to separate cache from unroll.

@markhamstra markhamstra merged commit 2b5e422 into alteryx:csd-1.6 Jul 25, 2017
@ianlcsd ianlcsd deleted the csd-1.6 branch April 22, 2018 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants