Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented May 28, 2015

The previous growing strategy is alway doubling the capacity.

This PR adjusts the growing strategy: doubling the capacity but if overflow, use the maximum capacity as the new capacity. It increases the maximum capacity of PartitionedPairBuffer from 2 ^ 29 to 2 ^ 30 - 1, the maximum capacity of PartitionedSerializedPairBuffer from 2 ^ 28 to (2 ^ 29) - 1, and the maximum capacity of AppendOnlyMap from 0.7 * (2 ^ 29) to (2 ^ 29).

@zsxwing
Copy link
Member Author

zsxwing commented May 28, 2015

cc @JoshRosen

@SparkQA
Copy link

SparkQA commented May 28, 2015

Test build #33658 has finished for PR 6456 at commit 64fe227.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have too many records, we'll end up failing when we try to do a put, right? Can we make this fail with a more explicit message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. How about now?

@SparkQA
Copy link

SparkQA commented May 29, 2015

Test build #33692 has finished for PR 6456 at commit 05b6420.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Increase the maximum capacity of AppendOnlyMap from 0.7 * (2 ^ 29) to (2 ^ 29)
@zsxwing zsxwing changed the title [SPARK-7913][Core]Increase the maximum capacity of PartitionedPairBuffer and PartitionedSerializedPairBuffer [SPARK-7913][Core]Increase the maximum capacity of PartitionedPairBuffe, PartitionedSerializedPairBuffer and AppendOnlyMap May 29, 2015
@SparkQA
Copy link

SparkQA commented May 29, 2015

Test build #33729 has finished for PR 6456 at commit e30b61b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we prefer min(capacity * 2, MAXIMUM_CAPACITY) over this kind of Scala syntax.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean capacity * 2).min(MAXIMUM_CAPACITY), right? Updated.

@sryza
Copy link
Contributor

sryza commented May 29, 2015

This LGTM. @JoshRosen do you want to take a look before I merge?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a note on how you arrived at this number?

@andrewor14
Copy link
Contributor

LGTM2

@SparkQA
Copy link

SparkQA commented May 30, 2015

Test build #33781 has finished for PR 6456 at commit abcb932.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 11, 2015

ping @andrewor14

@andrewor14
Copy link
Contributor

Sorry slipped on this a little. I'm merging it into master now thanks Ryan!

@asfgit asfgit closed this in a411a40 Jun 17, 2015
@zsxwing zsxwing deleted the SPARK-7913 branch June 18, 2015 02:29
@srowen
Copy link
Member

srowen commented Jun 18, 2015

@zsxwing @andrewor14 this is very related to another recent change:
c13da20

These classes do similar things but with different approaches. For example, why is 2^29 the max size here instead of 2^30? They check args and fail on growth differently too. If Spark is maintaining these custom collection classes I think they need to at least be more consistent.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 18, 2015

@srowen I will make AppendOnlyMap consistent with OpenHashMap later.

For 2^29, because unlike OpenHashMap, AppendOnlyMap use a single array to store both keys and values: private var data = new Array[AnyRef](2 * capacity). As the max length of an array is (2^31 -1), the max of capacity is (2^31 -1) / 2 = 2 ^ 29.

PartitionedPairBuffer and PartitionedSerializedPairBuffer are something like Array. The growth strategy is growing the buffer when reaching the capacity.

nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…uffe, PartitionedSerializedPairBuffer and AppendOnlyMap

The previous growing strategy is alway doubling the capacity.

This PR adjusts the growing strategy: doubling the capacity but if overflow, use the maximum capacity as the new capacity. It increases the maximum capacity of PartitionedPairBuffer from `2 ^ 29` to `2 ^ 30 - 1`, the maximum capacity of PartitionedSerializedPairBuffer from `2 ^ 28` to `(2 ^ 29) - 1`, and the maximum capacity of AppendOnlyMap from `0.7 * (2 ^ 29)` to `(2 ^ 29)`.

Author: zsxwing <[email protected]>

Closes apache#6456 from zsxwing/SPARK-7913 and squashes the following commits:

abcb932 [zsxwing] Address comments
e30b61b [zsxwing] Increase the maximum capacity of AppendOnlyMap
05b6420 [zsxwing] Update the exception message
64fe227 [zsxwing] Increase the maximum capacity of PartitionedPairBuffer and PartitionedSerializedPairBuffer
asfgit pushed a commit that referenced this pull request Jun 19, 2015
…f OpenHashSet and consistent exception message

This is a follow up PR for #6456 to make AppendOnlyMap consistent with OpenHashSet.

/cc srowen andrewor14

Author: zsxwing <[email protected]>

Closes #6879 from zsxwing/append-only-map and squashes the following commits:

912c0ad [zsxwing] Fix the doc
dd4385b [zsxwing] Make AppendOnlyMap use the same growth strategy of OpenHashSet and consistent exception message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants