Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Resolved review comments
  • Loading branch information
SlavikBaranov committed Jun 15, 2015
commit 4d5b9543b60611a54a1c1023783d3078f4b0b214
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
loadFactor: Double)
extends Serializable {

require(initialCapacity <= (1 << 30), "Can't make capacity bigger than 2^30 elements")
require(initialCapacity <= OpenHashSet.MAX_CAPACITY,
s"Can't make capacity bigger than ${OpenHashSet.MAX_CAPACITY} elements")
require(initialCapacity >= 1, "Invalid initial capacity")
require(loadFactor < 1.0, "Load factor must be less than 1.0")
require(loadFactor > 0.0, "Load factor must be greater than 0.0")
Expand Down Expand Up @@ -223,7 +224,8 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
*/
private def rehash(k: T, allocateFunc: (Int) => Unit, moveFunc: (Int, Int) => Unit) {
val newCapacity = _capacity * 2
require(newCapacity <= (1 << 30), "Can't make capacity bigger than 2^30 elements")
require(newCapacity <= OpenHashSet.MAX_CAPACITY,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is the theoretically largest number of elements that can be in the set. The failure will occur any time that twice the grow threshold exceeds MAX_CAPACITY, which can happen when the collection is less full than this. So I am actually not sure what's clearer here. Up to you.

I think we still have a little problem here, because when capacity reaches 2^30, twice that number becomes negative and newCapacity <= OpenHashSet.MAX_CAPACITY is true, still, because of overflow. Check whether the existing capacity is <= OpenHashSet.MAX_CAPACITY / 2 first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Integer overflow is not possible with current MAX_CAPACITY setting, since 2^31 is still positive number. Anyway, I've added check for positive capacity. IMO it's more clear way to guard against overflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2^31 is positive, but it is not representable as a 32-bit signed int.

scala> val _capacity = 1 << 30
_capacity: Int = 1073741824

scala> val newCapacity = _capacity * 2
newCapacity: Int = -2147483648

So that's an important check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes. Sorry :)

s"Can't contain more than ${(loadFactor * OpenHashSet.MAX_CAPACITY).toInt} elements")
allocateFunc(newCapacity)
val newBitset = new BitSet(newCapacity)
val newData = new Array[T](newCapacity)
Expand Down Expand Up @@ -277,9 +279,10 @@ class OpenHashSet[@specialized(Long, Int) T: ClassTag](
private[spark]
object OpenHashSet {

val MAX_CAPACITY = 1 << 30
val INVALID_POS = -1
val NONEXISTENCE_MASK = 0x80000000
val POSITION_MASK = 0x7FFFFFFF
val NONEXISTENCE_MASK = 1 << 31
val POSITION_MASK = (1 << 31) - 1

/**
* A set of specialized hash function implementation to avoid boxing hash code computation
Expand Down