Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Swamidass & Baldi approx. items in intersection of two Bloom filters.…
… Also function to create union (non-mutation) of two Bloom filters.
  • Loading branch information
Bcpoole committed Feb 9, 2017
commit 7a3ad46ff86bd3d2d47f6a56bace1a0c4fd171c8
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,11 @@ int getVersionNumber() {
*/
public abstract long bitSize();

/**
* Swamidass & Baldi (2007) approximation for number of items in a Bloom filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe the method first and its properties (approximation error). Then put the reference in @seealso with a permanent link to the paper: https://dx.doi.org/10.1021%2Fci600526a

*/
public abstract double approxItems();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this return a long rather than a double?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debating this due to possible rounding errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea but that would only be off by 1. I wouldn't worry about that since it is approximate anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to keep it as double because the estimate could be out of bound if the bits are full.


/**
* Puts an item into this {@code BloomFilter}. Ensures that subsequent invocations of
* {@linkplain #mightContain(Object)} with the same item will always return {@code true}.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,62 @@ public BloomFilter mergeInPlace(BloomFilter other) throws IncompatibleMergeExcep
return this;
}

@Override
public double approxItems() {
double m = bitSize();
return (m / numHashFunctions) * Math.log(1 - (bits.cardinality() / m));
}

/**
* Returns a new Bloom filter of the union of two Bloom filters.
* Unlike mergeInplace, this will not cause a mutation.
* Callers must ensure the bloom filters are appropriately sized to avoid saturating them.
*
* @throws IncompatibleUnionException if either are null, different classes, or different size or number of hash functions
*/
public static BloomFilterImpl createUnionBloomFilter(BloomFilter bf1, BloomFilter bf2) throws IncompatibleUnionException {
// Duplicates the logic of `isCompatible` here to provide better error message.
if (bf1 == null || bf2 == null) {
throw new IncompatibleUnionException("Cannot union null bloom filters");
}

if (!(bf1 instanceof BloomFilterImpl)) {
throw new IncompatibleUnionException(
"Cannot union bloom filter of class " + bf1.getClass().getName()
);
} else if (!(bf2 instanceof BloomFilterImpl)) {
throw new IncompatibleUnionException(
"Cannot union bloom filter of class " + bf2.getClass().getName()
);
}

BloomFilterImpl bfImpl1 = (BloomFilterImpl) bf1;
BloomFilterImpl bfImpl2 = (BloomFilterImpl) bf2;

if (bfImpl1.bitSize() != bfImpl2.bitSize()) {
throw new IncompatibleUnionException("Cannot union bloom filters with different bit size");
}

if (bfImpl1.numHashFunctions != bfImpl2.numHashFunctions) {
throw new IncompatibleUnionException("Cannot union bloom filters with different number of hash functions");
}

BloomFilterImpl bfUnion = (BloomFilterImpl)BloomFilter.create(bf1.bitSize());

bfUnion.bits.putAll(bfImpl1.bits);
bfUnion.bits.putAll(bfImpl2.bits);
return bfUnion;
}

/**
* Swamidass & Baldi (2007) approximation for number of items in the intersection of two Bloom filters
*/
public static double approxItemsInIntersection(BloomFilterImpl bf1, BloomFilterImpl bf2) throws IncompatibleUnionException {
BloomFilterImpl union = createUnionBloomFilter(bf1, bf2);

return bf1.approxItems() + bf2.approxItems() - union.approxItems();
}

@Override
public void writeTo(OutputStream out) throws IOException {
DataOutputStream dos = new DataOutputStream(out);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.util.sketch;

public class IncompatibleUnionException extends Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need some javadoc ere.

public IncompatibleUnionException(String message) {
super(message);
}
}