Skip to content

Conversation

@boaz-codota
Copy link

Which issue does this PR close?

Closes #240 .

Rationale for this change

Described in the issue.

What changes are included in this PR?

Described in the issue.

Are there any user-facing changes?

No user facing changes.

@boaz-codota
Copy link
Author

@Dandandan I hope I understood the requested change correctly. Was not familiar with bitvec before, but I used the docs and I think I implemented it correctly

@Dandandan
Copy link
Contributor

@Dandandan I hope I understood the requested change correctly. Was not familiar with bitvec before, but I used the docs and I think I implemented it correctly

Yes, this is awesome, exactly what I meant! Impressive how little is changed.

I will do some benchmarking tonight or tomorrow to see if it's changing anything (but I believe this part is not the most performance sensitive anyway).

@alamb @andygrove @jorgecarleitao
What do you think about adding bitvec as dependency? It might be useful later for other datastructures / algorithms too.

@codecov-commenter
Copy link

Codecov Report

Merging #342 (00b4a84) into master (1702d6c) will decrease coverage by 0.00%.
The diff coverage is 71.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #342      +/-   ##
==========================================
- Coverage   75.72%   75.71%   -0.01%     
==========================================
  Files         143      143              
  Lines       23832    23835       +3     
==========================================
+ Hits        18046    18047       +1     
- Misses       5786     5788       +2     
Impacted Files Coverage Δ
datafusion/src/physical_plan/hash_join.rs 86.14% <71.42%> (-0.26%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1702d6c...00b4a84. Read the comment docs.

@jorgecarleitao
Copy link
Member

Thanks a lot @boaz-codota !

Is there any performance difference? imo It would be good to measure before adding a dependency.

Note that arrow also has an BooleanBufferBuilder or something that can create a bitmap.

@Dandandan
Copy link
Contributor

Thanks a lot @boaz-codota !

Is there any performance difference? imo It would be good to measure before adding a dependency.

Note that arrow also has an BooleanBufferBuilder or something that can create a bitmap.

The idea is here is not performance per se, but memory savings. As this is an in-memory join, any saved memory is welcome, as it will increase the chance a given join will fit in memory.

Using a boolean array buffer builder makes sense to me to avoid the dependency.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we already have Arrow as dependency which has bitmaps, I like @jorgecarleitao 's idea of using code from that crate instead (such as https://docs.rs/arrow/4.0.0/arrow/bitmap/struct.Bitmap.html).

@boaz-codota
Copy link
Author

@alamb @jorgecarleitao I'm trying to replace the usage of bitvec with arrow's structures. Both BooleanBufferBuilder & Bitmap seems to behave like immutable data structures, which does not support the required use case (specifically: https://github.com/apache/arrow-datafusion/pull/342/files#diff-44d49c7778aa0c300afacdd7d89b0729ffaedd932d1ac34f3ef8db6b6cdfd73aR1084). So I'd love to change my code to use arrow's structures, but those two does not seem to support what is required.
I looked on other array structures under arrow, but none of them seem to fit as well (https://docs.rs/arrow/4.0.0/arrow/array/index.html).
Let me know how I should advance from here

@Dandandan
Copy link
Contributor

I agree both the datastructures don't seem very useful for this use case. I don't think there is a mutable structure like bitvec in Arrow...
WDYT @alamb @jorgecarleitao

@jorgecarleitao
Copy link
Member

It is just buried in the catacombs of the builder.rs 💀💀💀 BooleanBufferBuilder is its name.

@jorgecarleitao
Copy link
Member

Ah, sorry, I see; you need mutable access to an item in the middle of the builder; yeap, that is currently not supported. It assumes that it is always pushed/advanced.

@Dandandan
Copy link
Contributor

Yes, I think access to the middle only could be done by accessing / mutating the contents, in which case you are already close to rolling your own bitvec API.

@jorgecarleitao
Copy link
Member

To be fair, it is fairly easy to enable that on the BooleanBufferBuilder; we need to add a method to set the bit using bit_utils::set_bit. Buffer for validity is effectively an immutable bitvec, and BooleanBufferBuilder is the corresponding mutable version; we just haven't added that functionality.

I find the bitvec a bit overkill for a single use-case of a bitmap, specially when we already have a "bitvec" on arrow, but I am also fine pushing this through and leave this problem for later.

@alamb
Copy link
Contributor

alamb commented May 17, 2021

I also am hesitant to add a new dependency to DataFusion without some sort of compelling performance or memory improvement

@boazberman
Copy link
Contributor

I'll try to tackle adding the set_bit to Arrow's BooleanBufferBuilder later this week, hopefully contributing to arrow-rs will be as easy as contributing to this repo

@boazberman
Copy link
Contributor

I'm stuck on this, I've implemented the code which I think that should work, but it is not working as expected. I opened the PR to get help.
apache/arrow-rs#383
cc @alamb @Dandandan @jorgecarleitao

@alamb
Copy link
Contributor

alamb commented Jun 14, 2021

FYI arrow 4.3.0 has been released with the code in BooleanBufferBuilder

@boaz-codota
Copy link
Author

I'll refactor this as soon as I have some free time. Thanks @alamb !

@boaz-codota
Copy link
Author

@alamb tried to refactor the code now, noticed that there is no get_bit equivalent of set_bit in BooleanBufferBuilder and calling finish to make receive a Buffer is consuming the content of the BooleanBufferBuilder. I'm not sure whether or not I can consume the BooleanBufferBuilder within HashJoinStream.poll_next. Can I? If not, I guess I'd add another PR to BooleanBufferBuilder that exposes get_bit

@alamb
Copy link
Contributor

alamb commented Jun 29, 2021

I'm not sure whether or not I can consume the BooleanBufferBuilder within HashJoinStream.poll_next. Can I?

I am not sure

If not, I guess I'd add another PR to BooleanBufferBuilder that exposes get_bit

One thing that might be helpful would be to prototype out this PR using a local checkout / fork of the arrow-rs repo so that you can get everything working and find other potential "gotchas" without having to wait on the arrow-rs release cycle

Once you figured out what you needed then we could merge it according in parts

@boaz-codota
Copy link
Author

Implemented in an arrow native way here: #884

unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
H0TB0X420 pushed a commit to H0TB0X420/datafusion that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Left join could use bitmap for left join instead of Vec<bool>

6 participants