-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Left join could use bitmap for left join instead of Vec<bool> #342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@Dandandan I hope I understood the requested change correctly. Was not familiar with bitvec before, but I used the docs and I think I implemented it correctly |
Yes, this is awesome, exactly what I meant! Impressive how little is changed. I will do some benchmarking tonight or tomorrow to see if it's changing anything (but I believe this part is not the most performance sensitive anyway). @alamb @andygrove @jorgecarleitao |
Codecov Report
@@ Coverage Diff @@
## master #342 +/- ##
==========================================
- Coverage 75.72% 75.71% -0.01%
==========================================
Files 143 143
Lines 23832 23835 +3
==========================================
+ Hits 18046 18047 +1
- Misses 5786 5788 +2
Continue to review full report at Codecov.
|
|
Thanks a lot @boaz-codota ! Is there any performance difference? imo It would be good to measure before adding a dependency. Note that arrow also has an |
The idea is here is not performance per se, but memory savings. As this is an in-memory join, any saved memory is welcome, as it will increase the chance a given join will fit in memory. Using a boolean array buffer builder makes sense to me to avoid the dependency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we already have Arrow as dependency which has bitmaps, I like @jorgecarleitao 's idea of using code from that crate instead (such as https://docs.rs/arrow/4.0.0/arrow/bitmap/struct.Bitmap.html).
|
@alamb @jorgecarleitao I'm trying to replace the usage of bitvec with arrow's structures. Both |
|
I agree both the datastructures don't seem very useful for this use case. I don't think there is a mutable structure like bitvec in Arrow... |
|
It is just buried in the catacombs of the |
|
Ah, sorry, I see; you need mutable access to an item in the middle of the builder; yeap, that is currently not supported. It assumes that it is always pushed/advanced. |
|
Yes, I think access to the middle only could be done by accessing / mutating the contents, in which case you are already close to rolling your own bitvec API. |
|
To be fair, it is fairly easy to enable that on the I find the bitvec a bit overkill for a single use-case of a bitmap, specially when we already have a "bitvec" on arrow, but I am also fine pushing this through and leave this problem for later. |
|
I also am hesitant to add a new dependency to DataFusion without some sort of compelling performance or memory improvement |
|
I'll try to tackle adding the set_bit to Arrow's |
|
I'm stuck on this, I've implemented the code which I think that should work, but it is not working as expected. I opened the PR to get help. |
|
FYI arrow 4.3.0 has been released with the code in BooleanBufferBuilder |
|
I'll refactor this as soon as I have some free time. Thanks @alamb ! |
|
@alamb tried to refactor the code now, noticed that there is no |
I am not sure
One thing that might be helpful would be to prototype out this PR using a local checkout / fork of the arrow-rs repo so that you can get everything working and find other potential "gotchas" without having to wait on the arrow-rs release cycle Once you figured out what you needed then we could merge it according in parts |
|
Implemented in an arrow native way here: #884 |
Which issue does this PR close?
Closes #240 .
Rationale for this change
Described in the issue.
What changes are included in this PR?
Described in the issue.
Are there any user-facing changes?
No user facing changes.