-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Left join could use bitmap for left join instead of Vec<bool> #1291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| let indices = | ||
| UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| { | ||
| (unmatched ^ visited_left_side.get_bit(v)).then(|| v as u64) | ||
| })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this isn't a degradation in performance by doing the check per item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the xor? I'm not sure how big of an impact it might have, but I can remove it if you believe it is worth it 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to create a simple hash join micro benchmark in datafusion/benches to validate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to find the time to write such benchmark, but is there a possibility I can do it in a separate PR? Are we really that non confident that this PR will improve performance (even if by a bit)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't speak for @Dandandan. At least for me the memory saving is obvious 👍 But it's unclear to me how much overhead unmatched ^ visited_left_side.get_bit(v) adds. If we can look into the generated machine code and compare, then that would be another way to prove whether it is more efficient than what we have right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the performance tests against this branch (merged to master) and master:
For q13 (which has a left join): at TPCH Scale Factor (SF) 10 (aka around 10GB of data), on a GCP 16core 64GB RAM machine, running Ubuntu 20.04:
My results show no measureable difference
Coverted to parquet via:
cargo run --release --bin tpch -- convert --input ./data --output ./tpch-parquet --format parquetThen benchmarked using
cargo run --release --bin tpch -- benchmark datafusion --mem-table --format parquet --path ./tpch-parquet --query 13On master:
Query 13 iteration 0 took 10017.1 ms
Query 13 iteration 1 took 10638.8 ms
Query 13 iteration 2 took 10010.3 ms
Query 13 avg time: 10222.08 ms
On this branch merged to master:
git checkout boazberman/master
git merge origin/masterQuery 13 iteration 0 took 10438.6 ms
Query 13 iteration 1 took 10409.1 ms
Query 13 iteration 2 took 10030.8 ms
Query 13 avg time: 10292.82 ms
When I ran the same test again a few times, it reported avg times with sigificant deviation
...
Query 13 avg time: 10750.95 ms
...
Query 13 avg time: 10325.13 ms
...
Query 13 avg time: 10460.80 ms
This leads me to conclude the very small reported difference is noise.
Note that Q13 is:
select
c_count,
count(*) as custdist
from
(
select
c_custkey,
count(o_orderkey)
from
customer left outer join orders on
c_custkey = o_custkey
and o_comment not like '%special%requests%'
group by
c_custkey
) as c_orders (c_custkey, c_count)
group by
c_count
order by
custdist desc,
c_count desc;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to remove all doubt we could skip the check on each iteration. Something like (untested)
let indices = if unmached {
UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| {
(!visited_left_side.get_bit(v)).then(|| v as u64)
}))
} else {
UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| {
(visited_left_side.get_bit(v)).then(|| v as u64)
}));
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb for running the benchmark. I agree with your simplification as well 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boazberman are you willing to make the proposed change? I am also happy to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I made the proposed changes. Thanks.
houqp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @boazberman
|
Thanks for sticking with this @boazberman ❤️ I know it was a long trip |
|
Yay, at last 🚀 |
|
Thank you @boaz-codota |
Which issue does this PR close?
Closes #240.
This is a new version (arrow native) of: #884 (which went stale because I didn't had the time to finish it up until now)
Rationale for this change
Described in the issue.
What changes are included in this PR?
Described in the issue.
Are there any user-facing changes?