Skip to content

Conversation

@boazberman
Copy link
Contributor

@boazberman boazberman commented Nov 13, 2021

Which issue does this PR close?

Closes #240.
This is a new version (arrow native) of: #884 (which went stale because I didn't had the time to finish it up until now)

Rationale for this change

Described in the issue.

What changes are included in this PR?

Described in the issue.

Are there any user-facing changes?

@houqp houqp added the performance Make DataFusion faster label Nov 14, 2021
let indices =
UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| {
(unmatched ^ visited_left_side.get_bit(v)).then(|| v as u64)
}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this isn't a degradation in performance by doing the check per item?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the xor? I'm not sure how big of an impact it might have, but I can remove it if you believe it is worth it 🤷‍♂️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to create a simple hash join micro benchmark in datafusion/benches to validate this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to find the time to write such benchmark, but is there a possibility I can do it in a separate PR? Are we really that non confident that this PR will improve performance (even if by a bit)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't speak for @Dandandan. At least for me the memory saving is obvious 👍 But it's unclear to me how much overhead unmatched ^ visited_left_side.get_bit(v) adds. If we can look into the generated machine code and compare, then that would be another way to prove whether it is more efficient than what we have right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the performance tests against this branch (merged to master) and master:

For q13 (which has a left join): at TPCH Scale Factor (SF) 10 (aka around 10GB of data), on a GCP 16core 64GB RAM machine, running Ubuntu 20.04:

My results show no measureable difference

Coverted to parquet via:

cargo run --release --bin tpch -- convert --input ./data --output ./tpch-parquet --format parquet

Then benchmarked using

cargo run --release --bin tpch -- benchmark datafusion --mem-table --format parquet --path ./tpch-parquet --query 13

On master:

Query 13 iteration 0 took 10017.1 ms
Query 13 iteration 1 took 10638.8 ms
Query 13 iteration 2 took 10010.3 ms
Query 13 avg time: 10222.08 ms

On this branch merged to master:

git checkout boazberman/master
git merge origin/master
Query 13 iteration 0 took 10438.6 ms
Query 13 iteration 1 took 10409.1 ms
Query 13 iteration 2 took 10030.8 ms
Query 13 avg time: 10292.82 ms

When I ran the same test again a few times, it reported avg times with sigificant deviation

...
Query 13 avg time: 10750.95 ms
...
Query 13 avg time: 10325.13 ms
...
Query 13 avg time: 10460.80 ms

This leads me to conclude the very small reported difference is noise.

Note that Q13 is:

select
    c_count,
    count(*) as custdist
from
    (
        select
            c_custkey,
            count(o_orderkey)
        from
            customer left outer join orders on
                        c_custkey = o_custkey
                    and o_comment not like '%special%requests%'
        group by
            c_custkey
    ) as c_orders (c_custkey, c_count)
group by
    c_count
order by
    custdist desc,
    c_count desc;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to remove all doubt we could skip the check on each iteration. Something like (untested)

 let indices = if unmached {
        UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| {
            (!visited_left_side.get_bit(v)).then(|| v as u64)
        }))
} else {
        UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| {
            (visited_left_side.get_bit(v)).then(|| v as u64)
        }));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb for running the benchmark. I agree with your simplification as well 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boazberman are you willing to make the proposed change? I am also happy to do so

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I made the proposed changes. Thanks.

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @boazberman

@alamb alamb merged commit 5ef42eb into apache:master Dec 21, 2021
@alamb
Copy link
Contributor

alamb commented Dec 21, 2021

Thanks for sticking with this @boazberman ❤️ I know it was a long trip

@boaz-codota
Copy link

Yay, at last 🚀

@Dandandan
Copy link
Contributor

Thank you @boaz-codota

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Make DataFusion faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Left join could use bitmap for left join instead of Vec<bool>

5 participants