Skip to content

Conversation

@szehon-ho
Copy link
Member

What changes were proposed in this pull request?

Add docs for SPJ

Why are the changes needed?

There are no docs describing SPJ, even though it is mentioned in migration notes: #46673

Does this PR introduce any user-facing change?

No

How was this patch tested?

Checked the new text

Was this patch authored or co-authored using generative AI tooling?

No

@HyukjinKwon
Copy link
Member

But would be great if we have eyes from other folks like @sunchao @cloud-fan

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too with a few nits, thanks @szehon-ho !


Storage Partition Join (SPJ) is an optimization technique in Spark SQL that makes use the existing storage layout to avoid the shuffle phase.

This is a generalization of the concept of Bucket Joins, which is only applicable for [bucketed](sql-data-sources-load-save-functions.html#bucketing-sorting-and-partitioning) tables, to tables partitioned by functions registered in FunctionCatalog. Storage Partition Joins are currently supported for compatible V2 DataSources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this bucketed link doesn't work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, i built the site and it seems to work, but let me know if another one is better here?


This is a generalization of the concept of Bucket Joins, which is only applicable for [bucketed](sql-data-sources-load-save-functions.html#bucketing-sorting-and-partitioning) tables, to tables partitioned by functions registered in FunctionCatalog. Storage Partition Joins are currently supported for compatible V2 DataSources.

The following SQL properties enable Storage Partition Join.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps The following SQL properties enable Storage Partition Join and various optimizations of it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, added 'in different join queries with various optimizations.' (as some flags are about different scenarios)

@HyukjinKwon
Copy link
Member

Merged to master.

@szehon-ho
Copy link
Member Author

Thank you @HyukjinKwon @sunchao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants