[SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations #38004

aokolnychyi · 2022-09-26T21:10:50Z

What changes were proposed in this pull request?

This PR adds DS v2 APIs for handling row-level operations for data sources that support deltas of rows.

Why are the changes needed?

These changes are part of the approved SPIP in SPARK-35801.

Does this PR introduce any user-facing change?

Yes, this PR adds new DS v2 APIs per design doc.

How was this patch tested?

Tests will be part of the implementation PR.

aokolnychyi · 2022-09-26T21:13:35Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaBatchWrite.java

I did not add any doc for inherited methods as it would mostly overlap with the parent doc. I could add a few sentences and reference the parent doc, though.

aokolnychyi · 2022-09-26T21:14:52Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWrite.java

I added a default implementation to match the parent interface.
In the future, we may also override toStreaming.

aokolnychyi · 2022-09-26T21:16:40Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWriteBuilder.java

I had to override to avoid inheriting the default implementation from the parent interface.

aokolnychyi · 2022-09-26T21:21:42Z

@cloud-fan @rdblue @huaxingao @dongjoon-hyun @sunchao @viirya, could you take a look? This is the API from the design doc we discussed earlier.

I have also created PR #38005 that shows how this API will be consumed.

…operations

aokolnychyi · 2022-09-27T00:42:53Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/LogicalWriteInfo.java

+   * the schema of the input metadata from Spark to data source.
+   */
+  default StructType metadataSchema() {
+    return null;


The default implementation is purely for compatibility.

Nit: I usually really want to avoid returning null. Even throw an exception is better than null. Probably this is only my opinion.

I am open to discuss what value should indicate no row ID or metadata schema.

I see Optional in a few other places in the connector API. I could switch to that.
I think using Optional is always debatable so it is usually up to a particular project to decide.

Let me see how the implementation use this API if there is any.

The case is more like if there is an implementation forget to implement this thus has used the default null. NPE usually come from this case.

Returning null is pretty risky as you don't know when/where a NPE will happen. I'd prefer throw UnsupportedOperation by default.

@amaliujia @cloud-fan, we can do something like this.

/** * the schema of the ID columns from Spark to data source. */ default Optional<StructType> rowIdSchema() { throw new UnsupportedOperationException( getClass().getName() + " does not implement rowIdSchema"); } /** * the schema of the input metadata from Spark to data source. */ default Optional<StructType> metadataSchema() { throw new UnsupportedOperationException( getClass().getName() + " does not implement metadataSchema"); }

Now the question is what to report in schema() for delta-based DELETE operations, where we do not pass the row, we just pass row ID and metadata. One option is to report an empty struct but let me know if you have other ideas.

The way I was approaching it initially:

schema() -> the row schema for new records (only MERGE adds new records) rowIdSchema() -> the schema for row ID passed to data sources to mark a record as deleted/updated metadataSchema() -> the schema of projected metadata columns that contain some extra info about the row that is being deleted/updated

Please proceed to update this PR according to your proposed way, @aokolnychyi .

I've updated this PR and the reference implementation in PR #38005.

cloud-fan · 2022-09-27T15:59:17Z

Sorry for the delay! Will take a look this week.

amaliujia

A general question about the PR:

I see the purpose of this PR is to add APIs. But does it make sense to have reference implementation or in some testing code that implement the APIs for a bit testing?

amaliujia · 2022-09-27T21:50:22Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/LogicalWriteInfo.java

+   * the schema of the input metadata from Spark to data source.
+   */
+  default StructType metadataSchema() {
+    return null;


Nit: I usually really want to avoid returning null. Even throw an exception is better than null. Probably this is only my opinion.

aokolnychyi · 2022-09-27T23:26:34Z

@amaliujia, I have linked #38005 that adds test coverage and implementation. I've split this work to reduce the scope of each PR and simplify reviewing. Also, converging on the implementation usually takes more time so having smaller chunks of work helps to make some progress.

amaliujia · 2022-09-28T04:14:26Z

Thanks for the link of the implementation!

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/write/LogicalWriteInfoImpl.scala

cloud-fan · 2022-09-28T07:39:34Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWriter.java

+   * @param id a row ID to delete
+   * @throws IOException if failure happens during disk/network IO like writing files
+   */
+  void delete(T metadata, T id) throws IOException;


Sorry it's been a while and I can't recall all the context. Do we have a concrete example that we need both metadata columns and id columns to delete/update rows?

No problem. It has been a while for me too.

A partition tuple can be one example. A data source may be able to handle operations on a primary key but knowing the old record partition can help the data source to encode a delete efficiently. In that case, the partition tuple is not really part of the row ID but rather extra metadata that allows to encode a delete.

Another example is a metadata column that would carry the old version of the row.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWriter.java

aokolnychyi · 2022-10-10T23:47:05Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/LogicalWriteInfo.java

  /**
   * the schema of the input data from Spark to data source.
   */
  StructType schema();


In delta-based DELETE operations, I plan to return an empty struct type here. See here for context.
Any ideas are welcome.

dongjoon-hyun

+1, LGTM from my side. Thank you, @aokolnychyi .

Please review once more if you have some time, @cloud-fan , @amaliujia , @viirya , @sunchao , @huaxingao .

amaliujia · 2022-10-12T17:55:18Z

LGTM!

huaxingao · 2022-10-13T04:00:52Z

LGTM

dongjoon-hyun · 2022-10-13T18:50:30Z

Thank you, @aokolnychyi, @cloud-fan , @amaliujia , @viirya , @sunchao , @huaxingao .
Merged to master for Apache Spark 3.4.0.

cloud-fan · 2022-10-21T01:47:21Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWriter.java

+ * @since 3.4.0
+ */
+@Experimental
+public interface DeltaWriter<T> extends DataWriter<T> {


One more comment: We use the type parameter T because it can be in either row or columnar format. However, I think this new delta writer can only work with rows. We should probably do DeltaWriter extends DataWriter<InternalRow>

Let me take a closer look on Monday.

@cloud-fan, could you elaborate a bit on why you think we can only work with rows here?

because we have void delete(T metadata, T id) throws IOException;. Are we going to perform a delete with a batch of metadata and id rows?

Well, I can see us passing batches of deletes and metadata at some point in the future. We can assume values at the same index will belong to the same row.

…operations ### What changes were proposed in this pull request? This PR adds DS v2 APIs for handling row-level operations for data sources that support deltas of rows. ### Why are the changes needed? These changes are part of the approved SPIP in SPARK-35801. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds new DS v2 APIs per [design doc](https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60). ### How was this patch tested? Tests will be part of the implementation PR. Closes apache#38004 from aokolnychyi/spark-40551. Lead-authored-by: Anton Okolnychyi <[email protected]> Co-authored-by: aokolnychyi <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…sed sources ### What changes were proposed in this pull request? This PR adds support for DELETE commands for delta-based sources and implements the API added in PR #38004. Suppose there is a data source capable of encoding deletes using primary keys (`pk`). Also, this data source requires knowing the source file, which can be projected via a metadata column (`_file`), to encode deletes efficiently. As an example, there will be a table with 1 file that contains 3 records. ``` pk | salary | department ------------------------ 1, 100, hr 2, 50, software 3, 150, hardware ``` This PR would rewrite `DELETE FROM t WHERE salary <= 100` to perform the following steps: - find records that need to be removed by scanning the table with the delete condition; - project required columns to encode deletes (`pk` + `_file` in our case); - form a set of changes by adding a new column `__row_operation` column with value `delete`; - write the set of changes to the table using `WriteDeltaExec` and `DeltaWriter`; The set of changes to encode for the DELETE statement above will look like this: ``` __row_operation | pk | _file ---------------------------- delete, 1, file_a.parquet delete, 2, file_a.parquet ``` As opposed to group-based deletes that Spark already supports, the new logic will be able to discard records that did not change in the file that had matches (i.e. the record with `pk = 3` did not match the condition and was discarded). Then `WriteDeltaExec` will handle this set of changes and translate them into `delete()` calls on `DeltaWriter`. In the future, this logic will be extended to also cover UPDATEs and MERGEs by adding `update` and `insert` row operations to the set of changes supported by `WriteDeltaExec`. ### Why are the changes needed? Thes changes are needed as per SPIP SPARK-35801. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with tests. Closes #38005 from aokolnychyi/spark-40550-proto. Authored-by: aokolnychyi <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added the SQL label Sep 26, 2022

aokolnychyi commented Sep 26, 2022

View reviewed changes

aokolnychyi mentioned this pull request Sep 26, 2022

[SPARK-40550][SQL] DataSource V2: Handle DELETE commands for delta-based sources #38005

Closed

aokolnychyi force-pushed the spark-40551 branch from 1a6bfe2 to c77ea7a Compare September 26, 2022 23:16

[SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level …

742744b

…operations

aokolnychyi force-pushed the spark-40551 branch from c77ea7a to 742744b Compare September 27, 2022 00:42

aokolnychyi commented Sep 27, 2022

View reviewed changes

amaliujia reviewed Sep 27, 2022

View reviewed changes

cloud-fan reviewed Sep 28, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/write/LogicalWriteInfoImpl.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 28, 2022

View reviewed changes

dongjoon-hyun reviewed Oct 6, 2022

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWriter.java Show resolved Hide resolved

Review feedback

0eed443

aokolnychyi commented Oct 10, 2022

View reviewed changes

dongjoon-hyun approved these changes Oct 12, 2022

View reviewed changes

huaxingao approved these changes Oct 13, 2022

View reviewed changes

dongjoon-hyun closed this in b87d5f7 Oct 13, 2022

cloud-fan reviewed Oct 21, 2022

View reviewed changes

[SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations #38004

[SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations #38004

Uh oh!

Conversation

aokolnychyi commented Sep 26, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Sep 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Sep 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Sep 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 27, 2022

Uh oh!

amaliujia left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Sep 27, 2022

Uh oh!

amaliujia commented Sep 28, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aokolnychyi Oct 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia commented Oct 12, 2022

Uh oh!

huaxingao commented Oct 13, 2022

Uh oh!

dongjoon-hyun commented Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Sep 26, 2022 •

edited

Loading

aokolnychyi Sep 27, 2022 •

edited

Loading

amaliujia Sep 28, 2022 •

edited

Loading

aokolnychyi Oct 10, 2022 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Oct 13, 2022 •

edited

Loading