[SPARK-29461][SQL] Measure the number of records being updated for JDBC writer #26109

HeartSaVioR · 2019-10-14T09:36:14Z

What changes were proposed in this pull request?

This patch adds the functionality to measure records being written for JDBC writer. In reality, the value is meant to be a number of records being updated from queries, as per JDBC spec it will return updated count.

Why are the changes needed?

Output metrics for JDBC writer are missing now. The value of "bytesWritten" is also missing, but we can't measure it from JDBC API.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test added.

HeartSaVioR · 2019-10-14T09:38:26Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

+    assert(expected === runAndReturnMetrics(job, _.taskMetrics.outputMetrics.recordsWritten))
+  }
+
+  private def runAndReturnMetrics(job: => Unit, collector: (SparkListenerTaskEnd) => Long): Long = {


This is copied from InputOutputMetricsSuite - please let me know if it should be extracted with some utility class/object.

SparkQA · 2019-10-14T09:48:03Z

Test build #112011 has finished for PR 26109 at commit 11852c7.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-14T14:12:48Z

Test build #112020 has finished for PR 26109 at commit 298d968.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-10-14T21:43:59Z

cc. @maropu @cloud-fan @wangyum initially according to commit history.

SparkQA · 2019-10-15T01:18:24Z

Test build #112061 has finished for PR 26109 at commit 5f4c9e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-10-23T05:09:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

      dialect: JdbcDialect,
      isolationLevel: Int,
-      options: JDBCOptions): Iterator[Byte] = {
+      options: JDBCOptions): Long = {


We need to update this instead of updating the metric inside this method?

I guess it would be cleaner to handle the metric outside of the method, as it will not update metric if savePartition throws exception. We should add the metrics update logic to the end of finally statement which doesn't seem to be cleaner if we want to do the same but inside savePartition.

In other words, this approach doesn't support iterative updates on metric, as well as no update on partially written and failed. It would totally make sense to not update if it supports transaction, but if it doesn't support transaction and it leaves some records on failure, I'm not sure we should update the metric. How we are dealing with partial output?

I've just revisited SparkHadoopWriter and realized it just updates the metric regardless of task success or not. Got it. I'll include metric update into savePartition method. Thanks!

I though we took care of the metric only if transaction committed like this;

} finally { if (!committed) { // The stage must fail. We got here through an exception path, so // let the exception through unless rollback() or close() want to // tell the user about another problem. if (supportsTransactions) { conn.rollback() } conn.close() } else { // If the transaction committed, updates the metric outputMetrics.setRecordsWritten(recordsWritten) // The stage must succeed. We cannot propagate any exception close() might throw. try { conn.close() } catch { case e: Exception => logWarning("Transaction succeeded, but closing failed", e) } }

cc: @HyukjinKwon @wangyum

Yeah but looks like SparkHadoopWriter just updates the metric for any output being written - maybe that's because of nonexistence of transaction. If we take transaction into account, it would make sense to only update metric when the transaction is committed, but we might also want to update metric when both committed and supportsTransactions are false to reflect metric for dirty outputs. WDYT?

Yeah, that looks reasonable to me. So, can you brush up the code based on that?

OK let me update the patch. Thanks!

maropu · 2019-10-23T05:10:28Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

+    sparkContext.removeSparkListener(listener)
+    taskMetrics.sum
+  }
+


nit: remove this blank.

maropu · 2019-10-23T05:12:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

        }
        if (rowCount > 0) {
-          stmt.executeBatch()
+          totalUpdatedRows += stmt.executeBatch().sum


We cannot just sum up rowCont?

I took the approach to ensure we only count for actual updates, but not sure how Spark has been doing for others. Same for number of bytes written. I was actually asked to update number of bytes written as well, but there's no way to get the actual value from JDBC, so skipped it.

Please let me know how Spark has been updating these metrics - I'll follow the approach. Thanks!

SparkHadoopWriter uses a row count as the metric:

spark/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala

Line 136 in c128ac5

recordsWritten += 1

Since the returned values of stmt.executeBatch seems to be JDBC implementation specific, IMHO its ok to just do the same with SparkHadoopWriter

OK. Thanks for the guide! What about number of bytes? Reading the length of file is easy, but measuring the size of row for every rows seems nontrivial.

maropu · 2019-10-23T05:12:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+      val totalUpdatedRows = savePartition(
+        getConnection, table, iterator, rddSchema, insertStmt, batchSize, dialect, isolationLevel,
+        options)
+      outMetrics.setRecordsWritten(outMetrics.recordsWritten + totalUpdatedRows)


outMetrics.setRecordsWritten(totalUpdatedRows)?

SparkQA · 2019-10-23T15:00:56Z

Test build #112536 has finished for PR 26109 at commit 7bae87e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-24T07:05:02Z

Test build #112582 has finished for PR 26109 at commit 620d111.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-10-24T07:15:37Z

retest this, please

SparkQA · 2019-10-24T11:04:08Z

Test build #112595 has finished for PR 26109 at commit 620d111.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-10-24T12:48:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

            i = i + 1
          }
          stmt.addBatch()
          rowCount += 1


We cannot move rowCount outside try then just use it for the metric?

It's used to determine whether it needs one more flush or not at the end of iterating. It can just be a boolean flag, but we should have one specific variable for taking this into account anyway.

Can you leave some comments somewhere about the policy to collect metrics?

Just added it. 6e908d1

maropu · 2019-10-25T00:10:22Z

cc: @HyukjinKwon

HyukjinKwon

Thanks for cc'ing me @maropu. Looks good to me too

SparkQA · 2019-10-25T04:47:44Z

Test build #112634 has finished for PR 26109 at commit 6e908d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-10-25T07:33:19Z

Thanks, @HeartSaVioR and @HyukjinKwon ! Merged to master.

HeartSaVioR · 2019-10-25T08:22:02Z

Thanks all for reviewing and merging!

### What changes were proposed in this pull request? Fix JDBC metrics counter data type. Related pull request [26109](#26109). ### Why are the changes needed? Avoid overflow. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Exists UT. Closes #26346 from ulysses-you/SPARK-29687. Authored-by: ulysses <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

[SPARK-29461][SQL] Measure records being written for JDBC writer

11852c7

HeartSaVioR commented Oct 14, 2019

View reviewed changes

Fix compilation

298d968

dongjoon-hyun added the SPARK CORE label Oct 14, 2019

Fix UT: remove unnecessary registering dialect to fix missing unregister

5f4c9e6

dongjoon-hyun added SQL and removed SPARK CORE labels Oct 15, 2019

HeartSaVioR changed the title ~~[SPARK-29461][SQL] Measure records being updated for JDBC writer~~ [SPARK-29461][SQL] Measure the number of records being updated for JDBC writer Oct 15, 2019

maropu reviewed Oct 23, 2019

View reviewed changes

Reflect review comments

7bae87e

Reflect missing review comment

620d111

maropu reviewed Oct 24, 2019

View reviewed changes

srowen approved these changes Oct 24, 2019

View reviewed changes

maropu approved these changes Oct 25, 2019

View reviewed changes

add explanation of policy around recording metrics

6e908d1

HyukjinKwon approved these changes Oct 25, 2019

View reviewed changes

maropu closed this in cfbdd9d Oct 25, 2019

HeartSaVioR deleted the SPARK-29461 branch October 25, 2019 08:22

ulysses-you mentioned this pull request Oct 31, 2019

[SPARK-29687][SQL] Fix JDBC metrics counter data type #26346

Closed

[SPARK-29461][SQL] Measure the number of records being updated for JDBC writer #26109

[SPARK-29461][SQL] Measure the number of records being updated for JDBC writer #26109

Uh oh!

Conversation

HeartSaVioR commented Oct 14, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

HeartSaVioR commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 23, 2019

Uh oh!

SparkQA commented Oct 24, 2019

Uh oh!

HeartSaVioR commented Oct 24, 2019

Uh oh!

SparkQA commented Oct 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Oct 25, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 25, 2019

Uh oh!

maropu commented Oct 25, 2019

Uh oh!

HeartSaVioR commented Oct 25, 2019

Uh oh!

Reviewers

HeartSaVioR Oct 24, 2019 •

edited

Loading