Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

This patch adds the functionality to measure records being written for JDBC writer. In reality, the value is meant to be a number of records being updated from queries, as per JDBC spec it will return updated count.

Why are the changes needed?

Output metrics for JDBC writer are missing now. The value of "bytesWritten" is also missing, but we can't measure it from JDBC API.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test added.

assert(expected === runAndReturnMetrics(job, _.taskMetrics.outputMetrics.recordsWritten))
}

private def runAndReturnMetrics(job: => Unit, collector: (SparkListenerTaskEnd) => Long): Long = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from InputOutputMetricsSuite - please let me know if it should be extracted with some utility class/object.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112011 has finished for PR 26109 at commit 11852c7.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112020 has finished for PR 26109 at commit 298d968.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

cc. @maropu @cloud-fan @wangyum initially according to commit history.

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112061 has finished for PR 26109 at commit 5f4c9e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR HeartSaVioR changed the title [SPARK-29461][SQL] Measure records being updated for JDBC writer [SPARK-29461][SQL] Measure the number of records being updated for JDBC writer Oct 15, 2019
dialect: JdbcDialect,
isolationLevel: Int,
options: JDBCOptions): Iterator[Byte] = {
options: JDBCOptions): Long = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update this instead of updating the metric inside this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be cleaner to handle the metric outside of the method, as it will not update metric if savePartition throws exception. We should add the metrics update logic to the end of finally statement which doesn't seem to be cleaner if we want to do the same but inside savePartition.

In other words, this approach doesn't support iterative updates on metric, as well as no update on partially written and failed. It would totally make sense to not update if it supports transaction, but if it doesn't support transaction and it leaves some records on failure, I'm not sure we should update the metric. How we are dealing with partial output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just revisited SparkHadoopWriter and realized it just updates the metric regardless of task success or not. Got it. I'll include metric update into savePartition method. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though we took care of the metric only if transaction committed like this;

    } finally {
      if (!committed) {
        // The stage must fail.  We got here through an exception path, so
        // let the exception through unless rollback() or close() want to
        // tell the user about another problem.
        if (supportsTransactions) {
          conn.rollback()
        }
        conn.close()
      } else {
        // If the transaction committed, updates the metric
        outputMetrics.setRecordsWritten(recordsWritten)

        // The stage must succeed.  We cannot propagate any exception close() might throw.
        try {
          conn.close()
        } catch {
          case e: Exception => logWarning("Transaction succeeded, but closing failed", e)
        }
      }

cc: @HyukjinKwon @wangyum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but looks like SparkHadoopWriter just updates the metric for any output being written - maybe that's because of nonexistence of transaction. If we take transaction into account, it would make sense to only update metric when the transaction is committed, but we might also want to update metric when both committed and supportsTransactions are false to reflect metric for dirty outputs. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that looks reasonable to me. So, can you brush up the code based on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let me update the patch. Thanks!

sparkContext.removeSparkListener(listener)
taskMetrics.sum
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this blank.

}
if (rowCount > 0) {
stmt.executeBatch()
totalUpdatedRows += stmt.executeBatch().sum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot just sum up rowCont?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the approach to ensure we only count for actual updates, but not sure how Spark has been doing for others. Same for number of bytes written. I was actually asked to update number of bytes written as well, but there's no way to get the actual value from JDBC, so skipped it.

Please let me know how Spark has been updating these metrics - I'll follow the approach. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkHadoopWriter uses a row count as the metric:

Since the returned values of stmt.executeBatch seems to be JDBC implementation specific, IMHO its ok to just do the same with SparkHadoopWriter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Thanks for the guide! What about number of bytes? Reading the length of file is easy, but measuring the size of row for every rows seems nontrivial.

val totalUpdatedRows = savePartition(
getConnection, table, iterator, rddSchema, insertStmt, batchSize, dialect, isolationLevel,
options)
outMetrics.setRecordsWritten(outMetrics.recordsWritten + totalUpdatedRows)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outMetrics.setRecordsWritten(totalUpdatedRows)?

@SparkQA
Copy link

SparkQA commented Oct 23, 2019

Test build #112536 has finished for PR 26109 at commit 7bae87e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 24, 2019

Test build #112582 has finished for PR 26109 at commit 620d111.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Oct 24, 2019

Test build #112595 has finished for PR 26109 at commit 620d111.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

i = i + 1
}
stmt.addBatch()
rowCount += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot move rowCount outside try then just use it for the metric?

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used to determine whether it needs one more flush or not at the end of iterating. It can just be a boolean flag, but we should have one specific variable for taking this into account anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, I see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave some comments somewhere about the policy to collect metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added it. 6e908d1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@maropu
Copy link
Member

maropu commented Oct 25, 2019

cc: @HyukjinKwon

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cc'ing me @maropu. Looks good to me too

@SparkQA
Copy link

SparkQA commented Oct 25, 2019

Test build #112634 has finished for PR 26109 at commit 6e908d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu closed this in cfbdd9d Oct 25, 2019
@maropu
Copy link
Member

maropu commented Oct 25, 2019

Thanks, @HeartSaVioR and @HyukjinKwon ! Merged to master.

@HeartSaVioR
Copy link
Contributor Author

Thanks all for reviewing and merging!

@HeartSaVioR HeartSaVioR deleted the SPARK-29461 branch October 25, 2019 08:22
maropu pushed a commit that referenced this pull request Oct 31, 2019
### What changes were proposed in this pull request?

Fix JDBC metrics counter data type. Related pull request [26109](#26109).

### Why are the changes needed?

Avoid overflow.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Exists UT.

Closes #26346 from ulysses-you/SPARK-29687.

Authored-by: ulysses <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants