[SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files #28363

HeartSaVioR · 2020-04-27T07:37:32Z

What changes were proposed in this pull request?

This patch proposes to provide a new option to specify time-to-live (TTL) for output file entries in FileStreamSink. TTL is defined via current timestamp - the last modified time for the file.

This patch will filter out outdated output files in metadata while compacting batches (other batches don't have functionality to clean entries), which helps metadata to not grow linearly, as well as filtered out files will be "eventually" no longer seen in reader queries which leverage File(Stream)Source.

Why are the changes needed?

The metadata log greatly helps to easily achieve exactly-once but given the output path is open to arbitrary readers, there's no way to compact the metadata log, which ends up growing the metadata file as query runs for long time, especially for compacted batch.

Lots of end users have been reporting the issue: see comments in SPARK-24295 and SPARK-29995, and SPARK-30462.
(There're some reports from end users which include their workarounds: SPARK-24295)

Does this PR introduce any user-facing change?

No, as the configuration is new and by default it is not applied.

How was this patch tested?

New UT.

HeartSaVioR · 2020-04-27T07:38:25Z

This PR is just a revival of #24128 as the problem definition and the solution can be still applied.

HeartSaVioR · 2020-04-27T07:44:37Z

cc. @tdas @zsxwing @jose-torres @gaborgsomogyi

SparkQA · 2020-04-27T13:58:09Z

Test build #121886 has finished for PR 28363 at commit 31603b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-05-07T13:44:09Z

I've read the discussion on #24128 and I agree that TTL would be the way. I like the idea for instance how Kafka handles the situation (even if retention generates some confusion on Spark user side when retention deleted data but Spark wanted to process it and not found).

I think first the metadata must be compacted (remove file entries where TTL expired) but what I miss is to delete files. There are 2 type of files without this patch:

Name exists in metadata file
Name doesn't exists in metadata file (it's junk)

With this change this will be extended with a third one:

Name doesn't exists in metadata file (TTL expired)

If we want to do full TTL then a separate GC would be good to delete files matching 2nd and 3rd bullet points (of course only after whne from metadata removed).

What I see as a potential problem is that FS timestamp may be different from local time (not yet checked how Hadoop handles time).

HeartSaVioR · 2020-05-08T03:06:56Z

If we want to do full TTL then a separate GC would be good to delete files matching 2nd and 3rd bullet points (of course only after whne from metadata removed).

Yeah I didn't deal with this because there may be some reader queries which still read from old version of metadata which may contain excluded files. (Batch query would read all available files so there's still a chance for race condition.)

What I see as a potential problem is that FS timestamp may be different from local time (not yet checked how Hadoop handles time).

While I'm not sure it's a real problem (as we rely on the last modified time while reading files), I eliminated the case via adding "commit time" on entry and applying retention based on commit time. So I guess the thing is no longer valid.

gaborgsomogyi · 2020-05-08T13:07:01Z

Yeah I didn't deal with this because there may be some reader queries which still read from old version of metadata which may contain excluded files. (Batch query would read all available files so there's still a chance for race condition.)

That's a valid consideration. Cleaning junk files not necessarily must belong to this feature. This can be put behind another flag. I'm thinking about this for long time (though the initial idea was to delete only the generated junk). Of course this must be done in a separate thread because directory listing can be pathologically slow in some cases. This could reduce the storage cost to users significantly in an automatic way...

While I'm not sure it's a real problem (as we rely on the last modified time while reading files), I eliminated the case via adding "commit time" on entry and applying retention based on commit time. So I guess the thing is no longer valid.

I've played with HDFS and read the docs of the other filesystems and haven't found any glithes.

gaborgsomogyi

First round

gaborgsomogyi · 2020-05-20T14:35:20Z

docs/structured-streaming-programming-guide.md

+        <code>path</code>: path to the output directory, must be specified.<br/>
+        <code>outputRetentionMs</code>: time to live (TTL) for output files. Output files which batches were
+        committed older than TTL will be eventually excluded in metadata log. This means reader queries which read
+        the sink's output directory may not process them.


I would mention the default, like other params.

gaborgsomogyi · 2020-05-20T14:51:01Z

docs/structured-streaming-programming-guide.md

    <td>
-        <code>path</code>: path to the output directory, must be specified.
+        <code>path</code>: path to the output directory, must be specified.<br/>
+        <code>outputRetentionMs</code>: time to live (TTL) for output files. Output files which batches were


There are 2 time fields in SinkFileStatus, modificationTime and commitTime. Maybe worth to mention the exact field which is used for comparison to make it 100% clear.

I guess we avoid exposing the implementation details in docs. e.g. If I'm not mistaken, there's no explanation of the format of the metadata, hence it would be confusing which field is being used because end users even don't know what they are.

You're right and we're not explaining metadata details to users. What users would like to understand though what the reference TTL is bound to. As half developer and half user I was a bit confused which field SinkFileStatus we would like to refer to. Since we've removed the (in my view) duplicate field I'm fine here.

gaborgsomogyi · 2020-05-20T15:31:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala

    blockSize: Long,
-    action: String) {
+    action: String,
+    commitTime: Long) {


Since we're supporting this feature in append mode only isn't it possible to use modificationTime?

So the introduce of "commit time" came from the concern about uncertain of HDFS file timestamp in previous PR. If we are sure about the modification time, no need to use "commit time".

SparkQA · 2020-05-22T03:15:56Z

Test build #122955 has finished for PR 28363 at commit d2e7ab3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-22T07:05:03Z

Test build #122961 has finished for PR 28363 at commit fb4ce2c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-05-22T07:05:56Z

retest this, please

SparkQA · 2020-05-22T12:34:24Z

Test build #122975 has finished for PR 28363 at commit fb4ce2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-05-25T12:46:07Z

docs/structured-streaming-programming-guide.md

    <td>
-        <code>path</code>: path to the output directory, must be specified.
+        <code>path</code>: path to the output directory, must be specified.<br/>
+        <code>outputRetentionMs</code>: time to live (TTL) for output files. Output files which batches were


You're right and we're not explaining metadata details to users. What users would like to understand though what the reference TTL is bound to. As half developer and half user I was a bit confused which field SinkFileStatus we would like to refer to. Since we've removed the (in my view) duplicate field I'm fine here.

gaborgsomogyi · 2020-05-25T15:07:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala

-    val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
+    val curTime = System.currentTimeMillis()
+    val deletedFiles = logs.filter { log =>
+      log.action == FileStreamSinkLog.DELETE_ACTION || (curTime - log.modificationTime) > ttlMs


Can we add some debug information why a certain entry deleted from the log? I was just thinking that a user comes and tells that certain files have been deleted and they mustn't or the opposite. Without debug information hard to tell anything to such issue. If you have an idea w/o debug that's also fine to me...

gaborgsomogyi · 2020-05-25T15:59:51Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLogSuite.scala

+  private def withFileStreamSinkLog(f: FileStreamSinkLog => Unit): Unit =
+    withFileStreamSinkLog(None, f)
+
+  private def withFileStreamSinkLog(ttl: Option[Long], f: FileStreamSinkLog => Unit): Unit = {


If the 2 params order flipped with a default then we don't need a method override.

gaborgsomogyi · 2020-05-25T16:04:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLogSuite.scala

+  private def newFakeSinkFileStatus(
+      path: String,
+      action: String,
+      modificationTime: Long): SinkFileStatus = {


Maybe a default here?

SparkQA · 2020-05-26T06:43:57Z

Test build #123108 has finished for PR 28363 at commit 06be0e4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-05-26T08:34:24Z

Looks good apart from the style issue.

HeartSaVioR · 2020-05-26T08:39:52Z

Given it didn't fail on my local, I'll try to rebase to see how to code has been affected by automatic merge.

HeartSaVioR · 2020-05-26T08:47:47Z

OK just fixed it. Let's see the build result.

SparkQA · 2020-05-26T14:08:37Z

Test build #123116 has finished for PR 28363 at commit 9383fcb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi

LGTM.

HeartSaVioR · 2020-06-14T01:49:18Z

retest this, please

SparkQA · 2020-06-14T07:05:03Z

Test build #123986 has finished for PR 28363 at commit 9383fcb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-06-14T07:18:17Z

retest this, please

SparkQA · 2020-06-14T12:11:11Z

Test build #123999 has finished for PR 28363 at commit 9383fcb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-05T21:26:54Z

Test build #124953 has finished for PR 28363 at commit b648156.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-05T22:19:28Z

retest this, please

HeartSaVioR · 2020-10-26T12:36:29Z

retest this, please

SparkQA · 2020-10-26T14:07:10Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34888/

SparkQA · 2020-10-26T14:29:24Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34888/

SparkQA · 2020-10-26T17:36:35Z

Test build #130287 has finished for PR 28363 at commit 686fc6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-28T02:37:50Z

retest this, please

HeartSaVioR · 2020-11-28T03:37:13Z

cc. @tdas @zsxwing @gaborgsomogyi @viirya @xuanyuanking

Just a final reminder. I'll merge this in early next week if there's no further comments, according to the feedback from dev@ mailing list.

SparkQA · 2020-11-28T07:08:58Z

Test build #131890 has finished for PR 28363 at commit 686fc6d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-28T07:39:31Z

retest this, please

SparkQA · 2020-11-28T12:19:40Z

Test build #131898 has finished for PR 28363 at commit 686fc6d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-28T12:58:13Z

retest this, please

SparkQA · 2020-11-28T17:12:47Z

Test build #131903 has finished for PR 28363 at commit 686fc6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Left some minor comments. An ideal solution would be building a file compaction for file sink. But that would be a large effort. This option at least provides a workaround for people hitting the large metadata issue, so I'm +1 for adding this.

zsxwing · 2020-11-28T23:59:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala

    sparkSession.sessionState.conf)
-  private val fileLog =
-    new FileStreamSinkLog(FileStreamSinkLog.VERSION, sparkSession, logPath.toString)
+  private val outputTimeToLive = options.get("outputRetentionMs").map(_.toLong)


Can we use Utils.timeStringAsMs to parse this? Users likely set this to multiple days and asking them to calculate milliseconds is not user friendly.

Nit: regarding the option name, can we call it retention? It's obvious that the query is outputting files, so output sounds redundant to me.

Nit: any reason to use a different name outputTimeToLive? Using the same name as the option would help other people read codes.

Nit: it would be great to output an info log for this value if it's set. It might be useful when debugging data issues caused by the retention.

Good suggestions! I'll apply all of inputs.

zsxwing · 2020-11-29T00:14:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala

+  private val ttlMs = outputTimeToLiveMs.getOrElse(Long.MaxValue)
+
+  override def shouldRetain(log: SinkFileStatus): Boolean = {
+    val curTime = System.currentTimeMillis()


It would be great to avoid calling System.currentTimeMillis() if the option is not set, considering we need to call this method once (a JNI call) for each log entry.

Probably we could change the method signature a bit to provide "context" which would be same for the same compact batch. We changed the method shouldRetain in SPARK-30462 which is not yet released (3.1.0), hence making change shouldn't make backward compatibility for this change. (We decided to break the interface, but we break only once for these changes.)

Once we only call System.currentTimeMillis() once per compact batch the overhead should be ignorable.

Ah also CompactibleFileStreamLog is not a public API (its package is org.apache.spark.sql.execution.streaming), so it shouldn't matter much.

… a batch of logs in shouldRetain

HeartSaVioR · 2020-11-29T22:39:17Z

Btw, I also concern (probably more concerning) on metadata log growing in FileStreamSource.(#28422)

The format of each entry in FileStreamSource is much smaller than FileStreamSink's one so it's more resilient to the memory issue, but while there're 3rd party alternatives on FileStreamSink (as we all know), there're no alternative on FileStreamSource to read from files. That said users are forced to introduce external process to have less files in order to give less pressure to the metadata log in FileStreamSource, or use other data sources for the input of SS.

Unlike FileStreamSink, it's not that simple to remove log entry, just because we support latestFirst. We didn't need to consider such option in SPARK-20568 (#22952), but we'll never be able to have a threshold to remove log entry if we keep supporting latestFirst, as the meaning of latest keeps changing and it's reading backward without lower bound (maxFileAge is ignored), hence "any" files could be read, even ancient one, in later batch.

I've also raised the discussion thread but didn't get any committers' voice.

http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-quot-latestFirst-quot-option-and-metadata-growing-issue-in-File-stream-source-td29853.html

Though I see some voices want to see FileStreamSource work just like Kafka stream source, which says, replace latestFirst with start offset (last modified time for the file stream source). That says we do only support forward scanning. I think this is the right way to go, unless anyone provides there're lots of users leveraging latestFirst and their use case is not covered by start offset.

WDYT?

zsxwing · 2020-11-30T00:57:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

   * to change the behavior.
   */
-  def shouldRetain(log: T): Boolean = true
+  def shouldRetain(log: T, context: Map[String, Any]): Boolean = true


I feel adding context is overkill. How about passing the now timestamp into this instead? We can add the retain context like this in future if that's necessary:

def shouldRetain(log: T, currentTime: Long): Boolean = true def shouldRetain(log: T, context: Map[String, Any]): Boolean = { shouldRetain(log, context.get("currentTime").asInstanceOf[Long]) }

Actually the change was made to cover possible general case, because I got review comment when making change of shouldRetain that "please avoid breaking backward compatibility." I disagreed as it's a private API based on the package, but took considerable time and effort to persuade.

If we agree that this is not a public API and shouldn't bother with backward compatibility with this, I definitely agree this is pretty much overkill as of now. WDYT?

Yep, this is not a public API. It's okay to change it.

Thanks! Will make a quick fix.

Fixed. Please take a look again. Thanks!

SparkQA · 2020-11-30T01:54:39Z

Test build #131937 has finished for PR 28363 at commit 61f9089.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-30T02:51:39Z

Test build #131938 has finished for PR 28363 at commit 5739592.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-30T04:01:58Z

Test build #131940 has finished for PR 28363 at commit c8b6d24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

LGTM except one nit.

zsxwing · 2020-12-01T03:25:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

          val logs =
            getAllValidBatches(latestId, compactInterval).flatMap { id =>
-              filterInBatch(id)(shouldRetain).getOrElse {
+              val curTime = System.currentTimeMillis()


nit: we can move this out of the flatMap function.

zsxwing · 2020-12-01T03:28:29Z

TTL is defined via current timestamp - commit time (the time ManifestFileCommitProtocol.commitJob is called to write streaming file sink metadata log).

Could you update this in the PR description? We are using the file modification time now.

HeartSaVioR · 2020-12-01T03:34:53Z

Thanks for the detailed review. Just applied both.

zsxwing · 2020-12-01T05:32:50Z

LGTM. Thanks for your patience.

HeartSaVioR · 2020-12-01T05:42:11Z

Thanks all for the thoughtful reviews! Merging to master.

probot-autolabeler bot added DOCS SQL STRUCTURED STREAMING labels Apr 27, 2020

gaborgsomogyi reviewed May 20, 2020

View reviewed changes

gaborgsomogyi reviewed May 25, 2020

View reviewed changes

HeartSaVioR force-pushed the SPARK-27188-v2 branch from 06be0e4 to 9383fcb Compare May 26, 2020 08:47

gaborgsomogyi approved these changes May 27, 2020

View reviewed changes

HeartSaVioR mentioned this pull request Jun 30, 2020

[SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log #27694

Closed

HeartSaVioR force-pushed the SPARK-27188-v2 branch from 9383fcb to b648156 Compare July 3, 2020 03:34

zsxwing reviewed Nov 29, 2020

View reviewed changes

HeartSaVioR added 3 commits November 30, 2020 05:57

Introduce "retain context" to provide information which will apply to…

9bc07e3

… a batch of logs in shouldRetain

small fix

61f9089

Fix doc

5739592

Less bugging on log messages

c8b6d24

zsxwing reviewed Nov 30, 2020

View reviewed changes

Simplify the interface

e1834ea

zsxwing approved these changes Dec 1, 2020

View reviewed changes

Move currentTimeMillis out of loop

8aceb86

HeartSaVioR closed this in 52e5cc4 Dec 1, 2020

HeartSaVioR deleted the SPARK-27188-v2 branch December 1, 2020 05:43

[SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files #28363

[SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files #28363

Uh oh!

Conversation

HeartSaVioR commented Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Apr 27, 2020

Uh oh!

HeartSaVioR commented Apr 27, 2020

Uh oh!

SparkQA commented Apr 27, 2020

Uh oh!

gaborgsomogyi commented May 7, 2020

Uh oh!

HeartSaVioR commented May 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaborgsomogyi commented May 8, 2020

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 22, 2020

Uh oh!

SparkQA commented May 22, 2020

Uh oh!

HeartSaVioR commented May 22, 2020

Uh oh!

SparkQA commented May 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 26, 2020

Uh oh!

gaborgsomogyi commented May 26, 2020

Uh oh!

HeartSaVioR commented May 26, 2020

Uh oh!

HeartSaVioR commented May 26, 2020

Uh oh!

SparkQA commented May 26, 2020

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Jun 14, 2020

Uh oh!

SparkQA commented Jun 14, 2020

Uh oh!

HeartSaVioR commented Jun 14, 2020

Uh oh!

SparkQA commented Jun 14, 2020

Uh oh!

SparkQA commented Jul 5, 2020

Uh oh!

HeartSaVioR commented Jul 5, 2020

Uh oh!

HeartSaVioR commented Apr 27, 2020 •

edited

Loading

HeartSaVioR commented May 8, 2020 •

edited

Loading

HeartSaVioR commented Nov 28, 2020 •

edited

Loading

HeartSaVioR Nov 29, 2020 •

edited

Loading

HeartSaVioR commented Nov 29, 2020 •

edited

Loading