Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Apr 27, 2020

What changes were proposed in this pull request?

This patch proposes to provide a new option to specify time-to-live (TTL) for output file entries in FileStreamSink. TTL is defined via current timestamp - the last modified time for the file.

This patch will filter out outdated output files in metadata while compacting batches (other batches don't have functionality to clean entries), which helps metadata to not grow linearly, as well as filtered out files will be "eventually" no longer seen in reader queries which leverage File(Stream)Source.

Why are the changes needed?

The metadata log greatly helps to easily achieve exactly-once but given the output path is open to arbitrary readers, there's no way to compact the metadata log, which ends up growing the metadata file as query runs for long time, especially for compacted batch.

Lots of end users have been reporting the issue: see comments in SPARK-24295 and SPARK-29995, and SPARK-30462.
(There're some reports from end users which include their workarounds: SPARK-24295)

Does this PR introduce any user-facing change?

No, as the configuration is new and by default it is not applied.

How was this patch tested?

New UT.

@HeartSaVioR
Copy link
Contributor Author

This PR is just a revival of #24128 as the problem definition and the solution can be still applied.

@HeartSaVioR
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Apr 27, 2020

Test build #121886 has finished for PR 28363 at commit 31603b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor

I've read the discussion on #24128 and I agree that TTL would be the way. I like the idea for instance how Kafka handles the situation (even if retention generates some confusion on Spark user side when retention deleted data but Spark wanted to process it and not found).

I think first the metadata must be compacted (remove file entries where TTL expired) but what I miss is to delete files. There are 2 type of files without this patch:

  • Name exists in metadata file
  • Name doesn't exists in metadata file (it's junk)

With this change this will be extended with a third one:

  • Name doesn't exists in metadata file (TTL expired)

If we want to do full TTL then a separate GC would be good to delete files matching 2nd and 3rd bullet points (of course only after whne from metadata removed).

What I see as a potential problem is that FS timestamp may be different from local time (not yet checked how Hadoop handles time).

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented May 8, 2020

If we want to do full TTL then a separate GC would be good to delete files matching 2nd and 3rd bullet points (of course only after whne from metadata removed).

Yeah I didn't deal with this because there may be some reader queries which still read from old version of metadata which may contain excluded files. (Batch query would read all available files so there's still a chance for race condition.)

What I see as a potential problem is that FS timestamp may be different from local time (not yet checked how Hadoop handles time).

While I'm not sure it's a real problem (as we rely on the last modified time while reading files), I eliminated the case via adding "commit time" on entry and applying retention based on commit time. So I guess the thing is no longer valid.

@gaborgsomogyi
Copy link
Contributor

Yeah I didn't deal with this because there may be some reader queries which still read from old version of metadata which may contain excluded files. (Batch query would read all available files so there's still a chance for race condition.)

That's a valid consideration. Cleaning junk files not necessarily must belong to this feature. This can be put behind another flag. I'm thinking about this for long time (though the initial idea was to delete only the generated junk). Of course this must be done in a separate thread because directory listing can be pathologically slow in some cases. This could reduce the storage cost to users significantly in an automatic way...

While I'm not sure it's a real problem (as we rely on the last modified time while reading files), I eliminated the case via adding "commit time" on entry and applying retention based on commit time. So I guess the thing is no longer valid.

I've played with HDFS and read the docs of the other filesystems and haven't found any glithes.

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round

<code>path</code>: path to the output directory, must be specified.<br/>
<code>outputRetentionMs</code>: time to live (TTL) for output files. Output files which batches were
committed older than TTL will be eventually excluded in metadata log. This means reader queries which read
the sink's output directory may not process them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would mention the default, like other params.

<td>
<code>path</code>: path to the output directory, must be specified.
<code>path</code>: path to the output directory, must be specified.<br/>
<code>outputRetentionMs</code>: time to live (TTL) for output files. Output files which batches were
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 time fields in SinkFileStatus, modificationTime and commitTime. Maybe worth to mention the exact field which is used for comparison to make it 100% clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we avoid exposing the implementation details in docs. e.g. If I'm not mistaken, there's no explanation of the format of the metadata, hence it would be confusing which field is being used because end users even don't know what they are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right and we're not explaining metadata details to users. What users would like to understand though what the reference TTL is bound to. As half developer and half user I was a bit confused which field SinkFileStatus we would like to refer to. Since we've removed the (in my view) duplicate field I'm fine here.

blockSize: Long,
action: String) {
action: String,
commitTime: Long) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're supporting this feature in append mode only isn't it possible to use modificationTime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the introduce of "commit time" came from the concern about uncertain of HDFS file timestamp in previous PR. If we are sure about the modification time, no need to use "commit time".

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122955 has finished for PR 28363 at commit d2e7ab3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122961 has finished for PR 28363 at commit fb4ce2c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122975 has finished for PR 28363 at commit fb4ce2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

<td>
<code>path</code>: path to the output directory, must be specified.
<code>path</code>: path to the output directory, must be specified.<br/>
<code>outputRetentionMs</code>: time to live (TTL) for output files. Output files which batches were
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right and we're not explaining metadata details to users. What users would like to understand though what the reference TTL is bound to. As half developer and half user I was a bit confused which field SinkFileStatus we would like to refer to. Since we've removed the (in my view) duplicate field I'm fine here.

val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
val curTime = System.currentTimeMillis()
val deletedFiles = logs.filter { log =>
log.action == FileStreamSinkLog.DELETE_ACTION || (curTime - log.modificationTime) > ttlMs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some debug information why a certain entry deleted from the log? I was just thinking that a user comes and tells that certain files have been deleted and they mustn't or the opposite. Without debug information hard to tell anything to such issue. If you have an idea w/o debug that's also fine to me...

private def withFileStreamSinkLog(f: FileStreamSinkLog => Unit): Unit =
withFileStreamSinkLog(None, f)

private def withFileStreamSinkLog(ttl: Option[Long], f: FileStreamSinkLog => Unit): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the 2 params order flipped with a default then we don't need a method override.

private def newFakeSinkFileStatus(
path: String,
action: String,
modificationTime: Long): SinkFileStatus = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a default here?

@SparkQA
Copy link

SparkQA commented May 26, 2020

Test build #123108 has finished for PR 28363 at commit 06be0e4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor

Looks good apart from the style issue.

@HeartSaVioR
Copy link
Contributor Author

Given it didn't fail on my local, I'll try to rebase to see how to code has been affected by automatic merge.

@HeartSaVioR
Copy link
Contributor Author

OK just fixed it. Let's see the build result.

@SparkQA
Copy link

SparkQA commented May 26, 2020

Test build #123116 has finished for PR 28363 at commit 9383fcb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jun 14, 2020

Test build #123986 has finished for PR 28363 at commit 9383fcb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jun 14, 2020

Test build #123999 has finished for PR 28363 at commit 9383fcb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2020

Test build #124953 has finished for PR 28363 at commit b648156.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34888/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34888/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Test build #130287 has finished for PR 28363 at commit 686fc6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Nov 28, 2020

cc. @tdas @zsxwing @gaborgsomogyi @viirya @xuanyuanking

Just a final reminder. I'll merge this in early next week if there's no further comments, according to the feedback from dev@ mailing list.

@SparkQA
Copy link

SparkQA commented Nov 28, 2020

Test build #131890 has finished for PR 28363 at commit 686fc6d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Nov 28, 2020

Test build #131898 has finished for PR 28363 at commit 686fc6d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Nov 28, 2020

Test build #131903 has finished for PR 28363 at commit 686fc6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. An ideal solution would be building a file compaction for file sink. But that would be a large effort. This option at least provides a workaround for people hitting the large metadata issue, so I'm +1 for adding this.

sparkSession.sessionState.conf)
private val fileLog =
new FileStreamSinkLog(FileStreamSinkLog.VERSION, sparkSession, logPath.toString)
private val outputTimeToLive = options.get("outputRetentionMs").map(_.toLong)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Utils.timeStringAsMs to parse this? Users likely set this to multiple days and asking them to calculate milliseconds is not user friendly.

Nit: regarding the option name, can we call it retention? It's obvious that the query is outputting files, so output sounds redundant to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: any reason to use a different name outputTimeToLive? Using the same name as the option would help other people read codes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it would be great to output an info log for this value if it's set. It might be useful when debugging data issues caused by the retention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestions! I'll apply all of inputs.

private val ttlMs = outputTimeToLiveMs.getOrElse(Long.MaxValue)

override def shouldRetain(log: SinkFileStatus): Boolean = {
val curTime = System.currentTimeMillis()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to avoid calling System.currentTimeMillis() if the option is not set, considering we need to call this method once (a JNI call) for each log entry.

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Nov 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we could change the method signature a bit to provide "context" which would be same for the same compact batch. We changed the method shouldRetain in SPARK-30462 which is not yet released (3.1.0), hence making change shouldn't make backward compatibility for this change. (We decided to break the interface, but we break only once for these changes.)

Once we only call System.currentTimeMillis() once per compact batch the overhead should be ignorable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah also CompactibleFileStreamLog is not a public API (its package is org.apache.spark.sql.execution.streaming), so it shouldn't matter much.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Nov 29, 2020

Btw, I also concern (probably more concerning) on metadata log growing in FileStreamSource.(#28422)

The format of each entry in FileStreamSource is much smaller than FileStreamSink's one so it's more resilient to the memory issue, but while there're 3rd party alternatives on FileStreamSink (as we all know), there're no alternative on FileStreamSource to read from files. That said users are forced to introduce external process to have less files in order to give less pressure to the metadata log in FileStreamSource, or use other data sources for the input of SS.

Unlike FileStreamSink, it's not that simple to remove log entry, just because we support latestFirst. We didn't need to consider such option in SPARK-20568 (#22952), but we'll never be able to have a threshold to remove log entry if we keep supporting latestFirst, as the meaning of latest keeps changing and it's reading backward without lower bound (maxFileAge is ignored), hence "any" files could be read, even ancient one, in later batch.

I've also raised the discussion thread but didn't get any committers' voice.

http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-quot-latestFirst-quot-option-and-metadata-growing-issue-in-File-stream-source-td29853.html

Though I see some voices want to see FileStreamSource work just like Kafka stream source, which says, replace latestFirst with start offset (last modified time for the file stream source). That says we do only support forward scanning. I think this is the right way to go, unless anyone provides there're lots of users leveraging latestFirst and their use case is not covered by start offset.

WDYT?

* to change the behavior.
*/
def shouldRetain(log: T): Boolean = true
def shouldRetain(log: T, context: Map[String, Any]): Boolean = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel adding context is overkill. How about passing the now timestamp into this instead? We can add the retain context like this in future if that's necessary:

def shouldRetain(log: T, currentTime: Long): Boolean = true

def shouldRetain(log: T, context: Map[String, Any]): Boolean = {
  shouldRetain(log, context.get("currentTime").asInstanceOf[Long])
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the change was made to cover possible general case, because I got review comment when making change of shouldRetain that "please avoid breaking backward compatibility." I disagreed as it's a private API based on the package, but took considerable time and effort to persuade.

If we agree that this is not a public API and shouldn't bother with backward compatibility with this, I definitely agree this is pretty much overkill as of now. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is not a public API. It's okay to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will make a quick fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Please take a look again. Thanks!

@SparkQA
Copy link

SparkQA commented Nov 30, 2020

Test build #131937 has finished for PR 28363 at commit 61f9089.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2020

Test build #131938 has finished for PR 28363 at commit 5739592.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2020

Test build #131940 has finished for PR 28363 at commit c8b6d24.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one nit.

val logs =
getAllValidBatches(latestId, compactInterval).flatMap { id =>
filterInBatch(id)(shouldRetain).getOrElse {
val curTime = System.currentTimeMillis()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can move this out of the flatMap function.

@zsxwing
Copy link
Member

zsxwing commented Dec 1, 2020

TTL is defined via current timestamp - commit time (the time ManifestFileCommitProtocol.commitJob is called to write streaming file sink metadata log).

Could you update this in the PR description? We are using the file modification time now.

@HeartSaVioR
Copy link
Contributor Author

Thanks for the detailed review. Just applied both.

@zsxwing
Copy link
Member

zsxwing commented Dec 1, 2020

LGTM. Thanks for your patience.

@HeartSaVioR
Copy link
Contributor Author

Thanks all for the thoughtful reviews! Merging to master.

@HeartSaVioR HeartSaVioR deleted the SPARK-27188-v2 branch December 1, 2020 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants