[SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files #28422

HeartSaVioR · 2020-04-30T12:17:28Z

What changes were proposed in this pull request?

This PR introduces a new option, inputRetention to provide a way to specify retention on input files.

maxAgeMs plays as soft limit (it doesn't apply for some conditions like first batch, as well as it's applied relatively to the modified time of input files). Given it's not consistently applied across the matrix of configurations, Spark cannot purge the entries based on the configuration. (Streaming query can change the configurations and be relaunched.)

inputRetention plays as hard limit - Spark will not include files older than the retention as input files, as well as tries to exclude file entries older than the retention (it actually happens on compaction, as it's the only phase to remove entries).

inputRetention is relative to the system timestamp unlike maxAgeMs, which is easier for end users to reason about. This would require end users to correctly set the nodes' timestamp, but in most cases they would do it in other reasons as well. Also, this would filter out old files when the query intends to replay from input files, hence this should be considered as well.

Why are the changes needed?

This has been a pain to deal with metadata growing in both file stream source and file stream sink. For file stream source, all processed input files are tracked which size is continuously growing, and there's no approach on reducing the size/entries. In compact batch, it reads all previous input files to write new compact file, which brings major latency.

Does this PR introduce any user-facing change?

This doesn't bring any change "by default", as the new configuration is optional. (The default value is set to unrealistic one making it effectively none.)

This adds a new configuration - previous sections described the behavior.

How was this patch tested?

New UTs verifying two behaviors per test.

old files should not be included as input files if input retention is specified
when compacting, outdated entries should be filtered out

I've manually tested with above two behaviors as well.

HeartSaVioR · 2020-04-30T12:17:57Z

cc. @tdas @zsxwing @jose-torres @gaborgsomogyi

SparkQA · 2020-04-30T17:41:56Z

Test build #122127 has finished for PR 28422 at commit 738caa1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

docs/structured-streaming-programming-guide.md

SparkQA · 2020-05-02T11:56:51Z

Test build #122203 has finished for PR 28422 at commit 2af1df1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-05-26T22:06:36Z

retest this, please

SparkQA · 2020-05-27T03:08:22Z

Test build #123141 has finished for PR 28422 at commit 2af1df1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-05-27T04:29:59Z

retest this, please

SparkQA · 2020-05-27T07:05:02Z

Test build #123161 has finished for PR 28422 at commit 2af1df1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-05-27T07:50:17Z

retest this, please

SparkQA · 2020-05-27T12:38:21Z

Test build #123172 has finished for PR 28422 at commit 2af1df1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-06-05T14:44:37Z

Just picked this up. Maybe the user group reference can be added as a comment and not as commit msg.

I've manually tested with above two behaviors as well.

How do you mean that exactly?

gaborgsomogyi · 2020-06-05T16:18:13Z

docs/structured-streaming-programming-guide.md

+        NOTE 1: Please be careful to set the value if the query replays from the old input files.<br/>
+        NOTE 2: Please make sure the timestamp is in sync between nodes which run the query.<br/>
+        <br/>
        "file:///dataset.txt"<br/>


While I was reviewing this I've pinpointed a parameter description split and opened #28739

HeartSaVioR · 2020-06-08T07:38:29Z

Maybe the user group reference can be added as a comment and not as commit msg.

I'll move it to the JIRA description - personally I'd be OK to leave the rationalization in commit message, but I agree it's redundant and make commit message be verbose.

(EDIT: I already did that, skipping)

I've manually tested with above two behaviors as well.

How do you mean that exactly?

I meant both two behaviors are manually tested

old files should not be included as input files if input retention is specified
when compacting, outdated entries should be filtered out

SparkQA · 2020-06-08T11:14:37Z

Test build #123621 has finished for PR 28422 at commit 06ee53d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-06-08T11:21:34Z

Filed https://issues.apache.org/jira/browse/SPARK-31928

gaborgsomogyi · 2020-06-08T11:21:45Z

retest this please

gaborgsomogyi · 2020-06-08T12:54:17Z

I see that maxFileAge does'n't apply in all cases which means that it can't be used as an exclude criteria (at least I presume that's what you've considered). Please correct me if I'm wrong.

First I've taken a look at this purely from user perspective. I'm a user and I've realized that the compact file is growing infinitely. I've just taken a look at this solution and as a first glance:

I see that all of a sudden 2 parameters are there to configure maximum age of a file. This generated hell a lot of questions in my head and most probably I'll fall into several edge cases which may end-up in data loss.
I've started to think why do I need to configure an additional parameter on my side to have an influence on something which I don't want to care about. My expectation as a user that Spark saves couple of things to recover but I don't want to know what is that and just should work by default.

From engineering side I think if there are 2 params which meant to configure almost the same is something we should take a closer look: Maximum age of a file that can be found in this directory, before it is ignored.

I have the feeling something is not fully consistent with the general approach.

SparkQA · 2020-06-08T15:47:11Z

Test build #123632 has finished for PR 28422 at commit 06ee53d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-06-09T07:42:56Z

I agree the new addition of the similar option feels tricky.

Maybe you've already indicated there're some cases maxFileAge has to be ignored which means Spark is never able to drop entries from metadata (e.g. when latestFirst is true and maxFilesPerTrigger is set). Given all of these options can be changed for the further runs, I was confused whether it'd be safe to drop entries based on the current set of options and status of entries. There looked to be an edge-case input files can be processed more than once.

Also I felt it's less intuitive to reason about the way how the max age is specified - it is with respect to the timestamp of the latest file being figured out from Spark, not the timestamp of the current system. (But well... That might be only me.)

The new option ensures that the behavior is consistent regardless of these options. It just plays as "hard" limit and in any case Spark won't handle the files which are older than the threshold. (Suppose these files are simply deleted due to the retention policy - not physically though) It applies on both forward read and backward read, doesn't matter how many files Spark will read in a batch.
(Personally, I think maxFileAge itself should work like the way, and then we wouldn't have such confusion.)

gaborgsomogyi · 2020-06-12T13:29:33Z

I've analyzed this further. I have the same opinion about maxFileAge, namely it's unintuitive how it's programmed. I think it should be like:

maxFileAge should behave like inputRetention. Retention is based on current timestamp normally. We shouldn't go far, Kafka and similar components does that.
The current feature should depend on maxFileAge

If the user wants to operate a query with latestFirst in long term then I see these options:

User sets maxFileAge properly => no file loss just some fluctuation in the number of not processed files
User doesn't set maxFileAge properly but cluster sized properly => configuration issue, because with proper value all the files must be processed within maxFileAge.
User doesn't set maxFileAge properly and cluster sized badly => sizing and configuration issue. Cluster computation power must be increased to have room for the old not yet processed files. As in the previous case choosing appropriate maxFileAge is important.

The last point can be problematic and can end-up in data loss but this is exactly the same when processing data from Kafka. If retention fires then the data just disappear w/o any notification. This situation is better though because if the query is not able to catch-up then it can be restarted with bigger maxFileAge and cluster, allowing the query to catch up properly.

HeartSaVioR · 2020-06-14T01:42:35Z

I can even tolerate the fact maxFileAge is originated from path's latest timestamp. If we don't believe the node's wall time (I suspect other logic works well in such case though) then yes it might be the source of the truth across nodes.

I feel all the confusions come from the behavior of latestFirst. Yes we would like to read from latest in some case if we're only interested with latest files. But then should we really open the possibility to trace back older files? Would we just simply do the thing we do with Kafka's "latest" option, which only affects the first batch and no-op in further batches?

HeartSaVioR · 2020-06-14T01:49:22Z

retest this, please

SparkQA · 2020-06-14T07:05:01Z

Test build #123985 has finished for PR 28422 at commit 06ee53d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-06-14T07:18:00Z

retest this, please

SparkQA · 2020-06-14T12:07:55Z

Test build #123998 has finished for PR 28422 at commit 06ee53d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-16T18:26:44Z

Test build #125975 has finished for PR 28422 at commit 06ee53d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-21T05:38:52Z

FYI, I've initiated the discussion around this in dev@ mailing list, how to deal with "latestFirst" option and metadata growing.
https://lists.apache.org/thread.html/r08e3a8d7df74354b38d19ffdebe1afe7fa73c2f611f0a812a867dffb%40%3Cdev.spark.apache.org%3E

HeartSaVioR · 2020-08-18T05:01:19Z

retest this, please

SparkQA · 2020-08-18T07:05:02Z

Test build #127531 has finished for PR 28422 at commit 06ee53d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-08-18T07:08:28Z

retest this, please

SparkQA · 2020-08-18T12:16:33Z

Test build #127547 has finished for PR 28422 at commit 06ee53d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ntion on input files

HeartSaVioR · 2020-08-25T07:53:04Z

Just rebased. I haven't reached consensus around the discussion in dev mailing list though. I'll bump it again.

SparkQA · 2020-08-25T12:27:34Z

Test build #127882 has finished for PR 28422 at commit b7d94f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-09-15T00:23:50Z

retest this, please

SparkQA · 2020-09-15T07:05:03Z

Test build #128682 has finished for PR 28422 at commit b7d94f7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-09-15T07:27:44Z

retest this, please

SparkQA · 2020-09-15T12:54:47Z

Test build #128702 has finished for PR 28422 at commit b7d94f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-12-25T00:58:31Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

SparkQA · 2020-12-26T01:57:54Z

Test build #133389 has finished for PR 28422 at commit b7d94f7.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-12-28T07:54:38Z

retest this, please

SparkQA · 2020-12-28T08:04:37Z

Test build #133435 has finished for PR 28422 at commit b7d94f7.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-08T14:44:45Z

Test build #135034 has finished for PR 28422 at commit b7d94f7.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2021-05-20T00:09:50Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

darxriggs · 2025-03-03T18:17:19Z

The PR for FileStreamSink was already merged quite some time ago.
So I am wondering if the goal for FileStreamSource here is to also find some consensus and integrate it or is it abandoned?

probot-autolabeler bot added DOCS SQL STRUCTURED STREAMING labels Apr 30, 2020

HyukjinKwon reviewed May 1, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed May 1, 2020

View reviewed changes

docs/structured-streaming-programming-guide.md Show resolved Hide resolved

gaborgsomogyi reviewed Jun 5, 2020

View reviewed changes

HeartSaVioR force-pushed the SPARK-17604 branch from 2af1df1 to 06ee53d Compare June 8, 2020 07:44

HeartSaVioR mentioned this pull request Aug 18, 2020

[SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue #28904

Closed

[SPARK-17604][SS] FileStreamSource: provide a new option to have rete…

b7d94f7

…ntion on input files

HeartSaVioR force-pushed the SPARK-17604 branch from 06ee53d to b7d94f7 Compare August 25, 2020 07:49

HeartSaVioR mentioned this pull request Nov 29, 2020

[SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files #28363

Closed

github-actions bot added the Stale label Dec 25, 2020

github-actions bot closed this Dec 26, 2020

HeartSaVioR removed the Stale label Dec 26, 2020

HeartSaVioR reopened this Dec 26, 2020

github-actions bot added the Stale label May 20, 2021

github-actions bot closed this May 21, 2021

HeartSaVioR mentioned this pull request May 30, 2021

[SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink #32702

Closed

[SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files #28422

[SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files #28422

Uh oh!

Conversation

HeartSaVioR commented Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Apr 30, 2020

Uh oh!

SparkQA commented Apr 30, 2020

Uh oh!

Uh oh!

Uh oh!

SparkQA commented May 2, 2020

Uh oh!

HeartSaVioR commented May 26, 2020

Uh oh!

SparkQA commented May 27, 2020

Uh oh!

HeartSaVioR commented May 27, 2020

Uh oh!

SparkQA commented May 27, 2020

Uh oh!

HeartSaVioR commented May 27, 2020

Uh oh!

SparkQA commented May 27, 2020

Uh oh!

gaborgsomogyi commented Jun 5, 2020

Uh oh!

gaborgsomogyi Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Jun 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 8, 2020

Uh oh!

gaborgsomogyi commented Jun 8, 2020

Uh oh!

gaborgsomogyi commented Jun 8, 2020

Uh oh!

gaborgsomogyi commented Jun 8, 2020

Uh oh!

SparkQA commented Jun 8, 2020

Uh oh!

HeartSaVioR commented Jun 9, 2020

Uh oh!

gaborgsomogyi commented Jun 12, 2020

Uh oh!

HeartSaVioR commented Jun 14, 2020

Uh oh!

HeartSaVioR commented Jun 14, 2020

Uh oh!

SparkQA commented Jun 14, 2020

Uh oh!

HeartSaVioR commented Jun 14, 2020

Uh oh!

SparkQA commented Jun 14, 2020

Uh oh!

SparkQA commented Jul 16, 2020

Uh oh!

HeartSaVioR commented Jul 21, 2020

Uh oh!

HeartSaVioR commented Aug 18, 2020

Uh oh!

SparkQA commented Aug 18, 2020

Uh oh!

HeartSaVioR commented Aug 18, 2020

Uh oh!

SparkQA commented Aug 18, 2020

Uh oh!

HeartSaVioR commented Aug 25, 2020

Uh oh!

SparkQA commented Aug 25, 2020

Uh oh!

HeartSaVioR commented Sep 15, 2020

HeartSaVioR commented Apr 30, 2020 •

edited

Loading

HeartSaVioR commented Jun 8, 2020 •

edited

Loading