-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files #28422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #122127 has finished for PR 28422 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
Outdated
Show resolved
Hide resolved
|
Test build #122203 has finished for PR 28422 at commit
|
|
retest this, please |
|
Test build #123141 has finished for PR 28422 at commit
|
|
retest this, please |
|
Test build #123161 has finished for PR 28422 at commit
|
|
retest this, please |
|
Test build #123172 has finished for PR 28422 at commit
|
|
Just picked this up. Maybe the user group reference can be added as a comment and not as commit msg.
How do you mean that exactly? |
| NOTE 1: Please be careful to set the value if the query replays from the old input files.<br/> | ||
| NOTE 2: Please make sure the timestamp is in sync between nodes which run the query.<br/> | ||
| <br/> | ||
| "file:///dataset.txt"<br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I was reviewing this I've pinpointed a parameter description split and opened #28739
I'll move it to the JIRA description - personally I'd be OK to leave the rationalization in commit message, but I agree it's redundant and make commit message be verbose. (EDIT: I already did that, skipping)
I meant both two behaviors are manually tested
|
|
Test build #123621 has finished for PR 28422 at commit
|
|
retest this please |
|
I see that First I've taken a look at this purely from user perspective. I'm a user and I've realized that the compact file is growing infinitely. I've just taken a look at this solution and as a first glance:
From engineering side I think if there are 2 params which meant to configure almost the same is something we should take a closer look: I have the feeling something is not fully consistent with the general approach. |
|
Test build #123632 has finished for PR 28422 at commit
|
|
I agree the new addition of the similar option feels tricky. Maybe you've already indicated there're some cases Also I felt it's less intuitive to reason about the way how the max age is specified - it is with respect to the timestamp of the latest file being figured out from Spark, not the timestamp of the current system. (But well... That might be only me.) The new option ensures that the behavior is consistent regardless of these options. It just plays as "hard" limit and in any case Spark won't handle the files which are older than the threshold. (Suppose these files are simply deleted due to the retention policy - not physically though) It applies on both forward read and backward read, doesn't matter how many files Spark will read in a batch. |
|
I've analyzed this further. I have the same opinion about
If the user wants to operate a query with
The last point can be problematic and can end-up in data loss but this is exactly the same when processing data from Kafka. If retention fires then the data just disappear w/o any notification. This situation is better though because if the query is not able to catch-up then it can be restarted with bigger |
|
I can even tolerate the fact maxFileAge is originated from path's latest timestamp. If we don't believe the node's wall time (I suspect other logic works well in such case though) then yes it might be the source of the truth across nodes. I feel all the confusions come from the behavior of |
|
retest this, please |
|
Test build #123985 has finished for PR 28422 at commit
|
|
retest this, please |
|
Test build #123998 has finished for PR 28422 at commit
|
|
Test build #125975 has finished for PR 28422 at commit
|
|
FYI, I've initiated the discussion around this in dev@ mailing list, how to deal with "latestFirst" option and metadata growing. |
|
retest this, please |
|
Test build #127531 has finished for PR 28422 at commit
|
|
retest this, please |
|
Test build #127547 has finished for PR 28422 at commit
|
…ntion on input files
06ee53d to
b7d94f7
Compare
|
Just rebased. I haven't reached consensus around the discussion in dev mailing list though. I'll bump it again. |
|
Test build #127882 has finished for PR 28422 at commit
|
|
retest this, please |
|
Test build #128682 has finished for PR 28422 at commit
|
|
retest this, please |
|
Test build #128702 has finished for PR 28422 at commit
|
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
Test build #133389 has finished for PR 28422 at commit
|
|
retest this, please |
|
Test build #133435 has finished for PR 28422 at commit
|
|
Test build #135034 has finished for PR 28422 at commit
|
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
The PR for |
What changes were proposed in this pull request?
This PR introduces a new option,
inputRetentionto provide a way to specify retention on input files.maxAgeMsplays assoftlimit (it doesn't apply for some conditions like first batch, as well as it's applied relatively to the modified time of input files). Given it's not consistently applied across the matrix of configurations, Spark cannot purge the entries based on the configuration. (Streaming query can change the configurations and be relaunched.)inputRetentionplays ashardlimit - Spark will not include files older than the retention as input files, as well as tries to exclude file entries older than the retention (it actually happens on compaction, as it's the only phase to remove entries).inputRetentionis relative to the system timestamp unlikemaxAgeMs, which is easier for end users to reason about. This would require end users to correctly set the nodes' timestamp, but in most cases they would do it in other reasons as well. Also, this would filter out old files when the query intends to replay from input files, hence this should be considered as well.Why are the changes needed?
This has been a pain to deal with metadata growing in both file stream source and file stream sink. For file stream source, all processed input files are tracked which size is continuously growing, and there's no approach on reducing the size/entries. In compact batch, it reads all previous input files to write new compact file, which brings major latency.
Does this PR introduce any user-facing change?
This doesn't bring any change "by default", as the new configuration is optional. (The default value is set to unrealistic one making it effectively none.)
This adds a new configuration - previous sections described the behavior.
How was this patch tested?
New UTs verifying two behaviors per test.
I've manually tested with above two behaviors as well.