-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][DISCUSSION_NEEDED][SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata #23840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| href="api/R/read.stream.html">R</a>). | ||
| E.g. for "parquet" format options see <code>DataStreamReader.parquet()</code>. | ||
| <br/><br/> | ||
| <code>ignoreFileStreamSinkMetadata</code>: whether to ignore metadata information being left from file stream sink, which leads to always use in-memory file index. (default: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find place to leave this for (batch) file source. Please let me know if we have proper place to put. At least we guide the needs of enable this option when "retainOnlyLastBatchInMetadata" option is enabled, so might be safe for batch case as well.
| <code>retainOnlyLastBatchInMetadata</code>: whether to retain metadata information only for last succeed batch. | ||
| <br/><br/> | ||
| This option greatly reduces overhead on compacting metadata files which would be non-trivial when query processes lots of files in each batch.<br/> | ||
| NOTE: As it only retains the last batch in metadata, the metadata is not readable from file source: you must set "ignoreFileStreamSinkMetadata" option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this is not ideal, but given file stream sink itself also leverages file log, it cannot be an optional entirely.
If we would like to not leaving file log in this case, we may need to have another metadata (which store minimized information like the last succeed batch id) and store it instead when the option is turned on. It should be stored in another directory (_spark_metadata) and listing files should ignore this directory too.
|
In practice, end users would have policy for data retention, and output files could be removed based on the policy. So it would be ideal if metadata can be reflected on the change of output files, but in point of Spark's view it doesn't look like easy to do. For example, if we go on checking existence of files in metadata list periodically (maybe each X batches to avoid concurrent modification), it will be another huge overhead to slow down. Specifying retention policy in Spark query (which files will be removed outside of Spark) is also really odd, so neither is beauty. If it's OK for file stream sink to periodically check existence of files and get rid of removed files in file log (less side effect but not sure about performance), I'll apply the change. |
|
Test build #102529 has finished for PR 23840 at commit
|
|
Test build #102531 has finished for PR 23840 at commit
|
|
Now this approach conflicts #23850, so I would rather want to let us choose one of alternatives. |
|
Taking a look at this, as it could fix a longstanding critical performance bug in our ingestion pipeline. Couple comments:
|
|
FileStreamSink only reads the last batch of metadata to determine which batch sink wrote successfully. The huge metadata on FileStreamSink is actually not for FileStreamSink (it just effectively takes Now I think the ideal way to address is incorporating last successful batch ID to query checkpoint and only writing sink metadata when option is enabled (though maybe we should make it default to true since it breaks backward compatibility). |
|
I'm now seeing that metadata path (within checkpoint root) is injected to only Sources, which requires DSv2 change on Sink side if we really want to incorporate Sink metadata to query checkpoint. I guess this would not happen if we have concrete and nice use case, and even it happens we can get the change only in Spark 3.0.0 and upwards. Let's see how other sink (KafkaSink) is implemented: spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSink.scala Lines 26 to 43 in 4baa2d4
it only leverages latestBatchId in memory, which means the rows could be requested to be written again when query restarts - it's OK because Kafka sink supports at-least-once. I guess we couldn't take same approach to achieve exactly-once in File Stream Sink. It would be at least achieving weak exactly-once via at-least-once with idempotent if rewriting batch is idempotent, but not 100% sure about it. Might be better to initiate discussion on dev. mailing list? |
|
Posted discussion thread on dev. mailing list for necessary DSv2 API change. |
|
I was facing this issue SPARK-24295 so I was looking into commit. I have two queries.
|
Yes. Actually there's no way to achieve both. Looks like the only way is let sink checks and purges file entries when files are removed outside of query via some retention policy, but that will only work when query is running (it might be OK since metadata is growing only when query is running).
I guess so, so the proposed patch is not a complete solution as of now. We may need to focus alternatives I've suggested as well or raise a new idea. |
|
@alfredo-gimenez @NamanMahor |
|
I've raised a patch for one of alternatives - #24128 - which applies retention on FileStreamSink. |
|
Once we have a new patch which is less intrusive I'll close this one. Please follow up #24128 and review. Thanks all! |
What changes were proposed in this pull request?
This patch proposes adding option in file stream sink to retain only the last batch for file log (metadata). This would help on the case where query is outputting plenty of files per each batch, which compacting metadata files into one could bring non-trivial overhead.
Please refer the comment in JIRA issue for more details on the overhead current file stream sink metadata and file stream source metadata file index can bring to high-volume and long-run queries.
As this patch purges old batches and retains only last batch in metadata, metadata file index fails to construct list of files when we enable this option, and as a result file (stream) source cannot read the output directory. To re-enable reading from the output directory, this patch also proposes to add option in file (stream) source which ignores metadata information when reading directory. With this option, end users can also choose the faster one between in-memory file index and metadata file index when metadata file gets much bigger.
How was this patch tested?
Added unit tests.