-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][DISCUSSION_NEEDED][SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata #23840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -536,6 +536,11 @@ Here are the details of all the sources in Spark. | |
| href="api/R/read.stream.html">R</a>). | ||
| E.g. for "parquet" format options see <code>DataStreamReader.parquet()</code>. | ||
| <br/><br/> | ||
| <code>ignoreFileStreamSinkMetadata</code>: whether to ignore metadata information being left from file stream sink, which leads to always use in-memory file index. (default: false) | ||
| <br/> | ||
| This option is useful when metadata grows too big and reading metadata is even slower than listing files from filesystem.<br/> | ||
| NOTE: This option must be set to "true" if file source is reading from output files which file stream sink is written, with setting "retainOnlyLastBatchInMetadata" option to "true". | ||
| <br/><br/> | ||
| In addition, there are session configurations that affect certain file-formats. See the <a href="sql-programming-guide.html">SQL Programming Guide</a> for more details. E.g., for "parquet", see <a href="sql-data-sources-parquet.html#configuration">Parquet configuration</a> section. | ||
| </td> | ||
| <td>Yes</td> | ||
|
|
@@ -1812,6 +1817,12 @@ Here are the details of all the sinks in Spark. | |
| (<a href="api/scala/index.html#org.apache.spark.sql.DataFrameWriter">Scala</a>/<a href="api/java/org/apache/spark/sql/DataFrameWriter.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter">Python</a>/<a | ||
| href="api/R/write.stream.html">R</a>). | ||
| E.g. for "parquet" format options see <code>DataFrameWriter.parquet()</code> | ||
| <br/> | ||
| <code>retainOnlyLastBatchInMetadata</code>: whether to retain metadata information only for last succeed batch. | ||
| <br/><br/> | ||
| This option greatly reduces overhead on compacting metadata files which would be non-trivial when query processes lots of files in each batch.<br/> | ||
| NOTE: As it only retains the last batch in metadata, the metadata is not readable from file source: you must set "ignoreFileStreamSinkMetadata" option | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel this is not ideal, but given file stream sink itself also leverages file log, it cannot be an optional entirely. If we would like to not leaving file log in this case, we may need to have another metadata (which store minimized information like the last succeed batch id) and store it instead when the option is turned on. It should be stored in another directory (_spark_metadata) and listing files should ignore this directory too. |
||
| to "true" when reading sink's output files from another query, regardless of batch and streaming source. | ||
| </td> | ||
| <td>Yes (exactly-once)</td> | ||
| <td>Supports writes to partitioned tables. Partitioning by time may be useful.</td> | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find place to leave this for (batch) file source. Please let me know if we have proper place to put. At least we guide the needs of enable this option when "retainOnlyLastBatchInMetadata" option is enabled, so might be safe for batch case as well.