[SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond maxFilesPerTrigger as unread files #27620

HeartSaVioR · 2020-02-18T13:46:33Z

What changes were proposed in this pull request?

This patch caches the fetched list of files in FileStreamSource to avoid re-fetching whenever possible.

This improvement would be effective when the source options are being set to below:

maxFilesPerTrigger is set
latestFirst is set to false (default)

as

if maxFilesPerTrigger is unset, Spark will process all the new files within a batch
if latestFirst is set to true, it intends to process "latest" files which Spark has to refresh for every batch

Fetched list of files are filtered against SeenFilesMap before caching - unnecessary files are filtered in this phase. Once we cached the file, we don't check the file again for isNewFile, as Spark processes the files in timestamp order so cached files should have equal or later timestamp than latestTimestamp in SeenFilesMap.

Cache is only persisted in memory to simplify the logic - if we support restore cache when restarting query, we should deal with the changes of source options.

To avoid tiny set of inputs on the batch due to have tiny unread files (that could be possible when the list operation provides slightly more than the max files), this patch employs the "lower-bar" to determine whether it's helpful to retain unread files. Spark will discard unread files and do listing in the next batch if the number of unread files is lower than the specific (20% for now) ratio of max files.

This patch will have synergy with SPARK-20568 - while this patch helps to avoid redundant cost of listing, SPARK-20568 will get rid of the cost of listing for processed files. Once the query processes all files in initial load, the cost of listing for the files in initial load will be gone.

Why are the changes needed?

Spark spends huge cost to fetch the list of files from input paths, but restricts the usage of list in a batch. If the streaming query starts from huge input data for various reasons (initial load, reprocessing, etc.) the cost to fetch the files will be applied to all batches as it is unusual to let first microbatch to process all of initial load.

SPARK-20568 will help to reduce the cost to fetch as processed files will be either deleted or moved outside of input paths, but it still won't help in early phase.

Does this PR introduce any user-facing change?

Yes, the driver process would require more memory than before if maxFilesPerTrigger is set and latestFirst is set to "false" to cache fetched files. Previously Spark only takes some amount from left side of the list and discards remaining - so technically the peak memory would be same, but they can be freed sooner.

It may not hurt much, as peak memory is still be similar, and it would require similar amount of memory in any way when maxFilesPerTrigger is unset.

How was this patch tested?

New unit tests. Manually tested under the test environment:

input files
- 171,839 files distributed evenly into 24 directories
- each file contains 200 lines
query: read from the "file stream source" and repartition to 50, and write to the "file stream sink"
- maxFilesPerTrigger is set to 100

before applying the patch

after applying the patch

The area of brown color represents "latestOffset" where listing operation is performed for FileStreamSource. After the patch the cost for listing is paid "only once", whereas before the patch
it was for "every batch".

…d maxFilesPerTrigger as unread files

HeartSaVioR · 2020-02-18T13:56:39Z

The patch is actually very straightforward about how it works and how it helps (as the changeset except the test code is very small).

I'll attach the test result for the use case of "initial load" in the section of "How was this patch tested?" sooner. I've already have screenshots of UI, but would like to run against latest master.

EDIT: Just updated the description of PR.

HeartSaVioR · 2020-02-18T15:05:01Z

cc. @tdas @zsxwing @gaborgsomogyi

SparkQA · 2020-02-18T18:10:41Z

Test build #118640 has finished for PR 27620 at commit b417911.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

HeartSaVioR · 2020-04-13T22:51:34Z

retest this, please

SparkQA · 2020-04-14T03:52:41Z

Test build #121231 has finished for PR 27620 at commit b417911.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-04-15T06:55:23Z

retest this, please

SparkQA · 2020-04-15T12:23:54Z

Test build #121303 has finished for PR 27620 at commit b417911.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi

Just wondering what would happen in the following scenario?

"latestFirst" -> "true"
"maxFilesPerTrigger" -> "5"
6 files are available and 5 processed in batch0 -> 1 stored in unreadFiles
1 new file arrives
batch1 processed in next round

The question is with what content will be batch1 executed?

gaborgsomogyi · 2020-04-15T11:50:30Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+  }
+
+  override def listStatus(f: Path): Array[FileStatus] = {
+    val path = f.toUri.getPath


Nit: f.toUri.getPath can be inlined.

gaborgsomogyi · 2020-04-15T11:53:29Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+}
+
+object CountListingLocalFileSystem {
+  val scheme = s"FileStreamSourceSuite${math.abs(Random.nextInt)}fs"


Maybe we can use the object name since there are multiple filesystems declared here?

Ah yes good point. Will do.

gaborgsomogyi · 2020-04-15T12:54:20Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+
+        source.latestOffset(FileStreamSourceOffset(-1L), ReadLimit.maxFiles(5))
+          .asInstanceOf[FileStreamSourceOffset]
+        assert(1 === CountListingLocalFileSystem.pathToNumListStatusCalled


Maybe it worth to check nothing not relevant is inside. This probably indicate the need of some reset functionality for pathToNumListStatusCalled...

What I've meant here is that the test should fail if some nasty code puts irrelevant data into the map. For example when I put (just for the sake of representation) the following:

CountListingLocalFileSystem.resetCount() CountListingLocalFileSystem.pathToNumListStatusCalled.put("foo", new AtomicLong(1))

it would be good to fail.

Your example is now failing because I added check for counting the element of pathToNumListStatusCalled. Does it address your comment?

Sigh I realized I didn't push the change. Sorry about it. Will push.

Sorry I have to revert it. My bad. I remembered why I only checked the directory - this requires all input files to be verified, which is actually redundant, as we already verified such behavior from the UT "Caches and leverages unread files".

Even if it's checked in the positive case it still holds value in this negative case unless we find a pretty good reason why it's not possible. Negative case code parts can list unnecessary dirs/files.

I'm not sure we want to verify whole behavior of file stream source in this PR. This test only makes sure the calls of listing input directory (and input files as well) are expected, other checks are redundant and error-prone. E.g. Suppose file stream source employs some changes to read side due to some changes, then this test will fail unintentionally.

EDIT: it might be true for input files as well, but that may be the one of important things we may want to watch. (And we checked it in other test I've added.) Other paths are not that important relatively.

This test only makes sure the calls of listing input directory (and input files as well) are expected

Making sure that the modified code doesn't introduce further unintended directory listing is also important but I agree not with the price to make test failures when somebody makes modification in the stream source code. All in all I agree not to add it since we've double checked that no further unintended directory listing introduced.

gaborgsomogyi · 2020-04-15T13:06:44Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+        var lastModified = 0
+        val inputFiles = (0 to 19).map { idx =>
+          val f = createFile(idx.toString, new File(src, idx.toString), tmp)
+          f.setLastModified(lastModified)


Maybe idx * 10000?

Nice finding. I guess I used the variable and forgot to clean up when the variable was no longer needed.

gaborgsomogyi · 2020-04-15T13:07:16Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+        var lastModified = 0
+        (0 to 19).map { idx =>
+          val f = createFile(idx.toString, new File(src, idx.toString), tmp)
+          f.setLastModified(lastModified)


HeartSaVioR · 2020-04-16T03:16:20Z

Just wondering what would happen in the following scenario?

"latestFirst" -> "true"
"maxFilesPerTrigger" -> "5"
6 files are available and 5 processed in batch0 -> 1 stored in unreadFiles
1 new file arrives
batch1 processed in next round
The question is with what content will be batch1 executed?

I've explained the condition when the functionality takes effect in the description of PR - it won't cache the list of files if latestFirst is true, so it should be same as it is.

SparkQA · 2020-04-16T07:05:02Z

Test build #121348 has finished for PR 27620 at commit 07eed68.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-04-16T07:11:16Z

retest this, please

SparkQA · 2020-04-16T11:57:55Z

Test build #121353 has finished for PR 27620 at commit 07eed68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-04-16T12:27:19Z

Just wondering what would happen in the following scenario?
"latestFirst" -> "true"
"maxFilesPerTrigger" -> "5"
6 files are available and 5 processed in batch0 -> 1 stored in unreadFiles
1 new file arrives
batch1 processed in next round
The question is with what content will be batch1 executed?

I've explained the condition when the functionality takes effect in the description of PR - it won't cache the list of files if latestFirst is true, so it should be same as it is.

Wanted to write "latestFirst" -> "false" but with the modified config my question still stands.

HeartSaVioR · 2020-04-16T13:34:59Z

Only one file left in unread will be used for the batch for that case.

It's designed to avoid calling list operation whenever possible, but in some case it might be valid to drop unread files and call list operation if the number of remaining files are relatively smaller than the max files to trigger. I think it's affecting only few batch, though.

gaborgsomogyi · 2020-04-16T13:47:13Z

I've double checked maxFilesPerTrigger semantics and it's only max number to consider so this doesn't break that. Since I agree that it affects small amount of batches I agree that the overall gain is positive.

HeartSaVioR · 2020-04-17T05:01:50Z

Hmm... I thought about that more, and maybe it's good to add a lower bar to avoid the weird case, listing files provides slightly more than maxFilesPerTrigger. The tricky part is deciding the condition to discard unread files (ratio based on maxFilesPerTrigger? static number?); logic to add would be straightforward.

SparkQA · 2020-04-17T07:05:02Z

Test build #121395 has finished for PR 27620 at commit 57981cd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-04-17T10:57:00Z

Hmm, seems the issue is relevant.

maybe it's good to add a lower bar to avoid the weird case, listing files provides slightly more than maxFilesPerTrigger.

+1 on this. Maybe we can add a new test to cover this case.

HeartSaVioR · 2020-04-17T15:03:33Z

Just addressed lower bar of unseen files - the threshold ratio is set to 0.2 (20%) of max files for now, and we can adjust it later if we can find better value (or even condition).

SparkQA · 2020-04-17T19:52:53Z

Test build #121415 has finished for PR 27620 at commit 8251b74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2020-04-20T09:52:46Z

Looks good basically, only one thing is under discussion.

gaborgsomogyi · 2020-04-21T07:42:35Z

Does the modified code behaves the same way you've shown on the pictures attached?

SparkQA · 2020-06-14T07:05:01Z

Test build #123989 has finished for PR 27620 at commit 8251b74.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-06-14T07:18:09Z

retest this, please

SparkQA · 2020-06-14T12:04:20Z

Test build #124000 has finished for PR 27620 at commit 8251b74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-06-29T06:01:39Z

After looking at a couple of more issues on file stream source, I'm feeling that we also need to have upper bound of the cache, as file stream source is already contributing memory usage on driver and this adds (possibly) unbounded amount of memory.

I guess 10,000 entries are good enough, as it affects 100 batches when maxFilesPerTrigger is set to 100, and affects 10 batches when maxFilesPerTrigger is set to 1000. Once we find that higher value is OK for memory usage and pretty much helpful on majority of workloads, we can make it configurable with higher default value.

SparkQA · 2020-06-29T07:35:16Z

Test build #124638 has started for PR 27620 at commit 0e972fc.

HeartSaVioR · 2020-06-29T19:26:53Z

retest this, please

SparkQA · 2020-06-30T20:34:55Z

Test build #124675 has finished for PR 27620 at commit 0e972fc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-01T02:02:30Z

retest this, please

SparkQA · 2020-07-01T07:48:41Z

Test build #124725 has finished for PR 27620 at commit 0e972fc.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-01T09:02:46Z

retest this, please

SparkQA · 2020-07-01T11:46:02Z

Test build #124763 has finished for PR 27620 at commit 0e972fc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-01T11:50:59Z

HiveSuite failed

HeartSaVioR · 2020-07-01T11:51:10Z

retest this, please

SparkQA · 2020-07-01T19:55:51Z

Test build #124773 has finished for PR 27620 at commit 0e972fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-12T20:28:28Z

retest this, please

SparkQA · 2020-07-13T02:41:04Z

Test build #125728 has finished for PR 27620 at commit 0e972fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-09T15:53:52Z

Retest this please.

dongjoon-hyun

For the test case, could you rebase this to the master and deduplicate CountListingLocalFileSystem?

https://github.com/apache/spark/pull/27620/files#r382452458

The code regarding FileSystem I add here is very similar with what I add in #27664. When either one gets merged, I'll rebase and deduplicate it.

SparkQA · 2020-08-09T20:09:54Z

Test build #127239 has finished for PR 27620 at commit 0e972fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-08-10T02:58:50Z

I just revisited about deduplicating CountListingLocalFileSystem, and felt that deduplication isn't safe, because the instance refers singleton object which will be co-used once we deduplicate, and calling reset may affect other test - this would depend on the characteristic of parallelism on test suite run. If it's guaranteed that test suites are never executed in parallel in same JVM then the change would be safe, otherwise it wouldn't.

dongjoon-hyun · 2020-08-10T03:03:28Z

Got it, @HeartSaVioR .

dongjoon-hyun

+1, LGTM. Thank you, @HeartSaVioR and @gaborgsomogyi .
Merged to master for Apache Spark 3.1.0 on December 2020.

dongjoon-hyun · 2020-08-10T03:15:20Z

cc @tdas , @zsxwing , @jose-torres , @dbtsai

xuanyuanking · 2020-08-18T02:34:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

  type Timestamp = Long

+  val DISCARD_UNSEEN_FILES_RATIO = 0.2
+  val MAX_CACHED_UNSEEN_FILES = 10000


Any reason for keeping these 2 parameters instead of making it configurable? Is it to detail to expose to the end-user?

I just wanted to avoid Spark configuration be "airplane control panel" - end users already have bunch of things to tune. It's completely OK to make them be configurable, if we found the case the default value won't work.

### What changes were proposed in this pull request? This change adds configuration options for the streaming input File Source for `maxCachedFiles` and `discardCachedInputRatio`. These values were originally introduced with #27620 but were hardcoded to 10,000 and 0.2, respectively. ### Why are the changes needed? Under certain workloads with large `maxFilesPerTrigger` settings, the performance gain from caching the input files capped at 10,000 can cause a cluster to be underutilized and jobs to take longer to finish if each batch takes a while to finish. For example, a job with `maxFilesPerTrigger` set to 100,000 would do all 100k in batch 1, then only 10k in batch 2, but both batches could take just as long since some of the files cause skewed processing times. This results in a cluster spending nearly the same amount of time while processing only 1/10 of the files it could have. ### Does this PR introduce _any_ user-facing change? Updated documentation for structured streaming sources to describe new configurations options ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45362 from ragnarok56/filestream-cached-files-config. Authored-by: ragnarok56 <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

[SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyon…

b417911

…d maxFilesPerTrigger as unread files

HeartSaVioR mentioned this pull request Feb 21, 2020

[SPARK-30915][SS] CompactibleFileStreamLog: Avoid reading the metadata log file when finding the latest batch ID #27664

Closed

HeartSaVioR commented Feb 21, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala Show resolved Hide resolved

dongjoon-hyun added the STRUCTURED STREAMING label Feb 28, 2020

gaborgsomogyi reviewed Apr 15, 2020

View reviewed changes

Reflect review comments

07eed68

probot-autolabeler bot added the SQL label Apr 16, 2020

Reflect review comment

57981cd

Add condition to discard unseen files

8251b74

Set upper bound of caching (static value for now)

0e972fc

HeartSaVioR mentioned this pull request Jun 30, 2020

[SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log #27694

Closed

dongjoon-hyun reviewed Aug 9, 2020

View reviewed changes

dongjoon-hyun approved these changes Aug 10, 2020

View reviewed changes

dongjoon-hyun closed this in d08e73d Aug 10, 2020

xuanyuanking reviewed Aug 18, 2020

View reviewed changes

ragnarok56 mentioned this pull request Aug 23, 2023

[SPARK-44924][SS] Add config for FileStreamSource cached files #42623

Closed

ragnarok56 mentioned this pull request Mar 2, 2024

[SPARK-44924][SS] Add config for FileStreamSource cached files #45362

Closed

[SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond maxFilesPerTrigger as unread files #27620

[SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond maxFilesPerTrigger as unread files #27620

Uh oh!

Conversation

HeartSaVioR commented Feb 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Feb 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Feb 18, 2020

Uh oh!

SparkQA commented Feb 18, 2020

Uh oh!

Uh oh!

HeartSaVioR commented Apr 13, 2020

Uh oh!

SparkQA commented Apr 14, 2020

Uh oh!

HeartSaVioR commented Apr 15, 2020

Uh oh!

SparkQA commented Apr 15, 2020

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Apr 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Apr 16, 2020

Uh oh!

SparkQA commented Apr 16, 2020

Uh oh!

HeartSaVioR commented Apr 16, 2020

Uh oh!

SparkQA commented Apr 16, 2020

Uh oh!

gaborgsomogyi commented Apr 16, 2020

Uh oh!

HeartSaVioR commented Apr 16, 2020

Uh oh!

gaborgsomogyi commented Apr 16, 2020

Uh oh!

HeartSaVioR commented Apr 17, 2020

Uh oh!

SparkQA commented Apr 17, 2020

Uh oh!

gaborgsomogyi commented Apr 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HeartSaVioR commented Feb 18, 2020 •

edited

Loading

HeartSaVioR commented Feb 18, 2020 •

edited

Loading

HeartSaVioR Apr 21, 2020 •

edited

Loading

gaborgsomogyi commented Apr 17, 2020 •

edited

Loading

HeartSaVioR commented Apr 17, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading