[SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log #27694

HeartSaVioR · 2020-02-25T14:33:28Z

What changes were proposed in this pull request?

This patch uses DataInputStream / DataOutputStream to serialize/deserialize entry on FileStream(Source/Sink)Log, as well as applying LZ4 compression on serde.

The ground idea is that both FileEntry and SinkFileStatus haven't been changed for 3 years but we have been simply taking JSON serde to prepare with possible change and spending bunch of overhead on JSON format. CompactibleFileStreamLog has the metadata version information but it has been just version 1.

This patch introduces metadata version 2 of CompactibleFileStreamLog as leveraging DataInputStream / DataOutputStream & LZ4 compression. While we introduce version 2 for CompactibleFileStreamLog, CompactibleFileStreamLog is still compatible with version 1 for both serialization and deserialization, so we can read from version 2 and write to version 2 in normal, and also we can read from version 1 and write to version 2 for smooth migration.

There's some exceptional case on file stream sink, which outputs are possibly read by older version of Spark. To support such case, a new configuration (spark.sql.streaming.fileSink.log.writeMetadataVersion) is introduced to specify the write version of the metadata on file stream sink. If specified, further batch log files will be written via the write version. This doesn't rewrite the existing batch log files, so if the query is started from existing metadata, the query should be run at least for next compact batch so that the metadata log of compact batch can be written via the write version.

Why are the changes needed?

Multiple JIRA issues have been filed to report that huge metadata logs make their queries very slow. While this patch won't make their metadata log stop growing, this patch may heavily reduce down their metadata log and time to process compaction.

Please find the numbers in the section "How was this patch tested?".

As this also reduces the latency to read all files from file stream sink metadata, it also helps to speed up the query for both batch and streaming if the input path is pointing to the output of file stream sink.

Does this PR introduce any user-facing change?

Further metadata log files will be written by configured write version for file stream sink. For other compatible metadata logs, further metadata log files will be written by version 2.

How was this patch tested?

Existing UTs (UTs for checking with 2.1.0 compatibility is basically the checking with V1 compatibility, as I don't find any change)
Manually done simple perf. test

Note that the version 4 in the experiment is now taken as version 2 in this patch.

This is the experiment branch: https://github.com/HeartSaVioR/spark/tree/SPARK-30946-experiments

Version 3 is only applying compaction (LZ4) on existing format. See below commit:

HeartSaVioR@406670a

Version 4 is serializing/deserializing entry via DataInputStream / DataOutputStream

HeartSaVioR@7c66516

I've also implemented simple apps to 1) prepare metadata (so that we can experiment on the specific batch) and 2) run simple test with various versions:

HeartSaVioR/spark-delegation-token-experiment@bea7680

The numbers are recorded below:

https://docs.google.com/spreadsheets/d/1D5P103F_sKOjkDpNr9PaCC8Ehk4Y4dRtH3oEdytM4_c/edit?usp=sharing

version	elapsed time	elapsed time (ratio of v1)	size	size (ratio of v1)
1	10628.75	100.00%	57265744	100.00%
2	939.25	8.84%	16655736	29.08%
3	10116	95.18%	17259852	30.14%
4	837	7.87%	15285626	26.69%

HeartSaVioR · 2020-02-25T14:49:47Z

Honestly I have been thinking about larger changes, like:

avoid rewriting all entities on compact operation
support retention (ideal to design with above item, as we won't be able to read all existing entities on compact operation)
use tree structure, or at least two kinds of entry "directory" and "file" to heavily reduce down path string on entries
streamline the compaction - instead of loading all entities to do next operation, iterate the loop on "load an entry -> transform/filter -> store entry if not filtered out". This would help on reducing driver memory usage on compact operation.

but would like to have priorities on the perspective of (less changes & bigger impact), and make changes incrementally.

This patch brings the least changes but great impact on performance. Above items are orthogonal to this improvement so they can be addressed on demand later.

SparkQA · 2020-02-25T17:33:16Z

Test build #118924 has finished for PR 27694 at commit 55d238f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-02-25T23:16:34Z

cc. @tdas @zsxwing @gaborgsomogyi

HeartSaVioR · 2020-04-13T22:52:17Z

retest this, please

SparkQA · 2020-04-14T02:04:42Z

Test build #121228 has finished for PR 27694 at commit 55d238f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2020-04-20T03:34:07Z

Suppose there is a streaming job pipeline, and these streaming job comes from different end-users or department, if middle end-user upgrade their spark and use CompactibleFileStreamLog V2 (as you said in default, we can read from version 1 and write to version 2 for smooth migration.), then downstream jobs will fail. Is there something I misunderstand？So, It is better to keep the version of CompactibleFileStreamLog, or leave a config to upgrade version of CompactibleFileStreamLog with default false?

HeartSaVioR · 2020-04-20T08:13:47Z

@uncleGen
Yes I haven't thought about the case where Spark versions are mixed up.

It seems to be a clear case that end users don't want to upgrade version smoothly (implicitly), but would it be some case where they start to write the output and want to make it be compatible with Spark 2.x?

If we want to have latter then it's worth to have an option which specifies "version" of the format. (The default value would be ideally same with the existing one in file, or highest known version.)

If we can just live with former then a flag option would be fine.

gaborgsomogyi · 2020-04-28T13:29:47Z

I've just picked this up but seems like it has a conflict + linter job failed. Could you resolve them?

SparkQA · 2020-05-07T14:45:45Z

Test build #122403 has finished for PR 27694 at commit 74163d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-27T00:55:54Z

Test build #123150 has finished for PR 27694 at commit 3231611.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2020-05-27T01:54:40Z

retest this please.

SparkQA · 2020-05-27T05:10:20Z

Test build #123156 has finished for PR 27694 at commit 3231611.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2020-05-28T15:36:50Z

One lesson I learned from the past is UnsafeRow is not designed to be persisted across Spark versions. I'd avoid to introduce another place that persists data in the UnsafeRow format. Do you know what are the performance numbers if we just compress the text files?

HeartSaVioR · 2020-05-28T20:56:58Z

One lesson I learned from the past is UnsafeRow is not designed to be persisted across Spark versions.

This sounds like a blocker for SS, as we leverage it to store state which should be used across Spark versions and shouldn't be corrupted/lost as it's time-consuming or sometimes simply not possible to replay from the scratch to construct the same due to retention policy. Do you have references on following the discussion/gotcha for that issue? I think we should really fix it for storing state - probably state must not be stored via UnsafeRow format then.

Do you know what are the performance numbers if we just compress the text files?

I haven't experimented but we can easily imagine the file size would be smaller whereas processing time may be affected both positively (less to read from remote storage) and negatively (still have to serialize/deserialize with JSON + additional cost for compression/decompression).

Here the point of the PR is that we know the schema (and even versioning) of the file in prior, hence we don't (and shouldn't) pay huge cost to make the file be backward-compatible by itself. We don't do versioning for data structures being used by event log so we are paying huge cost to make it backward/forward compatible. If we are not sure about unsafe row format for storing then we may be able to just try with traditional approaches.

HeartSaVioR · 2020-05-29T09:49:24Z

I've spent some time to experiment more approaches.

This is the experiment branch: https://github.com/HeartSaVioR/spark/tree/SPARK-30946-experiments

Version 3 is only applying compaction (LZ4) on existing format. See below commit:

HeartSaVioR@406670a

Version 4 is serializing/deserializing entry via DataInputStream / DataOutputStream

HeartSaVioR@7c66516

I've also implemented simple apps to 1) prepare metadata (so that we can experiment on the specific batch) and 2) run simple test with various versions:

HeartSaVioR/spark-delegation-token-experiment@bea7680

The numbers are recorded below:

https://docs.google.com/spreadsheets/d/1D5P103F_sKOjkDpNr9PaCC8Ehk4Y4dRtH3oEdytM4_c/edit?usp=sharing

version	elapsed time	elapsed time (ratio of v1)	size	size (ratio of v1)
1	10628.75	100.00%	57265744	100.00%
2	939.25	8.84%	16655736	29.08%
3	10116	95.18%	17259852	30.14%
4	837	7.87%	15285626	26.69%

The number represents that applying compression on existing format doesn't help reducing the time, while the size is reduced similar with other alternatives. Other alternatives directly integrated to the data structure greatly reduce the time, say, 10 times faster. The size of compact files are similar across alternatives.

HeartSaVioR · 2020-06-05T07:28:34Z

Hi all, are these numbers enough to persuade? Given we concern about "compatibility" I think the version 4 is ideal, which "Java" guarantees compatibility across architectures. I'll update the patch to use the version 4 as a new version (version 2, of course) as of now.

SparkQA · 2020-06-05T11:04:39Z

Test build #123555 has finished for PR 27694 at commit 737cc60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-06-05T11:23:44Z

Also addressed the case on reading the metadata log from lower version of Spark.

@uncleGen Does the approach make sense?
@zsxwing Could we revisit the approach on performance & compatibility point of view?

SparkQA · 2020-06-05T12:16:48Z

Test build #123559 has finished for PR 27694 at commit 6cd9bfd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-06-05T12:22:10Z

Failed UTs are unrelated.

HeartSaVioR · 2020-06-05T12:22:17Z

retest this, please

SparkQA · 2020-06-05T14:42:27Z

Test build #123565 has finished for PR 27694 at commit 3933018.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-05T15:32:02Z

Test build #123564 has finished for PR 27694 at commit 6cd9bfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-06-30T02:19:10Z

@zsxwing
Thanks a lot for your detailed comment!

I think considering all of these would take me to redesign metadata log as well as file stream source itself, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives).

That said, shall we do the following?

Review and merge other PRs which don't incur metadata version upgrade, before proceeding 2.

File an issue to address these requirements altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll try to persuade my employer to allow me to take the work.

If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case instead of thinking too general.

WDYT?

HeartSaVioR · 2020-06-30T02:34:47Z

(IMHO it might be still good chance to leverage this PR to construct a good way for versioning properly - so that version 2 can be used as an interim with best practice on versioning, and we get version 3 with such major features included without headache on versioning.)

HeartSaVioR · 2020-07-01T03:43:46Z

Probably we can go with supporting relative path “as well” in this metadata version which would heavily reduce the size even without compression. I’ll experiment this approach as well.

HeartSaVioR · 2020-07-01T08:40:19Z

Probably we can go with supporting relative path “as well” in this metadata version which would heavily reduce the size even without compression. I’ll experiment this approach as well.

Never mind. I simply missed that the metadata log should be able to read from older Spark versions which might have no idea to handle relative path. This change also needs to bump up the version.

HeartSaVioR · 2020-07-12T20:28:21Z

retest this, please

SparkQA · 2020-07-12T22:56:04Z

Test build #125726 has finished for PR 27694 at commit 3933018.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-12T23:44:58Z

retest this, please

HeartSaVioR · 2020-07-12T23:51:44Z

@zsxwing Would you mind if I ask your opinion on #27694 (comment) ?

SparkQA · 2020-07-13T03:28:32Z

Test build #125737 has finished for PR 27694 at commit 3933018.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-13T03:33:11Z

retest this, please

SparkQA · 2020-07-13T06:26:07Z

Test build #125748 has finished for PR 27694 at commit 3933018.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-15T05:34:41Z

Test build #125876 has finished for PR 27694 at commit 86131af.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-15T07:32:01Z

retest this, please

SparkQA · 2020-07-16T00:41:05Z

Test build #125900 has finished for PR 27694 at commit 86131af.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

…th LZ4 compression on FileStream(Source/Sink)Log

SparkQA · 2020-07-16T07:05:02Z

Test build #125934 has finished for PR 27694 at commit 2559928.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-16T07:16:47Z

retest this, please

SparkQA · 2020-07-16T10:20:50Z

Test build #125957 has finished for PR 27694 at commit 2559928.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-08-18T05:01:30Z

retest this, please

SparkQA · 2020-08-18T07:05:03Z

Test build #127534 has finished for PR 27694 at commit 2559928.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-08-18T07:08:37Z

retest this, please

SparkQA · 2020-08-18T09:47:58Z

Test build #127549 has finished for PR 27694 at commit 2559928.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-08-18T12:22:50Z

retest this, please

SparkQA · 2020-08-18T15:27:49Z

Test build #127573 has finished for PR 27694 at commit 2559928.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-10-26T07:26:13Z

Closing as I don't see any support on this and I'm now in favor of one of alternatives.

dongjoon-hyun added the STRUCTURED STREAMING label Feb 28, 2020

HeartSaVioR force-pushed the SPARK-30946 branch from 55d238f to 74163d2 Compare May 7, 2020 11:55

probot-autolabeler bot added CORE SQL labels May 7, 2020

HeartSaVioR force-pushed the SPARK-30946 branch from 74163d2 to 3231611 Compare May 26, 2020 22:52

HeartSaVioR force-pushed the SPARK-30946 branch from 3231611 to 737cc60 Compare June 5, 2020 07:52

HeartSaVioR changed the title ~~[SPARK-30946][SS] Serde entry with UnsafeRow on FileStream(Source/Sink)Log with LZ4 compression~~ [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log Jun 5, 2020

HeartSaVioR force-pushed the SPARK-30946 branch from 3933018 to 86131af Compare July 15, 2020 03:09

[SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream wi…

2559928

…th LZ4 compression on FileStream(Source/Sink)Log

HeartSaVioR force-pushed the SPARK-30946 branch from 86131af to 2559928 Compare July 16, 2020 00:50

HeartSaVioR mentioned this pull request Jul 30, 2020

[SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue #28904

Closed

HeartSaVioR closed this Oct 26, 2020

HeartSaVioR mentioned this pull request May 30, 2021

[SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink #32702

Closed

[SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log #27694

[SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log #27694

Uh oh!

Conversation

HeartSaVioR commented Feb 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Feb 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 25, 2020

Uh oh!

HeartSaVioR commented Feb 25, 2020

Uh oh!

HeartSaVioR commented Apr 13, 2020

Uh oh!

SparkQA commented Apr 14, 2020

Uh oh!

uncleGen commented Apr 20, 2020

Uh oh!

HeartSaVioR commented Apr 20, 2020

Uh oh!

gaborgsomogyi commented Apr 28, 2020

Uh oh!

SparkQA commented May 7, 2020

Uh oh!

SparkQA commented May 27, 2020

Uh oh!

uncleGen commented May 27, 2020

Uh oh!

SparkQA commented May 27, 2020

Uh oh!

zsxwing commented May 28, 2020

Uh oh!

HeartSaVioR commented May 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented May 29, 2020

Uh oh!

HeartSaVioR commented Jun 5, 2020

Uh oh!

SparkQA commented Jun 5, 2020

Uh oh!

HeartSaVioR commented Jun 5, 2020

Uh oh!

SparkQA commented Jun 5, 2020

Uh oh!

HeartSaVioR commented Jun 5, 2020

Uh oh!

HeartSaVioR commented Jun 5, 2020

Uh oh!

SparkQA commented Jun 5, 2020

Uh oh!

SparkQA commented Jun 5, 2020

Uh oh!

HeartSaVioR commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Jul 1, 2020

Uh oh!

HeartSaVioR commented Jul 1, 2020

Uh oh!

HeartSaVioR commented Jul 12, 2020

Uh oh!

SparkQA commented Jul 12, 2020

Uh oh!

HeartSaVioR commented Jul 12, 2020

Uh oh!

HeartSaVioR commented Jul 12, 2020

Uh oh!

SparkQA commented Jul 13, 2020

Uh oh!

HeartSaVioR commented Jul 13, 2020

HeartSaVioR commented Feb 25, 2020 •

edited

Loading

HeartSaVioR commented Feb 25, 2020 •

edited

Loading

HeartSaVioR commented May 28, 2020 •

edited

Loading

HeartSaVioR commented Jun 30, 2020 •

edited

Loading

HeartSaVioR commented Jun 30, 2020 •

edited

Loading