Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Feb 25, 2020

What changes were proposed in this pull request?

This patch uses DataInputStream / DataOutputStream to serialize/deserialize entry on FileStream(Source/Sink)Log, as well as applying LZ4 compression on serde.

The ground idea is that both FileEntry and SinkFileStatus haven't been changed for 3 years but we have been simply taking JSON serde to prepare with possible change and spending bunch of overhead on JSON format. CompactibleFileStreamLog has the metadata version information but it has been just version 1.

This patch introduces metadata version 2 of CompactibleFileStreamLog as leveraging DataInputStream / DataOutputStream & LZ4 compression. While we introduce version 2 for CompactibleFileStreamLog, CompactibleFileStreamLog is still compatible with version 1 for both serialization and deserialization, so we can read from version 2 and write to version 2 in normal, and also we can read from version 1 and write to version 2 for smooth migration.

There's some exceptional case on file stream sink, which outputs are possibly read by older version of Spark. To support such case, a new configuration (spark.sql.streaming.fileSink.log.writeMetadataVersion) is introduced to specify the write version of the metadata on file stream sink. If specified, further batch log files will be written via the write version. This doesn't rewrite the existing batch log files, so if the query is started from existing metadata, the query should be run at least for next compact batch so that the metadata log of compact batch can be written via the write version.

Why are the changes needed?

Multiple JIRA issues have been filed to report that huge metadata logs make their queries very slow. While this patch won't make their metadata log stop growing, this patch may heavily reduce down their metadata log and time to process compaction.

Please find the numbers in the section "How was this patch tested?".

As this also reduces the latency to read all files from file stream sink metadata, it also helps to speed up the query for both batch and streaming if the input path is pointing to the output of file stream sink.

Does this PR introduce any user-facing change?

Further metadata log files will be written by configured write version for file stream sink. For other compatible metadata logs, further metadata log files will be written by version 2.

How was this patch tested?

  • Existing UTs (UTs for checking with 2.1.0 compatibility is basically the checking with V1 compatibility, as I don't find any change)
  • Manually done simple perf. test

Note that the version 4 in the experiment is now taken as version 2 in this patch.

This is the experiment branch: https://github.com/HeartSaVioR/spark/tree/SPARK-30946-experiments

Version 3 is only applying compaction (LZ4) on existing format. See below commit:

HeartSaVioR@406670a

Version 4 is serializing/deserializing entry via DataInputStream / DataOutputStream

HeartSaVioR@7c66516

I've also implemented simple apps to 1) prepare metadata (so that we can experiment on the specific batch) and 2) run simple test with various versions:

HeartSaVioR/spark-delegation-token-experiment@bea7680

The numbers are recorded below:

https://docs.google.com/spreadsheets/d/1D5P103F_sKOjkDpNr9PaCC8Ehk4Y4dRtH3oEdytM4_c/edit?usp=sharing

version elapsed time elapsed time (ratio of v1) size size (ratio of v1)
1 10628.75 100.00% 57265744 100.00%
2 939.25 8.84% 16655736 29.08%
3 10116 95.18% 17259852 30.14%
4 837 7.87% 15285626 26.69%

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 25, 2020

Honestly I have been thinking about larger changes, like:

  • avoid rewriting all entities on compact operation
  • support retention (ideal to design with above item, as we won't be able to read all existing entities on compact operation)
  • use tree structure, or at least two kinds of entry "directory" and "file" to heavily reduce down path string on entries
  • streamline the compaction - instead of loading all entities to do next operation, iterate the loop on "load an entry -> transform/filter -> store entry if not filtered out". This would help on reducing driver memory usage on compact operation.

but would like to have priorities on the perspective of (less changes & bigger impact), and make changes incrementally.

This patch brings the least changes but great impact on performance. Above items are orthogonal to this improvement so they can be addressed on demand later.

@SparkQA
Copy link

SparkQA commented Feb 25, 2020

Test build #118924 has finished for PR 27694 at commit 55d238f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

cc. @tdas @zsxwing @gaborgsomogyi

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121228 has finished for PR 27694 at commit 55d238f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uncleGen
Copy link
Contributor

Suppose there is a streaming job pipeline, and these streaming job comes from different end-users or department, if middle end-user upgrade their spark and use CompactibleFileStreamLog V2 (as you said in default, we can read from version 1 and write to version 2 for smooth migration.), then downstream jobs will fail. Is there something I misunderstand?So, It is better to keep the version of CompactibleFileStreamLog, or leave a config to upgrade version of CompactibleFileStreamLog with default false?

@HeartSaVioR
Copy link
Contributor Author

@uncleGen
Yes I haven't thought about the case where Spark versions are mixed up.

It seems to be a clear case that end users don't want to upgrade version smoothly (implicitly), but would it be some case where they start to write the output and want to make it be compatible with Spark 2.x?

If we want to have latter then it's worth to have an option which specifies "version" of the format. (The default value would be ideally same with the existing one in file, or highest known version.)

If we can just live with former then a flag option would be fine.

@gaborgsomogyi
Copy link
Contributor

I've just picked this up but seems like it has a conflict + linter job failed. Could you resolve them?

@SparkQA
Copy link

SparkQA commented May 7, 2020

Test build #122403 has finished for PR 27694 at commit 74163d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 27, 2020

Test build #123150 has finished for PR 27694 at commit 3231611.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uncleGen
Copy link
Contributor

retest this please.

@SparkQA
Copy link

SparkQA commented May 27, 2020

Test build #123156 has finished for PR 27694 at commit 3231611.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented May 28, 2020

One lesson I learned from the past is UnsafeRow is not designed to be persisted across Spark versions. I'd avoid to introduce another place that persists data in the UnsafeRow format. Do you know what are the performance numbers if we just compress the text files?

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented May 28, 2020

One lesson I learned from the past is UnsafeRow is not designed to be persisted across Spark versions.

This sounds like a blocker for SS, as we leverage it to store state which should be used across Spark versions and shouldn't be corrupted/lost as it's time-consuming or sometimes simply not possible to replay from the scratch to construct the same due to retention policy. Do you have references on following the discussion/gotcha for that issue? I think we should really fix it for storing state - probably state must not be stored via UnsafeRow format then.

Do you know what are the performance numbers if we just compress the text files?

I haven't experimented but we can easily imagine the file size would be smaller whereas processing time may be affected both positively (less to read from remote storage) and negatively (still have to serialize/deserialize with JSON + additional cost for compression/decompression).

Here the point of the PR is that we know the schema (and even versioning) of the file in prior, hence we don't (and shouldn't) pay huge cost to make the file be backward-compatible by itself. We don't do versioning for data structures being used by event log so we are paying huge cost to make it backward/forward compatible. If we are not sure about unsafe row format for storing then we may be able to just try with traditional approaches.

@HeartSaVioR
Copy link
Contributor Author

I've spent some time to experiment more approaches.

This is the experiment branch: https://github.com/HeartSaVioR/spark/tree/SPARK-30946-experiments

Version 3 is only applying compaction (LZ4) on existing format. See below commit:

HeartSaVioR@406670a

Version 4 is serializing/deserializing entry via DataInputStream / DataOutputStream

HeartSaVioR@7c66516

I've also implemented simple apps to 1) prepare metadata (so that we can experiment on the specific batch) and 2) run simple test with various versions:

HeartSaVioR/spark-delegation-token-experiment@bea7680

The numbers are recorded below:

https://docs.google.com/spreadsheets/d/1D5P103F_sKOjkDpNr9PaCC8Ehk4Y4dRtH3oEdytM4_c/edit?usp=sharing

version elapsed time elapsed time (ratio of v1) size size (ratio of v1)
1 10628.75 100.00% 57265744 100.00%
2 939.25 8.84% 16655736 29.08%
3 10116 95.18% 17259852 30.14%
4 837 7.87% 15285626 26.69%

The number represents that applying compression on existing format doesn't help reducing the time, while the size is reduced similar with other alternatives. Other alternatives directly integrated to the data structure greatly reduce the time, say, 10 times faster. The size of compact files are similar across alternatives.

@HeartSaVioR
Copy link
Contributor Author

Hi all, are these numbers enough to persuade? Given we concern about "compatibility" I think the version 4 is ideal, which "Java" guarantees compatibility across architectures. I'll update the patch to use the version 4 as a new version (version 2, of course) as of now.

@HeartSaVioR HeartSaVioR changed the title [SPARK-30946][SS] Serde entry with UnsafeRow on FileStream(Source/Sink)Log with LZ4 compression [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log Jun 5, 2020
@SparkQA
Copy link

SparkQA commented Jun 5, 2020

Test build #123555 has finished for PR 27694 at commit 737cc60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Also addressed the case on reading the metadata log from lower version of Spark.

@uncleGen Does the approach make sense?
@zsxwing Could we revisit the approach on performance & compatibility point of view?

@SparkQA
Copy link

SparkQA commented Jun 5, 2020

Test build #123559 has finished for PR 27694 at commit 6cd9bfd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Failed UTs are unrelated.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jun 5, 2020

Test build #123565 has finished for PR 27694 at commit 3933018.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 5, 2020

Test build #123564 has finished for PR 27694 at commit 6cd9bfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Jun 30, 2020

@zsxwing
Thanks a lot for your detailed comment!

I think considering all of these would take me to redesign metadata log as well as file stream source itself, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives).

That said, shall we do the following?

  1. Review and merge other PRs which don't incur metadata version upgrade, before proceeding 2.
  1. File an issue to address these requirements altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll try to persuade my employer to allow me to take the work.

If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case instead of thinking too general.

WDYT?

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Jun 30, 2020

(IMHO it might be still good chance to leverage this PR to construct a good way for versioning properly - so that version 2 can be used as an interim with best practice on versioning, and we get version 3 with such major features included without headache on versioning.)

@HeartSaVioR
Copy link
Contributor Author

Probably we can go with supporting relative path “as well” in this metadata version which would heavily reduce the size even without compression. I’ll experiment this approach as well.

@HeartSaVioR
Copy link
Contributor Author

Probably we can go with supporting relative path “as well” in this metadata version which would heavily reduce the size even without compression. I’ll experiment this approach as well.

Never mind. I simply missed that the metadata log should be able to read from older Spark versions which might have no idea to handle relative path. This change also needs to bump up the version.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 12, 2020

Test build #125726 has finished for PR 27694 at commit 3933018.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@HeartSaVioR
Copy link
Contributor Author

@zsxwing Would you mind if I ask your opinion on #27694 (comment) ?

@SparkQA
Copy link

SparkQA commented Jul 13, 2020

Test build #125737 has finished for PR 27694 at commit 3933018.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 13, 2020

Test build #125748 has finished for PR 27694 at commit 3933018.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 15, 2020

Test build #125876 has finished for PR 27694 at commit 86131af.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125900 has finished for PR 27694 at commit 86131af.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

…th LZ4 compression on FileStream(Source/Sink)Log
@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125934 has finished for PR 27694 at commit 2559928.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 16, 2020

Test build #125957 has finished for PR 27694 at commit 2559928.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Aug 18, 2020

Test build #127534 has finished for PR 27694 at commit 2559928.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Aug 18, 2020

Test build #127549 has finished for PR 27694 at commit 2559928.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Aug 18, 2020

Test build #127573 has finished for PR 27694 at commit 2559928.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Closing as I don't see any support on this and I'm now in favor of one of alternatives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants