Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[SPARK-7263] Add new shuffle manager which stores shuffle blocks in Parquet #7265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
[SPARK-7263] Add new shuffle manager which stores shuffle blocks in Parquet #7265
Changes from all commits
0a4c028File filter
Filter by extension
Conversations
Uh oh!
There was an error while loading. Please reload this page.
Jump to
Uh oh!
There was an error while loading. Please reload this page.
…arquet This commit adds a new Spark shuffle manager which reads and writes shuffle data to Apache Parquet files. Parquet has a File interface (not a streaming interface) because it is column-oriented and seeks in a File for metadata information, e.g. schemas, statistics. As such, this implementation fetches remote data to local, temporary blocks before the data is passed to Parquet for reading. This managers uses the following spark configuration parameters to configure Parquet: spark.shuffle.parquet.{compression, blocksize, pagesize, enabledictionary}. There is a spark.shuffle.parquet.fallback configuration option which allows users to specify a fallback shuffle manager. If the Parquet manager finds that the classes being shuffled have no schema information, and therefore can't be used, it will fallback to the specified fallback manager. With this PR, only Avro IndexedRecords are supported in the Parquet shuffle; however, it is straight-forward to extend this to other serialization systems that Parquet supports, e.g. Apache Thrift. If there is no spark.shuffle.parquet.fallback defined, any shuffle objects which are not compatible with Parquet will cause an error to be thrown which lists the incompatible objects. Because the ShuffleDependency forwards the key, value and combined class information, a full schema can be generated before the first read/write. This allows for less errors (since reflection isn't used) and makes support for null values possible without complex code. The ExternalSorter, if needed, is setup to not spill to disk if Parquet is used. In the future, an ExternalSorter would need to be created that can read/write Parquet. Only record-level metrics are supported at this time. Byte-level metrics are not currently supported and are complicated somewhat by column compression.Uh oh!
There was an error while loading. Please reload this page.
There are no files selected for viewing
Uh oh!
There was an error while loading. Please reload this page.