-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23243][SQL] Shuffle+Repartition on an RDD could lead to incorrect answers #20414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will necessarily be a breaking change [1] - though definitely worth doing given the data corruption/loss issue.
One alternative is to force a checkpoint() which will result in predictability, while avoiding the output skew.
Unfortunately, since checkpoint support is optional (based on whether checkpoint directory is configured), we cannot force this.
Perhaps a flag for users who want to preserve 'earlier' behavior via checkpoint ?
[1] Breaking change because in a lot of cases, coalasce with shuffle is done explicitly to prevent skew :-(
|
In addition, any use of random in spark code will get affected by this - unless input is an idempotent source; even if random initialization is done predictably with the partition index (which we were doing here anyway). |
|
Test build #86728 has finished for PR 20414 at commit
|
|
Just for context, I'm seeing RDD.repartition being used a lot, at the scale of almost every single job. |
|
@jiangxb1987 @mridulm Could we have a special case of using the sort-based approach when the RDD type is comparable ? I think that should cover a bunch of the common cases and the hash version will only be used when keys are not comparable. Also @mridulm your point about more things other than repartition being affected is definitely true (just in this file |
|
Talked to @yanboliang offline, he claimed that the major use cases of RDD/DataFrame.repartition() in ml workloads he has observed are:
Actually for the first case, you shall use Another approach is that we may let the |
Not quite - coalesce will not combine partitions across executors (aka shuffle) so you could still end up having many many files. I have seen that ^ quite a bit with large scale ML. But FWIW, my comment earlier was for both "regular" use cases and ML use cases. |
|
@felixcheung You are right that I didn't make it clear there should be still many shuffle blocks, and if you have the read task retried it should be slower than using Now I tend to fix the issue following the latter fix-shuffle-fetch-order way, since it may resolve for general cases. |
I'm not sure if I follow here. For I think we can special case |
|
@cloud-fan Yea you provide a more clear statement here, and I totally agree! |
|
@shivaram Thinking more, this might affect everything which does a zip (or variants/similar idioms like limit K, etc) on partition should be affected - with random + index in coalesce + shuffle=true being one special case. Essentially anything which assumes that order of records in a partition will always be the same - currently,
The more I think about it, I like @sameeragarwal's suggestion in #20393, a general solution for this could be introduce deterministic output for shuffle fetch - when enabled takes a more expensive but repeatable iteration of shuffle fetch. This assumes that spark shuffle is always repeatable given same input (I am yet to look into this in detail when spills are involved - any thoughts @sameeragarwal ?), which could be an implementation detail; but we could make it a requirement for shuffle. Note that we might be able to avoid this additional cost for most of the current usecases (otherwise we would have faced this problem 2 major releases ago !); so actual user impact, hopefully, might not be as high. |
|
@mridulm I also agree we should follow @sameeragarwal 's suggestion to let shuffle fetch produce deterministic output, and only do this for a few operations (e.g. repartition/zipWithIndex, do we have more?) IIUC spill should NOT affect the result, but if you find any suspects, please kindly share them with us. :) |
|
@jiangxb1987 Unfortunately I am unable to analyze this in detail; but hopefully can give some pointers, which I hope, helps ! One example I can think of is, for shuffle which uses Aggregator (like combineByKey), via ExternalAppendOnlyMap. Similarly, with sort based shuffle, depending on the length of the data array in AppendOnlyMap (which is determined by whether we spilt or not) we can get different sort order's ? There might be other cases where this is happening - I have not regularly looked at this part of the codebase in a while now unfortunately. Please note that all the cases above, there is no ordering defined. |
|
Hey I searched the We may need to check for all the other places we may spill/compare objects to ensure we generate deterministic output sequence everywhere, though. |
|
@jiangxb1987 You are correct when the sizes of the map's are same. |
|
Ouch... Yea, we have to think out a way to make it deterministic under hash collisions. |
|
Thanks @mridulm, all great points! We should investigate what's needed to guarantee ordering for spilled fetches. |
|
Test build #93558 has finished for PR 20414 at commit
|
|
Hi, @jiangxb1987 . Could you close this PR? |
What changes were proposed in this pull request?
The RDD repartition also uses the round-robin way to distribute data, this can also cause incorrect answers on RDD workload the similar way as in #20393
However, the approach that fixes DataFrame.repartition() doesn't apply on the RDD repartition issue, because the input data can be non-comparable, as discussed in #20393 (comment)
Here, I propose a quick fix that distribute elements use their hashes, this will cause perf regression if you have highly skewed input data, but it will ensure result correctness.
How was this patch tested?
Added test case in
RDDSuiteto ensureRDD.repartition()generate consistent answers.