-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9899][SQL] log warning for direct output committer with speculation enabled #8687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #42248 has finished for PR 8687 at commit
|
|
retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little bit confused here. jobCommitter was retrieved via jobFormat.getOutputCommitter(), which should return the default output committer bundled with the output format. Does this mean saveAsNewAPIHadoopDataset() never respects user-defined output committers (I mean the version before this change)? If that's true, we should just leave this method as is. Restricting output committer class to be FileOutputCommitter is too strong a limitation.
cc @yhuai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is a too strong limitation, but my concern is that: there is possibility that a user-defined output formatter provides a direct output committer which we should reject. However, these is no simple way to see if a output committer is direct or not, so I just restrict it to FileOutputCommitter only here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jobFormat.getOutputCommitter() will return the output committer associated with the output format. Basically, in normal cases, you cannot specify output committer for mapreduce api (the new api). So, I think we should not make change at here. Also, jobCommitter.getClass != classOf[org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] is a strong condition, which is not always the case. Every output format implemented with mapreduce API can have its own output committer (e.g. org.apache.parquet.hadoop.ParquetOutputCommitter).
Let's leave this part unchanged.
|
As commented above, I feel that there probably isn't a reliable way to disallow users to use customized output committer classes together with speculation at the level of Spark Core. Maybe we should just leave a big red alert in the documentation. |
|
Test build #42262 has finished for PR 8687 at commit
|
|
Yeah, I think a warming message in the log and mention it in the document is good. @cloud-fan Can you also take a look at the place that we save data to Hive? |
9987c86 to
e2bda91
Compare
|
Test build #42331 has finished for PR 8687 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just log a warning at here, right?
|
Let's also update the title to reflect the change. |
|
Test build #42368 has finished for PR 8687 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
val outputCommitterClass = hadoopConf.get("mapred.output.committer.class", "")
if (speculationEnabled && outputCommitterClass.contains("Direct")) {
val warningMessage =
s"$outputCommitterClass may be a output committer that writes data directly to the final location. " +
"Because speculation is enabled, this output committer may cause data loss (see the case in SPARK-10063). " +
"If possible, please use a output committer that does not have this behavior (e.g. FileOutputCommitter)."
logWarning(warningMessage)
}
db59c25 to
69b7d65
Compare
|
LGTM. Will merge to master once it passes jenkins. |
|
Test build #42429 has finished for PR 8687 at commit
|
|
Thanks. Merging to master. |
|
I test it locally with these code:
And the warning message do get logged. The hive write path should be similar. |
This is a follow-up of #8317.
When speculation is enabled, there may be multiply tasks writing to the same path. Generally it's OK as we will write to a temporary directory first and only one task can commit the temporary directory to target path.
However, when we use direct output committer, tasks will write data to target path directly without temporary directory. This causes problems like corrupted data. Please see PR comment for more details.
Unfortunately, we don't have a simple flag to tell if a output committer will write to temporary directory or not, so for safety, we have to disable any customized output committer when
speculationis true.