-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18394][SQL] Make an AttributeSet.toSeq output order consistent #18959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // We need to keep a deterministic output order for `baseSet` because this affects a variable | ||
| // order in generated code (e.g., `GenerateColumnAccessor`). | ||
| // See SPARK-18394 for details. | ||
| baseSet.map(_.a).toArray.sortBy { a => (a.name, a.exprId.id) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it needs to be a Seq, then should the toArray be toSeq? maybe I missed way it has to be an array first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, as you suggested, I initially did so. But, I just kept the original code cuz I was afraid this change wrongly affected the others. cc: @marmbrus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought map should always return a strict collection. I think it is safe to sort immediately after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll fix in that way. Thanks!
|
Test build #80725 has finished for PR 18959 at commit
|
|
In title and description, |
|
oh..ya, my bad.... thanks. |
|
Test build #80736 has finished for PR 18959 at commit
|
| assert(aSet == AttributeSet(aUpper :: Nil)) | ||
| } | ||
|
|
||
| test("SPARK-18394 keep a deterministic output order along with attribute names") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify this test case. Add a scenario in which the attribute set has two columns with the same name but different ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
|
||
| assert(actualOutputColumns === expectedOutputColumns, "Output columns mismatch") | ||
| assert(actualScannedColumns === expectedScannedColumns, "Scanned columns mismatch") | ||
| assert(actualScannedColumns.sorted === expectedScannedColumns.sorted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment to explain where we call AttributeSet.toSeq?
|
LGTM except two minor comments. |
|
Jenkins, retest this please. |
| // Scanned columns in `HiveTableScanExec` are generated by the `pruneFilterProject` method | ||
| // in `SparkPlanner` that internally uses `AttributeSet.toSeq`. | ||
| // Since we change an output order of `AttributeSet.toSeq` in SPARK-18394, | ||
| // we need to sort column names for a test below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
Scanned columns in
HiveTableScanExecare generated by thepruneFilterProjectmethod inSparkPlanner. This method internally usesAttributeSet.toSeq, in which the returned output columns are sorted by the names and expression ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look good, I'll update soon.
|
Test build #80763 has finished for PR 18959 at commit
|
|
Test build #80770 has finished for PR 18959 at commit
|
|
Jenkins, retest this please. |
|
Test build #80781 has finished for PR 18959 at commit
|
|
Merging to master. Thanks! |
What changes were proposed in this pull request?
This pr sorted output attributes on their name and exprId in
AttributeSet.toSeqto make the order consistent. If the order is different, spark possibly generates different code and then misses cache inCodeGenerator, e.g.,GenerateColumnAccessorgenerates code depending on an input attribute order.How was this patch tested?
Added tests in
AttributeSetSuiteand manually checked if the cache worked well in the given query of the JIRA.