Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Aug 16, 2017

What changes were proposed in this pull request?

This pr sorted output attributes on their name and exprId in AttributeSet.toSeq to make the order consistent. If the order is different, spark possibly generates different code and then misses cache in CodeGenerator, e.g., GenerateColumnAccessor generates code depending on an input attribute order.

How was this patch tested?

Added tests in AttributeSetSuite and manually checked if the cache worked well in the given query of the JIRA.

// We need to keep a deterministic output order for `baseSet` because this affects a variable
// order in generated code (e.g., `GenerateColumnAccessor`).
// See SPARK-18394 for details.
baseSet.map(_.a).toArray.sortBy { a => (a.name, a.exprId.id) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it needs to be a Seq, then should the toArray be toSeq? maybe I missed way it has to be an array first

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, as you suggested, I initially did so. But, I just kept the original code cuz I was afraid this change wrongly affected the others. cc: @marmbrus

Copy link
Contributor

@hvanhovell hvanhovell Aug 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought map should always return a strict collection. I think it is safe to sort immediately after that.

Copy link
Member Author

@maropu maropu Aug 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll fix in that way. Thanks!

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80725 has finished for PR 18959 at commit 3201f0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Aug 16, 2017

In title and description, Attribute.toSeq seems to be AttributeSet.toSeq?

@maropu
Copy link
Member Author

maropu commented Aug 16, 2017

oh..ya, my bad.... thanks.

@maropu maropu changed the title [SPARK-18394][SQL] Make an Attribute.toSeq output order consistent [SPARK-18394][SQL] Make an AttributeSet.toSeq output order consistent Aug 16, 2017
@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80736 has finished for PR 18959 at commit eba844e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(aSet == AttributeSet(aUpper :: Nil))
}

test("SPARK-18394 keep a deterministic output order along with attribute names") {
Copy link
Member

@gatorsmile gatorsmile Aug 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify this test case. Add a scenario in which the attribute set has two columns with the same name but different ids?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


assert(actualOutputColumns === expectedOutputColumns, "Output columns mismatch")
assert(actualScannedColumns === expectedScannedColumns, "Scanned columns mismatch")
assert(actualScannedColumns.sorted === expectedScannedColumns.sorted,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment to explain where we call AttributeSet.toSeq?

@gatorsmile
Copy link
Member

LGTM except two minor comments.

@maropu
Copy link
Member Author

maropu commented Aug 17, 2017

Jenkins, retest this please.

// Scanned columns in `HiveTableScanExec` are generated by the `pruneFilterProject` method
// in `SparkPlanner` that internally uses `AttributeSet.toSeq`.
// Since we change an output order of `AttributeSet.toSeq` in SPARK-18394,
// we need to sort column names for a test below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

Scanned columns in HiveTableScanExec are generated by the pruneFilterProject method in SparkPlanner. This method internally uses AttributeSet.toSeq, in which the returned output columns are sorted by the names and expression ids.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look good, I'll update soon.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80763 has finished for PR 18959 at commit b33fde8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80770 has finished for PR 18959 at commit 973402b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Aug 17, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80781 has finished for PR 18959 at commit 973402b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Merging to master. Thanks!

@asfgit asfgit closed this in 6aad02d Aug 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants