Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

This is a alternate way to convert SQL from analyzed logical plans containing Generate operator.
In this PR , generators in projection list are expressed as LATERAL VIEW.

Sample Plan :

GlobalLimit 3
+- LocalLimit 3
   +- Project [gencol2#204]
      +- Generate explode(gencol1#203), true, false, Some(gentab2), [gencol2#204]
         +- Generate explode(array(array(1, 2, 3))), true, false, Some(gentab1), [gencol1#203]
            +- MetastoreRelation default, t4, None

Generated Query:

SELECT `gentab2`.`gencol2` FROM `default`.`t4` LATERAL VIEW explode(array(array(1, 2, 3))) `gentab1` AS `gencol1` LATERAL VIEW explode(`gentab1`.`gencol1`) `gentab2` AS `gencol2` LIMIT 3

How was this patch tested?

Tests added to LogicalPlanToSQLSuite

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@dilipbiswal
Copy link
Contributor Author

@rxin @cloud-fan @liancheng @gatorsmile Hi Reynold and Wenchen,
I have made change to generate LATERAL view for all generators specified in projection list as well as in from clause. I am still testing it and adding more tests. Please let me know what you think..

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52719 has finished for PR 11596 at commit e10883a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

hi @dilipbiswal , thanks for working on this! Some high-level questions:

  1. Can any Generate operator be converted to a LATERAL VIEW? Does the plan tree have to match some kind of pattern?
  2. Can any Generate operator be converted to a "SELECT ...udtf..." format?

Generally speaking, we should use one unified format for Generator's SQL string if possible, whatever the original SQL string is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an assert here.

assert(join == true || plan.projectList == 1)

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thank you.

  1. Can any Generate operator be converted to a LATERAL VIEW? Does the plan tree have to match some kind of pattern?

Yes, based on the tests i have tried. I am testing this more to see if i uncover any issue. Now i am converting

select explode(array(1,2,3)) from t1

to

select tablequalifier.colname from t1 LATERAL VIEW explode(array(1,2,3)) tablequalifier as colname
  1. Can any Generate operator be converted to a "SELECT ...udtf..." format?

No. As we have a restriction of allowing one generate in a projection list. Thats why we try to
use the LATERAL VIEW clause.

@cloud-fan
Copy link
Contributor

But a Generate operator only has one generator, so I think we can convert any kind of Generate to SELECT?

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Yeah.. that should be possible, You are thinking to represent each Generate as a sub select clause ?

@cloud-fan
Copy link
Contributor

I'm not sure which one is better, it depends on which one is easier to understand and implement. I'm ok to generate verbose SQL string, but the generator itself should be as simple and robust as possible.

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52725 has finished for PR 11596 at commit d7f0207.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thanks !! Actually i remember exploring this option. There is a generate syntax for outer lateral view like following -

SELECT * FROM src LATERAL VIEW OUTER explode(array()) C AS a limit 10;

I thought if we converted this generate operator as a sub select where the generator is in projection
list , we will loose the outer semantics. Thats why i felt safer to generate as a LATERAL VIEW
clause.

@cloud-fan
Copy link
Contributor

ah makes sense, then we should always generate a LATERAL VIEW format SQL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do plan.generatorOutput.map(_.sql).mkString(", ")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thx

ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 9, 2016
#### What changes were proposed in this pull request?

As shown in another PR: apache#11596, we are using `SELECT 1` as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example,

```SQL
SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value
```
Before the PR, the optimized plan contains a useless `Project` after Optimizer executing the `ColumnPruning` rule, as shown below:

```
== Analyzed Logical Plan ==
value: int
Project [value#22]
+- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22]
   +- SubqueryAlias dummyTable
      +- Project [1 AS 1#21]
         +- OneRowRelation$

== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- Project
   +- OneRowRelation$
```

After the fix, the optimized plan removed the useless `Project`, as shown below:
```
== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- OneRowRelation$
```

This PR is to remove `Project` when its Child's output is Nil

#### How was this patch tested?

Added a new unit test case into the suite `ColumnPruningSuite.scala`

Author: gatorsmile <[email protected]>

Closes apache#11599 from gatorsmile/projectOneRowRelation.
@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52915 has finished for PR 11596 at commit bea871f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Can you please help trigger a retest ? This does not seem related to my changes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52930 has finished for PR 11596 at commit bea871f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Hi @dilipbiswal , sorry that recently I made a major refactor to SQLBuilder, and it changed some assumptions to the Generate SQL generation, so I took over your PR and send a new one with some simplification: #11696, do you mind taking a look?

@dilipbiswal
Copy link
Contributor Author

Closing this in favor of
#11696

asfgit pushed a commit that referenced this pull request Mar 17, 2016
## What changes were proposed in this pull request?

This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list.

This PR is based on #11658, please see the last commit to review the real changes.

Thanks dilipbiswal for his initial work! Takes over #11596

## How was this patch tested?

new tests in `LogicalPlanToSQLSuite`

Author: Wenchen Fan <[email protected]>

Closes #11696 from cloud-fan/generate.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
#### What changes were proposed in this pull request?

As shown in another PR: apache#11596, we are using `SELECT 1` as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example,

```SQL
SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value
```
Before the PR, the optimized plan contains a useless `Project` after Optimizer executing the `ColumnPruning` rule, as shown below:

```
== Analyzed Logical Plan ==
value: int
Project [value#22]
+- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22]
   +- SubqueryAlias dummyTable
      +- Project [1 AS 1#21]
         +- OneRowRelation$

== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- Project
   +- OneRowRelation$
```

After the fix, the optimized plan removed the useless `Project`, as shown below:
```
== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- OneRowRelation$
```

This PR is to remove `Project` when its Child's output is Nil

#### How was this patch tested?

Added a new unit test case into the suite `ColumnPruningSuite.scala`

Author: gatorsmile <[email protected]>

Closes apache#11599 from gatorsmile/projectOneRowRelation.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list.

This PR is based on apache#11658, please see the last commit to review the real changes.

Thanks dilipbiswal for his initial work! Takes over apache#11596

## How was this patch tested?

new tests in `LogicalPlanToSQLSuite`

Author: Wenchen Fan <[email protected]>

Closes apache#11696 from cloud-fan/generate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants