[SPARK-28612][SQL] Add DataFrameWriterV2 API #25354

rdblue · 2019-08-04T21:08:32Z

What changes were proposed in this pull request?

This adds a new write API as proposed in the SPIP to standardize logical plans. This new API:

Uses clear verbs to execute writes, like append, overwrite, create, and replace that correspond to the new logical plans.
Only creates v2 logical plans so the behavior is always consistent.
Does not allow table configuration options for operations that cannot change table configuration. For example, partitionedBy can only be called when the writer executes create or replace.

Here are a few example uses of the new API:

df.writeTo("catalog.db.table").append()
df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01")
df.writeTo("catalog.db.table").overwritePartitions()
df.writeTo("catalog.db.table").asParquet.create()
df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace()
df.writeTo("catalog.db.table").using("abc").replace()

How was this patch tested?

Added DataFrameWriterV2Suite that tests the new write API. Existing tests for v2 plans.

rdblue · 2019-08-04T21:10:41Z

@cloud-fan, please take a look at this PR with the new DSv2 write API, as discussed on the InsertInto thread.

@brkyvz, @mccheah, @jzhuge, and @dongjoon-hyun, you may also be interested in reviewing. Thanks!

SparkQA · 2019-08-04T21:19:38Z

Test build #108630 has finished for PR 25354 at commit 10c7b8f.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class PartitionTransformExpression extends Expression with Unevaluable
case class Years(child: Expression) extends PartitionTransformExpression
case class Months(child: Expression) extends PartitionTransformExpression
case class Days(child: Expression) extends PartitionTransformExpression
case class Hours(child: Expression) extends PartitionTransformExpression
case class Bucket(numBuckets: Literal, child: Expression) extends PartitionTransformExpression
implicit class OptionsHelper(options: Map[String, String])
trait WriteConfigMethods[R]
trait CreateTableWriter[T] extends WriteConfigMethods[CreateTableWriter[T]]

SparkQA · 2019-08-04T21:56:21Z

Test build #108632 has finished for PR 25354 at commit 23a1188.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-08-05T01:09:42Z

Thank you for pinging me, @rdblue . Could you fix the doc generation issue in order to pass the Jenkins?

SparkQA · 2019-08-05T16:41:27Z

Test build #108674 has finished for PR 25354 at commit 33dde6d.

This patch fails to generate documentation.
This patch does not merge cleanly.
This patch adds no public classes.

jzhuge · 2019-08-05T16:43:06Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala

+  }
+
+  @scala.annotation.varargs
+  override def partitionedBy(column: Column, columns: Column*): CreateTableWriter[T] = {


Is this name intended to be different from DataFrameWriter.partitionBy?

The intent is to match CREATE TABLE SQL, which uses PARTITIONED BY.

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/PartitionTransforms.scala

jzhuge · 2019-08-05T16:55:53Z

How do convert DSW.jdbc(url: String, table: String, connectionProperties: Properties) to DSWv2?
Maybe writeTo(table).asJdbc(url, connectionProperties).<DSWv2_Action>?
Or set url and connectionProperties in options with a simple asJdbc()?

BTW, a new JdbcCatalog CatalogPlugin type may store url and connectionProperties in catalog properties, thus easier for users to access Jdbc tables.

rdblue · 2019-08-05T17:19:07Z

@jzhuge, options like connection properties and JDBC URL belong at the catalog level, not at the table level. Those are table-level configuration in v1 because there is no catalog that holds common options like the database connection URL.

jzhuge · 2019-08-05T18:04:11Z

@rdblue Catalog level properties sounds good to me.

Also realized that JdbcCatalog plugin type is not needed if the only things we want are the provider common options. Each provider can pick up common options from catalog properties automatically, e.g.:

spark.sql.catalog.<name>.provider.<provider_name>.<provider_common_options>

SparkQA · 2019-08-05T21:01:52Z

Test build #108676 has finished for PR 25354 at commit 409a0bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

viirya · 2019-08-07T04:54:53Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala

+}
+
+/**
+ * Configuration methods common to create/replace operations and insert/overwrite operations.


By reading this, sounds like there is another Writer for insert/overwrite extending WriteConfigMethods, like CreateTableWriter?

Yes. DataFrameWriterV2 and CreateTableWriter implement these methods. When a CreateTableWriter method is called, like partitionedBy, the result is always a CreateTableWriter and not a DataFrameWriterV2 so that append can't be called with unsupported options.

brkyvz

Is it possible to use this API with paths or URIs instead of table names?
Something like:

writeTo.table("catalog.db.tbl")
writeTo.path("file:/tmp/abc") \\ or
writeTo.uri("jdbc:mysql://...")

brkyvz · 2019-08-07T18:39:17Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @group partition_transforms
+   * @since 3.0.0
+   */
+  def years(e: Column): Column = withExpr { Years(e.expr) }


I'm worried that these are a single letter away from existing function names. With one small typo, you either get your function to be unevaluable, or return a runtime exception.

Can you think of a solution to this besides supporting year(ts) as though it were years? I'm concerned that would cause confusion for functions like hour that have different meanings.

Maybe we should catch these and throw more helpful exceptions instead?

can we hide them by a namespace?

object partitionBy { def years: ... def months: ... }

So this would be partitionBy.years? I find partitionBy a little awkward. Is there a better name we could use? partitioner.years maybe?

I'm not too stuck up on the name. What do you think of the idea in general? Other names I can think of:

partitioner

logical

transformer

partitioning

partition
could be like partition.byYears, partition.byDays...

I think the names should match the SQL names, which are years, months, etc.

I think I like "partitioning" the best, since it qualifies the function. partitioning.years. Should I make this change?

I'm fine with that. Let's discuss at the sync in 30 mins and gather feedback. Wouldn't want you to waste work.

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

brkyvz · 2019-08-07T18:43:28Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala

+  @scala.annotation.varargs
+  override def partitionedBy(column: Column, columns: Column*): CreateTableWriter[T] = {
+    val asTransforms = (column +: columns).map(_.expr).map {
+      case Years(attr: Attribute) =>


Can we not use the existing:

year month dayofmonth hour

functions that already exist? The closeness of the function names worry me. I understand the separation of concerns, but something to consider. Maybe these were already discussed in the SPIP.

No, we can't.

First, those are concrete functions that have different meanings. hour(ts) is hour of day, not hourly partitions, and day of month is not a function you would partition on either.

Second, using those functions would not correspond to the transform names that are supported in SQL, which are years, months, days, and hours.

brkyvz · 2019-08-07T18:48:11Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala

+        identifier,
+        partitioning.getOrElse(Seq.empty),
+        logicalPlan,
+        properties = provider.map(p => properties + ("provider" -> p)).getOrElse(properties).toMap,


what about location, if this is meant to be an external table

We can add a location(String) method to this that is translated to a location property. Does this need to be in the initial version?

doesn't need to be initial. We should however standardize on whether:

option("path", ...) would/should show up as "location" as a table property

Or should it be set by tableProperty("location", ...)

or tableProperty("path", ...)

I think it should be tableProperty("location") because that's what we've standardized on elsewhere.

In the DFWriter we have a config:
df.sparkSession.sessionState.conf.defaultDataSourceName
do you want to use that here, or do you think the catalog is free to create whatever datasource if the provider isn't available?

I don't think so. It is up to the catalog what to use for the provider when one isn't specified. This API doesn't need to do that -- it should be filled in by the v2 session catalog.

Can this be documented somewhere?

Yes, this will be in the v2 documentation because we have to explain how USING is passed.

rdblue · 2019-08-07T19:49:16Z

Is it possible to use this API with paths or URIs instead of table names?

@brkyvz, Yes, it is. Instead of passing names to writeTo, you would pass a path or URI:

df.writeTo("s3://bucket/path/to/table").append()

The v2 identifiers SPIP includes how to represent a path-based table: PathIdentifier(path, name) that extends Identifier. The namespace of a PathIdentifier is a single level and contains a fully qualified path.

When we have a spec for how path-based tables behave, we can update writeTo to catch paths and pass PathIdentifier.

SparkQA · 2019-08-08T02:14:48Z

Test build #108788 has finished for PR 25354 at commit 4538721.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-12T07:05:02Z

Test build #108947 has finished for PR 25354 at commit 4538721.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-08-12T17:20:00Z

Retest this please.

SparkQA · 2019-08-12T21:05:22Z

Test build #108986 has finished for PR 25354 at commit 4538721.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-20T19:08:14Z

Test build #109429 has finished for PR 25354 at commit 17fe1fd.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class WorkerExecutorStateResponse(
case class WorkerDriverStateResponse(
case class WorkerSchedulerStateResponse(
case class LaunchDriver(
case class StandaloneResourceAllocation(pid: Int, allocations: Seq[ResourceAllocation])
abstract class FileCommitProtocol extends Logging
class ResourceAllocator(name: String, addresses: Seq[String]) extends Serializable
class RpcAbortException(message: String) extends Exception(message)
case class CatalystDataToAvro(
class BindingParquetOutputCommitter(
class PathOutputCommitProtocol(
class DecisionTreeParams(Params):
case class UnresolvedTable(v1Table: CatalogTable) extends Table
implicit class IdentifierHelper(identifier: TableIdentifier)
class CatalogManager(conf: SQLConf) extends Logging
sealed trait RewritableTransform extends Transform
case class Milliseconds(child: Expression, timeZoneId: Option[String] = None)
case class Microseconds(child: Expression, timeZoneId: Option[String] = None)
case class IsoYear(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
case class Millennium(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
case class Century(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
case class Decade(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
case class Epoch(child: Expression, timeZoneId: Option[String] = None)
case class ArrayForAll(
case class DescribeTable(table: NamedRelation, isExtended: Boolean) extends Command
case class DeleteFromTable(
trait V2CreateTablePlan extends LogicalPlan
case class DeleteFromStatement(
case class DescribeColumnStatement(
case class DescribeTableStatement(
case class ColumnarToRowExec(child: SparkPlan) extends UnaryExecNode with CodegenSupport
case class InputAdapter(child: SparkPlan) extends UnaryExecNode with InputRDDCodegen
case class InsertAdaptiveSparkPlan(
case class ReuseAdaptiveSubquery(
class DetectAmbiguousSelfJoin(conf: SQLConf) extends Rule[LogicalPlan]
case class DeleteFromTableExec(
case class DescribeTableExec(table: Table, isExtended: Boolean) extends LeafExecNode

SparkQA · 2019-08-21T03:13:14Z

Test build #109446 has finished for PR 25354 at commit fb0fafc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataFrameWriterV2Suite extends QueryTest with SharedSparkSession with BeforeAndAfter

keypointt · 2019-08-21T05:01:39Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/PartitionTransforms.scala

+/**
+ * Expression for the v2 partition transform hours.
+ */
+case class Hours(child: Expression) extends PartitionTransformExpression {


Hi Ryan, maybe naive question, why not supporting granularity down to minutes or seconds?

I've not seen an example of a table that requires partitioning down to minutes or seconds. I'm not opposed to adding them, but it seems to me that those would not be very useful and would probably get people that use them into trouble by over-partitioning.

got it, thanks. agree that if not many use cases for minutes or seconds then we can ignore it.

keypointt · 2019-08-21T05:30:28Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

 case class OverwritePartitionsDynamic(
    table: NamedRelation,
    query: LogicalPlan,
+    writeOptions: Map[String, String],


just curious, is it coding style to have boolean parameter as the last one like isByName?

here writeOptions 2nd from last, and writeOptions is the last parameter in OverwritePartitionsDynamic

Yes, it is for style. Boolean parameters should be passed by name, like isByName = false. Although you can pass positional parameters after a named parameter, the expectation is usually that named parameters are not necessarily in the correct position and can be omitted or reordered.

thanks Ryan 👍

keypointt · 2019-08-21T05:43:57Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * @group basic
+   * @since 3.0.0
+   */
+  def writeTo(table: String): DataFrameWriterV2[T] = {


I find there is write for v1

/** * Interface for saving the content of the non-streaming Dataset out into external storage. * * @group basic * @since 1.6.0 */ def write: DataFrameWriter[T] = { if (isStreaming) { logicalPlan.failAnalysis( "'write' can not be called on streaming Dataset/DataFrame") } new DataFrameWriter[T](this) }

why not name it writeV2 to be self-explaining? or overload write from different return type DataFrameWriterV2[T] ?

We can't change the behavior of write because we don't want to break older jobs. And we need to pass the table name or path somewhere. I think this works, but if everyone prefers writeV2, we can rename it.

I like writeTo, since we're explaining exactly what we're writing to at that point.

keypointt · 2019-08-21T06:16:10Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+      case lit @ Literal(_, IntegerType) =>
+        Bucket(lit, e.expr)
+      case _ =>
+        throw new AnalysisException(s"Invalid number of buckets: $numBuckets")


also add column information in exception msg for debugging, like s"Invalid number of buckets: $numBuckets, for column: $e"?

keypointt · 2019-08-21T14:42:22Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataFrameWriterV2Suite.scala

+      Seq(Row(1L, "a"), Row(2L, "b"), Row(3L, "c"), Row(4L, "d"), Row(5L, "e"), Row(6L, "f")))
+  }
+
+  test("Append: by name not position") {


minor, is it better to make test case name like line-88 with expected result? like fail if by name not position

This tests that the validation is by name and not by position, so failing if by name is incorrect. The failure tests that a name violation (can't find "data") is generated, even though the number columns and column types match by position.

oh I see. no issues then. thanks!

keypointt · 2019-08-21T14:45:15Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataFrameWriterV2Suite.scala

+      Seq(Row(1L, "a"), Row(2L, "b"), Row(4L, "d"), Row(5L, "e"), Row(6L, "f")))
+  }
+
+  test("Overwrite: by name not position") {


same with fail by name not position

SparkQA · 2019-08-21T21:25:37Z

Test build #109505 has finished for PR 25354 at commit 12097ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataFrameWriterV2Suite extends QueryTest with SharedSparkSession with BeforeAndAfter

SparkQA · 2019-08-21T21:29:31Z

Test build #109506 has finished for PR 25354 at commit 6950650.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

LGTM, but I still have worries about the partitioning expressions. I'm also fine with throwing better error messages, but curious what you think around namespacing.

brkyvz · 2019-08-21T22:25:37Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * @group basic
+   * @since 3.0.0
+   */
+  def writeTo(table: String): DataFrameWriterV2[T] = {


I like writeTo, since we're explaining exactly what we're writing to at that point.

brkyvz · 2019-08-21T22:26:41Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+    // TODO: streaming could be adapted to use this interface
+    if (isStreaming) {
+      logicalPlan.failAnalysis(
+        "'writeTo' can not be called on streaming Dataset/DataFrame")


may be good to add: use 'writeStream' instead

I didn't include this because I'd rather not have the v2 API recommend using the v1 API. That seems confusing to me.

brkyvz · 2019-08-21T22:28:50Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @group partition_transforms
+   * @since 3.0.0
+   */
+  def years(e: Column): Column = withExpr { Years(e.expr) }


can we hide them by a namespace?

object partitionBy { def years: ... def months: ... }

brkyvz · 2019-08-21T22:32:32Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala

+  /**
+   * Create a new table from the contents of the data frame.
+   *
+   * The new table's schema, partition layout, properties, and other configuration will be


The schema will be assumed to be nullable though

brkyvz · 2019-08-21T22:38:30Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala

+    this
+  }
+
+  override def tableProperty(property: String, value: String): DataFrameWriterV2[T] = {


should return a CreateTableWriter?

brkyvz · 2019-08-21T22:41:41Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala

+        identifier,
+        partitioning.getOrElse(Seq.empty),
+        logicalPlan,
+        properties = provider.map(p => properties + ("provider" -> p)).getOrElse(properties).toMap,


doesn't need to be initial. We should however standardize on whether:

option("path", ...) would/should show up as "location" as a table property

Or should it be set by tableProperty("location", ...)

or tableProperty("path", ...)

SparkQA · 2019-08-22T20:56:31Z

Test build #109591 has finished for PR 25354 at commit eff659a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-08-30T21:27:06Z

@brkyvz, I've removed the asParquet etc. methods and rebased on the current master.

SparkQA · 2019-08-30T21:36:17Z

Test build #109963 has finished for PR 25354 at commit e424c2c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-31T01:40:55Z

Test build #109967 has finished for PR 25354 at commit 57e6c5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-31T05:54:46Z

Test build #109982 has finished for PR 25354 at commit 9864d42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2019-09-01T04:27:15Z

LGTM. Merging to master. Thanks @rdblue

HyukjinKwon · 2019-09-02T03:41:30Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala


+  // turn off style check that object names must start with a capital letter
+  // scalastyle:off
+  object partitioning {


This seems breaking Maven build:

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/6803/console
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7-ubuntu-testing/1712/consoleFull
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2/289/consoleFull
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/357/consoleFull

[ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:32: cannot access org.apache.spark.sql.functions.1 class file for org.apache.spark.sql.functions$1 not found [ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:70: cannot find symbol symbol: method avg(java.lang.String) location: class org.apache.spark.sql.hive.JavaDataFrameSuite [ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:81: cannot find symbol symbol: method col(java.lang.String) location: class org.apache.spark.sql.hive.JavaDataFrameSuite [ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:89: cannot find symbol symbol: method col(java.lang.String) location: class org.apache.spark.sql.hive.JavaDataFrameSuite [ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:90: cannot find symbol symbol: method col(java.lang.String) location: class org.apache.spark.sql.hive.JavaDataFrameSuite [ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:91: cannot find symbol symbol: method col(java.lang.String) location: class org.apache.spark.sql.hive.JavaDataFrameSuite [ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:92: cannot find symbol symbol: method col(java.lang.String) location: class org.apache.spark.sql.hive.JavaDataFrameSuite

Seems after this commit, Github workflow also seems causing build failures:

HyukjinKwon · 2019-09-02T03:42:31Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala


+  // turn off style check that object names must start with a capital letter
+  // scalastyle:off
+  object partitioning {


Does this has a Java corresponding test? This file should be compatible with Java side as well. I doubt if nested object works in Java API.

This is a good point. We also have nested objects in SQLConf but it works fine. Maybe the java compatibility is the problem here.

SQLConf is internal. I don't think it's supposed to be exposed.

brkyvz · 2019-09-02T03:46:32Z

Uh oh, can you please revert?

…

On Sun, Sep 1, 2019, 8:44 PM Hyukjin Kwon ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sql/core/src/main/scala/org/apache/spark/sql/functions.scala <#25354 (comment)>: > @@ -3942,6 +3943,69 @@ object functions { */ def to_csv(e: Column): Column = to_csv(e, Map.empty[String, String].asJava) + // turn off style check that object names must start with a capital letter + // scalastyle:off + object partitioning { Does this has a Java corresponding test? This file should be compatible with Java side as well. I doubt if nested object works in Java API. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25354?email_source=notifications&email_token=ABIAE642NNEDPJVMA35RHTTQHSDZZA5CNFSM4IJGHPQKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCDKPHLI#pullrequestreview-282391469>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABIAE63KX52GQJWGWSFB3ODQHSDZZANCNFSM4IJGHPQA> .

HyukjinKwon · 2019-09-02T03:50:27Z

Sure, thanks @brkyvz for confirmation quickly!

HyukjinKwon · 2019-09-02T04:03:38Z

Also, locally verified that the Maven build now works fine after reverting this one.

HyukjinKwon · 2019-09-02T04:44:48Z

@rdblue sorry I had to revert this. Can you open a PR again with some fixes?

cloud-fan · 2019-09-02T05:19:41Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+     * @group partition_transforms
+     * @since 3.0.0
+     */
+    def bucket(numBuckets: Column, e: Column): Column = withExpr {


why do we need this overload if only literal is allowed as numBuckets?

dilipbiswal · 2019-09-02T18:34:05Z

@HyukjinKwon

@rdblue sorry I had to revert this. Can you open a PR again with some fixes?

Did we revert this ? I am seeing a compilation error in my local env while compiling the hive module like the following :

[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:32: cannot access org.apache.spark.sql.functions.1
  class file for org.apache.spark.sql.functions$1 not found
[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:70: cannot find symbol
  symbol:   method avg(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:81: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:89: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:90: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:91: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:92: cannot find symbol

dongjoon-hyun · 2019-09-02T18:43:21Z

Yes, @dilipbiswal . We reverted this 15 hours ago. Did you try to update your local environment?
You can see that master branch is healthy in two places.

Apache Spark Jenkins: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
GitHub Action: https://github.com/apache/spark/actions

dilipbiswal · 2019-09-02T19:10:35Z

@dongjoon-hyun My bad.. sorry.. yeah.. i see the revert when i updated the local env.

dongjoon-hyun added the SQL label Aug 5, 2019

jzhuge reviewed Aug 5, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/PartitionTransforms.scala Outdated Show resolved Hide resolved

rdblue force-pushed the SPARK-28612-add-data-frame-writer-v2 branch from 33dde6d to 409a0bc Compare August 5, 2019 17:33

viirya reviewed Aug 7, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated Show resolved Hide resolved

viirya reviewed Aug 7, 2019

View reviewed changes

brkyvz reviewed Aug 7, 2019

View reviewed changes

rdblue force-pushed the SPARK-28612-add-data-frame-writer-v2 branch from 17fe1fd to fb0fafc Compare August 20, 2019 23:27

keypointt reviewed Aug 21, 2019

View reviewed changes

rdblue force-pushed the SPARK-28612-add-data-frame-writer-v2 branch from fb0fafc to 12097ec Compare August 21, 2019 17:00

brkyvz approved these changes Aug 21, 2019

View reviewed changes

rdblue added 7 commits August 30, 2019 14:25

Fix bad import.

5dbc850

Fix DataFrameWriterV2 javadoc.

002339b

Fix comments from reviewers.

6a52509

Update tests for recent changes in master.

f545692

Improve error message in bucket function.

6c0a98b

Move partitioning functions into a partitioning object.

1bc6954

Remove asFormat methods.

e424c2c

rdblue force-pushed the SPARK-28612-add-data-frame-writer-v2 branch from eff659a to e424c2c Compare August 30, 2019 21:26

Update tests for InMemoryTableCatalog consolidation.

57e6c5b

Update test cases for CTAS with nullable schemas.

9864d42

asfgit closed this in 3821d75 Sep 1, 2019

HyukjinKwon reviewed Sep 2, 2019

View reviewed changes

cloud-fan reviewed Sep 2, 2019

View reviewed changes

[SPARK-28612][SQL] Add DataFrameWriterV2 API #25354

[SPARK-28612][SQL] Add DataFrameWriterV2 API #25354

Uh oh!

Conversation

rdblue commented Aug 4, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rdblue commented Aug 4, 2019

Uh oh!

SparkQA commented Aug 4, 2019

Uh oh!

SparkQA commented Aug 4, 2019

Uh oh!

dongjoon-hyun commented Aug 5, 2019

Uh oh!

SparkQA commented Aug 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jzhuge commented Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Aug 5, 2019

Uh oh!

jzhuge commented Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 5, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 8, 2019

jzhuge commented Aug 5, 2019 •

edited

Loading

jzhuge commented Aug 5, 2019 •

edited

Loading

rdblue commented Aug 7, 2019 •

edited

Loading