Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Aug 4, 2019

What changes were proposed in this pull request?

This adds a new write API as proposed in the SPIP to standardize logical plans. This new API:

  • Uses clear verbs to execute writes, like append, overwrite, create, and replace that correspond to the new logical plans.
  • Only creates v2 logical plans so the behavior is always consistent.
  • Does not allow table configuration options for operations that cannot change table configuration. For example, partitionedBy can only be called when the writer executes create or replace.

Here are a few example uses of the new API:

df.writeTo("catalog.db.table").append()
df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01")
df.writeTo("catalog.db.table").overwritePartitions()
df.writeTo("catalog.db.table").asParquet.create()
df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace()
df.writeTo("catalog.db.table").using("abc").replace()

How was this patch tested?

Added DataFrameWriterV2Suite that tests the new write API. Existing tests for v2 plans.

@rdblue
Copy link
Contributor Author

rdblue commented Aug 4, 2019

@cloud-fan, please take a look at this PR with the new DSv2 write API, as discussed on the InsertInto thread.

@brkyvz, @mccheah, @jzhuge, and @dongjoon-hyun, you may also be interested in reviewing. Thanks!

@SparkQA
Copy link

SparkQA commented Aug 4, 2019

Test build #108630 has finished for PR 25354 at commit 10c7b8f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class PartitionTransformExpression extends Expression with Unevaluable
  • case class Years(child: Expression) extends PartitionTransformExpression
  • case class Months(child: Expression) extends PartitionTransformExpression
  • case class Days(child: Expression) extends PartitionTransformExpression
  • case class Hours(child: Expression) extends PartitionTransformExpression
  • case class Bucket(numBuckets: Literal, child: Expression) extends PartitionTransformExpression
  • implicit class OptionsHelper(options: Map[String, String])
  • trait WriteConfigMethods[R]
  • trait CreateTableWriter[T] extends WriteConfigMethods[CreateTableWriter[T]]

@SparkQA
Copy link

SparkQA commented Aug 4, 2019

Test build #108632 has finished for PR 25354 at commit 23a1188.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @rdblue . Could you fix the doc generation issue in order to pass the Jenkins?

@SparkQA
Copy link

SparkQA commented Aug 5, 2019

Test build #108674 has finished for PR 25354 at commit 33dde6d.

  • This patch fails to generate documentation.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

}

@scala.annotation.varargs
override def partitionedBy(column: Column, columns: Column*): CreateTableWriter[T] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this name intended to be different from DataFrameWriter.partitionBy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to match CREATE TABLE SQL, which uses PARTITIONED BY.

@jzhuge
Copy link
Member

jzhuge commented Aug 5, 2019

How do convert DSW.jdbc(url: String, table: String, connectionProperties: Properties) to DSWv2?
Maybe writeTo(table).asJdbc(url, connectionProperties).<DSWv2_Action>?
Or set url and connectionProperties in options with a simple asJdbc()?

BTW, a new JdbcCatalog CatalogPlugin type may store url and connectionProperties in catalog properties, thus easier for users to access Jdbc tables.

@rdblue
Copy link
Contributor Author

rdblue commented Aug 5, 2019

@jzhuge, options like connection properties and JDBC URL belong at the catalog level, not at the table level. Those are table-level configuration in v1 because there is no catalog that holds common options like the database connection URL.

@rdblue rdblue force-pushed the SPARK-28612-add-data-frame-writer-v2 branch from 33dde6d to 409a0bc Compare August 5, 2019 17:33
@jzhuge
Copy link
Member

jzhuge commented Aug 5, 2019

@rdblue Catalog level properties sounds good to me.

Also realized that JdbcCatalog plugin type is not needed if the only things we want are the provider common options. Each provider can pick up common options from catalog properties automatically, e.g.:

spark.sql.catalog.<name>.provider.<provider_name>.<provider_common_options>

@SparkQA
Copy link

SparkQA commented Aug 5, 2019

Test build #108676 has finished for PR 25354 at commit 409a0bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* Configuration methods common to create/replace operations and insert/overwrite operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reading this, sounds like there is another Writer for insert/overwrite extending WriteConfigMethods, like CreateTableWriter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. DataFrameWriterV2 and CreateTableWriter implement these methods. When a CreateTableWriter method is called, like partitionedBy, the result is always a CreateTableWriter and not a DataFrameWriterV2 so that append can't be called with unsupported options.

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use this API with paths or URIs instead of table names?
Something like:

writeTo.table("catalog.db.tbl")
writeTo.path("file:/tmp/abc") \\ or
writeTo.uri("jdbc:mysql://...")

* @group partition_transforms
* @since 3.0.0
*/
def years(e: Column): Column = withExpr { Years(e.expr) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that these are a single letter away from existing function names. With one small typo, you either get your function to be unevaluable, or return a runtime exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of a solution to this besides supporting year(ts) as though it were years? I'm concerned that would cause confusion for functions like hour that have different meanings.

Maybe we should catch these and throw more helpful exceptions instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we hide them by a namespace?

object partitionBy {
  def years: ...
  def months: ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this would be partitionBy.years? I find partitionBy a little awkward. Is there a better name we could use? partitioner.years maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too stuck up on the name. What do you think of the idea in general? Other names I can think of:

  • partitioner
  • logical
  • transformer
  • partitioning
  • partition
    could be like partition.byYears, partition.byDays...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the names should match the SQL names, which are years, months, etc.

I think I like "partitioning" the best, since it qualifies the function. partitioning.years. Should I make this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that. Let's discuss at the sync in 30 mins and gather feedback. Wouldn't want you to waste work.

@scala.annotation.varargs
override def partitionedBy(column: Column, columns: Column*): CreateTableWriter[T] = {
val asTransforms = (column +: columns).map(_.expr).map {
case Years(attr: Attribute) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use the existing:

year
month
dayofmonth
hour

functions that already exist? The closeness of the function names worry me. I understand the separation of concerns, but something to consider. Maybe these were already discussed in the SPIP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can't.

First, those are concrete functions that have different meanings. hour(ts) is hour of day, not hourly partitions, and day of month is not a function you would partition on either.

Second, using those functions would not correspond to the transform names that are supported in SQL, which are years, months, days, and hours.

identifier,
partitioning.getOrElse(Seq.empty),
logicalPlan,
properties = provider.map(p => properties + ("provider" -> p)).getOrElse(properties).toMap,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about location, if this is meant to be an external table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a location(String) method to this that is translated to a location property. Does this need to be in the initial version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to be initial. We should however standardize on whether:

  • option("path", ...) would/should show up as "location" as a table property
  • Or should it be set by tableProperty("location", ...)
  • or tableProperty("path", ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be tableProperty("location") because that's what we've standardized on elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the DFWriter we have a config:
df.sparkSession.sessionState.conf.defaultDataSourceName
do you want to use that here, or do you think the catalog is free to create whatever datasource if the provider isn't available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. It is up to the catalog what to use for the provider when one isn't specified. This API doesn't need to do that -- it should be filled in by the v2 session catalog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be documented somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will be in the v2 documentation because we have to explain how USING is passed.

@rdblue
Copy link
Contributor Author

rdblue commented Aug 7, 2019

Is it possible to use this API with paths or URIs instead of table names?

@brkyvz, Yes, it is. Instead of passing names to writeTo, you would pass a path or URI:

df.writeTo("s3://bucket/path/to/table").append()

The v2 identifiers SPIP includes how to represent a path-based table: PathIdentifier(path, name) that extends Identifier. The namespace of a PathIdentifier is a single level and contains a fully qualified path.

When we have a spec for how path-based tables behave, we can update writeTo to catch paths and pass PathIdentifier.

@SparkQA
Copy link

SparkQA commented Aug 8, 2019

Test build #108788 has finished for PR 25354 at commit 4538721.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108947 has finished for PR 25354 at commit 4538721.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Aug 12, 2019

Retest this please.

@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108986 has finished for PR 25354 at commit 4538721.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 20, 2019

Test build #109429 has finished for PR 25354 at commit 17fe1fd.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class WorkerExecutorStateResponse(
  • case class WorkerDriverStateResponse(
  • case class WorkerSchedulerStateResponse(
  • case class LaunchDriver(
  • case class StandaloneResourceAllocation(pid: Int, allocations: Seq[ResourceAllocation])
  • abstract class FileCommitProtocol extends Logging
  • class ResourceAllocator(name: String, addresses: Seq[String]) extends Serializable
  • class RpcAbortException(message: String) extends Exception(message)
  • case class CatalystDataToAvro(
  • class BindingParquetOutputCommitter(
  • class PathOutputCommitProtocol(
  • class DecisionTreeParams(Params):
  • case class UnresolvedTable(v1Table: CatalogTable) extends Table
  • implicit class IdentifierHelper(identifier: TableIdentifier)
  • class CatalogManager(conf: SQLConf) extends Logging
  • sealed trait RewritableTransform extends Transform
  • case class Milliseconds(child: Expression, timeZoneId: Option[String] = None)
  • case class Microseconds(child: Expression, timeZoneId: Option[String] = None)
  • case class IsoYear(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
  • case class Millennium(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
  • case class Century(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
  • case class Decade(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
  • case class Epoch(child: Expression, timeZoneId: Option[String] = None)
  • case class ArrayForAll(
  • case class DescribeTable(table: NamedRelation, isExtended: Boolean) extends Command
  • case class DeleteFromTable(
  • trait V2CreateTablePlan extends LogicalPlan
  • case class DeleteFromStatement(
  • case class DescribeColumnStatement(
  • case class DescribeTableStatement(
  • case class ColumnarToRowExec(child: SparkPlan) extends UnaryExecNode with CodegenSupport
  • case class InputAdapter(child: SparkPlan) extends UnaryExecNode with InputRDDCodegen
  • case class InsertAdaptiveSparkPlan(
  • case class ReuseAdaptiveSubquery(
  • class DetectAmbiguousSelfJoin(conf: SQLConf) extends Rule[LogicalPlan]
  • case class DeleteFromTableExec(
  • case class DescribeTableExec(table: Table, isExtended: Boolean) extends LeafExecNode

@rdblue rdblue force-pushed the SPARK-28612-add-data-frame-writer-v2 branch from 17fe1fd to fb0fafc Compare August 20, 2019 23:27
@SparkQA
Copy link

SparkQA commented Aug 21, 2019

Test build #109446 has finished for PR 25354 at commit fb0fafc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DataFrameWriterV2Suite extends QueryTest with SharedSparkSession with BeforeAndAfter

/**
* Expression for the v2 partition transform hours.
*/
case class Hours(child: Expression) extends PartitionTransformExpression {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan, maybe naive question, why not supporting granularity down to minutes or seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not seen an example of a table that requires partitioning down to minutes or seconds. I'm not opposed to adding them, but it seems to me that those would not be very useful and would probably get people that use them into trouble by over-partitioning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks. agree that if not many use cases for minutes or seconds then we can ignore it.

case class OverwritePartitionsDynamic(
table: NamedRelation,
query: LogicalPlan,
writeOptions: Map[String, String],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, is it coding style to have boolean parameter as the last one like isByName?

here writeOptions 2nd from last, and writeOptions is the last parameter in OverwritePartitionsDynamic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is for style. Boolean parameters should be passed by name, like isByName = false. Although you can pass positional parameters after a named parameter, the expectation is usually that named parameters are not necessarily in the correct position and can be omitted or reordered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Ryan 👍

* @group basic
* @since 3.0.0
*/
def writeTo(table: String): DataFrameWriterV2[T] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find there is write for v1

  /**
   * Interface for saving the content of the non-streaming Dataset out into external storage.
   *
   * @group basic
   * @since 1.6.0
   */
  def write: DataFrameWriter[T] = {
    if (isStreaming) {
      logicalPlan.failAnalysis(
        "'write' can not be called on streaming Dataset/DataFrame")
    }
    new DataFrameWriter[T](this)
  }

why not name it writeV2 to be self-explaining? or overload write from different return type DataFrameWriterV2[T] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't change the behavior of write because we don't want to break older jobs. And we need to pass the table name or path somewhere. I think this works, but if everyone prefers writeV2, we can rename it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like writeTo, since we're explaining exactly what we're writing to at that point.

case lit @ Literal(_, IntegerType) =>
Bucket(lit, e.expr)
case _ =>
throw new AnalysisException(s"Invalid number of buckets: $numBuckets")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add column information in exception msg for debugging, like s"Invalid number of buckets: $numBuckets, for column: $e"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Seq(Row(1L, "a"), Row(2L, "b"), Row(3L, "c"), Row(4L, "d"), Row(5L, "e"), Row(6L, "f")))
}

test("Append: by name not position") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, is it better to make test case name like line-88 with expected result? like fail if by name not position

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests that the validation is by name and not by position, so failing if by name is incorrect. The failure tests that a name violation (can't find "data") is generated, even though the number columns and column types match by position.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see. no issues then. thanks!

Seq(Row(1L, "a"), Row(2L, "b"), Row(4L, "d"), Row(5L, "e"), Row(6L, "f")))
}

test("Overwrite: by name not position") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with fail by name not position

@rdblue rdblue force-pushed the SPARK-28612-add-data-frame-writer-v2 branch from fb0fafc to 12097ec Compare August 21, 2019 17:00
@SparkQA
Copy link

SparkQA commented Aug 21, 2019

Test build #109505 has finished for PR 25354 at commit 12097ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DataFrameWriterV2Suite extends QueryTest with SharedSparkSession with BeforeAndAfter

@SparkQA
Copy link

SparkQA commented Aug 21, 2019

Test build #109506 has finished for PR 25354 at commit 6950650.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I still have worries about the partitioning expressions. I'm also fine with throwing better error messages, but curious what you think around namespacing.

* @group basic
* @since 3.0.0
*/
def writeTo(table: String): DataFrameWriterV2[T] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like writeTo, since we're explaining exactly what we're writing to at that point.

// TODO: streaming could be adapted to use this interface
if (isStreaming) {
logicalPlan.failAnalysis(
"'writeTo' can not be called on streaming Dataset/DataFrame")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be good to add: use 'writeStream' instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't include this because I'd rather not have the v2 API recommend using the v1 API. That seems confusing to me.

* @group partition_transforms
* @since 3.0.0
*/
def years(e: Column): Column = withExpr { Years(e.expr) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we hide them by a namespace?

object partitionBy {
  def years: ...
  def months: ...
}

/**
* Create a new table from the contents of the data frame.
*
* The new table's schema, partition layout, properties, and other configuration will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema will be assumed to be nullable though

this
}

override def tableProperty(property: String, value: String): DataFrameWriterV2[T] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should return a CreateTableWriter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

identifier,
partitioning.getOrElse(Seq.empty),
logicalPlan,
properties = provider.map(p => properties + ("provider" -> p)).getOrElse(properties).toMap,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to be initial. We should however standardize on whether:

  • option("path", ...) would/should show up as "location" as a table property
  • Or should it be set by tableProperty("location", ...)
  • or tableProperty("path", ...)

@SparkQA
Copy link

SparkQA commented Aug 22, 2019

Test build #109591 has finished for PR 25354 at commit eff659a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-28612-add-data-frame-writer-v2 branch from eff659a to e424c2c Compare August 30, 2019 21:26
@rdblue
Copy link
Contributor Author

rdblue commented Aug 30, 2019

@brkyvz, I've removed the asParquet etc. methods and rebased on the current master.

@SparkQA
Copy link

SparkQA commented Aug 30, 2019

Test build #109963 has finished for PR 25354 at commit e424c2c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2019

Test build #109967 has finished for PR 25354 at commit 57e6c5b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2019

Test build #109982 has finished for PR 25354 at commit 9864d42.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor

brkyvz commented Sep 1, 2019

LGTM. Merging to master. Thanks @rdblue

@asfgit asfgit closed this in 3821d75 Sep 1, 2019

// turn off style check that object names must start with a capital letter
// scalastyle:off
object partitioning {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems breaking Maven build:

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/6803/console
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7-ubuntu-testing/1712/consoleFull
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2/289/consoleFull
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/357/consoleFull

[ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:32: cannot access org.apache.spark.sql.functions.1
  class file for org.apache.spark.sql.functions$1 not found
[ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:70: cannot find symbol
  symbol:   method avg(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:81: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:89: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:90: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:91: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-testing/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:92: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite

Seems after this commit, Github workflow also seems causing build failures:

Screen Shot 2019-09-02 at 12 40 04 PM


// turn off style check that object names must start with a capital letter
// scalastyle:off
object partitioning {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this has a Java corresponding test? This file should be compatible with Java side as well. I doubt if nested object works in Java API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. We also have nested objects in SQLConf but it works fine. Maybe the java compatibility is the problem here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLConf is internal. I don't think it's supposed to be exposed.

@brkyvz
Copy link
Contributor

brkyvz commented Sep 2, 2019 via email

@HyukjinKwon
Copy link
Member

Sure, thanks @brkyvz for confirmation quickly!

@HyukjinKwon
Copy link
Member

Also, locally verified that the Maven build now works fine after reverting this one.

@HyukjinKwon
Copy link
Member

@rdblue sorry I had to revert this. Can you open a PR again with some fixes?

* @group partition_transforms
* @since 3.0.0
*/
def bucket(numBuckets: Column, e: Column): Column = withExpr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this overload if only literal is allowed as numBuckets?

@dilipbiswal
Copy link
Contributor

@HyukjinKwon

@rdblue sorry I had to revert this. Can you open a PR again with some fixes?

Did we revert this ? I am seeing a compilation error in my local env while compiling the hive module like the following :

[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:32: cannot access org.apache.spark.sql.functions.1
  class file for org.apache.spark.sql.functions$1 not found
[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:70: cannot find symbol
  symbol:   method avg(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:81: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:89: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:90: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:91: cannot find symbol
  symbol:   method col(java.lang.String)
  location: class org.apache.spark.sql.hive.JavaDataFrameSuite
[ERROR] [Error] /Users/dbiswal/mygit/apache/spark/sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java:92: cannot find symbol

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 2, 2019

Yes, @dilipbiswal . We reverted this 15 hours ago. Did you try to update your local environment?
You can see that master branch is healthy in two places.

@dilipbiswal
Copy link
Contributor

@dongjoon-hyun My bad.. sorry.. yeah.. i see the revert when i updated the local env.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants