[SPARK-30490][SQL] Eliminate compiler warnings in Avro datasource #27174

MaxGekk · 2020-01-11T12:20:56Z

What changes were proposed in this pull request?

Remove the @deprecated annotation for AvroOptions. ignoreExtension
Output log warning if avro.mapred.ignore.inputs.without.extension is set to non-default value - true
Output log warning if avro option ignoreExtension is set to non-default value - true

Why are the changes needed?

Currently, the compiler output deprecation warning only during compilation of Spark source code but not user apps because AvroOptions.ignoreExtension is not used by users directly. In this ways, users are not aware of deprecated Hadoop's conf and avro option.
Elimination of unnecessary warnings highlights other warnings that can indicate about real problems.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

By AvroSuite

SparkQA · 2020-01-11T12:49:56Z

Test build #116528 has finished for PR 27174 at commit e25f195.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-01-11T15:32:34Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

   * If the option is not set, the Hadoop's config `avro.mapred.ignore.inputs.without.extension`
   * is taken into account. If the former one is not set too, file extensions are ignored.
   */
-  @deprecated("Use the general data source option pathGlobFilter for filtering file names", "3.0")


Why remove this if it's really deprecated? I get that it will remove some compiler warnings, but, that's not super important, or can be worked around as you do elsewhere by deprecating the test methods too?

MaxGekk · 2020-01-11T16:05:40Z

Sean, deprecating of the value doesn’t make any sense because it is not used by users.

srowen · 2020-01-11T16:12:42Z

OK, the class appears public though, it's definitely not meant to be accessed for other reasons?

…

On Sat, Jan 11, 2020 at 10:05 AM Maxim Gekk ***@***.***> wrote: Sean, deprecating of the value doesn’t make any sense because it is not used by users. сб, 11 янв. 2020 г. в 18:32, Sean Owen ***@***.***>: > ***@***.**** commented on this pull request. > ------------------------------ > > In > external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala > <#27174 (comment)>: > > > @@ -68,8 +68,10 @@ class AvroOptions( > * If the option is not set, the Hadoop's config `avro.mapred.ignore.inputs.without.extension` > * is taken into account. If the former one is not set too, file extensions are ignored. > */ > - @deprecated("Use the general data source option pathGlobFilter for filtering file names", "3.0") > > Why remove this if it's really deprecated? I get that it will remove some > compiler warnings, but, that's not super important, or can be worked around > as you do elsewhere by deprecating the test methods too? > > — > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub > < #27174?email_source=notifications&email_token=AAMB5GPBEQSPURU7DY5UBZDQ5HRBJA5CNFSM4KFRXRNKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCRNUYHY#pullrequestreview-341527583 >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAMB5GNSMNGO5AHKJSOU23TQ5HRBJANCNFSM4KFRXRNA > > . > -- Yours faithfully, Maxim Gekk http://www.linkedin.com/in/maxgekk — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#27174?email_source=notifications&email_token=AAGIZ6XYLIVBDX76OZ3FBDTQ5HU5LA5CNFSM4KFRXRNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIWFBPY#issuecomment-573329599>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGIZ6TVE5OHMOWBX7XPJH3Q5HU5LANCNFSM4KFRXRNA> .

MaxGekk · 2020-01-11T16:27:37Z

AvroOptions (and other options like CSVOptions) shouldn’t be accessible to users. Deprecating any values inside of AvroOptions seems similar to deprecating config entries inside of SQLConf - the values are not visible to users, and they are not aware of compiler warnings.

MaxGekk · 2020-01-11T19:24:07Z

@gengliangwang @HyukjinKwon Could you take a look at the PR.

HyukjinKwon · 2020-01-12T10:13:36Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

   */
-  @deprecated("Use the general data source option pathGlobFilter for filtering file names", "3.0")
  val ignoreExtension: Boolean = {
+    def warn(s: String): Unit = logWarning(


Why do we define a separate method?

hmm, to reuse the same code in 2 places.

I don't feel strongly but I think it's fine to don't do it ...

HyukjinKwon · 2020-01-12T10:17:01Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

-      .getOrElse(!ignoreFilesWithoutExtension)
+      .map { ignoreExtensionOption =>
+        if (ignoreExtensionOption != !ignoreFilesWithoutExtensionByDefault) {
+          warn(s"The Avro option '${AvroOptions.ignoreExtensionKey}'")


@MaxGekk, from a cursory look, this warning can be shown for every file which I think is noisy:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReaderFactory.scala

Lines 24 to 30 in 053dd85

abstract class FilePartitionReaderFactory extends PartitionReaderFactory {

override def createReader(partition: InputPartition): PartitionReader[InternalRow] = {

assert(partition.isInstanceOf[FilePartition])

val filePartition = partition.asInstanceOf[FilePartition]

val iter = filePartition.files.toIterator.map { file =>

PartitionedFileReader(file, buildReader(file))

}

spark/external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroPartitionReaderFactory.scala

Line 61 in 053dd85

val parsedOptions = new AvroOptions(options, conf)

Do you mind if I ask double check this?

@HyukjinKwon I will check that but general thoughts are:

The log warning is printed only if an user sets non-default config values

I don't think AvroOptions should be created (initialized from scratch) per-each file if it is created in current implementation. I would say it is not necessary to initialize AvroOptions again and again. After all, AvroOptions should be the same for all files/partitions.

And the noise in logs will force people to avoid using of the deprecated options ;-)

@HyukjinKwon you are right, it prints warnings per each partition. I have confirmed that by the test:

test("count deprecation log events") { val partitionNum = 3 val logAppender = new AppenderSkeleton { val loggingEvents = new ArrayBuffer[LoggingEvent]() override def append(loggingEvent: LoggingEvent): Unit = loggingEvents.append(loggingEvent) override def close(): Unit = {} override def requiresLayout(): Boolean = false } withTempPath { dir => Seq(("a", 1, 2), ("b", 1, 2), ("c", 2, 1), ("d", 2, 1)) .toDF("value", "p1", "p2") .repartition(partitionNum) .write .format("avro") .option("header", true) .save(dir.getCanonicalPath) withLogAppender(logAppender) { val df = spark .read .format("avro") .schema("value STRING, p1 INTEGER, p2 INTEGER") .option(AvroOptions.ignoreExtensionKey, false) .option("header", true) .load(dir.getCanonicalPath) df.count() } val deprecatedEvents = logAppender.loggingEvents .map(_.getRenderedMessage) .filter(_.contains(AvroOptions.ignoreExtensionKey)) assert(deprecatedEvents.size === partitionNum) } }

When I moved instantiation of AvroOptions out of buildReader():

The warning is printed always 2 times. It means AvroPartitionReaderFactory is constructed twice.
And both times from

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala

Lines 60 to 66 in 053dd85

override def supportsColumnar: Boolean = {

require(partitions.forall(readerFactory.supportColumnarReads) ||

!partitions.exists(readerFactory.supportColumnarReads),

"Cannot mix row-based and columnar input partitions.")

partitions.exists(readerFactory.supportColumnarReads)

}

which is called:
First time from

The second time from:

It is interesting that rewriting supportsColumnar as:

override val supportsColumnar: Boolean = { val factory = readerFactory require(partitions.forall(factory.supportColumnarReads) || !partitions.exists(factory.supportColumnarReads), "Cannot mix row-based and columnar input partitions.") partitions.exists(factory.supportColumnarReads) }

does not help too because DataSourceV2ScanExecBase is initialized twice from:
First time:

Second time in TreeNode.makeCopy:

Making supportsColumnar as lazy val doesn't help as well because supportsColumnar is invoked twice for different objects.

I think it is not nice that we construct some classes twice when it is not necessary. WDYT? /cc @cloud-fan @dongjoon-hyun

Yea we shouldn't instantiate twice, but not a big problem. I'm more worried about we instantiate it for every partition.

@MaxGekk, even if we fix this, it will still show the warning twice for schema inference and reading path at the very least. It's okay as long as we show the warning and document. Let's just go simple in this PR. This warning will be removed very soon, too.

HyukjinKwon · 2020-01-13T01:57:18Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala

      options: Map[String, String],
      files: Seq[FileStatus]): Option[StructType] = {
    val conf = spark.sessionState.newHadoopConf()
-    if (options.contains("ignoreExtension")) {


@MaxGekk, let's just remove this option after branch-3.0 is cut out.

Shouldn't it be deprecated explicitly for users before removing? It should be mentioned in docs at least if we don't want to output a warning like in the PR.

I think it still shows the warning properly although it only shows during schema inference. Yeah, can you simply fix the doc and say it's deprecated at docs/sql-data-sources-avro.md?

+1 with @HyukjinKwon
Let's remove the option and document it in the future, instead of creating such changes. If we merge this one, then there might be some other options we have to do the same thing.

Here is the PR #27194 for docs

MaxGekk added 3 commits January 11, 2020 14:13

Output log warning

4e1065a

Remove @deprecated

78077eb

Remove logWarning from inferSchema()

e25f195

srowen reviewed Jan 11, 2020

View reviewed changes

HyukjinKwon reviewed Jan 12, 2020

View reviewed changes

HyukjinKwon reviewed Jan 13, 2020

View reviewed changes

This was referenced Jan 13, 2020

[SPARK-30505][DOCS] Deprecate Avro option ignoreExtension in sql-data-sources-avro.md #27194

Closed

[SPARK-30509][SQL] Fix deprecation log warning in Avro schema inferring #27200

Closed

HyukjinKwon closed this in 51d2917 Jan 15, 2020

MaxGekk mentioned this pull request Jan 18, 2020

[SPARK-30558][SQL] Avoid rebuilding AvroOptions per each partition #27272

Closed

dongjoon-hyun added the SQL label Feb 5, 2020

MaxGekk deleted the avro-deprecation-warning branch June 5, 2020 19:42

	abstract class FilePartitionReaderFactory extends PartitionReaderFactory {
	override def createReader(partition: InputPartition): PartitionReader[InternalRow] = {
	assert(partition.isInstanceOf[FilePartition])
	val filePartition = partition.asInstanceOf[FilePartition]
	val iter = filePartition.files.toIterator.map { file =>
	PartitionedFileReader(file, buildReader(file))
	}

	override def supportsColumnar: Boolean = {
	require(partitions.forall(readerFactory.supportColumnarReads) \|\|
	!partitions.exists(readerFactory.supportColumnarReads),
	"Cannot mix row-based and columnar input partitions.")

	partitions.exists(readerFactory.supportColumnarReads)
	}

[SPARK-30490][SQL] Eliminate compiler warnings in Avro datasource #27174

[SPARK-30490][SQL] Eliminate compiler warnings in Avro datasource #27174

Uh oh!

Conversation

MaxGekk commented Jan 11, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jan 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Jan 11, 2020 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Jan 11, 2020 via email

Uh oh!

MaxGekk commented Jan 11, 2020 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Jan 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jan 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jan 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

MaxGekk commented Jan 11, 2020 via email •

edited

Loading

MaxGekk commented Jan 11, 2020 via email •

edited

Loading

HyukjinKwon Jan 12, 2020 •

edited

Loading

MaxGekk Jan 12, 2020 •

edited

Loading

HyukjinKwon Jan 13, 2020 •

edited

Loading

MaxGekk Jan 13, 2020 •

edited

Loading