Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jan 11, 2020

What changes were proposed in this pull request?

  • Remove the @deprecated annotation for AvroOptions. ignoreExtension
  • Output log warning if avro.mapred.ignore.inputs.without.extension is set to non-default value - true
  • Output log warning if avro option ignoreExtension is set to non-default value - true

Why are the changes needed?

  1. Currently, the compiler output deprecation warning only during compilation of Spark source code but not user apps because AvroOptions.ignoreExtension is not used by users directly. In this ways, users are not aware of deprecated Hadoop's conf and avro option.
  2. Elimination of unnecessary warnings highlights other warnings that can indicate about real problems.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

By AvroSuite

@SparkQA
Copy link

SparkQA commented Jan 11, 2020

Test build #116528 has finished for PR 27174 at commit e25f195.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* If the option is not set, the Hadoop's config `avro.mapred.ignore.inputs.without.extension`
* is taken into account. If the former one is not set too, file extensions are ignored.
*/
@deprecated("Use the general data source option pathGlobFilter for filtering file names", "3.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this if it's really deprecated? I get that it will remove some compiler warnings, but, that's not super important, or can be worked around as you do elsewhere by deprecating the test methods too?

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 11, 2020 via email

@srowen
Copy link
Member

srowen commented Jan 11, 2020 via email

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 11, 2020 via email

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 11, 2020

@gengliangwang @HyukjinKwon Could you take a look at the PR.

*/
@deprecated("Use the general data source option pathGlobFilter for filtering file names", "3.0")
val ignoreExtension: Boolean = {
def warn(s: String): Unit = logWarning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we define a separate method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, to reuse the same code in 2 places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly but I think it's fine to don't do it ...

.getOrElse(!ignoreFilesWithoutExtension)
.map { ignoreExtensionOption =>
if (ignoreExtensionOption != !ignoreFilesWithoutExtensionByDefault) {
warn(s"The Avro option '${AvroOptions.ignoreExtensionKey}'")
Copy link
Member

@HyukjinKwon HyukjinKwon Jan 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, from a cursory look, this warning can be shown for every file which I think is noisy:

abstract class FilePartitionReaderFactory extends PartitionReaderFactory {
override def createReader(partition: InputPartition): PartitionReader[InternalRow] = {
assert(partition.isInstanceOf[FilePartition])
val filePartition = partition.asInstanceOf[FilePartition]
val iter = filePartition.files.toIterator.map { file =>
PartitionedFileReader(file, buildReader(file))
}

Do you mind if I ask double check this?

Copy link
Member Author

@MaxGekk MaxGekk Jan 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I will check that but general thoughts are:

  1. The log warning is printed only if an user sets non-default config values
  2. I don't think AvroOptions should be created (initialized from scratch) per-each file if it is created in current implementation. I would say it is not necessary to initialize AvroOptions again and again. After all, AvroOptions should be the same for all files/partitions.
  3. And the noise in logs will force people to avoid using of the deprecated options ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon you are right, it prints warnings per each partition. I have confirmed that by the test:

  test("count deprecation log events") {
    val partitionNum = 3
    val logAppender = new AppenderSkeleton {
      val loggingEvents = new ArrayBuffer[LoggingEvent]()

      override def append(loggingEvent: LoggingEvent): Unit = loggingEvents.append(loggingEvent)
      override def close(): Unit = {}
      override def requiresLayout(): Boolean = false
    }
    withTempPath { dir =>
      Seq(("a", 1, 2), ("b", 1, 2), ("c", 2, 1), ("d", 2, 1))
        .toDF("value", "p1", "p2")
        .repartition(partitionNum)
        .write
        .format("avro")
        .option("header", true)
        .save(dir.getCanonicalPath)
      withLogAppender(logAppender) {
        val df = spark
          .read
          .format("avro")
          .schema("value STRING, p1 INTEGER, p2 INTEGER")
          .option(AvroOptions.ignoreExtensionKey, false)
          .option("header", true)
          .load(dir.getCanonicalPath)
        df.count()
      }
      val deprecatedEvents = logAppender.loggingEvents
        .map(_.getRenderedMessage)
        .filter(_.contains(AvroOptions.ignoreExtensionKey))
      assert(deprecatedEvents.size === partitionNum)
    }
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I moved instantiation of AvroOptions out of buildReader():
Screen Shot 2020-01-12 at 20 14 06
The warning is printed always 2 times. It means AvroPartitionReaderFactory is constructed twice.
And both times from

override def supportsColumnar: Boolean = {
require(partitions.forall(readerFactory.supportColumnarReads) ||
!partitions.exists(readerFactory.supportColumnarReads),
"Cannot mix row-based and columnar input partitions.")
partitions.exists(readerFactory.supportColumnarReads)
}
which is called:
First time from
Screen Shot 2020-01-12 at 20 32 02
The second time from:
Screen Shot 2020-01-12 at 20 33 36

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting that rewriting supportsColumnar as:

  override val supportsColumnar: Boolean = {
    val factory = readerFactory
    require(partitions.forall(factory.supportColumnarReads) ||
      !partitions.exists(factory.supportColumnarReads),
      "Cannot mix row-based and columnar input partitions.")

    partitions.exists(factory.supportColumnarReads)
  }

does not help too because DataSourceV2ScanExecBase is initialized twice from:
First time:
Screen Shot 2020-01-12 at 20 52 49
Second time in TreeNode.makeCopy:
Screen Shot 2020-01-12 at 21 06 40
Making supportsColumnar as lazy val doesn't help as well because supportsColumnar is invoked twice for different objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not nice that we construct some classes twice when it is not necessary. WDYT? /cc @cloud-fan @dongjoon-hyun

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea we shouldn't instantiate twice, but not a big problem. I'm more worried about we instantiate it for every partition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, even if we fix this, it will still show the warning twice for schema inference and reading path at the very least. It's okay as long as we show the warning and document. Let's just go simple in this PR. This warning will be removed very soon, too.

options: Map[String, String],
files: Seq[FileStatus]): Option[StructType] = {
val conf = spark.sessionState.newHadoopConf()
if (options.contains("ignoreExtension")) {
Copy link
Member

@HyukjinKwon HyukjinKwon Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, let's just remove this option after branch-3.0 is cut out.

Copy link
Member Author

@MaxGekk MaxGekk Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be deprecated explicitly for users before removing? It should be mentioned in docs at least if we don't want to output a warning like in the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still shows the warning properly although it only shows during schema inference. Yeah, can you simply fix the doc and say it's deprecated at docs/sql-data-sources-avro.md?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 with @HyukjinKwon
Let's remove the option and document it in the future, instead of creating such changes. If we merge this one, then there might be some other options we have to do the same thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR #27194 for docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants