Skip to content

Conversation

@shardulm94
Copy link
Contributor

@shardulm94 shardulm94 commented Feb 19, 2021

What changes were proposed in this pull request?

In YARN, ship the spark.jars.ivySettings file to the driver when using cluster deploy mode so that addJar is able to find it in order to resolve ivy paths.

Why are the changes needed?

SPARK-33084 introduced support for Ivy paths in sc.addJar or Spark SQL ADD JAR. If we use a custom ivySettings file using spark.jars.ivySettings, it is loaded at

val file = new File(settingsFile)
. However, this file is only accessible on the client machine. In YARN cluster mode, this file is not available on the driver and so addJar fails to find it.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests to verify that the ivySettings file is localized by the YARN client and that a YARN cluster mode application is able to find to load the ivySettings file.

Copy link
Contributor

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this important PR @shardulm94 !

I'm very glad to see you were able to get a good test case together, I know these things can be tricky to test.

I didn't leave comments on the SparkSubmit-related code for now because I'd like to invite feedback on the approach. Leveraging --files is a very simple approach from a code perspective and it's great to see how few changes were necessary to make this possible. But I see two drawbacks for this:
(1) There will be a conflict if the user specifies any file to --files with the same name as their spark.jars.ivySettings. This seems like an odd case, so I am not worried about it.
(2) We have to separately upload/localize this settings file to/from the YARN distributed cache, whereas if we used an approach where the settings file was treated like a Spark conf file, it would be included in the Spark conf zip (saving some HDFS RPCs in a typical deployment).

Shardul, I know we discussed the possibility of using an approach where the settings file is in the Spark conf zip, did you try this approach and discover any issues / complexity? I think that route might be a little cleaner / more robust from a design perspective, but I think the --files approach makes sense if going through the Spark conf zip adds unreasonable complexity.

cc also @viirya @maropu @wangyum @HyukjinKwon @dongjoon-hyun @AngersZhuuuu who worked on PR 29966

@shardulm94
Copy link
Contributor Author

shardulm94 commented Feb 19, 2021

@xkrogen

I know we discussed the possibility of using an approach where the settings file is in the Spark conf zip, did you try this approach and discover any issues / complexity?

I did try and there are some tradeoffs. I was planning to implement this similar to how we ship log4j.properties as part of Spark conf.

for { prop <- Seq("log4j.properties", "metrics.properties")
Then we can use classloader.getResource to read the ivy file on the driver classpath. However, this approach for shipping the file would be specific to YARN, and won't work with other resource managers with cluster mode unless we make similar change in all resource managers. There is also possibility that classloader.getResource may get resources from other jars/zip (not spark_conf.zip) depending on classpath ordering. One alternative can be to search for the unzipped file in spark_conf i.e __spark_conf__/ivysettings.xml but again that is specific to YARN's spark conf naming.

I am very new to Spark, so there might be obvious ways of resolving the above concerns that I am missing. Please let me know, I would be happy to try them out.

(1) There will be a conflict if the user specifies any file to --files with the same name as their spark.jars.ivySettings. This seems like an odd case, so I am not worried about it.

Yes, that is possible, however even in the alternative approaches I thought of (e.g. the one above), there was always a possibility of this happening.

@xkrogen
Copy link
Contributor

xkrogen commented Feb 19, 2021

Good point about other resource managers. I'm not familiar enough with Kubernetes or Mesos to have any opinion there or understand if the same problem exists. Hopefully someone more knowledgeable can chime in.

One other idea I've been toying with is to treat this similar to spark.sql.hive.metastore.jars:

  val HIVE_METASTORE_JARS = buildStaticConf("spark.sql.hive.metastore.jars")
    .doc(s"""
      | Location of the jars that should be used to instantiate the HiveMetastoreClient.
      | This property can be one of four options:
...
      | 3. "path"
      |   Use Hive jars configured by `spark.sql.hive.metastore.jars.path`
      |   in comma separated format. Support both local or remote paths.The provided jars
      |   should be the same version as ${HIVE_METASTORE_VERSION}.
      | 4. A classpath in the standard format for both Hive and Hadoop. The provided jars
      |   should be the same version as ${HIVE_METASTORE_VERSION}.
      """.stripMargin)

With this config (besides options 1 and 2 which aren't relevant here), you can either supply classpath entries like dep1.jar:dep2.jar or, if you use the path option you use a file path with no special handling. If you specify an absolute local path, the responsibility is yours to make sure that local path exists on all worker nodes. This also reminds me that an absolute local path which is expected to be present on all nodes is a valid use case, which may be a reason not to go with the current proposal where we intercept the spark.jars.ivySettings and always put it into the distributed cache.

Similar to this Hive JARs conf, we could enhance spark.jars.ivySettings to accept a URI instead of just a local file path, allowing for either local or remote paths. We can set up some custom scheme like classpath://ivysettings.xml to indicate that the URI points to a classpath entry. Users have the flexibility to use some local path deployed to all nodes, point to a single remote path, or put the file onto the classpath using a mechanism like --jars, placing it into SPARK_CONF_DIR, etc. URIs without a scheme are treated as local paths for backwards-compatibility. WDYT?

@shardulm94
Copy link
Contributor Author

I thought about this a bit more. I think the main concern here is where/how to find the ivySettings file. I believe this becomes tricky because the same code and properties are currently being used to load the ivySettings file in SparkSubmit (where it makes more sense to read the file directly since it will be local and its too early to be put on the classpath) and inside the driver (where it would be better to read this off the classpath, the user provided local path can be invalid).

We could set an internal config like spark.yarn.xxx but this property will need to be used inside core and I don't see any precedent for yarn specific properties being used in such a way.

@xkrogen We could enhance the path to support custom schemes like classpath://, but since the code also needs to work inside SparkSubmit where the classpath is still being constructed, this would not work for both SparkSubmit and the driver together. I guess one thing we can do is modify spark.jars.ivySettings to convert it from a file entry to a classpath entry before passing it to the driver? Also not sure if we should worry about backward compatibility too much here. Being able to use addJar in Spark context is a fairly new feature and hasn't been released yet, its set for 3.2.0.

Another option is to have separate code for loading ivySettings file in SparkSubmit v/s driver and call the appropriate method, or pass an explicit parameter to the loadIvySettings function from the driver which instructs it to only search the classpath.

ivyPath: Option[String]): IvySettings = {
val file = new File(settingsFile)
require(file.exists(), s"Ivy settings file $file does not exist")
require(file.isFile(), s"Ivy settings file $file is not a normal file")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shardulm94, does it work if you set spark.jars.ivySettings to ./ivysettings.xml and pass ivysettings.xml to spark.yarn.files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you might have to copy ivysettings.xml to the current working directory when Spark submit runs but I think it might work for your usecase.

Copy link
Contributor Author

@shardulm94 shardulm94 Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as you pointed out it only works if I have the copy in the current working directory. However we cannot control which directory our users launch spark-submit from. So ideally we would want something which works without user intervention.

What do you think of this? In Yarn Client, we can add the ivySettings file to spark.yarn.dist.files or __spark__conf__.zip and then we can modify the property spark.jars.ivySettings to change it to ./ivysettings.xml or __spark__conf__/ivysettings.xml within Yarn Client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding it into __spark__conf__ sounds okay to me ... but I would prefer to have second opinions from Yarn experts such as @tgravescs, @mridulm or @jerryshao.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this out, and it looks like it can be handled pretty cleanly this way targeting just YARN. shardulm94@12709f0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR with this approach.

@xkrogen
Copy link
Contributor

xkrogen commented Feb 23, 2021

We could enhance the path to support custom schemes like classpath://, but since the code also needs to work inside SparkSubmit where the classpath is still being constructed, this would not work for both SparkSubmit and the driver together.

There are mechanisms to adjust the classpath that the client sees, but the impetus would be on the user/administrator to configure this properly, so it does push up some extra work. That being said, I think your recent proposal about using __spark_conf__ looks cleaner.

Also not sure if we should worry about backward compatibility too much here. Being able to use addJar in Spark context is a fairly new feature and hasn't been released yet, its set for 3.2.0.

Totally agreed that for reading ivysettings.xml from the driver, it doesn't need backwards-compatibility. But for client-side cases it should be backwards-compatible.

@shardulm94 shardulm94 changed the title [SPARK-34472][CORE] Ship ivySettings file to driver in cluster mode [SPARK-34472][YARN] Ship ivySettings file to driver in cluster mode Feb 23, 2021
@shardulm94
Copy link
Contributor Author

@xkrogen @HyukjinKwon This is ready for a second round of reviews

cc: @mridulm @jerryshao @tgravescs since this deals with YARN

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40064/

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Test build #135484 has finished for PR 31591 at commit 3f49d07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 26, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40064/

Copy link
Contributor

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only concern with the current approach is that the behavior might be unexpected if some user/admin configures spark.jars.ivySettings to point to /path/to/ivy.xml and deploys a file /path/to/ivy.xml onto all nodes of the cluster, then is surprised to find that the value of spark.jars.ivySettings has been overwritten to some YARN-localized path instead of the path they specified.

I don't see this being a realistic use case that should cause us to change our approach, but I welcome comments from others. At minimum, we should update the documentation either in configuration.md or running-on-yarn.md to explain this behavior.

// with any other conf file.
val amIvySettingsFileName = ivySettingsFile.getName() + "-" + UUID.randomUUID().toString
confStream.putNextEntry(new ZipEntry(amIvySettingsFileName))
Files.copy(ivySettingsFile, confStream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use java.nio.file.Files here instead of Guava's Files? I know Guava is already used in this file but I think it's better to leverage the built-in functionality moving forward. You can do a rename import like import java.nio.file.{Files => NioFiles} (or update the other references as well if it's a small change?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

val ivySettings = sparkConf.getOption("spark.jars.ivySettings")
if (isClusterMode && ivySettings.isDefined) {
val ivySettingsFile = new File(ivySettings.get)
require(ivySettingsFile.exists(), s"Ivy settings file $ivySettingsFile not found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a little more Scala-idiomatic for unpacking the Option:

      ivySettings match {
        case Some(ivySettingsPath) if isClusterMode =>
          // ... logic here
        case _ => // do nothing
      }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also check to see if Utils.isLocalUri

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


val emptyIvySettings = File.createTempFile("ivy", ".xml")
FileUtils.write(emptyIvySettings, "<ivysettings />", StandardCharsets.UTF_8)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use NIO for these? No need for commons-io now that NIO supports this kind of stuff built-in.

val emptyIvySettings = Files.createTempFile(...)
Files.write(emptyIvySettings, Seq(...))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to use Guava's Files which is used in many places within this file. Can I create a followup PR to replace these with NIO?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw this comment, sounds fine to me.

}
if (caught.getMessage.contains("unresolved dependency: org.fake-project.test#test")) {
// "unresolved dependency" is expected as the dependency does not exist
// but exception like "Ivy settings file <file> does not exist should result in failure"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: end quote is in the wrong place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@tgravescs
Copy link
Contributor

so I guess this case is just if the user does sc.addJar("ivy:..") so you wouldn't know at submission time and can't necessarily download the file in client before submission. One issue with yarn clusters and likely k8s cluster is that network might be more restricted on the cluster itself rather than the client where you submit. But I guess either way that would fail.

With k8s if everything isn't in the docker image you would have to configure a separate path to upload things to (like s3). It looks like this change is only for YARN so I'm assuming this doesn't work with k8s either then? Or standalone mode with cluster mode? I'm fine with fixing for just yarn with this but we should document what it works for.

note that spark specifically supports the "local:" prefix to mean local files. This means we shouldn't upload them as they are local to each machine. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L103

Is there a reason we don't just call distributed() like we do for key tab file or the app jar for instance? If we are worried about it conflicting with something else in --files then just give it a unique name

amKeytabFileName.foreach { kt => props.setProperty(KEYTAB.key, kt) }

// Upload user provided ivysettings.xml file to the distributed cache
val ivySettings = sparkConf.getOption("spark.jars.ivySettings")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to move this into the configuration package.scala and use ConfigBuilder. Even if we just reference it by the .key option in the SparkSubmitArguments file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be okay to handle this as a followup immediately after this PR? There are about 7-8 other places where this string is hardcoded and we can also refactor them out into config package.scala

Copy link
Contributor

@tgravescs tgravescs Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine, please file an issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val ivySettings = sparkConf.getOption("spark.jars.ivySettings")
if (isClusterMode && ivySettings.isDefined) {
val ivySettingsFile = new File(ivySettings.get)
require(ivySettingsFile.exists(), s"Ivy settings file $ivySettingsFile not found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also check to see if Utils.isLocalUri

// Generate a file name that can be used for the ivySettings file, that does not conflict
// with any other conf file.
val amIvySettingsFileName = ivySettingsFile.getName() + "-" + UUID.randomUUID().toString
confStream.putNextEntry(new ZipEntry(amIvySettingsFileName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to just use distribute() rather then putting into the conf files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would help save some RPCs in an HDFS environment, more context in
#31591 (review) See drawback (2)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think also, semantically, this is more similar to a Spark configuration file than a user-provided file and thus makes sense to have within the conf zip as opposed to living with the user JARs/files. This is similar to the approach used for log4j.properties and metrics.properties above.

Copy link
Contributor

@tgravescs tgravescs Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that its kind of like a conf file but its not, there is no template file in the conf directory for this, it's a user specified config and not automatically picked up, so I would rather keep things consistent and have it treated like all the other configs similar to it.
Also we don't want to support this being in HDFS? I think that comes for free if you just use distribute, I don't see a few rpc calls as being a big deal here especially since it adds functionality. I guess if you allow that you might need more code on the client side to download it. Its a bit unfortunately we are not consistent with these things.

Personally I would prefer us to stay consistent and this act just like other files users can specify. it should allow being in HDFS and it should be distributed via distribute() just like all the other files users can specify. If you disagree please let me know why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that its kind of like a conf file but its not, there is no template file in the conf directory for this, it's a user specified config and not automatically picked up ... Its a bit unfortunately we are not consistent with these things.

The same is (mostly) true for examples like metrics.properties, which is an external/non-Spark file, the only difference being that it is automatically picked up vs. being requested by a user. I don't have much opinion on whether or not this is how it should be -- I agree with you that more clear guidelines/consistency in this area would be nice.

I don't have too much opinion between storing in the conf object and leveraging distribute, I do agree that we should support remote file systems (e.g. HDFS) so this is a good point. @HyukjinKwon do you have any opinion here since you participated in some of the earlier conversations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR using distribute() to localize the ivySettings file instead of putting the file in Spark conf zip. I also had to slightly modify the ivySettings loading code in SparkSubmit.scala since it did not support local URIs in the first place. I also added tests for local and non-local URIs for ivySettings.

@shardulm94
Copy link
Contributor Author

@tgravescs Sorry for the delay, time flies!

This PR started off try to address this issue in the general cluster case using --files to distribute the ivy setting file for all resource managers. However one of the concerns was that we would have to look for the ivy file different based on whether the driver was running locally v/s on the cluster, since the codepath is shared between SparkSubmit and sc.addJar.

// When running driver in cluster mode, the settingsFile is localized and so needs to be
There is more context at #31591 (comment) and #31591 (comment)

So I decided to make this specific to YARN since I am not familiar with the other resource managers.

@github-actions github-actions bot added the DOCS label Mar 16, 2021
@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Test build #136444 has started for PR 31591 at commit a4d76d7.

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41028/

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41028/

val file = if (Utils.isLocalUri(settingsFile)) {
new File(new URI(settingsFile).getPath)
} else {
new File(settingsFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't work with HDFS files, right? I think we would need to download the file.

private val ivySettingsLocalizedFileName: Option[String] = ivySettings match {
case Some(ivySettingsPath) if isClusterMode && !Utils.isLocalUri(ivySettingsPath) =>
val ivySettingsFile = new File(ivySettingsPath)
require(ivySettingsFile.exists(), s"Ivy settings file $ivySettingsFile not found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't going to work if the file is in HDFS, the distribute call would handle that.
Can you please test that this works pulling from HDFS all the way through for both cluster and client mode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
It would be much nicer if we could rely on distribute instead of trying to figure out for ourselves here if it needs to be uploaded, e.g. distribute will handle the local URI for us.

I think the reason we have to do this here is because createConfArchive doesn't give us a way to pass in the ivySettingsLocalizedFileName information. Can we just update createConfArchive to accept a parameter like configOverrides: Map[String, String] which provides a way to override configs that need to be different for the YARN container environment? Then we can move this within prepareLocalResources and rely more on the logic of distribute.

@shardulm94
Copy link
Contributor Author

shardulm94 commented Apr 2, 2021

@tgravescs @xkrogen spark.jar.ivySettings today only supports files present on the client. It does not support local:// or hdfs:// files. Can the scope of this PR be restricted to making the current file support work in cluster mode?

Trying to add local:// and hdfs:// support would be nice, but there doesn't seem to be a requirement yet and it is making things unnecessarily complicated. I think most of the complexity stems from the fact the ivySettings loading code in SparkSubmit is shared between both the Spark client and the Spark driver, the way to access the files in these two contexts seems very different.

Since the restriction for ivySettings file being client local already existed before, I am guessing most users who need a custom ivySettings file have it configured locally and we can address adding support for other schemes later if need arises.

@xkrogen
Copy link
Contributor

xkrogen commented Apr 12, 2021

Interesting point @shardulm94. My take is that we should definitely support the full range of URIs (remote, file, local) eventually for the sake of consistency, so we shouldn't make any changes which would hinder this in the future. But an incremental step forward which just brings support for file to start makes sense to me, as long as it is extensible for future local/remote support.

cc also @mridulm in case you have any new input

@tgravescs
Copy link
Contributor

sorry I somehow missed the response. I would be fine with this PR doing a subset of scope as long as we file a jira to add the others and document it appropriately and make sure nothing weird happens if the other schemes are tried.

@mridulm
Copy link
Contributor

mridulm commented Apr 12, 2021

+1 to @tgravescs's suggestion.

@SparkQA
Copy link

SparkQA commented Apr 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41892/

@SparkQA
Copy link

SparkQA commented Apr 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41892/

@SparkQA
Copy link

SparkQA commented Apr 14, 2021

Test build #137312 has finished for PR 31591 at commit d99cc86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shardulm94
Copy link
Contributor Author

@tgravescs @xkrogen I have updated the PR to restrict scope to the file:// scheme. A path without scheme is assumed to be file://. Other schemes will throw an IllegalArgumentException. Please take a look.

I have also filed SPARK-35072 and SPARK-35073 for future improvements and linked it to SPARK-34472.

Copy link
Contributor

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New set of changes LGTM, just some minor comments. For tests it would be nice to see a little more coverage, especially since I think it should be easy to add now that you have the YarnAddJarTest class defined.

require(ivySettingsFile.exists(), s"Ivy settings file $ivySettingsFile not found")
require(ivySettingsFile.isFile(), s"Ivy settings file $ivySettingsFile is not a" +
"normal file")
// Generate a file name that can be used for the ivySettings file, that does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of code (and the IllegalArgumentException below) -- convert to URI, check scheme, throw if bad scheme, assert on file existence -- is duplicated between SparkSubmit and Client. Can we put a method like getIvySettingsFile into SparkSubmit which is leveraged in both places?

Not too big of a duplication so if it can't be done cleanly no worries from my end.

Copy link
Contributor Author

@shardulm94 shardulm94 Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertions of file existence are slightly different in both places. In YarnClient we would want to make this assertion only for the file scheme before we process it, however in SparkSubmit we would make to make the assertion for all schemes after the appropriate download steps. E.g. for local scheme too we would want to validate that the final file that loadIvySettings will process does exist.
Once we add handling for local and hdfs schemes, the functions in both classes will differ even more. So I think it is okay to have duplication here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM thanks!

@SparkQA
Copy link

SparkQA commented Apr 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41931/

@SparkQA
Copy link

SparkQA commented Apr 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41931/

@github-actions
Copy link

Test build #592944647 for PR 31591 at commit fab9f9e.

@SparkQA
Copy link

SparkQA commented Apr 14, 2021

Test build #137352 has finished for PR 31591 at commit fab9f9e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xkrogen
Copy link
Contributor

xkrogen commented Apr 14, 2021

Last diff LGTM!

@SparkQA
Copy link

SparkQA commented Apr 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41948/

@SparkQA
Copy link

SparkQA commented Apr 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41948/

@github-actions
Copy link

Test build #592944647 for PR 31591 at commit fd3ddb2.

@SparkQA
Copy link

SparkQA commented Apr 14, 2021

Test build #137370 has finished for PR 31591 at commit fd3ddb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shardulm94
Copy link
Contributor Author

Hey @tgravescs, it's been a few days since the approval. Are we waiting for more inputs before merging this?

@asfgit asfgit closed this in 83f753e Apr 20, 2021
@tgravescs
Copy link
Contributor

merged to master, thanks @shardulm94

@shardulm94
Copy link
Contributor Author

Thanks @tgravescs and @xkrogen for the reviews on this!

JeffInChrist added a commit to JeffABC/spark that referenced this pull request May 15, 2021
* [SPARK-34225][CORE][FOLLOWUP] Replace Hadoop's Path with Utils.resolveURI to make the way to get URI simple

### What changes were proposed in this pull request?

This PR proposes to replace Hadoop's `Path` with `Utils.resolveURI` to make the way to get URI simple in `SparkContext`.

### Why are the changes needed?

Keep the code simple.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32164 from sarutak/followup-SPARK-34225.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-35070][SQL] TRANSFORM not support alias in inputs

### What changes were proposed in this pull request?
Normal function parameters should not support alias, hive not support too
![image](https://user-images.githubusercontent.com/46485123/114645556-4a7ff400-9d0c-11eb-91eb-bc679ea0039a.png)
In this pr we forbid use alias in `TRANSFORM`'s inputs

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32165 from AngersZhuuuu/SPARK-35070.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [MINOR][CORE] Correct the number of started fetch requests in log

### What changes were proposed in this pull request?

When counting the number of started fetch requests, we should exclude the deferred requests.

### Why are the changes needed?

Fix the wrong number in the log.

### Does this PR introduce _any_ user-facing change?

Yes, users see the correct number of started requests in logs.

### How was this patch tested?

Manually tested.

Closes #32180 from Ngone51/count-deferred-request.

Lead-authored-by: yi.wu <[email protected]>
Co-authored-by: wuyi <[email protected]>
Signed-off-by: attilapiros <[email protected]>

* [SPARK-34995] Port/integrate Koalas remaining codes into PySpark

### What changes were proposed in this pull request?

There are some more changes in Koalas such as [databricks/koalas#2141](https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47), [databricks/koalas#2143](https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e) after the main code porting, this PR is to synchronize those changes with the `pyspark.pandas`.

### Why are the changes needed?

We should port the whole Koalas codes into PySpark and synchronize them.

### Does this PR introduce _any_ user-facing change?

Fixed some incompatible behavior with pandas 1.2.0 and added more to the `to_markdown` docstring.

### How was this patch tested?

Manually tested in local.

Closes #32154 from itholic/SPARK-34995.

Authored-by: itholic <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* Revert "[SPARK-34995] Port/integrate Koalas remaining codes into PySpark"

This reverts commit 9689c44b602781c1d6b31a322162c488ed17a29b.

* [SPARK-34843][SQL][FOLLOWUP] Fix a test failure in OracleIntegrationSuite

### What changes were proposed in this pull request?

This PR fixes a test failure in `OracleIntegrationSuite`.
After SPARK-34843 (#31965), the way to divide partitions is changed and `OracleIntegrationSuites` is affected.
```
[info] - SPARK-22814 support date/timestamp types in partitionColumn *** FAILED *** (230 milliseconds)
[info]   Set(""D" < '2018-07-11' or "D" is null", ""D" >= '2018-07-11' AND "D" < '2018-07-15'", ""D" >= '2018-07-15'") did not equal Set(""D" < '2018-07-10' or "D" is null", ""D" >= '2018-07-10' AND "D" < '2018-07-14'", ""D" >= '2018-07-14'") (OracleIntegrationSuite.scala:448)
[info]   Analysis:
[info]   Set(missingInLeft: ["D" < '2018-07-10' or "D" is null, "D" >= '2018-07-10' AND "D" < '2018-07-14', "D" >= '2018-07-14'], missingInRight: ["D" < '2018-07-11' or "D" is null, "D" >= '2018-07-11' AND "D" < '2018-07-15', "D" >= '2018-07-15'])
```

### Why are the changes needed?

To follow the previous change.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The modified test.

Closes #32186 from sarutak/fix-oracle-date-error.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-35032][PYTHON] Port Koalas Index unit tests into PySpark

### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Index unit tests to PySpark.

### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the Index unit tests.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Enable Index unit tests.

Closes #32139 from xinrong-databricks/port.indexes_tests.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-35099][SQL] Convert ANSI interval literals to SQL string in ANSI style

### What changes were proposed in this pull request?
Handle `YearMonthIntervalType` and `DayTimeIntervalType` in the `sql()` and `toString()` method of `Literal`, and format the ANSI interval in the ANSI style.

### Why are the changes needed?
To improve readability and UX with Spark SQL. For example, a test output before the changes:
```
-- !query
select timestamp'2011-11-11 11:11:11' - interval '2' day
-- !query schema
struct<TIMESTAMP '2011-11-11 11:11:11' - 172800000000:timestamp>
-- !query output
2011-11-09 11:11:11
```

### Does this PR introduce _any_ user-facing change?
Should not since the new intervals haven't been released yet.

### How was this patch tested?
By running new tests:
```
$ ./build/sbt "test:testOnly *LiteralExpressionSuite"
```

Closes #32196 from MaxGekk/literal-ansi-interval-sql.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-35083][CORE] Support remote scheduler pool files

### What changes were proposed in this pull request?

Use hadoop FileSystem instead of FileInputStream.

### Why are the changes needed?

Make `spark.scheduler.allocation.file` suport remote file. When using Spark as a server (e.g. SparkThriftServer), it's hard for user to specify a local path as the scheduler pool.

### Does this PR introduce _any_ user-facing change?

Yes, a minor feature.

### How was this patch tested?

Pass `core/src/test/scala/org/apache/spark/scheduler/PoolSuite.scala` and manul test
After add config `spark.scheduler.allocation.file=hdfs:///tmp/fairscheduler.xml`. We intrudoce the configed pool.
![pool1](https://user-images.githubusercontent.com/12025282/114810037-df065700-9ddd-11eb-8d7a-54b59a07ee7b.jpg)

Closes #32184 from ulysses-you/SPARK-35083.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-35104][SQL] Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true

### What changes were proposed in this pull request?

This issue fixes an issue that indentation of multiple output JSON records in a single split file are broken except for the first record in the split when `pretty` option is `true`.
```
// Run in the Spark Shell.
// Set spark.sql.leafNodeDefaultParallelism to 1 for the current master.
// Or set spark.default.parallelism for the previous releases.
spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1)
val df = Seq("a", "b", "c").toDF
df.write.option("pretty", "true").json("/path/to/output")

# Run in a Shell
$ cat /path/to/output/*.json
{
  "value" : "a"
}
 {
  "value" : "b"
}
 {
  "value" : "c"
}
```

### Why are the changes needed?

It's not pretty even though `pretty` option is true.

### Does this PR introduce _any_ user-facing change?

I think "No". Indentation style is changed but JSON format is not changed.

### How was this patch tested?

New test.

Closes #32203 from sarutak/fix-ugly-indentation.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-34995] Port/integrate Koalas remaining codes into PySpark

### What changes were proposed in this pull request?

There are some more changes in Koalas such as [databricks/koalas#2141](https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47), [databricks/koalas#2143](https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e) after the main code porting, this PR is to synchronize those changes with the `pyspark.pandas`.

### Why are the changes needed?

We should port the whole Koalas codes into PySpark and synchronize them.

### Does this PR introduce _any_ user-facing change?

Fixed some incompatible behavior with pandas 1.2.0 and added more to the `to_markdown` docstring.

### How was this patch tested?

Manually tested in local.

Closes #32197 from itholic/SPARK-34995-fix.

Authored-by: itholic <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [MINOR][DOCS] Soften security warning and keep it in cluster management docs only

### What changes were proposed in this pull request?

Soften security warning and keep it in cluster management docs only, not in the main doc page, where it's not necessarily relevant.

### Why are the changes needed?

The statement is perhaps unnecessarily 'frightening' as the first section in the main docs page. It applies to clusters not local mode, anyhow.

### Does this PR introduce _any_ user-facing change?

Just a docs change.

### How was this patch tested?

N/A

Closes #32206 from srowen/SecurityStatement.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Sean Owen <[email protected]>

* [SPARK-34787][CORE] Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)

### What changes were proposed in this pull request?
Make the attemptId in the log of historyServer to be more easily to read.

### Why are the changes needed?
Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
manual test

Closes #32189 from kyoty/history-server-print-option-variable.

Authored-by: kyoty <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-35101][INFRA] Add GitHub status check in PR instead of a comment

### What changes were proposed in this pull request?

TL;DR: now it shows green yellow read status of tests instead of relying on a comment in a PR, **see https://github.com/HyukjinKwon/spark/pull/41 for an example**.

This PR proposes the GitHub status checks instead of a comment that link to the build (from forked repository) in PRs.

This is how it works:

1. **forked repo**: "Build and test" workflow is triggered when you create a branch to create a PR which uses your resources in GitHub Actions.
1. **main repo**: "Notify test workflow" (previously created a comment) now creates a in-progress status (yellow status) as a GitHub Actions check to your current PR.
1.  **main repo**: "Update build status workflow" regularly (every 15 mins) checks open PRs, and updates the status of GitHub Actions checks at PRs according to the status of workflows in the forked repositories (status sync).

**NOTE** that creating/updating statuses in the PRs is only allowed from the main repo. That's why the flow is as above.

### Why are the changes needed?

The GitHub status shows a green although the tests are running, which is confusing.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested at:
- https://github.com/HyukjinKwon/spark/pull/41
- HyukjinKwon#42
- HyukjinKwon#43
- https://github.com/HyukjinKwon/spark/pull/37

**queued**:
<img width="861" alt="Screen Shot 2021-04-16 at 10 56 03 AM" src="https://user-images.githubusercontent.com/6477701/114960831-c9a73080-9ea2-11eb-8442-ddf3f6008a45.png">

**in progress**:
<img width="871" alt="Screen Shot 2021-04-16 at 12 14 39 PM" src="https://user-images.githubusercontent.com/6477701/114966359-59ea7300-9ead-11eb-98cb-1e63323980ad.png">

**passed**:
![Screen Shot 2021-04-16 at 2 04 07 PM](https://user-images.githubusercontent.com/6477701/114974045-a12c3000-9ebc-11eb-9be5-653393a863e6.png)

**failure**:
![Screen Shot 2021-04-16 at 10 46 10 PM](https://user-images.githubusercontent.com/6477701/115033584-90ec7300-9f05-11eb-8f2e-0fc2ef986a70.png)

Closes #32193 from HyukjinKwon/update-checks-pr-poc.

Lead-authored-by: HyukjinKwon <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Co-authored-by: Yikun Jiang <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [MINOR][INFRA] Upgrade Jira client to 2.0.0

### What changes were proposed in this pull request?

SPARK-10498 added the initial Jira client requirement with 1.0.3 five year ago (2016 January). As of today, it causes `dev/merge_spark_pr.py` failure with `Python 3.9.4` due to this old dependency. This PR aims to upgrade it to the latest version, 2.0.0. The latest version is also a little old (2018 July).
- https://pypi.org/project/jira/#history

### Why are the changes needed?

`Jira==2.0.0` works well with both Python 3.8/3.9 while `Jira==1.0.3` fails with Python 3.9.

**BEFORE**
```
$ pyenv global 3.9.4
$ pip freeze | grep jira
jira==1.0.3
$ dev/merge_spark_pr.py
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-merge/dev/merge_spark_pr.py", line 39, in <module>
    import jira.client
  File "/Users/dongjoon/.pyenv/versions/3.9.4/lib/python3.9/site-packages/jira/__init__.py", line 5, in <module>
    from .config import get_jira
  File "/Users/dongjoon/.pyenv/versions/3.9.4/lib/python3.9/site-packages/jira/config.py", line 17, in <module>
    from .client import JIRA
  File "/Users/dongjoon/.pyenv/versions/3.9.4/lib/python3.9/site-packages/jira/client.py", line 165
    validate=False, get_server_info=True, async=False, logging=True, max_retries=3):
                                          ^
SyntaxError: invalid syntax
```

**AFTER**
```
$ pip install jira==2.0.0
$ dev/merge_spark_pr.py
git rev-parse --abbrev-ref HEAD
Which pull request would you like to merge? (e.g. 34):
```

### Does this PR introduce _any_ user-facing change?

No. This is a committer-only script.

### How was this patch tested?

Manually.

Closes #32215 from dongjoon-hyun/jira.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-35116][SQL][TESTS] The generated data fits the precision of DayTimeIntervalType in spark

### What changes were proposed in this pull request?
The precision of `java.time.Duration` is nanosecond, but when it is used as `DayTimeIntervalType` in Spark, it is microsecond.
At present, the `DayTimeIntervalType` data generated in the implementation of `RandomDataGenerator` is accurate to nanosecond, which will cause the `DayTimeIntervalType` to be converted to long, and then back to `DayTimeIntervalType` to lose the accuracy, which will cause the test to fail. For example: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137390/testReport/org.apache.spark.sql.hive.execution/HashAggregationQueryWithControlledFallbackSuite/udaf_with_all_data_types/

### Why are the changes needed?
Improve `RandomDataGenerator` so that the generated data fits the precision of DayTimeIntervalType in spark.

### Does this PR introduce _any_ user-facing change?
'No'. Just change the test class.

### How was this patch tested?
Jenkins test.

Closes #32212 from beliefer/SPARK-35116.

Authored-by: beliefer <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-35114][SQL][TESTS] Add checks for ANSI intervals to `LiteralExpressionSuite`

### What changes were proposed in this pull request?
In the PR, I propose to add additional checks for ANSI interval types `YearMonthIntervalType` and `DayTimeIntervalType` to `LiteralExpressionSuite`.

Also, I replaced some long literal values by `CalendarInterval` to check `CalendarIntervalType` that the tests were supposed to check.

### Why are the changes needed?
To improve test coverage and have the same checks for ANSI types as for `CalendarIntervalType`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *LiteralExpressionSuite"
```

Closes #32213 from MaxGekk/interval-literal-tests.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-34716][SQL] Support ANSI SQL intervals by the aggregate function `sum`

### What changes were proposed in this pull request?
Extend the `Sum` expression to  to support `DayTimeIntervalType` and `YearMonthIntervalType` added by #31614.

Note: the expressions can throw the overflow exception independently from the SQL config `spark.sql.ansi.enabled`. In this way, the modified expressions always behave in the ANSI mode for the intervals.

### Why are the changes needed?
Extend `org.apache.spark.sql.catalyst.expressions.aggregate.Sum` to support `DayTimeIntervalType` and `YearMonthIntervalType`.

### Does this PR introduce _any_ user-facing change?
'No'.
Should not since new types have not been released yet.

### How was this patch tested?
Jenkins test

Closes #32107 from beliefer/SPARK-34716.

Lead-authored-by: gengjiaan <[email protected]>
Co-authored-by: beliefer <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-35115][SQL][TESTS] Check ANSI intervals in `MutableProjectionSuite`

### What changes were proposed in this pull request?
Add checks for `YearMonthIntervalType` and `DayTimeIntervalType` to `MutableProjectionSuite`.

### Why are the changes needed?
To improve test coverage, and the same checks as for `CalendarIntervalType`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *MutableProjectionSuite"
```

Closes #32225 from MaxGekk/test-ansi-intervals-in-MutableProjectionSuite.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>

* [SPARK-35092][UI] the auto-generated rdd's name in the storage tab should be truncated if it is too long

### What changes were proposed in this pull request?
the auto-generated rdd's name in the storage tab should be truncated  as a single line if it is too long.

### Why are the changes needed?
to make the ui shows more friendly.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
just a simple modifition in css, manual test works well like below:

before modified:
![the rdd title in storage page shows too long](https://user-images.githubusercontent.com/52202080/115009655-17da2500-9edf-11eb-86a7-088bed7ef8f7.png)

after modified:
Tht titile  needs just one line:

![storage标题过长修改后](https://user-images.githubusercontent.com/52202080/114872091-8c07c080-9e2c-11eb-81a8-0c097b1a77bf.png)

Closes #32191 from kyoty/storage-rdd-titile-display-improve.

Authored-by: kyoty <[email protected]>
Signed-off-by: Kousuke Saruta <[email protected]>

* [SPARK-35109][SQL] Fix minor exception messages of HashedRelation and HashJoin

### What changes were proposed in this pull request?

It seems that we miss classifying one `SparkOutOfMemoryError` in `HashedRelation`. Add the error classification for it. In addition, clean up two errors definition of `HashJoin` as they are not used.

### Why are the changes needed?

Better error classification.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32211 from c21/error-message.

Authored-by: Cheng Su <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>

* [SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function

### What changes were proposed in this pull request?
This PR:
- Adds a new expression `GroupingExprRef` that can be used in aggregate expressions of `Aggregate` nodes to refer grouping expressions by index. These expressions capture the data type and nullability of the referred grouping expression.
- Adds a new rule `EnforceGroupingReferencesInAggregates` that inserts the references in the beginning of the optimization phase.
- Adds a new rule `UpdateGroupingExprRefNullability` to update nullability of `GroupingExprRef` expressions as nullability of referred grouping expression can change during optimization.

### Why are the changes needed?
If aggregate expressions (without aggregate functions) in an `Aggregate` node are complex then the `Optimizer` can optimize out grouping expressions from them and so making aggregate expressions invalid.

Here is a simple example:
```
SELECT not(t.id IS NULL) , count(*)
FROM t
GROUP BY t.id IS NULL
```
In this case the `BooleanSimplification` rule does this:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
!Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]   Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]
 +- Project [value#219 AS id#222]                                                                 +- Project [value#219 AS id#222]
    +- LocalRelation [value#219]                                                                     +- LocalRelation [value#219]
```
where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no longer refers to any grouping expression.

Before this PR:
```
== Optimized Logical Plan ==
Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L]
+- Project [value#219 AS id#222]
   +- LocalRelation [value#219]
```
and running the query throws an error:
```
Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
```

After this PR:
```
== Optimized Logical Plan ==
Aggregate [isnull(id#222)], [NOT groupingexprref(0) AS (NOT (id IS NULL))#234, count(1) AS c#232L]
+- Project [value#219 AS id#222]
   +- LocalRelation [value#219]
```
and the query works.

### Does this PR introduce _any_ user-facing change?
Yes, the query works.

### How was this patch tested?
Added new UT.

Closes #31913 from peter-toth/SPARK-34581-keep-grouping-expressions.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-35122][SQL] Migrate CACHE/UNCACHE TABLE to use AnalysisOnlyCommand

### What changes were proposed in this pull request?

Now that `AnalysisOnlyCommand` in introduced in #32032, `CacheTable` and `UncacheTable` can extend `AnalysisOnlyCommand` to simplify the code base. For example, the logic to handle these commands such that the tables are only analyzed is scattered across different places.

### Why are the changes needed?

To simplify the code base to handle these two commands.

### Does this PR introduce _any_ user-facing change?

No, just internal refactoring.

### How was this patch tested?

The existing tests (e.g., `CachedTableSuite`) cover the changes in this PR. For example, if I make `CacheTable`/`UncacheTable` extend `LeafCommand`, there are few failures in `CachedTableSuite`.

Closes #32220 from imback82/cache_cmd_analysis_only.

Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-31937][SQL] Support processing ArrayType/MapType/StructType data using no-serde mode script transform

### What changes were proposed in this pull request?
Support no-serde mode script transform use ArrayType/MapType/StructStpe data.

### Why are the changes needed?
Make user can process array/map/struct data

### Does this PR introduce _any_ user-facing change?
Yes, user can process array/map/struct data in script transform `no-serde` mode

### How was this patch tested?
Added UT

Closes #30957 from AngersZhuuuu/SPARK-31937.

Lead-authored-by: Angerszhuuuu <[email protected]>
Co-authored-by: angerszhu <[email protected]>
Co-authored-by: AngersZhuuuu <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-35045][SQL][FOLLOW-UP] Add a configuration for CSV input buffer size

### What changes were proposed in this pull request?

This PR makes the input buffer configurable (as an internal configuration). This is mainly to work around the regression in uniVocity/univocity-parsers#449.

This is particularly useful for SQL workloads that requires to rewrite the `CREATE TABLE` with options.

### Why are the changes needed?

To work around uniVocity/univocity-parsers#449.

### Does this PR introduce _any_ user-facing change?

No, it's only internal option.

### How was this patch tested?

Manually tested by modifying the unittest added in https://github.com/apache/spark/pull/31858 as below:

```diff
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index fd25a79619d..705f38dbfbd 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 -2456,6 +2456,7  abstract class CSVSuite
   test("SPARK-34768: counting a long record with ignoreTrailingWhiteSpace set to true") {
     val bufSize = 128
     val line = "X" * (bufSize - 1) + "| |"
+    spark.conf.set("spark.sql.csv.parser.inputBufferSize", 128)
     withTempPath { path =>
       Seq(line).toDF.write.text(path.getAbsolutePath)
       assert(spark.read.format("csv")
```

Closes #32231 from HyukjinKwon/SPARK-35045-followup.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-34837][SQL] Support ANSI SQL intervals by the aggregate function `avg`

### What changes were proposed in this pull request?
Extend the `Average` expression to support `DayTimeIntervalType` and `YearMonthIntervalType` added by #31614.

Note: the expressions can throw the overflow exception independently from the SQL config `spark.sql.ansi.enabled`. In this way, the modified expressions always behave in the ANSI mode for the intervals.

### Why are the changes needed?
Extend `org.apache.spark.sql.catalyst.expressions.aggregate.Average` to support `DayTimeIntervalType` and `YearMonthIntervalType`.

### Does this PR introduce _any_ user-facing change?
'No'.
Should not since new types have not been released yet.

### How was this patch tested?
Jenkins test

Closes #32229 from beliefer/SPARK-34837.

Authored-by: gengjiaan <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-35107][SQL] Parse unit-to-unit interval literals to ANSI intervals

### What changes were proposed in this pull request?
Parse the year-month interval literals like `INTERVAL '1-1' YEAR TO MONTH` to values of `YearMonthIntervalType`, and day-time interval literals to `DayTimeIntervalType` values. Currently, Spark SQL supports:
- DAY TO HOUR
- DAY TO MINUTE
- DAY TO SECOND
- HOUR TO MINUTE
- HOUR TO SECOND
- MINUTE TO SECOND

All such interval literals are converted to `DayTimeIntervalType`, and `YEAR TO MONTH` to `YearMonthIntervalType` while loosing info about `from` and `to` units.

**Note**: new behavior is under the SQL config `spark.sql.legacy.interval.enabled` which is `false` by default. When the config is set to `true`, the interval literals are parsed to `CaledarIntervalType` values.

Closes #32176

### Why are the changes needed?
To conform the ANSI SQL standard which assumes conversions of interval literals to year-month or day-time interval but not to mixed interval type like Catalyst's `CalendarIntervalType`.

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```sql
spark-sql> SELECT INTERVAL '1 01:02:03.123' DAY TO SECOND;
1 days 1 hours 2 minutes 3.123 seconds
spark-sql> SELECT typeof(INTERVAL '1 01:02:03.123' DAY TO SECOND);
interval
```

After:
```sql
spark-sql> SELECT INTERVAL '1 01:02:03.123' DAY TO SECOND;
1 01:02:03.123000000
spark-sql> SELECT typeof(INTERVAL '1 01:02:03.123' DAY TO SECOND);
day-time interval
```

### How was this patch tested?
1. By running the affected test suites:
```
$ ./build/sbt "test:testOnly *.ExpressionParserSuite"
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z create_view.sql"
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z date.sql"
$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z timestamp.sql"
```
2. PostgresSQL tests are executed with `spark.sql.legacy.interval.enabled` is set to `true` to keep compatibility with PostgreSQL output:
```sql
> SELECT interval '999' second;
0 years 0 mons 0 days 0 hours 16 mins 39.00 secs
```

Closes #32209 from MaxGekk/parse-ansi-interval-literals.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-34715][SQL][TESTS] Add round trip tests for period <-> month and duration <-> micros

### What changes were proposed in this pull request?
Similarly to the test from the PR https://github.com/apache/spark/pull/31799, add tests:
1. Months -> Period -> Months
2. Period -> Months -> Period
3. Duration -> micros -> Duration

### Why are the changes needed?
Add round trip tests for period <-> month and duration <-> micros

### Does this PR introduce _any_ user-facing change?
'No'. Just test cases.

### How was this patch tested?
Jenkins test

Closes #32234 from beliefer/SPARK-34715.

Authored-by: gengjiaan <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-35125][K8S] Upgrade K8s client to 5.3.0 to support K8s 1.20

### What changes were proposed in this pull request?

Although AS-IS master branch already works with K8s 1.20, this PR aims to upgrade K8s client to 5.3.0 to support K8s 1.20 officially.
- https://github.com/fabric8io/kubernetes-client#compatibility-matrix

The following are the notable breaking API changes.

1. Remove Doneable (5.0+):
    - https://github.com/fabric8io/kubernetes-client/pull/2571
2. Change Watcher.onClose signature (5.0+):
    - https://github.com/fabric8io/kubernetes-client/pull/2616
3. Change Readiness (5.1+)
    - https://github.com/fabric8io/kubernetes-client/pull/2796

### Why are the changes needed?

According to the compatibility matrix, this makes Apache Spark and its external cluster manager extension support all K8s 1.20 features officially for Apache Spark 3.2.0.

### Does this PR introduce _any_ user-facing change?

Yes, this is a dev dependency change which affects K8s cluster extension users.

### How was this patch tested?

Pass the CIs.

This is manually tested with K8s IT.
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 17 minutes, 44 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #32221 from dongjoon-hyun/SPARK-K8S-530.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-35102][SQL] Make spark.sql.hive.version read-only, not deprecated and meaningful

### What changes were proposed in this pull request?

Firstly let's take a look at the definition and comment.

```
// A fake config which is only here for backward compatibility reasons. This config has no effect
// to Spark, just for reporting the builtin Hive version of Spark to existing applications that
// already rely on this config.
val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version")
  .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive version in Spark.")
  .version("1.1.1")
  .fallbackConf(HIVE_METASTORE_VERSION)
```
It is used for reporting the built-in Hive version but the current status is unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax.

It is marked as deprecated but kept a long way until now. I guess it is hard for us to remove it and not even necessary.

On second thought, it's actually good for us to keep it to work with the `spark.sql.hive.metastore.version`. As when `spark.sql.hive.metastore.version` is changed, it could be used to report the compiled hive version statically, it's useful when an error occurs in this case. So this parameter should be fixed to compiled hive version.

### Why are the changes needed?

`spark.sql.hive.version` is useful in certain cases and should be read-only

### Does this PR introduce _any_ user-facing change?

`spark.sql.hive.version` now is read-only

### How was this patch tested?

new test cases

Closes #32200 from yaooqinn/SPARK-35102.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-35136] Remove initial null value of LiveStage.info

### What changes were proposed in this pull request?
To prevent potential NullPointerExceptions, this PR changes the `LiveStage` constructor to take `info` as a constructor parameter and adds a nullcheck in  `AppStatusListener.activeStages`.

### Why are the changes needed?
The `AppStatusListener.getOrCreateStage` would create a LiveStage object with the `info` field set to null and right after that set it to a specific StageInfo object. This can lead to a race condition when the `livestages` are read in between those calls. This could then lead to a null pointer exception in, for instance: `AppStatusListener.activeStages`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Regular CI/CD tests

Closes #32233 from sander-goos/SPARK-35136-livestage.

Authored-by: Sander Goos <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-35138][SQL] Remove Antlr4 workaround

### What changes were proposed in this pull request?

Remove Antlr 4.7 workaround.

### Why are the changes needed?

The https://github.com/antlr/antlr4/commit/ac9f7530 has been fixed in upstream, so remove the workaround to simplify code.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existed UTs.

Closes #32238 from pan3793/antlr-minor.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-35120][INFRA] Guide users to sync branch and enable GitHub Actions in their forked repository

### What changes were proposed in this pull request?

This PR proposes to add messages when the workflow fails to find the workflow run in a forked repository, for example as below:

**Before**

![Screen Shot 2021-04-19 at 9 41 52 PM](https://user-images.githubusercontent.com/6477701/115238011-28e19b00-a158-11eb-8c5c-6374ca1e9790.png)

![Screen Shot 2021-04-19 at 9 42 00 PM](https://user-images.githubusercontent.com/6477701/115237984-22ebba00-a158-11eb-9b0f-11fe11072830.png)

**After**

![Screen Shot 2021-04-19 at 9 25 32 PM](https://user-images.githubusercontent.com/6477701/115237507-9c36dd00-a157-11eb-8ba7-f5f88caa1058.png)

![Screen Shot 2021-04-19 at 9 23 13 PM](https://user-images.githubusercontent.com/6477701/115236793-c2a84880-a156-11eb-98fc-1bb7d4bc31dd.png)
(typo `foce` in the image was fixed)

See this example: https://github.com/HyukjinKwon/spark/runs/2380644793

### Why are the changes needed?

To guide users to enable Github Actions in their forked repositories (and sync their branch to the latest `master` in Apache Spark).

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in:
- https://github.com/HyukjinKwon/spark/pull/47
- https://github.com/HyukjinKwon/spark/pull/46

Closes #32235 from HyukjinKwon/test-test-test.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-35131][K8S] Support early driver service clean-up during app termination

### What changes were proposed in this pull request?

This PR aims to support a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, to clean up `Driver Service` resource during app termination.

### Why are the changes needed?

The K8s service is one of the important resources and sometimes it's controlled by quota.
```
$ k describe quota
Name:       service
Namespace:  default
Resource    Used  Hard
--------    ----  ----
services    1     3
```

Apache Spark creates a service for driver whose lifecycle is the same with driver pod.
It means a new Spark job submission fails if the number of completed Spark jobs equals the number of service quota.

**BEFORE**
```
$ k get pod
NAME                                                        READY   STATUS      RESTARTS   AGE
org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver   0/1     Completed   0          31m
org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver   0/1     Completed   0          78s

$ k get svc
NAME                                                            TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE
kubernetes                                                      ClusterIP   10.96.0.1    <none>        443/TCP                      80m
org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver-svc   ClusterIP   None         <none>        7078/TCP,7079/TCP,4040/TCP   31m
org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver-svc   ClusterIP   None         <none>        7078/TCP,7079/TCP,4040/TCP   80s

$ k describe quota
Name:       service
Namespace:  default
Resource    Used  Hard
--------    ----  ----
services    3     3

$ bin/spark-submit...
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException:
Failure executing: POST at: https://192.168.64.50:8443/api/v1/namespaces/default/services.
Message: Forbidden! User minikube doesn't have permission.
services "org-apache-spark-examples-sparkpi-843f6978e722819c-driver-svc" is forbidden:
exceeded quota: service, requested: services=1, used: services=3, limited: services=3.
```

**AFTER**
```
$ k get pod
NAME                                                        READY   STATUS      RESTARTS   AGE
org-apache-spark-examples-sparkpi-23d5f278e77731a7-driver   0/1     Completed   0          26s
org-apache-spark-examples-sparkpi-d1292278e7768ed4-driver   0/1     Completed   0          67s
org-apache-spark-examples-sparkpi-e5bedf78e776ea9d-driver   0/1     Completed   0          44s

$ k get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   172m

$ k describe quota
Name:       service
Namespace:  default
Resource    Used  Hard
--------    ----  ----
services    1     3
```

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, and enables it by default.
The change is documented at the migration guide.

### How was this patch tested?

Pass the CIs.

This is tested with K8s IT manually.

```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 19 minutes, 9 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #32226 from dongjoon-hyun/SPARK-35131.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

* [SPARK-35103][SQL] Make TypeCoercion rules more efficient

## What changes were proposed in this pull request?
This PR fixes a couple of things in TypeCoercion rules:
- Only run the propagate types step if the children of a node have output attributes with changed dataTypes and/or nullability. This is implemented as custom tree transformation. The TypeCoercion rules now only implement a partial function.
- Combine multiple type coercion rules into a single rule. Multiple rules are applied in single tree traversal.
- Reduce calls to conf.get in DecimalPrecision. This now happens once per tree traversal, instead of once per matched expression.
- Reduce the use of withNewChildren.

This brings down the number of CPU cycles spend in analysis by ~28% (benchmark: 10 iterations of all TPC-DS queries on SF10).

## How was this patch tested?
Existing tests.

Closes #32208 from sigmod/coercion.

Authored-by: Yingyi Bu <[email protected]>
Signed-off-by: herman <[email protected]>

* [SPARK-35117][UI] Change progress bar back to highlight ratio of tasks in progress

### What changes were proposed in this pull request?
Small UI update to add highlighting the number of tasks in progress in a stage/job instead of highlighting the whole in progress stage/job. This was the behavior pre Spark 3.1 and the bootstrap 4 upgrade.

### Why are the changes needed?

To add back in functionality lost between 3.0 and 3.1. This provides a great visual queue of how much of a stage/job is currently being run.

### Does this PR introduce _any_ user-facing change?

Small UI change.

Before:
![image](https://user-images.githubusercontent.com/3536454/115216189-3fddaa00-a0d2-11eb-88e0-e3be925c92f0.png)

After (and pre Spark 3.1):
![image](https://user-images.githubusercontent.com/3536454/115216216-48ce7b80-a0d2-11eb-9953-2adb3b377133.png)

### How was this patch tested?

Updated existing UT.

Closes #32214 from Kimahriman/progress-bar-started.

Authored-by: Adam Binford <[email protected]>
Signed-off-by: Kousuke Saruta <[email protected]>

* [SPARK-35080][SQL] Only allow a subset of correlated equality predicates when a subquery is aggregated

### What changes were proposed in this pull request?
This PR updated the `foundNonEqualCorrelatedPred` logic for correlated subqueries in `CheckAnalysis` to only allow correlated equality predicates that guarantee one-to-one mapping between inner and outer attributes, instead of all equality predicates.

### Why are the changes needed?
To fix correctness bugs. Before this fix Spark can give wrong results for certain correlated subqueries that pass CheckAnalysis:
Example 1:
```sql
create or replace view t1(c) as values ('a'), ('b')
create or replace view t2(c) as values ('ab'), ('abc'), ('bc')

select c, (select count(*) from t2 where t1.c = substring(t2.c, 1, 1)) from t1
```
Correct results: [(a, 2), (b, 1)]
Spark results:
```
+---+-----------------+
|c  |scalarsubquery(c)|
+---+-----------------+
|a  |1                |
|a  |1                |
|b  |1                |
+---+-----------------+
```
Example 2:
```sql
create or replace view t1(a, b) as values (0, 6), (1, 5), (2, 4), (3, 3);
create or replace view t2(c) as values (6);

select c, (select count(*) from t1 where a + b = c) from t2;
```
Correct results: [(6, 4)]
Spark results:
```
+---+-----------------+
|c  |scalarsubquery(c)|
+---+-----------------+
|6  |1                |
|6  |1                |
|6  |1                |
|6  |1                |
+---+-----------------+
```
### Does this PR introduce _any_ user-facing change?
Yes. Users will not be able to run queries that contain unsupported correlated equality predicates.

### How was this patch tested?
Added unit tests.

Closes #32179 from allisonwang-db/spark-35080-subquery-bug.

Lead-authored-by: allisonwang-db <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-35052][SQL] Use static bits for AttributeReference and Literal

### What changes were proposed in this pull request?

- Share a static ImmutableBitSet for `treePatternBits` in all object instances of AttributeReference.
- Share three static ImmutableBitSets for  `treePatternBits` in three kinds of Literals.
- Add an ImmutableBitSet as a subclass of BitSet.

### Why are the changes needed?

Reduce the additional memory usage caused by `treePatternBits`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32157 from sigmod/leaf.

Authored-by: Yingyi Bu <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>

* [SPARK-35134][BUILD][TESTS] Manually exclude redundant netty jars in SparkBuild.scala to avoid version conflicts in test

### What changes were proposed in this pull request?
The following logs will print  when Jenkins execute [PySpark pip packaging tests](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137500/console):

```
copying deps/jars/netty-all-4.1.51.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-buffer-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-codec-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-common-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-handler-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-resolver-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-transport-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-transport-native-epoll-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
```

There will be 2 different versions of netty4 jars copied to the jars directory, but the `netty-xxx-4.1.50.Final.jar` not in maven `dependency:tree `, but spark only needs to rely on `netty-all-xxx.jar`.

So this pr try to add new `ExclusionRule`s  to `SparkBuild.scala` to exclude  unnecessary netty 4 dependencies.

### Why are the changes needed?
Make sure that only `netty-all-xxx.jar` is used in the test to avoid possible jar conflicts.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- Pass the Jenkins or GitHub Action
- Check Jenkins log manually, there should be only

`copying deps/jars/netty-all-4.1.51.Final.jar -> pyspark-3.2.0.dev0/deps/jars`

and there should be no such logs as

```
copying deps/jars/netty-buffer-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-codec-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-common-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-handler-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-resolver-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-transport-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
copying deps/jars/netty-transport-native-epoll-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
```

Closes #32230 from LuciferYang/SPARK-35134.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-35018][SQL][TESTS] Check transferring of year-month intervals via Hive Thrift server

### What changes were proposed in this pull request?
1. Add a test to check that Thrift server is able to collect year-month intervals and transfer them via thrift protocol.
2. Improve similar test for day-time intervals. After the changes, the test doesn't depend on the result of date subtractions. In the future, the type of date subtract can be changed. So, current PR should make the test tolerant to the changes.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suite:
```
$ ./build/sbt -Phive -Phive-thriftserver "test:testOnly *SparkThriftServerProtocolVersionsSuite"
```

Closes #32240 from MaxGekk/year-month-interval-thrift-protocol.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-34974][SQL] Improve subquery decorrelation framework

### What changes were proposed in this pull request?
This PR implements the decorrelation technique in the paper "Unnesting Arbitrary Queries" by T. Neumann; A. Kemper
(http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf). It currently supports Filter, Project, Aggregate, Join, and UnaryNode that passes CheckAnalysis.

This feature can be controlled by the config `spark.sql.optimizer.decorrelateInnerQuery.enabled` (default: true).

A few notes:
1. This PR does not relax any constraints in CheckAnalysis for correlated subqueries, even though some cases can be supported by this new framework, such as aggregate with correlated non-equality predicates. This PR focuses on adding the new framework and making sure all existing cases can be supported. Constraints can be relaxed gradually in the future via separate PRs.
2. The new framework is only enabled for correlated scalar subqueries, as the first step. EXISTS/IN subqueries can be supported in the future.

### Why are the changes needed?
Currently, Spark has limited support for correlated subqueries. It only allows `Filter` to reference outer query columns and does not support non-equality predicates when the subquery is aggregated. This new framework will allow more operators to host outer column references and support correlated non-equality predicates and more types of operators in correlated subqueries.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit and SQL query tests and new optimizer plan tests.

Closes #32072 from allisonwang-db/spark-34974-decorrelation.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-35068][SQL] Add tests for ANSI intervals to HiveThriftBinaryServerSuite

### What changes were proposed in this pull request?
After the PR https://github.com/apache/spark/pull/32209, this should be possible now.
We can add test case for ANSI intervals to HiveThriftBinaryServerSuite

### Why are the changes needed?
Add more test case

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32250 from AngersZhuuuu/SPARK-35068.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-33976][SQL][DOCS] Add a SQL doc page for a TRANSFORM clause

### What changes were proposed in this pull request?
Add doc about `TRANSFORM` and related function.

![image](https://user-images.githubusercontent.com/46485123/114332579-1627fe80-9b79-11eb-8fa7-131f0a20f72f.png)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #31010 from AngersZhuuuu/SPARK-33976.

Lead-authored-by: Angerszhuuuu <[email protected]>
Co-authored-by: angerszhu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-34877][CORE][YARN] Add the code change for adding the Spark AM log link in spark UI

### What changes were proposed in this pull request?
On Running Spark job with yarn and deployment mode as client, Spark Driver and Spark Application master launch in two separate containers. In various scenarios there is need to see Spark Application master logs to see the resource allocation, Decommissioning status and other information shared between yarn RM and Spark Application master.

In Cluster mode Spark driver and Spark AM is on same container, So Log link of the driver already there to see the logs in Spark UI

This PR is for adding the spark AM log link for spark job running in the client mode for yarn. Instead of searching the container id and then find the logs. We can directly check in the Spark UI

This change is only for showing the AM log links in the Client mode when resource manager is yarn.

### Why are the changes needed?
Till now the only way to check this by finding the container id of the AM and check the logs either using Yarn utility or Yarn RM Application History server.

This PR is for adding the spark AM log link for spark job running in the client mode for yarn. Instead of searching the container id and then find the logs. We can directly check in the Spark UI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added the unit test also checked the Spark UI
**In Yarn Client mode**
Before Change

![image](https://user-images.githubusercontent.com/34540906/112644861-e1733200-8e6b-11eb-939b-c76ca9902a4e.png)

After the Change - The AM info is there

![image](https://user-images.githubusercontent.com/34540906/115264198-b7075280-a153-11eb-98f3-2aed66ffad2a.png)

AM Log

![image](https://user-images.githubusercontent.com/34540906/112645680-c0f7a780-8e6c-11eb-8b82-4ccc0aee927b.png)

**In Yarn Cluster Mode**  - The AM log link will not be there

![image](https://user-images.githubusercontent.com/34540906/112649512-86900980-8e70-11eb-9b37-69d5c4b53ffa.png)

Closes #31974 from SaurabhChawla100/SPARK-34877.

Authored-by: SaurabhChawla <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>

* [SPARK-34035][SQL] Refactor ScriptTransformation to remove input parameter and replace it by child.output

### What changes were proposed in this pull request?
Refactor ScriptTransformation to remove input parameter and replace it by child.output

### Why are the changes needed?
refactor code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #32228 from AngersZhuuuu/SPARK-34035.

Lead-authored-by: Angerszhuuuu <[email protected]>
Co-authored-by: AngersZhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-34338][SQL] Report metrics from Datasource v2 scan

### What changes were proposed in this pull request?

This patch proposes to leverage `CustomMetric`, `CustomTaskMetric` API to report custom metrics from DS v2 scan to Spark.

### Why are the changes needed?

This is related to #31398. In SPARK-34297, we want to add a couple of metrics when reading from Kafka in SS. We need some public API change in DS v2 to make it possible. This extracts only DS v2 change and make it general for DS v2 instead of micro-batch DS v2 API.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Implement a simple test DS v2 class locally and run it:

```scala
scala> import org.apache.spark.sql.execution.datasources.v2._
import org.apache.spark.sql.execution.datasources.v2._

scala> classOf[CustomMetricDataSourceV2].getName
res0: String = org.apache.spark.sql.execution.datasources.v2.CustomMetricDataSourceV2

scala> val df = spark.read.format(res0).load()
df: org.apache.spark.sql.DataFrame = [i: int, j: int]

scala> df.collect
```

<img width="703" alt="Screen Shot 2021-03-30 at 11 07 13 PM" src="https://user-images.githubusercontent.com/68855/113098080-d8a49800-91ac-11eb-8681-be408a0f2e69.png">

Closes #31451 from viirya/dsv2-metrics.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-35145][SQL] CurrentOrigin should support nested invoking

### What changes were proposed in this pull request?

`CurrentOrigin` is a thread-local variable to track the original SQL line position in plan/expression. Usually, we set `CurrentOrigin`, create `TreeNode` instances, and reset `CurrentOrigin`.

This PR updates the last step to set `CurrentOrigin` to its previous value, instead of resetting it. This is necessary when we invoke `CurrentOrigin` in a nested way, like with subqueries.

### Why are the changes needed?

To keep the original SQL line position in the error message in more cases.

### Does this PR introduce _any_ user-facing change?

No, only minor error message changes.

### How was this patch tested?

existing tests

Closes #32249 from cloud-fan/origin.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-34472][YARN] Ship ivySettings file to driver in cluster mode

### What changes were proposed in this pull request?

In YARN, ship the `spark.jars.ivySettings` file to the driver when using `cluster` deploy mode so that `addJar` is able to find it in order to resolve ivy paths.

### Why are the changes needed?

SPARK-33084 introduced support for Ivy paths in `sc.addJar` or Spark SQL `ADD JAR`. If we use a custom ivySettings file using `spark.jars.ivySettings`, it is loaded at https://github.com/apache/spark/blob/b26e7b510bbaee63c4095ab47e75ff2a70e377d7/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1280. However, this file is only accessible on the client machine. In YARN cluster mode, this file is not available on the driver and so `addJar` fails to find it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit tests to verify that the `ivySettings` file is localized by the YARN client and that a YARN cluster mode application is able to find to load the `ivySettings` file.

Closes #31591 from shardulm94/SPARK-34472.

Authored-by: Shardul Mahadik <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>

* [SPARK-35153][SQL] Make textual representation of ANSI interval operators more readable

### What changes were proposed in this pull request?
In the PR, I propose to override the `sql` and `toString` methods of the expressions that implement operators over ANSI intervals (`YearMonthIntervalType`/`DayTimeIntervalType`), and replace internal expression class names by operators like `*`, `/` and `-`.

### Why are the changes needed?
Proposed methods should make the textual representation of such operators more readable, and potentially parsable by Spark SQL parser.

### Does this PR introduce _any_ user-facing change?
Yes. This can influence on column names.

### How was this patch tested?
By running existing test suites for interval and datetime expressions, and re-generating the `*.sql` tests:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
```

Closes #32262 from MaxGekk/interval-operator-sql.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-35132][BUILD][CORE] Upgrade netty-all to 4.1.63.Final

### What changes were proposed in this pull request?
There are 3 CVE problems were found after netty 4.1.51.Final as follows:

- [CVE-2021-21409](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-21409)
- [CVE-2021-21295](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-21295)
- [CVE-2021-21290](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-21290)

So the main change of this pr is upgrade netty-all to 4.1.63.Final avoid these potential risks.

Another change is to clean up deprecated api usage: [Tiny caches have been merged into small caches](https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java#L447-L455)(after [netty#10267](https://github.com/netty/netty/pull/10267)) and [should use  PooledByteBufAllocator(boolean, int, int, int, int, int, int, boolean, int)](https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java#L227-L239) api to create `PooledByteBufAllocator`.

### Why are the changes needed?
Upgrade netty-all to 4.1.63.Final avoid CVE problems.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #32227 from LuciferYang/SPARK-35132.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Sean Owen <[email protected]>

* [SPARK-35044][SQL][FOLLOWUP][TEST-HADOOP2.7] Fix hadoop 2.7 test due to diff between hadoop 2.7 and hadoop 3

### What changes were proposed in this pull request?

dfs.replication is inconsistent from hadoop 2.x to 3.x, so in this PR we use `dfs.hosts` to verify per https://github.com/apache/spark/pull/32144#discussion_r616833099

```
== Results ==
!== Correct Answer - 1 ==        == Spark Answer - 1 ==
!struct<>                        struct<key:string,value:string>
![dfs.replication,<undefined>]   [dfs.replication,3]
```

### Why are the changes needed?

fix Jenkins job with Hadoop 2.7

### Does this PR introduce _any_ user-facing change?

test only change
### How was this patch tested?

test only change

Closes #32263 from yaooqinn/SPARK-35044-F.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>

* [SPARK-35113][SQL] Support ANSI intervals in the Hash expression

### What changes were proposed in this pull request?
Support ANSI interval in HashExpression and add UT

### Why are the changes needed?
Support ANSI interval in HashExpression

### Does this PR introduce _any_ user-facing change?
User can pass ANSI interval in HashExpression function

### How was this patch tested?
Added UT

Closes #32259 from AngersZhuuuu/SPARK-35113.

Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-35120][INFRA][FOLLOW-UP] Try catch an error to show the correct guidance

### What changes were proposed in this pull request?

This PR proposes to handle 404 not found, see https://github.com/apache/spark/pull/32255/checks?check_run_id=2390446579 as an example.

If a fork does not have any previous workflow runs, it …
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants