Skip to content

Conversation

@gaborgsomogyi
Copy link
Contributor

What changes were proposed in this pull request?

When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to establish a connection due to lack of Kerberos ticket or ability to generate it.

This is a real issue when trying to ingest data from kerberized data sources (SQL Server, Oracle) in enterprise environment where exposing simple authentication access is not an option due to IT policy issues.

In this PR I've added MS SQL support.

What this PR contains:

  • Added MSSQLConnectionProvider
  • Added MSSQLConnectionProviderSuite
  • Changed MS SQL JDBC driver to use the latest (test scope only)
  • Changed MsSqlServerIntegrationSuite docker image to use the latest
  • Added a version comment to MariaDBConnectionProvider to increase trackability

Why are the changes needed?

Missing JDBC kerberos support.

Does this PR introduce any user-facing change?

Yes, now user is able to connect to MS SQL using kerberos.

How was this patch tested?

  • Additional + existing unit tests
  • Existing integration tests
  • Test on cluster manually

@gaborgsomogyi
Copy link
Contributor Author

gaborgsomogyi commented May 25, 2020

Apart from manual testing I've tried to add docker integration test (failed) and tried out the following:

  • Set up an active directory docker => not yet supported, please see this link for details.
  • Use Hadoop MiniKdc => simply never authenticated, furthermore MS never claimed it is working

If there will be a working docker image with an Active Directory instance we can try it again. In the meantime if somebody has an idea how to overcome this feel free to add.

@SparkQA
Copy link

SparkQA commented May 25, 2020

Test build #123081 has finished for PR 28635 at commit 3c30c4a.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of pointers to help reviewers.

There is a driver wrapper here which achieves more or less the same behaviour but with an external jar. The blogpost can be found here.

class MsSqlServerIntegrationSuite extends DockerJDBCIntegrationSuite {
override val db = new DatabaseOnDocker {
override val imageName = "mcr.microsoft.com/mssql/server:2017-GA-ubuntu"
override val imageName = "mcr.microsoft.com/mssql/server:2019-GA-ubuntu-16.04"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not absolutely necessary, if we think we can extract it into a new PR. Thought it would be overkill.

val configName = "jaasConfigurationName"
val appEntryDefault = "SQLJDBCDriver"

val parseURL = try {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are basically 2 approaches to parse the URL to get jaasConfigurationName:

Both way has been tested in MSSQLConnectionProviderSuite.


import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions

private[sql] class MSSQLConnectionProvider(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is based on this.

Comment on lines +72 to +73
result.put("integratedSecurity", "true")
result.put("authenticationScheme", "JavaKerberos")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to add the background information on comment, either the code side or the class doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an inline comment.

}

test("setAuthenticationConfigIfNeeded custom parser must set authentication if not set") {
val parserMethod = "IntentionallyNotExistingMethod"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simulates a driver where the private parser method doesn't exist. Such case manual parsing takes place and the code goes forward without issue.

@gaborgsomogyi
Copy link
Contributor Author

cc @HeartSaVioR

@SparkQA
Copy link

SparkQA commented May 25, 2020

Test build #123082 has finished for PR 28635 at commit 88306f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 25, 2020

Test build #123090 has finished for PR 28635 at commit 16687d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 27, 2020

Test build #123177 has finished for PR 28635 at commit 587a502.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ClusterStats(featureSum: Vector, squaredNormSum: Double, weightSum: Double)
  • class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, HasWeightCol,
  • case class Log(child: Expression) extends UnaryLogExpression(StrictMath.log, \"LOG\")
  • case class CurrentCatalog() extends LeafExpression with Unevaluable
  • case class GetCurrentDatabaseAndCatalog(catalogManager: CatalogManager) extends Rule[LogicalPlan]

@gaborgsomogyi
Copy link
Contributor Author

gaborgsomogyi commented May 27, 2020

retest this, please

@SparkQA
Copy link

SparkQA commented May 27, 2020

Test build #123184 has finished for PR 28635 at commit 587a502.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ClusterStats(featureSum: Vector, squaredNormSum: Double, weightSum: Double)
  • class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, HasWeightCol,
  • case class Log(child: Expression) extends UnaryLogExpression(StrictMath.log, \"LOG\")
  • case class CurrentCatalog() extends LeafExpression with Unevaluable
  • case class GetCurrentDatabaseAndCatalog(catalogManager: CatalogManager) extends Rule[LogicalPlan]

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through the general review but I'm afraid I'm not qualified to review in details.

Comment on lines +72 to +73
result.put("integratedSecurity", "true")
result.put("authenticationScheme", "JavaKerberos")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to add the background information on comment, either the code side or the class doc.

@gaborgsomogyi
Copy link
Contributor Author

cc @vanzin

@SparkQA
Copy link

SparkQA commented Jun 4, 2020

Test build #123530 has finished for PR 28635 at commit 0a71849.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Jun 16, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124083 has finished for PR 28635 at commit 0a71849.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Jun 17, 2020

Merging to master.

@vanzin vanzin closed this in eeb8120 Jun 17, 2020
@thereverand
Copy link

I could be wrong, but it doesn't look like this fixes this issue completely.
It doesn't support username/password and Kerberos, which the driver supports, and if the proper krb5.conf and JAAS config are in place does work on the driver node of a Databricks cluster, but then fails on the remote executors in the cluster with the same error being described here. So it looks to me to be the same issue.

@gaborgsomogyi
Copy link
Contributor Author

@thereverand just seen your comment. It has never stated that username/password is working.
Propagating u/p through the whole cluster will enlarge the attack surface.
This supports keytab which can be better protected.

As a resolution we're planning to add JDBC connection provider API in SPARK-32001 where any custom provider can be added.
This allows each provider to add custom authentication possibilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants