[SPARK-5158] Access kerberized HDFS from Spark standalone #17530

themodernlife · 2017-04-04T18:26:36Z

What changes were proposed in this pull request?

Refactor ConfigurableCredentialManager and related CredentialProviders so that they are no longer tied to YARN
Setup credential renewal/updating from within the StandaloneSchedulerBackend
Ensure executors/drivers are able to find initial tokens for contacting HDFS and renew them at regular intervals

The implementation does basically the same thing as the YARN backend. The keytab is copied to driver/executors through an environment variable in the ApplicationDescription.

How was this patch tested?

https://github.com/themodernlife/spark-standalone-kerberos contains a docker-compose environment with a KDC and Kerberized HDFS mini-cluster. The README contains instructions for running the integration test script to see credential refresh/updating occur. Credentials are set to update very 2 minutes or so.

… renewer = yarn.resourcemanager.principal

* master: (164 commits) [SPARK-20198][SQL] Remove the inconsistency in table/function name conventions in SparkSession.Catalog APIs [SPARK-20190][APP-ID] applications//jobs' in rest api,status should be [running|s… [SPARK-19825][R][ML] spark.ml R API for FPGrowth [SPARK-20067][SQL] Unify and Clean Up Desc Commands Using Catalog Interface [SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS [SPARK-19408][SQL] filter estimation on two columns of same table [SPARK-20145] Fix range case insensitive bug in SQL [SPARK-20194] Add support for partition pruning to in-memory catalog [SPARK-19641][SQL] JSON schema inference in DROPMALFORMED mode produces incorrect schema for non-array/object JSONs [SPARK-19969][ML] Imputer doc and example [SPARK-9002][CORE] KryoSerializer initialization does not include 'Array[Int]' [MINOR][DOCS] Replace non-breaking space to normal spaces that breaks rendering markdown [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options [SPARK-19985][ML] Fixed copy method for some ML Models [SPARK-20159][SPARKR][SQL] Support all catalog API in R [SPARK-20173][SQL][HIVE-THRIFTSERVER] Throw NullPointerException when HiveThriftServer2 is shutdown [SPARK-20123][BUILD] SPARK_HOME variable might have spaces in it(e.g. $SPARK… [SPARK-20143][SQL] DataType.fromJson should throw an exception with better message [SPARK-20186][SQL] BroadcastHint should use child's stats [SPARK-19148][SQL][FOLLOW-UP] do not expose the external table concept in Catalog ...

AmplabJenkins · 2017-04-04T18:27:17Z

Can one of the admins verify this patch?

vanzin · 2017-04-04T20:48:52Z

How does this patch handle the security issues raised in other attempts at such feature, such as #4106? Specifically the following comment, if you don't want to go through all the discussion: #4106 (comment)

Basically, how do you prevent user A from reading user B's keytab here?

themodernlife · 2017-04-05T14:45:57Z

Hi @vanzin, spark standalone isn't really multi user in any sense since the executors for all jobs run as whatever user the worker daemon was started as. That shouldn't preclude standalone clusters from communicating with secured resources.

Happy to add some additional documentation on this very point to the PR.

Any other other thoughts?

Thanks,

vanzin · 2017-04-05T17:02:34Z

That shouldn't preclude standalone clusters from communicating with secured resources.

Of course it should. You're inserting a service into your cluster that allows people to steal each others' credentials, basically making security in all the other services pointless. Explain to me how does that make sense?

If all you want is to allow Spark to access secure HDFS, create a user for the "Spark Standalone" service and login on every Worker node as that user. All Spark applications will use that user, and then administrators have control of how much damage that user can do to other services.

But as is, you're basically breaking security of the whole datacenter by introducing an insecure service.

themodernlife · 2017-04-05T18:12:32Z

In our setup each user gets their own standalone cluster. Users cannot submit jobs to each other's clusters. By providing a keytab on cluster creation and having Spark manage renewal on behalf of the user, we can support long running jobs with less headache.

vanzin · 2017-04-05T18:18:54Z

Then in your setup you can configure things so that the cluster already has the user's keytab; Spark doesn't need to distribute it for you.

themodernlife · 2017-04-05T18:55:04Z

That's right, but you still need a separate out of band process refreshing with the KDC. My thinking is why not have spark do that on your behalf?

vanzin · 2017-04-05T18:57:23Z

That is not what you change does, though.

If you want to change the master / worker scripts to refresh kerberos credentials, that would be a lot more acceptable. This change is just not acceptable, because outside of your user case (which I don't know all the details of), it's a big security hole.

themodernlife · 2017-04-05T19:43:41Z

To me it's basically the same as users including S3 credentials when submitting to spark standalone. Kerberos just requires more machinery. It might be a little harder to get at the spark conf entries of another user's job, but still possible since everything runs as the same unix user and shares the cluster secret.

themodernlife · 2017-04-05T19:45:06Z

Said another way people need another layer to use spark standalone in secured environments anyway.

vanzin · 2017-04-05T19:50:19Z

I'm sorry but you won't convince me that it's a useful feature to have Spark be a big security hole when inserted into a kerberos environment.

As I said, if you want to change your approach to have the Master / Worker daemons manage a kerberos login that is shared among all users, that would be more acceptable, and also cover your use case as far as I can see. But your current approach is just not going to happen, at least from my point of view. If you can find someone else to shepherd your PR, I'll let them figure it out, but I'm not for adding this particular implementation to Spark.

And people can use YARN if they really care about security. Standalone was never meant to be secure, nor used in secure environments. (There's also ongoing work to get Kerberos to work with Mesos, which does have the necessary security features as far as I know.)

vanzin · 2017-04-05T22:19:33Z

And BTW, if you really want to pursue this, please write a detailed spec explaining everything that is being done, and describe all the security issues people need to be aware of. It might even be prudent to make it an SPIP.

themodernlife · 2017-04-05T22:21:41Z

That would work for cluster mode but in client mode the driver on the submitting nodes still needs the keytab unfortunately.

Standalone clusters are best viewed as distributed single-user programs, so I think the real mistake is not bringing them into a secure environment, but bringing them into a secure environment and trying to use them in a multi-tenant/multi-user fashion.

I can see the concern that this feature might give someone who brings standalone clusters into a kerberized environment a false sense of security. What about disabling unless something like spark.standalone.single-user is set to true?

themodernlife · 2017-04-05T22:24:10Z

BTW not trying to give you the hard sell and appreciate the help rounding out the requirements from the core committers' POV.

vanzin · 2017-04-05T22:27:00Z

I'm just not sold on the idea that this is necessary in the first place. Personally I don't use standalone nor do I play with it at all, so my concerns are purely from a security standpoint. As in, if I were managing a secure cluster, why would I ever allow this code to run on it?

If you can write a spec and convince people who maintain the standalone manager that it's a good idea, then go for it. I just want to make sure you're not setting up people to completely break security on their clusters.

vanzin · 2017-04-05T22:28:08Z

That would work for cluster mode but in client mode the driver on the submitting nodes still needs the keytab unfortunately.

You're setting up a special cluster for a single user. I'm pretty sure that user can login to the KDC on the driver host.

mgummelt · 2017-04-13T22:35:31Z

@themodernlife I'm trying to add Kerberos support for Mesos, and creating HadoopRDDs fail for me because YARN isn't configured: https://issues.apache.org/jira/browse/SPARK-20328

Did you run into this?

themodernlife · 2017-04-14T09:59:24Z

There is a spot in HadoopFSCredentialProvider where it looks for a Hadoop config key related to yarn to set the token renewer.

In getTokenRenewer it calls Master.getMasterPrincipal(conf) which will need some yarn configuration set for things to succeed.

Right now the PR doesn't set that, so it needs to be set under the user's HADOOP_CONF even though it had no real effect. That probably should be changed.

Didn't have a chance to dig into the ticket you linked to but will try to have a look and compare notes. If anything comes to mind will comment there.

mgummelt · 2017-04-14T18:07:45Z

Right now the PR doesn't set that, so it needs to be set under the user's HADOOP_CONF even though it had no real effect. That probably should be changed.

Yep, same problem I'm seeing. Thanks.

jiangxb1987 · 2017-06-20T14:50:23Z

Is this addressing similar issue with #17387 ?

HyukjinKwon · 2017-07-24T03:17:05Z

gentle ping @themodernlife on ^.

## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 Closes apache#14085 - [SPARK-16408][SQL] SparkSQL Added file get Exception: is a directory … Closes apache#14239 - [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism to accelerate shuffle stage. Closes apache#14567 - [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation Closes apache#14579 - [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers Closes apache#14601 - [SPARK-13979][Core] Killed executor is re spawned without AWS key… Closes apache#14830 - [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples Closes apache#14963 - [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python Closes apache#15227 - [SPARK-17655][SQL]Remove unused variables declarations and definations in a WholeStageCodeGened stage Closes apache#15240 - [SPARK-17556] [CORE] [SQL] Executor side broadcast for broadcast joins Closes apache#15405 - [SPARK-15917][CORE] Added support for number of executors in Standalone [WIP] Closes apache#16099 - [SPARK-18665][SQL] set statement state to "ERROR" after user cancel job Closes apache#16445 - [SPARK-19043][SQL]Make SparkSQLSessionManager more configurable Closes apache#16618 - [SPARK-14409][ML][WIP] Add RankingEvaluator Closes apache#16766 - [SPARK-19426][SQL] Custom coalesce for Dataset Closes apache#16832 - [SPARK-19490][SQL] ignore case sensitivity when filtering hive partition columns Closes apache#17052 - [SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work Closes apache#17267 - [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly Closes apache#17371 - [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of time column Closes apache#17401 - [SPARK-18364][YARN] Expose metrics for YarnShuffleService Closes apache#17519 - [SPARK-15352][Doc] follow-up: add configuration docs for topology-aware block replication Closes apache#17530 - [SPARK-5158] Access kerberized HDFS from Spark standalone Closes apache#17854 - [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000) Closes apache#17979 - [SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus required in each spark executor when running on mesos Closes apache#18127 - [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql statement 'insert into' on hbase table Closes apache#18236 - [SPARK-21015] Check field name is not null and empty in GenericRowWit… Closes apache#18269 - [SPARK-21056][SQL] Use at most one spark job to list files in InMemoryFileIndex Closes apache#18328 - [SPARK-21121][SQL] Support changing storage level via the spark.sql.inMemoryColumnarStorage.level variable Closes apache#18354 - [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting Closes apache#18383 - [SPARK-21167][SS] Set kafka clientId while fetch messages Closes apache#18414 - [SPARK-21169] [core] Make sure to update application status to RUNNING if executors are accepted and RUNNING after recovery Closes apache#18432 - resolve com.esotericsoftware.kryo.KryoException Closes apache#18490 - [SPARK-21269][Core][WIP] Fix FetchFailedException when enable maxReqSizeShuffleToMem and KryoSerializer Closes apache#18585 - SPARK-21359 Closes apache#18609 - Spark SQL merge small files to big files Update InsertIntoHiveTable.scala Added: Closes apache#18308 - [SPARK-21099][Spark Core] INFO Log Message Using Incorrect Executor I… Closes apache#18599 - [SPARK-21372] spark writes one log file even I set the number of spark_rotate_log to 0 Closes apache#18619 - [SPARK-21397][BUILD]Maven shade plugin adding dependency-reduced-pom.xml to … Closes apache#18667 - Fix the simpleString used in error messages Closes apache#18782 - Branch 2.1 Added: Closes apache#17694 - [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts when using multiple threads Added: Closes apache#16456 - [SPARK-18994] clean up the local directories for application in future by annother thread Closes apache#18683 - [SPARK-21474][CORE] Make number of parallel fetches from a reducer configurable Closes apache#18690 - [SPARK-21334][CORE] Add metrics reporting service to External Shuffle Server Added: Closes apache#18827 - Merge pull request 1 from apache/master ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18780 from HyukjinKwon/close-prs.

Ian Hummel added 18 commits February 24, 2017 16:29

WIP

62a6e20

Add license header that somehow got removed

accfe0c

Fixup tests

b8559b5

WIP

539cc6c

Push delegation token out to ExecutorRunner

3f76281

More wip... probably borked

25e7639

Untested... make cluster mode work with standalone

847f604

Hadoop FileInputFormat is hardcoded to request delegation tokens with…

4689a55

… renewer = yarn.resourcemanager.principal

Still need to sort out a few things, but overall much smaller patch-set

3e85aa5

WIP

f743e6b

WIP

31c91dc

Still something isn't working

1964419

Merge master

b5bacf3

Actually use credential updater

83f0501

Change order of configuration setting so that everything works

917b077

Remove inadvertent file

16f9551

Cleanup code

246c76a

HyukjinKwon mentioned this pull request Jul 31, 2017

[INFRA] Close stale PRs #18780

Closed

asfgit closed this in 3a45c7f Aug 5, 2017

[SPARK-5158] Access kerberized HDFS from Spark standalone #17530

[SPARK-5158] Access kerberized HDFS from Spark standalone #17530

Uh oh!

Conversation

themodernlife commented Apr 4, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Apr 4, 2017

Uh oh!

vanzin commented Apr 4, 2017

Uh oh!

themodernlife commented Apr 5, 2017

Uh oh!

vanzin commented Apr 5, 2017

Uh oh!

themodernlife commented Apr 5, 2017

Uh oh!

vanzin commented Apr 5, 2017

Uh oh!

themodernlife commented Apr 5, 2017

Uh oh!

vanzin commented Apr 5, 2017

Uh oh!

themodernlife commented Apr 5, 2017

Uh oh!

themodernlife commented Apr 5, 2017

Uh oh!

vanzin commented Apr 5, 2017

Uh oh!

vanzin commented Apr 5, 2017

Uh oh!

themodernlife commented Apr 5, 2017

Uh oh!

themodernlife commented Apr 5, 2017

Uh oh!

vanzin commented Apr 5, 2017

Uh oh!

vanzin commented Apr 5, 2017

Uh oh!

mgummelt commented Apr 13, 2017

Uh oh!

themodernlife commented Apr 14, 2017

Uh oh!

mgummelt commented Apr 14, 2017

Uh oh!

jiangxb1987 commented Jun 20, 2017

Uh oh!

HyukjinKwon commented Jul 24, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants