-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-5158] Access kerberized HDFS from Spark standalone #17530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… renewer = yarn.resourcemanager.principal
* master: (164 commits) [SPARK-20198][SQL] Remove the inconsistency in table/function name conventions in SparkSession.Catalog APIs [SPARK-20190][APP-ID] applications//jobs' in rest api,status should be [running|s… [SPARK-19825][R][ML] spark.ml R API for FPGrowth [SPARK-20067][SQL] Unify and Clean Up Desc Commands Using Catalog Interface [SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS [SPARK-19408][SQL] filter estimation on two columns of same table [SPARK-20145] Fix range case insensitive bug in SQL [SPARK-20194] Add support for partition pruning to in-memory catalog [SPARK-19641][SQL] JSON schema inference in DROPMALFORMED mode produces incorrect schema for non-array/object JSONs [SPARK-19969][ML] Imputer doc and example [SPARK-9002][CORE] KryoSerializer initialization does not include 'Array[Int]' [MINOR][DOCS] Replace non-breaking space to normal spaces that breaks rendering markdown [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options [SPARK-19985][ML] Fixed copy method for some ML Models [SPARK-20159][SPARKR][SQL] Support all catalog API in R [SPARK-20173][SQL][HIVE-THRIFTSERVER] Throw NullPointerException when HiveThriftServer2 is shutdown [SPARK-20123][BUILD] SPARK_HOME variable might have spaces in it(e.g. $SPARK… [SPARK-20143][SQL] DataType.fromJson should throw an exception with better message [SPARK-20186][SQL] BroadcastHint should use child's stats [SPARK-19148][SQL][FOLLOW-UP] do not expose the external table concept in Catalog ...
|
Can one of the admins verify this patch? |
|
How does this patch handle the security issues raised in other attempts at such feature, such as #4106? Specifically the following comment, if you don't want to go through all the discussion: #4106 (comment) Basically, how do you prevent user A from reading user B's keytab here? |
|
Hi @vanzin, spark standalone isn't really multi user in any sense since the executors for all jobs run as whatever user the worker daemon was started as. That shouldn't preclude standalone clusters from communicating with secured resources. Happy to add some additional documentation on this very point to the PR. Any other other thoughts? Thanks, |
Of course it should. You're inserting a service into your cluster that allows people to steal each others' credentials, basically making security in all the other services pointless. Explain to me how does that make sense? If all you want is to allow Spark to access secure HDFS, create a user for the "Spark Standalone" service and login on every Worker node as that user. All Spark applications will use that user, and then administrators have control of how much damage that user can do to other services. But as is, you're basically breaking security of the whole datacenter by introducing an insecure service. |
|
In our setup each user gets their own standalone cluster. Users cannot submit jobs to each other's clusters. By providing a keytab on cluster creation and having Spark manage renewal on behalf of the user, we can support long running jobs with less headache. |
|
Then in your setup you can configure things so that the cluster already has the user's keytab; Spark doesn't need to distribute it for you. |
|
That's right, but you still need a separate out of band process refreshing with the KDC. My thinking is why not have spark do that on your behalf? |
|
That is not what you change does, though. If you want to change the master / worker scripts to refresh kerberos credentials, that would be a lot more acceptable. This change is just not acceptable, because outside of your user case (which I don't know all the details of), it's a big security hole. |
|
To me it's basically the same as users including S3 credentials when submitting to spark standalone. Kerberos just requires more machinery. It might be a little harder to get at the spark conf entries of another user's job, but still possible since everything runs as the same unix user and shares the cluster secret. |
|
Said another way people need another layer to use spark standalone in secured environments anyway. |
|
I'm sorry but you won't convince me that it's a useful feature to have Spark be a big security hole when inserted into a kerberos environment. As I said, if you want to change your approach to have the Master / Worker daemons manage a kerberos login that is shared among all users, that would be more acceptable, and also cover your use case as far as I can see. But your current approach is just not going to happen, at least from my point of view. If you can find someone else to shepherd your PR, I'll let them figure it out, but I'm not for adding this particular implementation to Spark. And people can use YARN if they really care about security. Standalone was never meant to be secure, nor used in secure environments. (There's also ongoing work to get Kerberos to work with Mesos, which does have the necessary security features as far as I know.) |
|
And BTW, if you really want to pursue this, please write a detailed spec explaining everything that is being done, and describe all the security issues people need to be aware of. It might even be prudent to make it an SPIP. |
|
That would work for cluster mode but in client mode the driver on the submitting nodes still needs the keytab unfortunately. Standalone clusters are best viewed as distributed single-user programs, so I think the real mistake is not bringing them into a secure environment, but bringing them into a secure environment and trying to use them in a multi-tenant/multi-user fashion. I can see the concern that this feature might give someone who brings standalone clusters into a kerberized environment a false sense of security. What about disabling unless something like |
|
BTW not trying to give you the hard sell and appreciate the help rounding out the requirements from the core committers' POV. |
|
I'm just not sold on the idea that this is necessary in the first place. Personally I don't use standalone nor do I play with it at all, so my concerns are purely from a security standpoint. As in, if I were managing a secure cluster, why would I ever allow this code to run on it? If you can write a spec and convince people who maintain the standalone manager that it's a good idea, then go for it. I just want to make sure you're not setting up people to completely break security on their clusters. |
You're setting up a special cluster for a single user. I'm pretty sure that user can login to the KDC on the driver host. |
|
@themodernlife I'm trying to add Kerberos support for Mesos, and creating HadoopRDDs fail for me because YARN isn't configured: https://issues.apache.org/jira/browse/SPARK-20328 Did you run into this? |
|
There is a spot in HadoopFSCredentialProvider where it looks for a Hadoop config key related to yarn to set the token renewer. In getTokenRenewer it calls Master.getMasterPrincipal(conf) which will need some yarn configuration set for things to succeed. Right now the PR doesn't set that, so it needs to be set under the user's HADOOP_CONF even though it had no real effect. That probably should be changed. Didn't have a chance to dig into the ticket you linked to but will try to have a look and compare notes. If anything comes to mind will comment there. |
Yep, same problem I'm seeing. Thanks. |
|
Is this addressing similar issue with #17387 ? |
|
gentle ping @themodernlife on ^. |
## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 Closes apache#14085 - [SPARK-16408][SQL] SparkSQL Added file get Exception: is a directory … Closes apache#14239 - [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism to accelerate shuffle stage. Closes apache#14567 - [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation Closes apache#14579 - [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers Closes apache#14601 - [SPARK-13979][Core] Killed executor is re spawned without AWS key… Closes apache#14830 - [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples Closes apache#14963 - [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python Closes apache#15227 - [SPARK-17655][SQL]Remove unused variables declarations and definations in a WholeStageCodeGened stage Closes apache#15240 - [SPARK-17556] [CORE] [SQL] Executor side broadcast for broadcast joins Closes apache#15405 - [SPARK-15917][CORE] Added support for number of executors in Standalone [WIP] Closes apache#16099 - [SPARK-18665][SQL] set statement state to "ERROR" after user cancel job Closes apache#16445 - [SPARK-19043][SQL]Make SparkSQLSessionManager more configurable Closes apache#16618 - [SPARK-14409][ML][WIP] Add RankingEvaluator Closes apache#16766 - [SPARK-19426][SQL] Custom coalesce for Dataset Closes apache#16832 - [SPARK-19490][SQL] ignore case sensitivity when filtering hive partition columns Closes apache#17052 - [SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work Closes apache#17267 - [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly Closes apache#17371 - [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of time column Closes apache#17401 - [SPARK-18364][YARN] Expose metrics for YarnShuffleService Closes apache#17519 - [SPARK-15352][Doc] follow-up: add configuration docs for topology-aware block replication Closes apache#17530 - [SPARK-5158] Access kerberized HDFS from Spark standalone Closes apache#17854 - [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000) Closes apache#17979 - [SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus required in each spark executor when running on mesos Closes apache#18127 - [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql statement 'insert into' on hbase table Closes apache#18236 - [SPARK-21015] Check field name is not null and empty in GenericRowWit… Closes apache#18269 - [SPARK-21056][SQL] Use at most one spark job to list files in InMemoryFileIndex Closes apache#18328 - [SPARK-21121][SQL] Support changing storage level via the spark.sql.inMemoryColumnarStorage.level variable Closes apache#18354 - [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting Closes apache#18383 - [SPARK-21167][SS] Set kafka clientId while fetch messages Closes apache#18414 - [SPARK-21169] [core] Make sure to update application status to RUNNING if executors are accepted and RUNNING after recovery Closes apache#18432 - resolve com.esotericsoftware.kryo.KryoException Closes apache#18490 - [SPARK-21269][Core][WIP] Fix FetchFailedException when enable maxReqSizeShuffleToMem and KryoSerializer Closes apache#18585 - SPARK-21359 Closes apache#18609 - Spark SQL merge small files to big files Update InsertIntoHiveTable.scala Added: Closes apache#18308 - [SPARK-21099][Spark Core] INFO Log Message Using Incorrect Executor I… Closes apache#18599 - [SPARK-21372] spark writes one log file even I set the number of spark_rotate_log to 0 Closes apache#18619 - [SPARK-21397][BUILD]Maven shade plugin adding dependency-reduced-pom.xml to … Closes apache#18667 - Fix the simpleString used in error messages Closes apache#18782 - Branch 2.1 Added: Closes apache#17694 - [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts when using multiple threads Added: Closes apache#16456 - [SPARK-18994] clean up the local directories for application in future by annother thread Closes apache#18683 - [SPARK-21474][CORE] Make number of parallel fetches from a reducer configurable Closes apache#18690 - [SPARK-21334][CORE] Add metrics reporting service to External Shuffle Server Added: Closes apache#18827 - Merge pull request 1 from apache/master ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18780 from HyukjinKwon/close-prs.
What changes were proposed in this pull request?
ConfigurableCredentialManagerand relatedCredentialProvidersso that they are no longer tied to YARNStandaloneSchedulerBackendThe implementation does basically the same thing as the YARN backend. The keytab is copied to driver/executors through an environment variable in the
ApplicationDescription.How was this patch tested?
https://github.com/themodernlife/spark-standalone-kerberos contains a docker-compose environment with a KDC and Kerberized HDFS mini-cluster. The README contains instructions for running the integration test script to see credential refresh/updating occur. Credentials are set to update very 2 minutes or so.