Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Nov 14, 2014

When JVM is started in a Python process, it should exit once the stdin is closed.

test: add spark.driver.memory in conf/spark-defaults.conf

davies@dm:~/work/spark$ cat conf/spark-defaults.conf
spark.driver.memory       8g
davies@dm:~/work/spark$ bin/pyspark
>>> quit
davies@dm:~/work/spark$ jps
4931 Jps
286
davies@dm:~/work/spark$ python wc.py
943738
0.719928026199
davies@dm:~/work/spark$ jps
286
4990 Jps

@davies
Copy link
Contributor Author

davies commented Nov 14, 2014

cc @andrewor14

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23397 has started for PR 3274 at commit 050651f.

  • This patch merges cleanly.

@vanzin
Copy link
Contributor

vanzin commented Nov 14, 2014

So, if I understand correctly this handles the case where pyspark apps are not executed using the pyspark script, but with python directly?

It feels a little bit sketchy to support that, but the change looks good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just call this PYSPARK, and rename the variable isPySpark

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(before you do that, can you search the codebase to see if we already use the PYSPARK environment variable? It would be good to avoid clobbering it)

@andrewor14
Copy link
Contributor

@vanzin Yes your understanding is correct. I think this is safe to support in case the user wants to use different versions of python. Otherwise this silently does not kill the outer process, which is unintuitive.

@vanzin
Copy link
Contributor

vanzin commented Nov 15, 2014

in case the user wants to use different versions of python

Is there a way to define the python executable to use for the executors? Otherwise this will end up in tears, since pickle is not compatible across python versions...

@SparkQA
Copy link

SparkQA commented Nov 15, 2014

Test build #23399 has started for PR 3274 at commit ce8599c.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Nov 15, 2014

@vanzin The python used in exector could be defined by PYSPARK_PYTHON, so it's easy to run pyspark with different python, such as:

$ PYSPARK_PYTHON=pypy pypy wc.py

Or run python with any options

$ python -u -s -B wc.py

@vanzin
Copy link
Contributor

vanzin commented Nov 15, 2014

Ah, cool. Thanks for clarifying.

@SparkQA
Copy link

SparkQA commented Nov 15, 2014

Test build #23397 has finished for PR 3274 at commit 050651f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InSet(value: Expression, hset: Set[Any])
    • case class In(attribute: String, values: Array[Any]) extends Filter

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23397/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this case. Actually not all pyspark applications should go through this path. I think we should rename this variable to IS_PYTHON_SUBPROCESS on second thought:

env["IS_PYTHON_SUBPROCESS"] = "1" # Tell JVM to exit after python exits

@andrewor14
Copy link
Contributor

Hey @davies sorry I missed the case in which the python application is run through spark-submit, which doesn't actually go through this code path. I have provided suggestions for renaming the variables and rephrasing certain comments.

@SparkQA
Copy link

SparkQA commented Nov 15, 2014

Test build #23399 timed out for PR 3274 at commit ce8599c after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23399/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 15, 2014

Test build #23406 has started for PR 3274 at commit df0e524.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 15, 2014

Test build #23406 has finished for PR 3274 at commit df0e524.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23406/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can run. I'll fix this when I merge it

@andrewor14
Copy link
Contributor

Ok merging this master 1.2 thanks @davies

@asfgit asfgit closed this in 7fe08b4 Nov 15, 2014
asfgit pushed a commit that referenced this pull request Nov 15, 2014
When JVM is started in a Python process, it should exit once the stdin is closed.

test: add spark.driver.memory in conf/spark-defaults.conf

```
daviesdm:~/work/spark$ cat conf/spark-defaults.conf
spark.driver.memory       8g
daviesdm:~/work/spark$ bin/pyspark
>>> quit
daviesdm:~/work/spark$ jps
4931 Jps
286
daviesdm:~/work/spark$ python wc.py
943738
0.719928026199
daviesdm:~/work/spark$ jps
286
4990 Jps
```

Author: Davies Liu <[email protected]>

Closes #3274 from davies/exit and squashes the following commits:

df0e524 [Davies Liu] address comments
ce8599c [Davies Liu] address comments
050651f [Davies Liu] JVM should exit after Python exit

(cherry picked from commit 7fe08b4)
Signed-off-by: Andrew Or <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants