-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py file. #20089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Btw, should we add |
|
Test build #85424 has finished for PR 20089 at commit
|
|
Yea, I think we could. I added the support and tested it before - SPARK-19019. I think it's okay to add it they are just metadata AFAIK. |
|
@HyukjinKwon Thanks! I'll add it soon. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Not a big deal but I know one more place we might also update - https://github.com/apache/spark/blob/master/python/README.md#python-requirements
|
@HyukjinKwon I'll update it as well. |
|
Test build #85425 has finished for PR 20089 at commit
|
|
Test build #85426 has finished for PR 20089 at commit
|
python/README.md
Outdated
| ## Python Requirements | ||
|
|
||
| At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy and pandas). | ||
| At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy, pandas, and pyarrow). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like mandatory, but I think pyarrow is still an optional choice. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, Pandas and PyArrow are optional. Maybe, it's nicer if we have some more details here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some more details. WDYT?
| 'ml': ['numpy>=1.7'], | ||
| 'mllib': ['numpy>=1.7'], | ||
| 'sql': ['pandas>=0.19.2'] | ||
| 'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no pyarrow is installed, will setup force users to install it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, extras_require does not do anything in normal cases but they can be installed together with a dev option via pip IIRC.
|
Test build #85432 has finished for PR 20089 at commit
|
python/README.md
Outdated
| ## Python Requirements | ||
|
|
||
| At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy and pandas). | ||
| At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages might have their own requirements declared as "Extras" (including numpy, pandas, and pyarrow). You can install the requirements by specifying their extra names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. How about simply like ... :
At its core PySpark depends on Py4J (currently version 0.10.6), but some additional sub-packages have their own
extra requirements for some features (including numpy, pandas, and pyarrow).
for now? I just noticed we are a bit unclear on this (e.g., actually I have been under impression that NumPy is required for ML/MLlib so far) but I think this roughly describes it correctly and is good enough.
Will maybe try to make a PR to fully describe the dependencies and related features later. This PR targets PyArrow anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal anyway. I am actually fine as is too if you prefer @ueshin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the simple one you suggested and leave the detailed description for the future prs.
|
Still LGTM |
|
Test build #85434 has finished for PR 20089 at commit
|
|
Merged to master. |
What changes were proposed in this pull request?
This is a follow-up pr of #19884 updating setup.py file to add pyarrow dependency.
How was this patch tested?
Existing tests.