Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Dec 27, 2017

What changes were proposed in this pull request?

This is a follow-up pr of #19884 updating setup.py file to add pyarrow dependency.

How was this patch tested?

Existing tests.

@ueshin
Copy link
Member Author

ueshin commented Dec 27, 2017

Btw, should we add 'Programming Language :: Python :: 3.6' to classifiers?

@SparkQA
Copy link

SparkQA commented Dec 27, 2017

Test build #85424 has finished for PR 20089 at commit 36614af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Yea, I think we could. I added the support and tested it before - SPARK-19019. I think it's okay to add it they are just metadata AFAIK.

@ueshin
Copy link
Member Author

ueshin commented Dec 27, 2017

@HyukjinKwon Thanks! I'll add it soon.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Not a big deal but I know one more place we might also update - https://github.com/apache/spark/blob/master/python/README.md#python-requirements

@ueshin
Copy link
Member Author

ueshin commented Dec 27, 2017

@HyukjinKwon I'll update it as well.

@SparkQA
Copy link

SparkQA commented Dec 27, 2017

Test build #85425 has finished for PR 20089 at commit bee3c69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 27, 2017

Test build #85426 has finished for PR 20089 at commit 896f752.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

python/README.md Outdated
## Python Requirements

At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy and pandas).
At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy, pandas, and pyarrow).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like mandatory, but I think pyarrow is still an optional choice. Right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, Pandas and PyArrow are optional. Maybe, it's nicer if we have some more details here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more details. WDYT?

'ml': ['numpy>=1.7'],
'mllib': ['numpy>=1.7'],
'sql': ['pandas>=0.19.2']
'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no pyarrow is installed, will setup force users to install it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, extras_require does not do anything in normal cases but they can be installed together with a dev option via pip IIRC.

@SparkQA
Copy link

SparkQA commented Dec 27, 2017

Test build #85432 has finished for PR 20089 at commit e142e69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

python/README.md Outdated
## Python Requirements

At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy and pandas).
At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages might have their own requirements declared as "Extras" (including numpy, pandas, and pyarrow). You can install the requirements by specifying their extra names.
Copy link
Member

@HyukjinKwon HyukjinKwon Dec 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. How about simply like ... :

At its core PySpark depends on Py4J (currently version 0.10.6), but some additional sub-packages have their own 
extra requirements for some features (including numpy, pandas, and pyarrow).

for now? I just noticed we are a bit unclear on this (e.g., actually I have been under impression that NumPy is required for ML/MLlib so far) but I think this roughly describes it correctly and is good enough.

Will maybe try to make a PR to fully describe the dependencies and related features later. This PR targets PyArrow anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal anyway. I am actually fine as is too if you prefer @ueshin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the simple one you suggested and leave the detailed description for the future prs.

@HyukjinKwon
Copy link
Member

Still LGTM

@SparkQA
Copy link

SparkQA commented Dec 27, 2017

Test build #85434 has finished for PR 20089 at commit d8d9564.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in b8bfce5 Dec 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants