-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python #18999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| >>> df.sample(False, 0.5, 42).count() | ||
| 2 | ||
| """ | ||
| assert fraction >= 0.0, "Negative fraction value: %s" % fraction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this as it looks checked in Scala / Java side:
>>> df.sample(fraction=-0.1).count()
...
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Sampling fraction (-0.1) must be on interval [0, 1] without replacement'There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do the check in python, so the error message is more clear. best if the error messages match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. wouldn't we better avoid duplicating expression requirement? It looks I should do:
Lines 714 to 722 in 5ad1796
| if (withReplacement) { | |
| require( | |
| fraction >= 0.0 - eps, | |
| s"Sampling fraction ($fraction) must be nonnegative with replacement") | |
| } else { | |
| require( | |
| fraction >= 0.0 - eps && fraction <= 1.0 + eps, | |
| s"Sampling fraction ($fraction) must be on interval [0, 1] without replacement") | |
| } |
within Python side. I have been thinking of avoiding it if the error message makes sense to Python users (but not the case of exposing non-Pythonic error messages, for example, Java types java.lang.Long in the error message) although I understand it is good to throw an exception ahead before going to JVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea it'd be better to have python handle the simpler error checking.
| 2 | ||
| """ | ||
| assert fraction >= 0.0, "Negative fraction value: %s" % fraction | ||
| seed = seed if seed is not None else random.randint(0, sys.maxsize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also removed random.randint(0, sys.maxsize) and tried to directly call Scala / Java side one.
|
Test build #80870 has finished for PR 18999 at commit
|
|
cc @rxin. Dose this make sense to you? |
python/pyspark/sql/dataframe.py
Outdated
| raise TypeError( | ||
| "withReplacement (optional), fraction (required) and seed (optional)" | ||
| " should be a bool, float and number; however, " | ||
| "got %s." % ", ".join(argtypes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this change, all three parameters can be None by default, argtypes seems to be an empty list here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it looks so. Let me try to improve this message.
|
Test build #80898 has finished for PR 18999 at commit
|
python/pyspark/sql/dataframe.py
Outdated
| >>> df.sample(seed="abc").count() | ||
| Traceback (most recent call last): | ||
| ... | ||
| TypeError:... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we don't do the error cases here in doctest, but move them to unit test instead?
also these cases aren't really that meaningfully different to me as an user....?
>>> df.sample(0.5, 3).count()
+ 4
+ >>> df.sample(fraction=0.5, seed=3).count()
+ 4
+ >>> df.sample(1.0).count()
+ 10
+ >>> df.sample(fraction=1.0).count()
+ 10
+ >>> df.sample(False, fraction=1.0).count()
+ 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense! doc tests are examples users can follow
|
#18999 (comment) looks hidden. I addressed the other comment for now. |
|
Test build #80912 has finished for PR 18999 at commit
|
|
Test build #80911 has finished for PR 18999 at commit
|
|
@rxin, would you maybe have some opinion on #18999 (comment) (avoiding fraction checking within Python side)? |
|
LGTM. |
|
retest this please |
|
Test build #81303 has finished for PR 18999 at commit
|
|
Merged to master. |
|
Thank you @viirya, @felixcheung, @rxin and @ueshin. |
What changes were proposed in this pull request?
This PR make
DataFrame.sample(...)can omitwithReplacementdefaultingFalse, consistently with equivalent Scala / Java API.In short, the following examples are allowed:
In addition, this PR also adds some type checking logics as below:
How was this patch tested?
Manually tested, unit tests added in doc tests and manually checked the built documentation for Python.