Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Aug 19, 2017

What changes were proposed in this pull request?

This PR make DataFrame.sample(...) can omit withReplacement defaulting False, consistently with equivalent Scala / Java API.

In short, the following examples are allowed:

>>> df = spark.range(10)
>>> df.sample(0.5).count()
7
>>> df.sample(fraction=0.5).count()
3
>>> df.sample(0.5, seed=42).count()
5
>>> df.sample(fraction=0.5, seed=42).count()
5

In addition, this PR also adds some type checking logics as below:

>>> df = spark.range(10)
>>> df.sample().count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [].
>>> df.sample(True).count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>].
>>> df.sample(42).count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'int'>].
>>> df.sample(fraction=False, seed="a").count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>, <type 'str'>].
>>> df.sample(seed=[1]).count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'list'>].
>>> df.sample(withReplacement="a", fraction=0.5, seed=1)
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'str'>, <type 'float'>, <type 'int'>].

How was this patch tested?

Manually tested, unit tests added in doc tests and manually checked the built documentation for Python.

>>> df.sample(False, 0.5, 42).count()
2
"""
assert fraction >= 0.0, "Negative fraction value: %s" % fraction
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this as it looks checked in Scala / Java side:

>>> df.sample(fraction=-0.1).count()
...
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Sampling fraction (-0.1) must be on interval [0, 1] without replacement'

Copy link
Contributor

@rxin rxin Aug 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do the check in python, so the error message is more clear. best if the error messages match.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. wouldn't we better avoid duplicating expression requirement? It looks I should do:

if (withReplacement) {
require(
fraction >= 0.0 - eps,
s"Sampling fraction ($fraction) must be nonnegative with replacement")
} else {
require(
fraction >= 0.0 - eps && fraction <= 1.0 + eps,
s"Sampling fraction ($fraction) must be on interval [0, 1] without replacement")
}

within Python side. I have been thinking of avoiding it if the error message makes sense to Python users (but not the case of exposing non-Pythonic error messages, for example, Java types java.lang.Long in the error message) although I understand it is good to throw an exception ahead before going to JVM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea it'd be better to have python handle the simpler error checking.

2
"""
assert fraction >= 0.0, "Negative fraction value: %s" % fraction
seed = seed if seed is not None else random.randint(0, sys.maxsize)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also removed random.randint(0, sys.maxsize) and tried to directly call Scala / Java side one.

@SparkQA
Copy link

SparkQA commented Aug 19, 2017

Test build #80870 has finished for PR 18999 at commit 5de97d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

cc @rxin. Dose this make sense to you?

@HyukjinKwon HyukjinKwon changed the title [SPARK-21779][PYTHON] Simpler Dataset.sample API in Python [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python Aug 19, 2017
raise TypeError(
"withReplacement (optional), fraction (required) and seed (optional)"
" should be a bool, float and number; however, "
"got %s." % ", ".join(argtypes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this change, all three parameters can be None by default, argtypes seems to be an empty list here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it looks so. Let me try to improve this message.

@SparkQA
Copy link

SparkQA commented Aug 20, 2017

Test build #80898 has finished for PR 18999 at commit 0328446.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

>>> df.sample(seed="abc").count()
Traceback (most recent call last):
...
TypeError:...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we don't do the error cases here in doctest, but move them to unit test instead?
also these cases aren't really that meaningfully different to me as an user....?

        >>> df.sample(0.5, 3).count()
 +        4
 +        >>> df.sample(fraction=0.5, seed=3).count()
 +        4
 +        >>> df.sample(1.0).count()
 +        10
 +        >>> df.sample(fraction=1.0).count()
 +        10
 +        >>> df.sample(False, fraction=1.0).count()
 +        10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense! doc tests are examples users can follow

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Aug 21, 2017

#18999 (comment) looks hidden. I addressed the other comment for now.

@SparkQA
Copy link

SparkQA commented Aug 21, 2017

Test build #80912 has finished for PR 18999 at commit f2608ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 21, 2017

Test build #80911 has finished for PR 18999 at commit 24525bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@rxin, would you maybe have some opinion on #18999 (comment) (avoiding fraction checking within Python side)?

@HyukjinKwon
Copy link
Member Author

cc @holdenk and @ueshin, could you maybe take a look when you have some time?

@ueshin
Copy link
Member

ueshin commented Aug 31, 2017

LGTM.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 1, 2017

Test build #81303 has finished for PR 18999 at commit f2608ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@HyukjinKwon
Copy link
Member Author

Thank you @viirya, @felixcheung, @rxin and @ueshin.

@asfgit asfgit closed this in 5cd8ea9 Sep 1, 2017
@HyukjinKwon HyukjinKwon deleted the SPARK-21779 branch January 2, 2018 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants