[SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python #18999

HyukjinKwon · 2017-08-19T08:32:07Z

What changes were proposed in this pull request?

This PR make DataFrame.sample(...) can omit withReplacement defaulting False, consistently with equivalent Scala / Java API.

In short, the following examples are allowed:

>>> df = spark.range(10)
>>> df.sample(0.5).count()
7
>>> df.sample(fraction=0.5).count()
3
>>> df.sample(0.5, seed=42).count()
5
>>> df.sample(fraction=0.5, seed=42).count()
5

In addition, this PR also adds some type checking logics as below:

>>> df = spark.range(10)
>>> df.sample().count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [].
>>> df.sample(True).count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>].
>>> df.sample(42).count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'int'>].
>>> df.sample(fraction=False, seed="a").count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>, <type 'str'>].
>>> df.sample(seed=[1]).count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'list'>].
>>> df.sample(withReplacement="a", fraction=0.5, seed=1)
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'str'>, <type 'float'>, <type 'int'>].

How was this patch tested?

Manually tested, unit tests added in doc tests and manually checked the built documentation for Python.

HyukjinKwon · 2017-08-19T08:38:11Z

python/pyspark/sql/dataframe.py

-        >>> df.sample(False, 0.5, 42).count()
-        2
-        """
-        assert fraction >= 0.0, "Negative fraction value: %s" % fraction


I removed this as it looks checked in Scala / Java side:

>>> df.sample(fraction=-0.1).count() ... pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Sampling fraction (-0.1) must be on interval [0, 1] without replacement'

I'd do the check in python, so the error message is more clear. best if the error messages match.

Hm.. wouldn't we better avoid duplicating expression requirement? It looks I should do:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

Lines 714 to 722 in 5ad1796

if (withReplacement) {

require(

fraction >= 0.0 - eps,

s"Sampling fraction ($fraction) must be nonnegative with replacement")

} else {

require(

fraction >= 0.0 - eps && fraction <= 1.0 + eps,

s"Sampling fraction ($fraction) must be on interval [0, 1] without replacement")

}

within Python side. I have been thinking of avoiding it if the error message makes sense to Python users (but not the case of exposing non-Pythonic error messages, for example, Java types java.lang.Long in the error message) although I understand it is good to throw an exception ahead before going to JVM.

yea it'd be better to have python handle the simpler error checking.

HyukjinKwon · 2017-08-19T08:38:53Z

python/pyspark/sql/dataframe.py

-        2
-        """
-        assert fraction >= 0.0, "Negative fraction value: %s" % fraction
-        seed = seed if seed is not None else random.randint(0, sys.maxsize)


I also removed random.randint(0, sys.maxsize) and tried to directly call Scala / Java side one.

SparkQA · 2017-08-19T11:06:34Z

Test build #80870 has finished for PR 18999 at commit 5de97d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-08-19T11:15:38Z

cc @rxin. Dose this make sense to you?

viirya · 2017-08-19T13:41:51Z

python/pyspark/sql/dataframe.py

+            raise TypeError(
+                "withReplacement (optional), fraction (required) and seed (optional)"
+                " should be a bool, float and number; however, "
+                "got %s." % ", ".join(argtypes))


By this change, all three parameters can be None by default, argtypes seems to be an empty list here?

Yea, it looks so. Let me try to improve this message.

SparkQA · 2017-08-20T13:49:31Z

Test build #80898 has finished for PR 18999 at commit 0328446.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-08-20T19:06:12Z

python/pyspark/sql/dataframe.py

+        >>> df.sample(seed="abc").count()
+        Traceback (most recent call last):
+            ...
+        TypeError:...


maybe we don't do the error cases here in doctest, but move them to unit test instead?
also these cases aren't really that meaningfully different to me as an user....?

>>> df.sample(0.5, 3).count() + 4 + >>> df.sample(fraction=0.5, seed=3).count() + 4 + >>> df.sample(1.0).count() + 10 + >>> df.sample(fraction=1.0).count() + 10 + >>> df.sample(False, fraction=1.0).count() + 10

that makes sense! doc tests are examples users can follow

HyukjinKwon · 2017-08-21T01:24:51Z

#18999 (comment) looks hidden. I addressed the other comment for now.

SparkQA · 2017-08-21T04:07:14Z

Test build #80912 has finished for PR 18999 at commit f2608ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-21T04:07:21Z

Test build #80911 has finished for PR 18999 at commit 24525bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-08-24T11:35:53Z

@rxin, would you maybe have some opinion on #18999 (comment) (avoiding fraction checking within Python side)?

HyukjinKwon · 2017-08-30T03:51:37Z

cc @holdenk and @ueshin, could you maybe take a look when you have some time?

ueshin · 2017-08-31T04:48:53Z

LGTM.

HyukjinKwon · 2017-09-01T00:24:47Z

retest this please

SparkQA · 2017-09-01T02:58:18Z

Test build #81303 has finished for PR 18999 at commit f2608ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-01T04:02:48Z

Merged to master.

HyukjinKwon · 2017-09-01T04:03:23Z

Thank you @viirya, @felixcheung, @rxin and @ueshin.

Simpler Dataset.sample API in Python

5de97d1

HyukjinKwon commented Aug 19, 2017

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-21779][PYTHON] Simpler Dataset.sample API in Python~~ [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python Aug 19, 2017

viirya reviewed Aug 19, 2017

View reviewed changes

Add [ and ] to make no argument case pretty

0328446

felixcheung reviewed Aug 20, 2017

View reviewed changes

Move error validation tests to tests.py

24525bc

No need to count in the tests.

f2608ab

felixcheung approved these changes Aug 31, 2017

View reviewed changes

asfgit closed this in 5cd8ea9 Sep 1, 2017

HyukjinKwon deleted the SPARK-21779 branch January 2, 2018 03:37

	if (withReplacement) {
	require(
	fraction >= 0.0 - eps,
	s"Sampling fraction ($fraction) must be nonnegative with replacement")
	} else {
	require(
	fraction >= 0.0 - eps && fraction <= 1.0 + eps,
	s"Sampling fraction ($fraction) must be on interval [0, 1] without replacement")
	}

[SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python #18999

[SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python #18999

Uh oh!

Conversation

HyukjinKwon commented Aug 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon Aug 19, 2017

Choose a reason for hiding this comment

Uh oh!

rxin Aug 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin Aug 24, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 19, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 19, 2017

Uh oh!

HyukjinKwon commented Aug 19, 2017

Uh oh!

viirya Aug 19, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 19, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 20, 2017

Uh oh!

felixcheung Aug 20, 2017

Choose a reason for hiding this comment

Uh oh!

rxin Aug 20, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 21, 2017

Uh oh!

SparkQA commented Aug 21, 2017

Uh oh!

HyukjinKwon commented Aug 24, 2017

Uh oh!

HyukjinKwon commented Aug 30, 2017

Uh oh!

ueshin commented Aug 31, 2017

Uh oh!

HyukjinKwon commented Sep 1, 2017

Uh oh!

SparkQA commented Sep 1, 2017

Uh oh!

HyukjinKwon commented Sep 1, 2017

Uh oh!

HyukjinKwon commented Sep 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HyukjinKwon commented Aug 19, 2017 •

edited

Loading

rxin Aug 20, 2017 •

edited

Loading

HyukjinKwon Aug 21, 2017 •

edited

Loading

HyukjinKwon commented Aug 21, 2017 •

edited

Loading