[SPARK-15243][ML][SQL][PYSPARK] Param methods should use basestring for type checking #13036

sethah · 2016-05-10T22:58:17Z

What changes were proposed in this pull request?

The following methods used isinstance(value, str) for checking string types in Python ML params:

_resolveParam(param)
hasParam(param)

This causes a ValueError in Python 2.x when param is a unicode string:

>>> from pyspark.ml.classification import LogisticRegression
>>> lr = LogisticRegression()
>>> lr.hasParam("threshold")
True
>>> lr.hasParam(u"threshold")
Traceback (most recent call last):
 ...
    raise TypeError("hasParam(): paramName must be a string")
TypeError: hasParam(): paramName must be a string
>>>

How was this patch tested?

Unit tests added to python/ml/tests.py

SparkQA · 2016-05-10T23:12:24Z

Test build #58289 has finished for PR 13036 at commit 878fc5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-05-11T15:18:34Z

Currently investigating some other usages of isinstance(obj, str) for this PR. Will update soon.

sethah · 2016-05-11T17:27:48Z

I updated instances of similar checks in the sql library as noted on the Jira. I searched and this type of check now only exists here and here. They don't cause problems with unicode though, so I did not change them, but I can do that if needed.

cc @viirya I'm not as familiar with the sql library, could you check those changes?

Also cc @holdenk @davies

SparkQA · 2016-05-11T17:39:19Z

Test build #58386 has finished for PR 13036 at commit b04ac41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-05-12T04:08:08Z

python/pyspark/sql/dataframe.py


        """
-        if not isinstance(col, str):
+        if not isinstance(col, basestring):


I am not sure if this change is needed. Because I think in SQL the column name is only allowed with alphabet, digit and underline, so it is a question why users will use unicode string as column in particular.

According to f958f27, it seems to be possible to use Non-ascii characters in column name.
I think there are use cases which want to use non-ascii character in column name.

ah, got it. I just mean from SQL parser.

Similarly, as the unicode column name will be encoded by name.encode('utf-8'), it is now a str instance. In other words, the schema still stores column names as str. However, this change is allowing unicode input as col. I think there will be mismatching.

So I think we don't need to do this.

Thank you for answering. I understood why isinstance(col, basestring) is not needed here.

Although column name is basically stored as str, it is stored as unicode in a certain case.
See SPARK-15244 for details.

Is there some harm in allowing unicode here though? If my column is 'a' and I call sampleBy(u'a') it will work after this change, otherwise it will throw an error. I think it's better to treat 'a' and u'a' as equivalent...

I agree with you. There is no problem caused by allowing unicode here.
As you mentioned, it's better to handle 'a' and u'a' because there are few cases that unicode is passed. (e.g. when __future__.unicode_literals is imported in Python 2.)

MechCoder · 2016-08-18T00:06:55Z

lgtm

jkbradley · 2016-09-06T21:07:29Z

Checking old PRs---is this active still? The ML parts look good to me. I haven't checked the SQL ones carefully.

holdenk · 2016-10-07T19:53:54Z

Just a quick ping @sethah - I know your pretty busy but I'm assuming this is still active. One minor note is it seems there is another new addition to types.py which maybe should also be changed.

…5243

SparkQA · 2016-10-07T23:54:34Z

Test build #66542 has finished for PR 13036 at commit 48f0557.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-16T08:35:18Z

Test build #67032 has finished for PR 13036 at commit c6a8828.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-10-17T22:41:10Z

ping @holdenk @viirya

I think this is ready now :)

SparkQA · 2016-10-17T23:12:43Z

Test build #67093 has finished for PR 13036 at commit 976d682.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-21T13:20:29Z

python/pyspark/sql/tests.py

    def test_approxQuantile(self):
        df = self.sc.parallelize([Row(a=i) for i in range(10)]).toDF()
-        aq = df.stat.approxQuantile("a", [0.1, 0.5, 0.9], 0.1)
+        aq = df.stat.approxQuantile(u"a", [0.1, 0.5, 0.9], 0.1)


Basically in these tests the field names are all ascii characters. Is it possibly to add tests using non-ascii characters so we can make sure it works?

holdenk · 2017-02-15T00:06:18Z

Gentle ping, whats the status of this PR?

sethah · 2017-02-28T05:39:27Z

@holdenk please feel free to take this over. Can't find time to work on it

holdenk · 2017-02-28T05:41:13Z

Ok, lets see if maybe @zero323 or @HyukjinKwon are interested in taking this over. Otherwise I'll add this to my backlog.

HyukjinKwon · 2017-02-28T07:09:33Z

I am happy to do so. I assume that It seems already almost done except for #13036 (comment)?

…am methods & functions in dataframe ## What changes were proposed in this pull request? This PR proposes to support unicodes in Param methods in ML, other missed functions in DataFrame. For example, this causes a `ValueError` in Python 2.x when param is a unicode string: ```python >>> from pyspark.ml.classification import LogisticRegression >>> lr = LogisticRegression() >>> lr.hasParam("threshold") True >>> lr.hasParam(u"threshold") Traceback (most recent call last): ... raise TypeError("hasParam(): paramName must be a string") TypeError: hasParam(): paramName must be a string ``` This PR is based on apache#13036 ## How was this patch tested? Unit tests in `python/pyspark/ml/tests.py` and `python/pyspark/sql/tests.py`. Author: hyukjinkwon <[email protected]> Author: sethah <[email protected]> Closes apache#17096 from HyukjinKwon/SPARK-15243.

check for basestring in param methods

878fc5f

replacing sql isinstance(obj, str)

b04ac41

sethah changed the title ~~[SPARK-15243][ML][PYSPARK] Param methods should use basestring for type checking~~ [SPARK-15243][ML][SQL][PYSPARK] Param methods should use basestring for type checking May 11, 2016

viirya reviewed May 12, 2016
View reviewed changes

sethah added 3 commits October 7, 2016 15:51

Merge branch 'master' of https://github.com/apache/spark into SPARK-1…

e6d9c19

…5243

merging master

3babfd3

revert

48f0557

test for sampleby

c6a8828

revert doc test

976d682

viirya reviewed Oct 21, 2016

View reviewed changes

sethah closed this Feb 28, 2017

HyukjinKwon mentioned this pull request Feb 28, 2017

[SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods & functions in dataframe #17096

Closed

[SPARK-15243][ML][SQL][PYSPARK] Param methods should use basestring for type checking #13036

[SPARK-15243][ML][SQL][PYSPARK] Param methods should use basestring for type checking #13036

Conversation

sethah commented May 10, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

sethah commented May 11, 2016

Uh oh!

sethah commented May 11, 2016

Uh oh!

SparkQA commented May 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Aug 18, 2016

Uh oh!

jkbradley commented Sep 6, 2016

Uh oh!

holdenk commented Oct 7, 2016

Uh oh!

SparkQA commented Oct 7, 2016

Uh oh!

SparkQA commented Oct 16, 2016

Uh oh!

sethah commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Feb 15, 2017

Uh oh!

sethah commented Feb 28, 2017

Uh oh!

holdenk commented Feb 28, 2017

Uh oh!

HyukjinKwon commented Feb 28, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants