[SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods & functions in dataframe #17096

HyukjinKwon · 2017-02-28T08:59:07Z

What changes were proposed in this pull request?

This PR proposes to support unicodes in Param methods in ML, other missed functions in DataFrame.

For example, this causes a ValueError in Python 2.x when param is a unicode string:

>>> from pyspark.ml.classification import LogisticRegression
>>> lr = LogisticRegression()
>>> lr.hasParam("threshold")
True
>>> lr.hasParam(u"threshold")
Traceback (most recent call last):
 ...
    raise TypeError("hasParam(): paramName must be a string")
TypeError: hasParam(): paramName must be a string

This PR is based on #13036

How was this patch tested?

Unit tests in python/pyspark/ml/tests.py and python/pyspark/sql/tests.py.

HyukjinKwon · 2017-02-28T09:00:32Z

cc @sethah, @holdenk, @viirya and @k-yokoshi

SparkQA · 2017-02-28T09:24:35Z

Test build #73579 has finished for PR 17096 at commit d421a82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-02-28T19:16:34Z

Thank you for taking this over @HyukjinKwon :)

holdenk · 2017-02-28T19:26:36Z

Let's double check with @viirya to make sure his comment was addressed, but I really appreciate the improved test coverage :)

viirya · 2017-03-01T09:01:14Z

python/pyspark/sql/dataframe.py

nit: can directly useisStr here.

viirya · 2017-03-01T09:06:12Z

python/pyspark/sql/tests.py

nit: corr -> sampled

viirya · 2017-03-01T09:10:33Z

Few minor comments. Overall looks good.

HyukjinKwon · 2017-03-01T13:24:30Z

Thank you @viirya. I think it is ready now.

SparkQA · 2017-03-01T13:57:54Z

Test build #73688 has finished for PR 17096 at commit 9ac773c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-03T05:19:01Z

In StructType.add, if the given field name is a basestring, we will directly use it as key in names property.

If we pass in a StructField, we take StructField.name as key in names. But StructField will encode field name as utf-8 if it is not a str.

And StructType.__getitem__ will find matched field by comparing each StructField.name with given search key. So it is unable to find the field back with the unicode field name.

    from pyspark.sql.types import StructType, StringType, StructField

    struct1 = StructType().add(u"a", "string", True)
    struct2 = StructType([StructField(u"a", StringType(), True)])
    self.assertTrue(struct1 == struct2)  # pass
    self.assertTrue(struct1[u"a"] == struct2[u"a"]) #pass

    struct1 = StructType().add(u"測試", "string", True)
    struct2 = StructType([StructField(u"測試", StringType(), True)])
    self.assertTrue(struct1 == struct2)  # fail
    self.assertTrue(struct1[u"測試"] == struct2[u"測試"]) # fail, you can't find the field with key u"測試"

HyukjinKwon · 2017-03-03T07:00:16Z

@viirya, thank you so much for taking a look and your time.

So, basically, the second case it compares str to unicode as below:

>>> u"測試" == u"測試".encode("utf-8")
False

Apparently, it seems we could pass unicode as is? Let me raise another issue for this after testing and looking into this. Actually, the support in StructType.add seems not the problem specified in the JIRA.

HyukjinKwon · 2017-03-03T07:03:05Z

Let me check if each is fine for others for sure.

HyukjinKwon · 2017-03-03T07:45:22Z

@holdenk and @viirya, I got rid of the changes in types.py and only left that I am pretty sure.

There are two kind of changes here that look used in the only local scope.

One seems for used getattr I guess it is fine as below:

>>> getattr("a", u"__str__")
<method-wrapper '__str__' of str object at 0x10a24e580>
>>> getattr("a", "__str__")
<method-wrapper '__str__' of str object at 0x10a24e580>

and other one seems used for setting an parameter to JVM which seems already used in the code base much more.

SparkQA · 2017-03-03T07:54:00Z

Test build #73822 has finished for PR 17096 at commit cd235a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-07T07:30:47Z

Remaining changes LGTM. cc @holdenk

HyukjinKwon · 2017-03-07T07:56:30Z

Thank you @viirya for your sign-off.

holdenk · 2017-03-07T15:37:04Z

Thanks for taking the time to review this @viirya :)

holdenk · 2017-03-07T15:42:15Z

python/pyspark/ml/tests.py

Would it make sense to also add a param with a unicode name that if it was converted down to ASCII behind the scenes and test it?

Yes, I think that makes sense, though. It seems that requires more look and tests. I believe the current state resolves the specific JIRA. Maybe, could we merge this as is if you are think either way is fine? I feel It has been dragged by changes unrelated (or loosely related) with the specific JIRA and hope it could be merged if it is okay to you. If it dose not sound good to you, then, let me try to take a look.

HyukjinKwon · 2017-05-11T14:40:53Z

Hey @holdenk. I am willing to close this for now if you are not confident enough for merging it for now. I can re-open this later.

ueshin · 2017-06-26T23:50:31Z

@HyukjinKwon @holdenk Hi, are you still working on this?

HyukjinKwon · 2017-06-26T23:55:26Z

Yea, I was waiting for the feedback. It is also about the unicode vs byte string.

holdenk · 2017-07-02T02:17:43Z

I'd really like to see that further test I was talking about, @HyukjinKwon -- it shouldn't be too hard to do in this pr right? Just add a unicode string which doesn't make sense in down converted ascii.

HyukjinKwon · 2017-07-02T02:53:33Z

Do you mean a test case such as self.assertEqual(testParams._resolveParam(u"아"), testParams.아) ?

holdenk · 2017-07-02T02:57:51Z

@HyukjinKwon Pretty much, yes.

…rame

HyukjinKwon · 2017-07-02T04:11:54Z

python/pyspark/ml/tests.py

+        self.assertEqual(testParams._resolveParam("maxIter"), testParams.maxIter)
+
+        self.assertEqual(testParams._resolveParam(u"maxIter"), testParams.maxIter)
+        if sys.version_info[0] >= 3:


@holdenk, would this test address your concern enough?

SparkQA · 2017-07-02T04:36:51Z

Test build #79038 has finished for PR 17096 at commit 830b4fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-07-24T02:34:36Z

gentle ping ...

holdenk · 2017-09-06T19:39:08Z

Sorry for the delay. Lets get Jenkins to retest this and make sure everything is ok but it looks like a good change :)

LGTM pending jenkins/merge issues (if they show up during jenkins). Jenkins retest this please.

holdenk · 2017-09-07T00:19:57Z

Jenkins retest this please.

SparkQA · 2017-09-07T00:50:43Z

Test build #81488 has finished for PR 17096 at commit 830b4fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-09-08T18:59:34Z

Thanks, merged to master!

HyukjinKwon · 2017-09-08T19:04:23Z

Thank you @holdenk, @viirya and @ueshin.

HyukjinKwon changed the title ~~[SPARK-15243][ML][SQL][PYSPARK] Add missing support for unicode in Param methods/functions in dataframe/types~~ [SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods/functions in dataframe/types Feb 28, 2017

viirya reviewed Mar 1, 2017

View reviewed changes

python/pyspark/sql/dataframe.py Outdated

Copy link

Member

viirya Mar 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can directly useisStr here.

viirya reviewed Mar 1, 2017

View reviewed changes

python/pyspark/sql/tests.py Outdated

Copy link

Member

viirya Mar 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: corr -> sampled

HyukjinKwon changed the title ~~[SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods/functions in dataframe/types~~ [SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods & functions in dataframe Mar 3, 2017

holdenk reviewed Mar 7, 2017

View reviewed changes

Add missing support for unicode in Param methods & functions in dataf…

feeec46

…rame

HyukjinKwon force-pushed the SPARK-15243 branch 2 times, most recently from be6e483 to 485040b Compare July 2, 2017 04:10

Address comments

830b4fe

HyukjinKwon force-pushed the SPARK-15243 branch from 485040b to 830b4fe Compare July 2, 2017 04:10

HyukjinKwon commented Jul 2, 2017

View reviewed changes

HyukjinKwon mentioned this pull request Aug 3, 2017

[SPARK-21612] Allow unicode strings in __getitem__ of StructType #18817

Closed

asfgit closed this in 8598d03 Sep 8, 2017

HyukjinKwon deleted the SPARK-15243 branch January 2, 2018 03:41

[SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods & functions in dataframe #17096

[SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods & functions in dataframe #17096

Uh oh!

Conversation

HyukjinKwon commented Feb 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Feb 28, 2017

Uh oh!

SparkQA commented Feb 28, 2017

Uh oh!

holdenk commented Feb 28, 2017

Uh oh!

holdenk commented Feb 28, 2017

Uh oh!

viirya Mar 1, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Mar 1, 2017

Choose a reason for hiding this comment

Uh oh!

viirya commented Mar 1, 2017

Uh oh!

HyukjinKwon commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

viirya commented Mar 3, 2017

Uh oh!

HyukjinKwon commented Mar 3, 2017

Uh oh!

HyukjinKwon commented Mar 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Mar 3, 2017

Uh oh!

SparkQA commented Mar 3, 2017

Uh oh!

viirya commented Mar 7, 2017

Uh oh!

HyukjinKwon commented Mar 7, 2017

Uh oh!

holdenk commented Mar 7, 2017

Uh oh!

holdenk Mar 7, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 11, 2017

Uh oh!

ueshin commented Jun 26, 2017

Uh oh!

HyukjinKwon commented Jun 26, 2017

Uh oh!

holdenk commented Jul 2, 2017

Uh oh!

HyukjinKwon commented Jul 2, 2017

Uh oh!

holdenk commented Jul 2, 2017

Uh oh!

HyukjinKwon Jul 2, 2017

Choose a reason for hiding this comment

Uh oh!

holdenk Sep 6, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 2, 2017

Uh oh!

HyukjinKwon commented Jul 24, 2017

Uh oh!

holdenk commented Sep 6, 2017

Uh oh!

holdenk commented Sep 7, 2017

Uh oh!

SparkQA commented Sep 7, 2017

HyukjinKwon commented Feb 28, 2017 •

edited

Loading

HyukjinKwon commented Mar 3, 2017 •

edited

Loading

HyukjinKwon Mar 8, 2017 •

edited

Loading