Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 28, 2017

What changes were proposed in this pull request?

This PR proposes to support unicodes in Param methods in ML, other missed functions in DataFrame.

For example, this causes a ValueError in Python 2.x when param is a unicode string:

>>> from pyspark.ml.classification import LogisticRegression
>>> lr = LogisticRegression()
>>> lr.hasParam("threshold")
True
>>> lr.hasParam(u"threshold")
Traceback (most recent call last):
 ...
    raise TypeError("hasParam(): paramName must be a string")
TypeError: hasParam(): paramName must be a string

This PR is based on #13036

How was this patch tested?

Unit tests in python/pyspark/ml/tests.py and python/pyspark/sql/tests.py.

@HyukjinKwon
Copy link
Member Author

cc @sethah, @holdenk, @viirya and @k-yokoshi

@HyukjinKwon HyukjinKwon changed the title [SPARK-15243][ML][SQL][PYSPARK] Add missing support for unicode in Param methods/functions in dataframe/types [SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods/functions in dataframe/types Feb 28, 2017
@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73579 has finished for PR 17096 at commit d421a82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Feb 28, 2017

Thank you for taking this over @HyukjinKwon :)

@holdenk
Copy link
Contributor

holdenk commented Feb 28, 2017

Let's double check with @viirya to make sure his comment was addressed, but I really appreciate the improved test coverage :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can directly useisStr here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: corr -> sampled

@viirya
Copy link
Member

viirya commented Mar 1, 2017

Few minor comments. Overall looks good.

@HyukjinKwon
Copy link
Member Author

Thank you @viirya. I think it is ready now.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73688 has finished for PR 17096 at commit 9ac773c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Mar 3, 2017

In StructType.add, if the given field name is a basestring, we will directly use it as key in names property.

If we pass in a StructField, we take StructField.name as key in names. But StructField will encode field name as utf-8 if it is not a str.

And StructType.__getitem__ will find matched field by comparing each StructField.name with given search key. So it is unable to find the field back with the unicode field name.

    from pyspark.sql.types import StructType, StringType, StructField

    struct1 = StructType().add(u"a", "string", True)
    struct2 = StructType([StructField(u"a", StringType(), True)])
    self.assertTrue(struct1 == struct2)  # pass
    self.assertTrue(struct1[u"a"] == struct2[u"a"]) #pass

    struct1 = StructType().add(u"測試", "string", True)
    struct2 = StructType([StructField(u"測試", StringType(), True)])
    self.assertTrue(struct1 == struct2)  # fail
    self.assertTrue(struct1[u"測試"] == struct2[u"測試"]) # fail, you can't find the field with key u"測試"

@HyukjinKwon
Copy link
Member Author

@viirya, thank you so much for taking a look and your time.

So, basically, the second case it compares str to unicode as below:

>>> u"測試" == u"測試".encode("utf-8")
False

Apparently, it seems we could pass unicode as is? Let me raise another issue for this after testing and looking into this. Actually, the support in StructType.add seems not the problem specified in the JIRA.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 3, 2017

Let me check if each is fine for others for sure.

@HyukjinKwon HyukjinKwon changed the title [SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods/functions in dataframe/types [SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods & functions in dataframe Mar 3, 2017
@HyukjinKwon
Copy link
Member Author

@holdenk and @viirya, I got rid of the changes in types.py and only left that I am pretty sure.

There are two kind of changes here that look used in the only local scope.

One seems for used getattr I guess it is fine as below:

>>> getattr("a", u"__str__")
<method-wrapper '__str__' of str object at 0x10a24e580>
>>> getattr("a", "__str__")
<method-wrapper '__str__' of str object at 0x10a24e580>

and other one seems used for setting an parameter to JVM which seems already used in the code base much more.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73822 has finished for PR 17096 at commit cd235a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Mar 7, 2017

Remaining changes LGTM. cc @holdenk

@HyukjinKwon
Copy link
Member Author

Thank you @viirya for your sign-off.

@holdenk
Copy link
Contributor

holdenk commented Mar 7, 2017

Thanks for taking the time to review this @viirya :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also add a param with a unicode name that if it was converted down to ASCII behind the scenes and test it?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Mar 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that makes sense, though. It seems that requires more look and tests. I believe the current state resolves the specific JIRA. Maybe, could we merge this as is if you are think either way is fine? I feel It has been dragged by changes unrelated (or loosely related) with the specific JIRA and hope it could be merged if it is okay to you. If it dose not sound good to you, then, let me try to take a look.

@HyukjinKwon
Copy link
Member Author

Hey @holdenk. I am willing to close this for now if you are not confident enough for merging it for now. I can re-open this later.

@ueshin
Copy link
Member

ueshin commented Jun 26, 2017

@HyukjinKwon @holdenk Hi, are you still working on this?

@HyukjinKwon
Copy link
Member Author

Yea, I was waiting for the feedback. It is also about the unicode vs byte string.

@holdenk
Copy link
Contributor

holdenk commented Jul 2, 2017

I'd really like to see that further test I was talking about, @HyukjinKwon -- it shouldn't be too hard to do in this pr right? Just add a unicode string which doesn't make sense in down converted ascii.

@HyukjinKwon
Copy link
Member Author

Do you mean a test case such as self.assertEqual(testParams._resolveParam(u"아"), testParams.아) ?

@holdenk
Copy link
Contributor

holdenk commented Jul 2, 2017

@HyukjinKwon Pretty much, yes.

@HyukjinKwon HyukjinKwon force-pushed the SPARK-15243 branch 2 times, most recently from be6e483 to 485040b Compare July 2, 2017 04:10
self.assertEqual(testParams._resolveParam("maxIter"), testParams.maxIter)

self.assertEqual(testParams._resolveParam(u"maxIter"), testParams.maxIter)
if sys.version_info[0] >= 3:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk, would this test address your concern enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! :)

@SparkQA
Copy link

SparkQA commented Jul 2, 2017

Test build #79038 has finished for PR 17096 at commit 830b4fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

gentle ping ...

@holdenk
Copy link
Contributor

holdenk commented Sep 6, 2017

Sorry for the delay. Lets get Jenkins to retest this and make sure everything is ok but it looks like a good change :)

LGTM pending jenkins/merge issues (if they show up during jenkins). Jenkins retest this please.

@holdenk
Copy link
Contributor

holdenk commented Sep 7, 2017

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Sep 7, 2017

Test build #81488 has finished for PR 17096 at commit 830b4fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Sep 8, 2017

Thanks, merged to master!

@asfgit asfgit closed this in 8598d03 Sep 8, 2017
@HyukjinKwon
Copy link
Member Author

Thank you @holdenk, @viirya and @ueshin.

@HyukjinKwon HyukjinKwon deleted the SPARK-15243 branch January 2, 2018 03:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants