-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods & functions in dataframe #17096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #73579 has finished for PR 17096 at commit
|
|
Thank you for taking this over @HyukjinKwon :) |
|
Let's double check with @viirya to make sure his comment was addressed, but I really appreciate the improved test coverage :) |
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can directly useisStr here.
python/pyspark/sql/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: corr -> sampled
|
Few minor comments. Overall looks good. |
|
Thank you @viirya. I think it is ready now. |
|
Test build #73688 has finished for PR 17096 at commit
|
|
In If we pass in a And |
|
@viirya, thank you so much for taking a look and your time. So, basically, the second case it compares str to unicode as below: >>> u"測試" == u"測試".encode("utf-8")
FalseApparently, it seems we could pass unicode as is? Let me raise another issue for this after testing and looking into this. Actually, the support in |
|
Let me check if each is fine for others for sure. |
|
@holdenk and @viirya, I got rid of the changes in There are two kind of changes here that look used in the only local scope. One seems for used >>> getattr("a", u"__str__")
<method-wrapper '__str__' of str object at 0x10a24e580>
>>> getattr("a", "__str__")
<method-wrapper '__str__' of str object at 0x10a24e580>and other one seems used for setting an parameter to JVM which seems already used in the code base much more. |
|
Test build #73822 has finished for PR 17096 at commit
|
|
Remaining changes LGTM. cc @holdenk |
|
Thank you @viirya for your sign-off. |
|
Thanks for taking the time to review this @viirya :) |
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also add a param with a unicode name that if it was converted down to ASCII behind the scenes and test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that makes sense, though. It seems that requires more look and tests. I believe the current state resolves the specific JIRA. Maybe, could we merge this as is if you are think either way is fine? I feel It has been dragged by changes unrelated (or loosely related) with the specific JIRA and hope it could be merged if it is okay to you. If it dose not sound good to you, then, let me try to take a look.
|
Hey @holdenk. I am willing to close this for now if you are not confident enough for merging it for now. I can re-open this later. |
|
@HyukjinKwon @holdenk Hi, are you still working on this? |
|
Yea, I was waiting for the feedback. It is also about the unicode vs byte string. |
|
I'd really like to see that further test I was talking about, @HyukjinKwon -- it shouldn't be too hard to do in this pr right? Just add a unicode string which doesn't make sense in down converted ascii. |
|
Do you mean a test case such as |
|
@HyukjinKwon Pretty much, yes. |
be6e483 to
485040b
Compare
| self.assertEqual(testParams._resolveParam("maxIter"), testParams.maxIter) | ||
|
|
||
| self.assertEqual(testParams._resolveParam(u"maxIter"), testParams.maxIter) | ||
| if sys.version_info[0] >= 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@holdenk, would this test address your concern enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! :)
|
Test build #79038 has finished for PR 17096 at commit
|
|
gentle ping ... |
|
Sorry for the delay. Lets get Jenkins to retest this and make sure everything is ok but it looks like a good change :) LGTM pending jenkins/merge issues (if they show up during jenkins). Jenkins retest this please. |
|
Jenkins retest this please. |
|
Test build #81488 has finished for PR 17096 at commit
|
|
Thanks, merged to master! |
What changes were proposed in this pull request?
This PR proposes to support unicodes in Param methods in ML, other missed functions in DataFrame.
For example, this causes a
ValueErrorin Python 2.x when param is a unicode string:This PR is based on #13036
How was this patch tested?
Unit tests in
python/pyspark/ml/tests.pyandpython/pyspark/sql/tests.py.