-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15243][ML][SQL][PYSPARK] Param methods should use basestring for type checking #13036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #58289 has finished for PR 13036 at commit
|
|
Currently investigating some other usages of |
|
I updated instances of similar checks in the sql library as noted on the Jira. I searched and this type of check now only exists here and here. They don't cause problems with unicode though, so I did not change them, but I can do that if needed. cc @viirya I'm not as familiar with the sql library, could you check those changes? |
|
Test build #58386 has finished for PR 13036 at commit
|
| """ | ||
| if not isinstance(col, str): | ||
| if not isinstance(col, basestring): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this change is needed. Because I think in SQL the column name is only allowed with alphabet, digit and underline, so it is a question why users will use unicode string as column in particular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to f958f27, it seems to be possible to use Non-ascii characters in column name.
I think there are use cases which want to use non-ascii character in column name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, got it. I just mean from SQL parser.
Similarly, as the unicode column name will be encoded by name.encode('utf-8'), it is now a str instance. In other words, the schema still stores column names as str. However, this change is allowing unicode input as col. I think there will be mismatching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we don't need to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for answering. I understood why isinstance(col, basestring) is not needed here.
Although column name is basically stored as str, it is stored as unicode in a certain case.
See SPARK-15244 for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some harm in allowing unicode here though? If my column is 'a' and I call sampleBy(u'a') it will work after this change, otherwise it will throw an error. I think it's better to treat 'a' and u'a' as equivalent...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. There is no problem caused by allowing unicode here.
As you mentioned, it's better to handle 'a' and u'a' because there are few cases that unicode is passed. (e.g. when __future__.unicode_literals is imported in Python 2.)
|
lgtm |
|
Checking old PRs---is this active still? The ML parts look good to me. I haven't checked the SQL ones carefully. |
|
Just a quick ping @sethah - I know your pretty busy but I'm assuming this is still active. One minor note is it seems there is another new addition to types.py which maybe should also be changed. |
|
Test build #66542 has finished for PR 13036 at commit
|
|
Test build #67032 has finished for PR 13036 at commit
|
|
Test build #67093 has finished for PR 13036 at commit
|
| def test_approxQuantile(self): | ||
| df = self.sc.parallelize([Row(a=i) for i in range(10)]).toDF() | ||
| aq = df.stat.approxQuantile("a", [0.1, 0.5, 0.9], 0.1) | ||
| aq = df.stat.approxQuantile(u"a", [0.1, 0.5, 0.9], 0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically in these tests the field names are all ascii characters. Is it possibly to add tests using non-ascii characters so we can make sure it works?
|
Gentle ping, whats the status of this PR? |
|
@holdenk please feel free to take this over. Can't find time to work on it |
|
Ok, lets see if maybe @zero323 or @HyukjinKwon are interested in taking this over. Otherwise I'll add this to my backlog. |
|
I am happy to do so. I assume that It seems already almost done except for #13036 (comment)? |
…am methods & functions in dataframe
## What changes were proposed in this pull request?
This PR proposes to support unicodes in Param methods in ML, other missed functions in DataFrame.
For example, this causes a `ValueError` in Python 2.x when param is a unicode string:
```python
>>> from pyspark.ml.classification import LogisticRegression
>>> lr = LogisticRegression()
>>> lr.hasParam("threshold")
True
>>> lr.hasParam(u"threshold")
Traceback (most recent call last):
...
raise TypeError("hasParam(): paramName must be a string")
TypeError: hasParam(): paramName must be a string
```
This PR is based on apache#13036
## How was this patch tested?
Unit tests in `python/pyspark/ml/tests.py` and `python/pyspark/sql/tests.py`.
Author: hyukjinkwon <[email protected]>
Author: sethah <[email protected]>
Closes apache#17096 from HyukjinKwon/SPARK-15243.
What changes were proposed in this pull request?
The following methods used
isinstance(value, str)for checking string types in Python ML params:_resolveParam(param)hasParam(param)This causes a
ValueErrorin Python 2.x whenparamis a unicode string:How was this patch tested?
Unit tests added to
python/ml/tests.py