Skip to content

Conversation

@sethah
Copy link
Contributor

@sethah sethah commented May 10, 2016

What changes were proposed in this pull request?

The following methods used isinstance(value, str) for checking string types in Python ML params:

  • _resolveParam(param)
  • hasParam(param)

This causes a ValueError in Python 2.x when param is a unicode string:

>>> from pyspark.ml.classification import LogisticRegression
>>> lr = LogisticRegression()
>>> lr.hasParam("threshold")
True
>>> lr.hasParam(u"threshold")
Traceback (most recent call last):
 ...
    raise TypeError("hasParam(): paramName must be a string")
TypeError: hasParam(): paramName must be a string
>>> 

How was this patch tested?

Unit tests added to python/ml/tests.py

@SparkQA
Copy link

SparkQA commented May 10, 2016

Test build #58289 has finished for PR 13036 at commit 878fc5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented May 11, 2016

Currently investigating some other usages of isinstance(obj, str) for this PR. Will update soon.

@sethah sethah changed the title [SPARK-15243][ML][PYSPARK] Param methods should use basestring for type checking [SPARK-15243][ML][SQL][PYSPARK] Param methods should use basestring for type checking May 11, 2016
@sethah
Copy link
Contributor Author

sethah commented May 11, 2016

I updated instances of similar checks in the sql library as noted on the Jira. I searched and this type of check now only exists here and here. They don't cause problems with unicode though, so I did not change them, but I can do that if needed.

cc @viirya I'm not as familiar with the sql library, could you check those changes?

Also cc @holdenk @davies

@SparkQA
Copy link

SparkQA commented May 11, 2016

Test build #58386 has finished for PR 13036 at commit b04ac41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"""
if not isinstance(col, str):
if not isinstance(col, basestring):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this change is needed. Because I think in SQL the column name is only allowed with alphabet, digit and underline, so it is a question why users will use unicode string as column in particular.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to f958f27, it seems to be possible to use Non-ascii characters in column name.
I think there are use cases which want to use non-ascii character in column name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, got it. I just mean from SQL parser.

Similarly, as the unicode column name will be encoded by name.encode('utf-8'), it is now a str instance. In other words, the schema still stores column names as str. However, this change is allowing unicode input as col. I think there will be mismatching.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we don't need to do this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for answering. I understood why isinstance(col, basestring) is not needed here.

Although column name is basically stored as str, it is stored as unicode in a certain case.
See SPARK-15244 for details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some harm in allowing unicode here though? If my column is 'a' and I call sampleBy(u'a') it will work after this change, otherwise it will throw an error. I think it's better to treat 'a' and u'a' as equivalent...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. There is no problem caused by allowing unicode here.
As you mentioned, it's better to handle 'a' and u'a' because there are few cases that unicode is passed. (e.g. when __future__.unicode_literals is imported in Python 2.)

@MechCoder
Copy link
Contributor

lgtm

@jkbradley
Copy link
Member

Checking old PRs---is this active still? The ML parts look good to me. I haven't checked the SQL ones carefully.

@holdenk
Copy link
Contributor

holdenk commented Oct 7, 2016

Just a quick ping @sethah - I know your pretty busy but I'm assuming this is still active. One minor note is it seems there is another new addition to types.py which maybe should also be changed.

@SparkQA
Copy link

SparkQA commented Oct 7, 2016

Test build #66542 has finished for PR 13036 at commit 48f0557.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 16, 2016

Test build #67032 has finished for PR 13036 at commit c6a8828.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Oct 17, 2016

ping @holdenk @viirya

I think this is ready now :)

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67093 has finished for PR 13036 at commit 976d682.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def test_approxQuantile(self):
df = self.sc.parallelize([Row(a=i) for i in range(10)]).toDF()
aq = df.stat.approxQuantile("a", [0.1, 0.5, 0.9], 0.1)
aq = df.stat.approxQuantile(u"a", [0.1, 0.5, 0.9], 0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically in these tests the field names are all ascii characters. Is it possibly to add tests using non-ascii characters so we can make sure it works?

@holdenk
Copy link
Contributor

holdenk commented Feb 15, 2017

Gentle ping, whats the status of this PR?

@sethah sethah closed this Feb 28, 2017
@sethah
Copy link
Contributor Author

sethah commented Feb 28, 2017

@holdenk please feel free to take this over. Can't find time to work on it

@holdenk
Copy link
Contributor

holdenk commented Feb 28, 2017

Ok, lets see if maybe @zero323 or @HyukjinKwon are interested in taking this over. Otherwise I'll add this to my backlog.

@HyukjinKwon
Copy link
Member

I am happy to do so. I assume that It seems already almost done except for #13036 (comment)?

ghost pushed a commit to dbtsai/spark that referenced this pull request Sep 8, 2017
…am methods & functions in dataframe

## What changes were proposed in this pull request?

This PR proposes to support unicodes in Param methods in ML, other missed functions in DataFrame.

For example, this causes a `ValueError` in Python 2.x when param is a unicode string:

```python
>>> from pyspark.ml.classification import LogisticRegression
>>> lr = LogisticRegression()
>>> lr.hasParam("threshold")
True
>>> lr.hasParam(u"threshold")
Traceback (most recent call last):
 ...
    raise TypeError("hasParam(): paramName must be a string")
TypeError: hasParam(): paramName must be a string
```

This PR is based on apache#13036

## How was this patch tested?

Unit tests in `python/pyspark/ml/tests.py` and `python/pyspark/sql/tests.py`.

Author: hyukjinkwon <[email protected]>
Author: sethah <[email protected]>

Closes apache#17096 from HyukjinKwon/SPARK-15243.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants