Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add fallback to positional assignment
  • Loading branch information
BryanCutler committed Jun 14, 2018
commit d4b5da17452d6278435b011ef3ef5e83f360aaec
14 changes: 13 additions & 1 deletion python/pyspark/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@
from pyspark.util import _get_argspec, fail_on_stopiteration
from pyspark import shuffle

if sys.version >= '3':
basestring = str

pickleSer = PickleSerializer()
utf8_deserializer = UTF8Deserializer()

Expand Down Expand Up @@ -110,7 +113,16 @@ def wrapped(key_series, value_series):
"Number of columns of the returned pandas.DataFrame "
"doesn't match specified schema. "
"Expected: {} Actual: {}".format(len(return_type), len(result.columns)))
return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
try:
# Assign result columns by schema name
return [(result[field.name], to_arrow_type(field.dataType)) for field in return_type]
except KeyError:
if all(not isinstance(name, basestring) for name in result.columns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just do isinstance(name, str) here to deal with python2/3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe he's trying to deal with unicode case too just in python 2. isinstance(name, basestring) should be safer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we still need to check for the possibility that python 2 uses unicode.

# Assign result columns by position if they are not named with strings
return [(result[result.columns[i]], to_arrow_type(field.dataType))
for i, field in enumerate(return_type)]
else:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we limit to just result columns not named with strings?

In the case we return a pd.DataFrame with matching field types, but not matching field names, we don't like to allow it?

If returned pd.DataFrame doesn't match return_type's column names, shouldn't we follow current behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when user specify column names explicitly on the returned pd.DataFrame but it doesn't match the schema, then it's most likely to be a bug / typo, so throw exception makes sense to me.

Copy link
Member Author

@BryanCutler BryanCutler May 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I think that it's just very common for users to create a DataFrame with a dict using names as keys and not know that this can change the order of columns. So even if the field types all match (in the case of this JIRA they were all StringTypes), there could be a mix up between the data and column names. This is really weird and hard to figure out what is going on from the user perspective.

When defining the pandas_udf, the return type requires the field names, so if the returned DataFrame has columns indexed by strings, I think it's fair to assume that if they do not match it was a mistake. If the user wants to use positional columns, they can index by integers - and I'll add this to the docs.

That being said, I do suppose that this slightly changes the behavior if by chance the user had gone out of the way to make a pandas_udf by specifying columns with different names than the return type schema, but still with the same field type order. That seems pretty unlikely to me though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I saw you add document for this behavior. Looks good.


return wrapped

Expand Down