-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26979][PYTHON] Add missing string column name support for some SQL functions #23882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4e22433
935bf4c
cf8ede4
23d0222
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -37,8 +37,8 @@ | |
| from pyspark.sql.udf import UserDefinedFunction, _create_udf | ||
|
|
||
|
|
||
| def _create_function(name, doc=""): | ||
| """ Create a function for aggregator by name""" | ||
| def _create_name_function(name, doc=""): | ||
| """ Create a function that takes a column name argument, by name""" | ||
| def _(col): | ||
| sc = SparkContext._active_spark_context | ||
| jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col) | ||
|
|
@@ -48,6 +48,17 @@ def _(col): | |
| return _ | ||
|
|
||
|
|
||
| def _create_function(name, doc=""): | ||
| """ Create a function that takes a Column object, by name""" | ||
| def _(col): | ||
| sc = SparkContext._active_spark_context | ||
| jc = getattr(sc._jvm.functions, name)(_to_java_column(col)) | ||
| return Column(jc) | ||
| _.__name__ = name | ||
| _.__doc__ = doc | ||
| return _ | ||
|
|
||
|
|
||
| def _wrap_deprecated_function(func, message): | ||
| """ Wrap the deprecated function to print out deprecation warnings""" | ||
| def _(col): | ||
|
|
@@ -85,13 +96,16 @@ def _(): | |
| >>> df.select(lit(5).alias('height')).withColumn('spark_user', lit(True)).take(1) | ||
| [Row(height=5, spark_user=True)] | ||
| """ | ||
| _functions = { | ||
| _name_functions = { | ||
| # name functions take a column name as their argument | ||
| 'lit': _lit_doc, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and .. what's really "name function" ... ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, The name "name function" is something I came up with just to distinguish these functions from the ones that take columns as input. They are defined by that distinction - they are "functions that take a column name as their argument", exclusively.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be fair,
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One (lit) of five (col, column, asc, desc, lit) doesn't sound like a special case tho. It had to have a better category if 20% of items doesn't fit to the category. |
||
| 'col': 'Returns a :class:`Column` based on the given column name.', | ||
| 'column': 'Returns a :class:`Column` based on the given column name.', | ||
| 'asc': 'Returns a sort expression based on the ascending order of the given column name.', | ||
| 'desc': 'Returns a sort expression based on the descending order of the given column name.', | ||
| } | ||
|
|
||
| _functions = { | ||
| 'upper': 'Converts a string expression to upper case.', | ||
| 'lower': 'Converts a string expression to upper case.', | ||
| 'sqrt': 'Computes the square root of the specified float value.', | ||
|
|
@@ -141,7 +155,7 @@ def _(): | |
| 'bitwiseNOT': 'Computes bitwise not.', | ||
| } | ||
|
|
||
| _functions_2_4 = { | ||
| _name_functions_2_4 = { | ||
| 'asc_nulls_first': 'Returns a sort expression based on the ascending order of the given' + | ||
| ' column name, and null values return before non-null values.', | ||
| 'asc_nulls_last': 'Returns a sort expression based on the ascending order of the given' + | ||
|
|
@@ -254,6 +268,8 @@ def _(): | |
| _functions_deprecated = { | ||
| } | ||
|
|
||
| for _name, _doc in _name_functions.items(): | ||
| globals()[_name] = since(1.3)(_create_name_function(_name, _doc)) | ||
| for _name, _doc in _functions.items(): | ||
| globals()[_name] = since(1.3)(_create_function(_name, _doc)) | ||
| for _name, _doc in _functions_1_4.items(): | ||
|
|
@@ -268,8 +284,8 @@ def _(): | |
| globals()[_name] = since(2.1)(_create_function(_name, _doc)) | ||
| for _name, _message in _functions_deprecated.items(): | ||
| globals()[_name] = _wrap_deprecated_function(globals()[_name], _message) | ||
| for _name, _doc in _functions_2_4.items(): | ||
| globals()[_name] = since(2.4)(_create_function(_name, _doc)) | ||
| for _name, _doc in _name_functions_2_4.items(): | ||
| globals()[_name] = since(2.4)(_create_name_function(_name, _doc)) | ||
| del _name, _doc | ||
|
|
||
|
|
||
|
|
@@ -1437,10 +1453,6 @@ def hash(*cols): | |
| 'ascii': 'Computes the numeric value of the first character of the string column.', | ||
| 'base64': 'Computes the BASE64 encoding of a binary column and returns it as a string column.', | ||
| 'unbase64': 'Decodes a BASE64 encoded string column and returns it as a binary column.', | ||
| 'initcap': 'Returns a new string column by converting the first letter of each word to ' + | ||
| 'uppercase. Words are delimited by whitespace.', | ||
| 'lower': 'Converts a string column to lower case.', | ||
| 'upper': 'Converts a string column to upper case.', | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This had to stay in string functions!
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was saying Also, strictly this should have not been removed in this PR as it doesn't target to remove overwritten functions. As you said, we should avoid such function definition way later.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, if it was an exception, it had to describe it specifically. This doesn't look making sense if you read the codes from scratch.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The java/scala API documentation says it was added in 1.3, but I just tracked down the JIRA/PR and it seems it actually was 1.0. https://issues.apache.org/jira/browse/SPARK-1995 As for removing overwritten functions, maybe it would've been better to make a separate PR, but the first fix did require removing them. When I changed the approach it seemed reasonable to keep the change, since the problem was obvious and easy to fix. |
||
| 'ltrim': 'Trim the spaces from left end for the specified string value.', | ||
| 'rtrim': 'Trim the spaces from right end for the specified string value.', | ||
| 'trim': 'Trim the spaces from both ends for the specified string column.', | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you not add the column name support because Scala side has this signautre below?: def trim(e: Column): Column
def trim(e: Column, trimString: String): ColumnThat's not allowed in Python >>> from pyspark.sql.functions import trim, lit
>>> spark.range(1).select(lit('a').alias("value")).select(trim("value", "a"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: _() takes exactly 1 argument (2 given)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is what I initially expected when I asked to whitelist them, @asmello
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand what you're saying here.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was saying this because I didn't get why you excluded string functions. |
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.