-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string. #37631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ling empty string.
| if (limit == 0 && numBytes() != 0 && pattern.numBytes() == 0) { | ||
| byte[] input = getBytes(); | ||
| UTF8String[] result = new UTF8String[numBytes]; | ||
| for (int i = 0; i < numBytes; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure it's one byte per character?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, fixed!
| override def third: Expression = limit | ||
|
|
||
| def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1)); | ||
| def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally 0 and -1 should be no difference according to the function doc. Do you use -1 as a legacy flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct. This should not change. Reverted.
|
Can one of the admins verify this patch? |
|
What should be the behavior of |
Co-authored-by: Wenchen Fan <[email protected]>
|
@cloud-fan other databases don't have
|
|
This seems a bit inconsistent. According to the function doc, I'd consider trailing empty string as a bug, can we fix it in all the cases? |
|
@cloud-fan trailing empty string is not actually a bug, it has consistent behavior in all systems, e.g. |
|
ah, now I understand where the trailing empty string comes from, thanks! I agree to go with the "split and drop trailing empty string" behavior, but can we do it regardless of the limit parameter? I'd expect something like |
|
@cloud-fan thank you, that is a great suggestion. Done. |
| public UTF8String[] split(UTF8String pattern, int limit) { | ||
| // This is a special case for converting string into array of symbols without a trailing empty | ||
| // string. E.g. `"hello".split("", 0) => ["h", "e", "l", "l", "o"]. | ||
| // Note that negative limit will preserve a trailing empty string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to update this comment now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thank you for reviewing!
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one comment.
|
thanks, merging to master! |
… an empty regex and a limit ### What changes were proposed in this pull request? After [SPARK-40194](#37631), the current behavior of the split function is as follows: ``` select split('hello', 'h', 1) // result is ["hello"] select split('hello', '-', 1) // result is ["hello"] select split('hello', '', 1) // result is ["h"] select split('1A2A3A4', 'A', 3) // result is ["1","2","3A4"] select split('1A2A3A4', '', 3) // result is ["1","A","2"] ``` However, according to the function's description, when the limit is greater than zero, the last element of the split result should contain the remaining part of the input string. ``` Arguments: * str - a string expression to split. * regex - a string representing a regular expression. The regex string should be a Java regular expression. * limit - an integer expression which controls the number of times the regex is applied. * limit > 0: The resulting array's length will not be more than `limit`, and the resulting array's last entry will contain all input beyond the last matched regex. * limit <= 0: `regex` will be applied as many times as possible, and the resulting array can be of any size. ``` So, the split function produces incorrect results with an empty regex and a limit. The correct result should be: ``` select split('hello', '', 1) // result is ["hello"] select split('1A2A3A4', '', 3) // result is ["1","A","2A3A4"] ``` ### Why are the changes needed? Fix correctness issue. ### Does this PR introduce _any_ user-facing change? Yes. When the empty regex parameter is provided along with a limit parameter greater than 0, the output of the split function changes. Before this patch ``` select split('hello', '', 1) // result is ["h"] select split('1A2A3A4', '', 3) // result is ["1","A","2"] ``` After this patch ``` select split('hello', '', 1) // result is ["hello"] select split('1A2A3A4', '', 3) // result is ["1","A","2A3A4"] ``` ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48470 from DenineLu/fix_split_on_empty_regex. Authored-by: DenineLu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Special case for
splitfunction whenregexparameter is empty. The result will split input string but avoid empty remainder/tail string. I.e.split('hello', '')will produce['h', 'e', 'l', 'l', 'o']instead of['h', 'e', 'l', 'l', 'o', '']. Old behavior is preserved whenlimitparameter is explicitly set to negative value -split('hello', '', -1)Why are the changes needed?
This is nice to have and matches intuitively expected behavior.
Does this PR introduce any user-facing change?
Yes.
splitfunction output changes when emptyregexparameter is provided andlimitparameter is not specified or set to 0. In this case a trailing empty string is ignored.How was this patch tested?
Unit tests.