[SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string. #37631

vitaliili-db · 2022-08-23T22:22:36Z

What changes were proposed in this pull request?

Special case for split function when regex parameter is empty. The result will split input string but avoid empty remainder/tail string. I.e. split('hello', '') will produce ['h', 'e', 'l', 'l', 'o'] instead of ['h', 'e', 'l', 'l', 'o', '']. Old behavior is preserved when limit parameter is explicitly set to negative value - split('hello', '', -1)

Why are the changes needed?

This is nice to have and matches intuitively expected behavior.

Does this PR introduce any user-facing change?

Yes.

split function output changes when empty regex parameter is provided and limit parameter is not specified or set to 0. In this case a trailing empty string is ignored.

How was this patch tested?

Unit tests.

…ling empty string.

vitaliili-db · 2022-08-23T23:42:13Z

@cloud-fan

cloud-fan · 2022-08-24T06:33:03Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+    if (limit == 0 && numBytes() != 0 && pattern.numBytes() == 0) {
+      byte[] input = getBytes();
+      UTF8String[] result = new UTF8String[numBytes];
+      for (int i = 0; i < numBytes; i++) {


are we sure it's one byte per character?

Great catch, fixed!

cloud-fan · 2022-08-24T06:34:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

  override def third: Expression = limit

-  def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1));
+  def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(0))


ideally 0 and -1 should be no difference according to the function doc. Do you use -1 as a legacy flag?

Yes, you are correct. This should not change. Reverted.

AmplabJenkins · 2022-08-24T15:58:49Z

Can one of the admins verify this patch?

docs/sql-migration-guide.md

cloud-fan · 2022-08-24T16:10:53Z

What should be the behavior of split('hello', '', 100)? Do other databases return a trailing empty string?

Co-authored-by: Wenchen Fan <[email protected]>

vitaliili-db · 2022-08-24T18:39:48Z

@cloud-fan other databases don't have limit parameter in split function and don't return trailing empty strings for empty delimiter. The behavior differs, e.g. BigQuery splits into chars (like this PR) while Snowflake returns original string (like Spark StringSplitSQL).

split('hello', '', 100) will return array with trailing empty string => ['h', 'e', 'l', 'l', 'o', '']

cloud-fan · 2022-08-25T02:50:15Z

This seems a bit inconsistent. According to the function doc, -1 means no limit, and it's confusing why no limit is different from a large enough limit (which means no limit as well).

I'd consider trailing empty string as a bug, can we fix it in all the cases?

vitaliili-db · 2022-08-26T21:07:52Z

@cloud-fan trailing empty string is not actually a bug, it has consistent behavior in all systems, e.g. split("aaAbbAccA", "A") gives same result in all systems => ['aa', 'bb', 'cc', '']. So were are pretty consistent here. The only difference in absence of limit parameter (e.g. for migration purposes) is how empty delimiter/regex behaves. As mentioned we might choose either ignore regex (return original string), split and drop trailing empty string (this PR) or do nothing and return array with trailing empty string.

cloud-fan · 2022-08-29T02:39:09Z

ah, now I understand where the trailing empty string comes from, thanks! I agree to go with the "split and drop trailing empty string" behavior, but can we do it regardless of the limit parameter? I'd expect something like

if (split == "") {
  str.toSeq.take(limit)
}

vitaliili-db · 2022-08-29T23:47:22Z

@cloud-fan thank you, that is a great suggestion. Done.

cloud-fan · 2022-08-30T02:13:27Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

  public UTF8String[] split(UTF8String pattern, int limit) {
+    // This is a special case for converting string into array of symbols without a trailing empty
+    // string. E.g. `"hello".split("", 0) => ["h", "e", "l", "l", "o"].
+    // Note that negative limit will preserve a trailing empty string.


We need to update this comment now.

Done, thank you for reviewing!

cloud-fan

LGTM except for one comment.

cloud-fan · 2022-08-31T08:51:52Z

thanks, merging to master!

… an empty regex and a limit ### What changes were proposed in this pull request? After [SPARK-40194](#37631), the current behavior of the split function is as follows: ``` select split('hello', 'h', 1) // result is ["hello"] select split('hello', '-', 1) // result is ["hello"] select split('hello', '', 1) // result is ["h"] select split('1A2A3A4', 'A', 3) // result is ["1","2","3A4"] select split('1A2A3A4', '', 3) // result is ["1","A","2"] ``` However, according to the function's description, when the limit is greater than zero, the last element of the split result should contain the remaining part of the input string. ``` Arguments: * str - a string expression to split. * regex - a string representing a regular expression. The regex string should be a Java regular expression. * limit - an integer expression which controls the number of times the regex is applied. * limit > 0: The resulting array's length will not be more than `limit`, and the resulting array's last entry will contain all input beyond the last matched regex. * limit <= 0: `regex` will be applied as many times as possible, and the resulting array can be of any size. ``` So, the split function produces incorrect results with an empty regex and a limit. The correct result should be: ``` select split('hello', '', 1) // result is ["hello"] select split('1A2A3A4', '', 3) // result is ["1","A","2A3A4"] ``` ### Why are the changes needed? Fix correctness issue. ### Does this PR introduce _any_ user-facing change? Yes. When the empty regex parameter is provided along with a limit parameter greater than 0, the output of the split function changes. Before this patch ``` select split('hello', '', 1) // result is ["h"] select split('1A2A3A4', '', 3) // result is ["1","A","2"] ``` After this patch ``` select split('hello', '', 1) // result is ["hello"] select split('1A2A3A4', '', 3) // result is ["1","A","2A3A4"] ``` ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48470 from DenineLu/fix_split_on_empty_regex. Authored-by: DenineLu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-40194][SQL] SPLIT function on empty regex should truncate trai…

2e5d698

…ling empty string.

github-actions bot added the SQL label Aug 23, 2022

Updated sql-migration-guide.md

885106c

github-actions bot added the DOCS label Aug 23, 2022

Some test fixes

7a67a41

cloud-fan reviewed Aug 24, 2022

View reviewed changes

vitaliili-db added 2 commits August 24, 2022 09:01

Revert default limit and fix UTF8 char point size

34c6b47

Merge branch 'master' of https://github.com/apache/spark into SC-103756

ff17ee4

vitaliili-db requested a review from cloud-fan August 24, 2022 16:03

cloud-fan reviewed Aug 24, 2022

View reviewed changes

docs/sql-migration-guide.md Outdated Show resolved Hide resolved

Update docs/sql-migration-guide.md

29344fc

Co-authored-by: Wenchen Fan <[email protected]>

vitaliili-db requested a review from cloud-fan August 24, 2022 18:42

vitaliili-db added 2 commits August 29, 2022 16:42

Always ignore empty string for empty pattern regardless of limit

74b3f0b

Merge remote-tracking branch 'origin/SC-103756' into SC-103756

e05fd28

cloud-fan reviewed Aug 30, 2022

View reviewed changes

cloud-fan approved these changes Aug 30, 2022

View reviewed changes

Update comment for split function with empty pattern

7f6c0dd

cloud-fan approved these changes Aug 31, 2022

View reviewed changes

cloud-fan closed this in 247306c Aug 31, 2022

DenineLu mentioned this pull request Oct 15, 2024

[SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit #48470

Closed

DenineLu referenced this pull request in DenineLu/spark Oct 21, 2024

Add some tests

a8327ec

[SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string. #37631

[SPARK-40194][SQL] SPLIT function on empty regex should truncate trailing empty string. #37631

Uh oh!

Conversation

vitaliili-db commented Aug 23, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

vitaliili-db commented Aug 23, 2022

Uh oh!

cloud-fan Aug 24, 2022

Choose a reason for hiding this comment

Uh oh!

vitaliili-db Aug 24, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 24, 2022

Choose a reason for hiding this comment

Uh oh!

vitaliili-db Aug 24, 2022

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Aug 24, 2022

Uh oh!

Uh oh!

cloud-fan commented Aug 24, 2022

Uh oh!

vitaliili-db commented Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Aug 25, 2022

Uh oh!

vitaliili-db commented Aug 26, 2022

Uh oh!

cloud-fan commented Aug 29, 2022

Uh oh!

vitaliili-db commented Aug 29, 2022

Uh oh!

cloud-fan Aug 30, 2022

Choose a reason for hiding this comment

Uh oh!

vitaliili-db Aug 30, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vitaliili-db commented Aug 24, 2022 •

edited

Loading