Skip to content

Conversation

@vitaliili-db
Copy link
Contributor

What changes were proposed in this pull request?

Special case for split function when regex parameter is empty. The result will split input string but avoid empty remainder/tail string. I.e. split('hello', '') will produce ['h', 'e', 'l', 'l', 'o'] instead of ['h', 'e', 'l', 'l', 'o', '']. Old behavior is preserved when limit parameter is explicitly set to negative value - split('hello', '', -1)

Why are the changes needed?

This is nice to have and matches intuitively expected behavior.

Does this PR introduce any user-facing change?

Yes.

  • split function output changes when empty regex parameter is provided and limit parameter is not specified or set to 0. In this case a trailing empty string is ignored.

How was this patch tested?

Unit tests.

@github-actions github-actions bot added the SQL label Aug 23, 2022
@github-actions github-actions bot added the DOCS label Aug 23, 2022
@vitaliili-db
Copy link
Contributor Author

@cloud-fan

if (limit == 0 && numBytes() != 0 && pattern.numBytes() == 0) {
byte[] input = getBytes();
UTF8String[] result = new UTF8String[numBytes];
for (int i = 0; i < numBytes; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure it's one byte per character?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, fixed!

override def third: Expression = limit

def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1));
def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally 0 and -1 should be no difference according to the function doc. Do you use -1 as a legacy flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. This should not change. Reverted.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@vitaliili-db vitaliili-db requested a review from cloud-fan August 24, 2022 16:03
@cloud-fan
Copy link
Contributor

What should be the behavior of split('hello', '', 100)? Do other databases return a trailing empty string?

@vitaliili-db
Copy link
Contributor Author

vitaliili-db commented Aug 24, 2022

@cloud-fan other databases don't have limit parameter in split function and don't return trailing empty strings for empty delimiter. The behavior differs, e.g. BigQuery splits into chars (like this PR) while Snowflake returns original string (like Spark StringSplitSQL).

split('hello', '', 100) will return array with trailing empty string => ['h', 'e', 'l', 'l', 'o', '']

@vitaliili-db vitaliili-db requested a review from cloud-fan August 24, 2022 18:42
@cloud-fan
Copy link
Contributor

This seems a bit inconsistent. According to the function doc, -1 means no limit, and it's confusing why no limit is different from a large enough limit (which means no limit as well).

I'd consider trailing empty string as a bug, can we fix it in all the cases?

@vitaliili-db
Copy link
Contributor Author

@cloud-fan trailing empty string is not actually a bug, it has consistent behavior in all systems, e.g. split("aaAbbAccA", "A") gives same result in all systems => ['aa', 'bb', 'cc', '']. So were are pretty consistent here. The only difference in absence of limit parameter (e.g. for migration purposes) is how empty delimiter/regex behaves. As mentioned we might choose either ignore regex (return original string), split and drop trailing empty string (this PR) or do nothing and return array with trailing empty string.

@cloud-fan
Copy link
Contributor

ah, now I understand where the trailing empty string comes from, thanks! I agree to go with the "split and drop trailing empty string" behavior, but can we do it regardless of the limit parameter? I'd expect something like

if (split == "") {
  str.toSeq.take(limit)
}

@vitaliili-db
Copy link
Contributor Author

@cloud-fan thank you, that is a great suggestion. Done.

public UTF8String[] split(UTF8String pattern, int limit) {
// This is a special case for converting string into array of symbols without a trailing empty
// string. E.g. `"hello".split("", 0) => ["h", "e", "l", "l", "o"].
// Note that negative limit will preserve a trailing empty string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update this comment now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thank you for reviewing!

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one comment.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

DenineLu referenced this pull request in DenineLu/spark Oct 21, 2024
cloud-fan pushed a commit that referenced this pull request Jul 25, 2025
… an empty regex and a limit

### What changes were proposed in this pull request?
After [SPARK-40194](#37631), the current behavior of the split function is as follows:
```
select split('hello', 'h', 1) // result is ["hello"]
select split('hello', '-', 1) // result is ["hello"]
select split('hello', '', 1)  // result is ["h"]

select split('1A2A3A4', 'A', 3) // result is ["1","2","3A4"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2"]
```

However, according to the function's description, when the limit is greater than zero, the last element of the split result should contain the remaining part of the input string.
```
Arguments:
      * str - a string expression to split.
      * regex - a string representing a regular expression. The regex string should be a Java regular expression.
      * limit - an integer expression which controls the number of times the regex is applied.
          * limit > 0: The resulting array's length will not be more than `limit`, and the resulting array's last entry will contain all input beyond the last matched regex.
          * limit <= 0: `regex` will be applied as many times as possible, and the resulting array can be of any size.
```

So, the split function produces incorrect results with an empty regex and a limit. The correct result should be:
```
select split('hello', '', 1)    // result is ["hello"]

select split('1A2A3A4', '', 3)  // result is ["1","A","2A3A4"]
```

### Why are the changes needed?
Fix correctness issue.

### Does this PR introduce _any_ user-facing change?
Yes.
When the empty regex parameter is provided along with a limit parameter greater than 0, the output of the split function changes.
Before this patch
```
select split('hello', '', 1)          // result is ["h"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2"]
```
After this patch
```
select split('hello', '', 1)          // result is ["hello"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2A3A4"]
```

### How was this patch tested?
Unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48470 from DenineLu/fix_split_on_empty_regex.

Authored-by: DenineLu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants