[SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit #48470

DenineLu · 2024-10-15T05:13:03Z

What changes were proposed in this pull request?

After SPARK-40194, the current behavior of the split function is as follows:

select split('hello', 'h', 1) // result is ["hello"]
select split('hello', '-', 1) // result is ["hello"]
select split('hello', '', 1)  // result is ["h"]

select split('1A2A3A4', 'A', 3) // result is ["1","2","3A4"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2"]

However, according to the function's description, when the limit is greater than zero, the last element of the split result should contain the remaining part of the input string.

Arguments:
      * str - a string expression to split.
      * regex - a string representing a regular expression. The regex string should be a Java regular expression.
      * limit - an integer expression which controls the number of times the regex is applied.
          * limit > 0: The resulting array's length will not be more than `limit`, and the resulting array's last entry will contain all input beyond the last matched regex.
          * limit <= 0: `regex` will be applied as many times as possible, and the resulting array can be of any size.

So, the split function produces incorrect results with an empty regex and a limit. The correct result should be:

select split('hello', '', 1)    // result is ["hello"]

select split('1A2A3A4', '', 3)  // result is ["1","A","2A3A4"]

Why are the changes needed?

Fix correctness issue.

Does this PR introduce any user-facing change?

Yes.
When the empty regex parameter is provided along with a limit parameter greater than 0, the output of the split function changes.
Before this patch

select split('hello', '', 1)          // result is ["h"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2"]

After this patch

select split('hello', '', 1)          // result is ["hello"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2A3A4"]

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

TongWei1105 · 2024-10-15T06:13:49Z

cc @wangyum @cloud-fan

wangyum · 2024-10-15T07:50:00Z

cc @vitaliili-db

uros-db

thanks for making this change - however, please add collation-related tests as well, see:

test("StringSplit expression with collated strings")

in CollationRegexpExpressionsSuite.scala

DenineLu · 2024-10-17T08:08:32Z

thanks for making this change - however, please add collation-related tests as well, see:
test("StringSplit expression with collated strings")
in CollationRegexpExpressionsSuite.scala

Thank you for your guidance. The relevant tests have been added.

...c/test/scala/org/apache/spark/sql/catalyst/expressions/CollationRegexpExpressionsSuite.scala

sql/core/src/test/resources/sql-tests/inputs/string-functions.sql

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

uros-db · 2024-10-17T11:44:28Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+      for (int i = 0; i < newLimit - 1; i++) {
        int currCharNumBytes = numBytesForFirstByte(input[byteIndex]);
-        result[charIndex++] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);
+        result[i] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);


Suggested change

result[i] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);

result[charIndex] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);

uros-db · 2024-10-17T11:46:18Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+        result[i] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);
        byteIndex += currCharNumBytes;
      }
+      result[newLimit - 1] = UTF8String.fromBytes(input, byteIndex, numBytes() - byteIndex);


is ArrayIndexOutOfBoundsException possible here?
what if newLimit=0 (i.e. numChars()=0, limit=-1)

is ArrayIndexOutOfBoundsException possible here? what if newLimit=0 (i.e. numChars()=0, limit=-1)

no, this code block will only be entered when the following conditions are met.

if (numBytes() != 0 && pattern.numBytes() == 0)

uros-db · 2024-10-17T11:52:28Z

sql/core/src/test/resources/sql-tests/inputs/string-functions.sql

 SELECT split('hello', '');
+SELECT split('hello', '', 1);
+SELECT split('hello', '', 3);
 SELECT split('', '');


I would also prefer to see:

SELECT split('', '', -1); SELECT split('', '', 0); SELECT split('', '', 1);

here, for more complete testing

Thanks, already added.

uros-db · 2024-10-18T09:20:40Z

...c/test/scala/org/apache/spark/sql/catalyst/expressions/CollationRegexpExpressionsSuite.scala

Suggested change

StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), -1),

StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), 0),

StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1A2B3C"), 1),

StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2B3C"), 3),

StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3C"), 5),

StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), 100),

StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), -1),

StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), 0),

StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1A2B3C"), 1),

StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2B3C"), 3),

StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3C"), 5),

StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), 100),

sorry, I meant to request using collation (other than UTF8_BINARY) here

In the current situation, if UTF8_LCASE is applied to an empty string, the condition here will not be met because the value of pattern after being collated by collationAwareRegex is(?ui), meaning that #37631 does not support truncating the trailing empty string when the pattern is (?ui).

public UTF8String[] split(UTF8String pattern, int limit) { // For the empty `pattern` a `split` function ignores trailing empty strings unless original // string is empty. if (numBytes() != 0 && pattern.numBytes() == 0) {

Therefore, it seems that the result is not what we want when limit <= 0.

select split('1A2B3C', '(?ui)', -1); // result is ["1", "A", "2", "B", "3", "C", ""] select split('1A2B3C', '(?ui)', 0); // result is ["1", "A", "2", "B", "3", "C", ""] select split('1A2B3C', '(?ui)', 1); // result is ["1A2B3C"] select split('1A2B3C', '(?ui)', 3); // result is ["1", "A", "2B3C"] select split('1A2B3C', '(?ui)', 6); // result is ["1", "A", "2", "B", "3", "C"] select split('1A2B3C', '(?ui)', 100); // result is ["1", "A", "2", "B", "3", "C"]

When the pattern is "(?ui)", a simple and direct approach can be taken to correct the result.

public UTF8String[] split(UTF8String pattern, int limit) { // For the empty `pattern` a `split` function ignores trailing empty strings unless original // string is empty. if (numBytes() != 0 && (pattern.numBytes() == 0 || lowercaseRegexPrefix.equals(pattern))) {

However, when the pattern is "(?ui)(?ui)" or "(?ui)(?ui)(?ui)", the result still contains a trailing empty string, and I haven't thought of an efficient way to match and resolve it. Should we consider this a reasonable situation?
Additionally, do you think it is necessary to check in the CollationSupport.lowercaseRegex method whether the regex already has a (?ui) prefix?

uros-db

left just one more comment, otherwise lgtm (mostly focusing on collation behaviour)

adding @vitaliili-db @cloud-fan to carefully review, extending on #37631

uros-db · 2024-10-21T09:17:42Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

how about: instead of checking whether the pattern equals to the (?ui) prefix, we modify the collation implementation (prefixing logic) to avoid appending the prefix at all in the case that pattern is an empty string

I agree with what you're saying, but should we consider that the user's pattern itself might be (?ui) and is unrelated to prefixing logic?

that is an interesting observation, although in that case I don't see why the user's pattern can't be any other flag modifier combination, such as: (?m), (?s), (?x), (?a)

taking this into consideration, there is really nothing special about lowercaseRegexPrefix. instead, you should look for a library method that can discern whether a pattern is "functionally" empty, instead of doing a manual check against lowercaseRegexPrefix

Thank you for your explanation. It looks like there’s no way to validate this "weird" situation without losing performance. I made changes according to your advice. Thanks again.

uros-db

passing on to @vitaliili-db @cloud-fan @MaxGekk for further review

see apache/spark#48470

| Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | - | Feat | Feature | Introduce Spark41Shims and update build configuration to support Spark 4.1. | pom.xml shims/pom.xml shims/spark41/pom.xml shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider shims/spark41/.../spark41/Spark41Shims.scala shims/spark41/.../spark41/SparkShimProvider.scala | | [#51477](apache/spark#51477) | Fix | Compatibility | Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. | gluten-core/.../caller/CallerInfo.scala | | [#50852](apache/spark#50852) | Fix | Compatibility | Add printOutputColumns parameter to generateTreeString methods | shims/spark41/.../GenerateTreeStringShim.scala | | [#51775](apache/spark#51775) | Fix | Compatibility | Remove unused MDC import in FileSourceScanExecShim.scala | shims/spark41/.../FileSourceScanExecShim.scala | | [#51979](apache/spark#51979) | Fix | Compatibility | Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec | shims/spark41/.../v2/AbstractBatchScanExec.scala shims/spark41/.../v2/BatchScanExecShim.scala | | [#51302](apache/spark#51302) | Fix | Compatibility | Remove TimeAdd from ExpressionConverter and ExpressionMappings for test | gluten-substrait/.../ExpressionConverter.scala gluten-substrait/.../ExpressionMappings.scala | | [#50598](apache/spark#50598) | Fix | Compatibility | Adapt to QueryExecution.createSparkPlan interface change | gluten-substrait/.../GlutenImplicits.scala shims/spark*/.../QueryExecutionShim.scala | | [#52599](apache/spark#52599) | Fix | Compatibility | Adapt to DataSourceV2Relation interface change | backends-velox/.../ArrowConvertorRule.scala shims/spark*/.../v2/DataSourceV2RelationShim.scala | | [#52384](apache/spark#52384) | Fix | Compatibility | Using new interface of ParquetFooterReader | backends-velox/.../ParquetMetadataUtils.scala gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala shims/spark*/.../parquet/ParquetFooterReaderShim.scala | | [#52509](apache/spark#52509) | Fix | Build | Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'` | pom.xml | | - | Fix | Test | Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability | backends-velox/.../VeloxHashJoinSuite.scala | | [#50849](apache/spark#50849) | Fix | Test | Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 | backends-velox/.../MiscOperatorSuite.scala | | [#52723](apache/spark#52723) | Fix | Compatibility | Add GeographyVal and GeometryVal support in ColumnarArrayShim | shims/spark41/.../vectorized/ColumnarArrayShim.java | | [#48470](apache/spark#48470) | 4.1.0 | Exclude | Exclude split test in VeloxStringFunctionsSuite | backends-velox/.../VeloxStringFunctionsSuite.scala | | [#51259](apache/spark#51259) | 4.1.0 | Exclude | Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0， we need update ci python to 3.10 | backends-velox/.../python/ArrowEvalPythonExecSuite.scala |

see apache/spark#48470

| Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | - | Feat | Feature | Introduce Spark41Shims and update build configuration to support Spark 4.1. | pom.xml shims/pom.xml shims/spark41/pom.xml shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider shims/spark41/.../spark41/Spark41Shims.scala shims/spark41/.../spark41/SparkShimProvider.scala | | [#51477](apache/spark#51477) | Fix | Compatibility | Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. | gluten-core/.../caller/CallerInfo.scala | | [#50852](apache/spark#50852) | Fix | Compatibility | Add printOutputColumns parameter to generateTreeString methods | shims/spark41/.../GenerateTreeStringShim.scala | | [#51775](apache/spark#51775) | Fix | Compatibility | Remove unused MDC import in FileSourceScanExecShim.scala | shims/spark41/.../FileSourceScanExecShim.scala | | [#51979](apache/spark#51979) | Fix | Compatibility | Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec | shims/spark41/.../v2/AbstractBatchScanExec.scala shims/spark41/.../v2/BatchScanExecShim.scala | | [#51302](apache/spark#51302) | Fix | Compatibility | Remove TimeAdd from ExpressionConverter and ExpressionMappings for test | gluten-substrait/.../ExpressionConverter.scala gluten-substrait/.../ExpressionMappings.scala | | [#50598](apache/spark#50598) | Fix | Compatibility | Adapt to QueryExecution.createSparkPlan interface change | gluten-substrait/.../GlutenImplicits.scala shims/spark\*/.../shims/spark\*/Spark*Shims.scala | | [#52599](apache/spark#52599) | Fix | Compatibility | Adapt to DataSourceV2Relation interface change | backends-velox/.../ArrowConvertorRule.scala | | [#52384](apache/spark#52384) | Fix | Compatibility | Using new interface of ParquetFooterReader | backends-velox/.../ParquetMetadataUtils.scala gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala shims/spark*/.../parquet/ParquetFooterReaderShim.scala | | [#52509](apache/spark#52509) | Fix | Build | Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'` | pom.xml | | - | Fix | Test | Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability | backends-velox/.../VeloxHashJoinSuite.scala | | [#50849](apache/spark#50849) | Fix | Test | Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 | backends-velox/.../MiscOperatorSuite.scala | | [#52723](apache/spark#52723) | Fix | Compatibility | Add GeographyVal and GeometryVal support in ColumnarArrayShim | shims/spark41/.../vectorized/ColumnarArrayShim.java | | [#48470](apache/spark#48470) | 4.1.0 | Exclude | Exclude split test in VeloxStringFunctionsSuite | backends-velox/.../VeloxStringFunctionsSuite.scala | | [#51259](apache/spark#51259) | 4.1.0 | Exclude | Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0， we need update ci python to 3.10 | backends-velox/.../python/ArrowEvalPythonExecSuite.scala |

## Changes | Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | N/A | Feat | Build | Update build configuration to support Spark 4.1 UT | `.github/workflows/velox_backend_x86.yml`, `gluten-ut/pom.xml`, `gluten-ut/spark41/pom.xml`, `tools/gluten-it/pom.xml` | | [#52165](apache/spark#52165) | Fix | Dependency | Update Parquet dependency version to 1.16.0 to avoid NoSuchMethodError issue | `gluten-ut/spark41/pom.xml` | | [#51477](apache/spark#51477) | Fix | Compatibility | Update imports to reflect streaming runtime package refactoring in Apache Spark | `gluten-ut/spark41/.../GlutenDynamicPartitionPruningSuite.scala`, `gluten-ut/spark41/.../GlutenStreamingQuerySuite.scala` | | [#50674](apache/spark#50674) | Fix | Compatibility | Fix compatibility issue introduced by `TypedConfigBuilder` | `gluten-substrait/.../ExpressionConverter.scala`, `gluten-ut/spark41/.../GlutenCSVSuite.scala`, `gluten-ut/spark41/.../GlutenJsonSuite.scala` | | [#49766](apache/spark#49766) | Fix | Compatibility | Disable V2 bucketing in GlutenDynamicPartitionPruningSuite since spark.sql.sources.v2.bucketing.enabled is now enabled by default | `gluten-ut/spark41/.../GlutenDynamicPartitionPruningSuite.scala` | | [#42414](apache/spark#42414), [#53038](apache/spark#53038) | Fix | Bug Fix | Resolve an issue introduced by SPARK-42414, as identified in SPARK-53038 | `backends-velox/.../VeloxBloomFilterAggregate.scala` | | N/A | Fix | Bug Fix | Enforce row fallback for unsupported cached batches - keep columnar execution only when schema validation succeeds | `backends-velox/.../ColumnarCachedBatchSerializer.scala` | | [SPARK-53132](apache/spark#53132), [SPARK-53142](apache/spark#53142) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 KeyGroupedPartitioningSuite tests. Excluded tests: `SPARK-53322*`, `SPARK-54439*` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [SPARK-53535](https://issues.apache.org/jira/browse/SPARK-53535), [SPARK-54220](https://issues.apache.org/jira/browse/SPARK-54220) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 GlutenParquetIOSuite tests. Excluded tests: `SPARK-53535*`, `vectorized reader: missing all struct fields*`, `SPARK-54220*` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#52645](apache/spark#52645) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 GlutenStreamingQuerySuite tests. Excluded tests: `SPARK-53942: changing the number of stateless shuffle partitions via config`, `SPARK-53942: stateful shuffle partitions are retained from old checkpoint` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#47856](apache/spark#47856) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 GlutenDataFrameWindowFunctionsSuite and GlutenJoinSuite tests. Excluded tests: `SPARK-49386: Window spill with more than the inMemoryThreshold and spillSizeThreshold`, `SPARK-49386: test SortMergeJoin (with spill by size threshold)` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#52157](apache/spark#52157) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 GlutenQueryExecutionSuite tests. Excluded test: `#53413: Cleanup shuffle dependencies for commands` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#48470](apache/spark#48470) | 4.1.0 | Test Exclusion | Exclude split test in GlutenRegexpExpressionsSuite. Excluded test: `GlutenRegexpExpressionsSuite.SPLIT` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#51623](apache/spark#51623) | 4.1.0 | Test Exclusion | Add `spark.sql.unionOutputPartitioning=false` to Maven test args. Excluded tests: `GlutenBroadcastExchangeSuite.SPARK-52962`, `GlutenDataFrameSetOperationsSuite.SPARK-52921*` | `.github/workflows/velox_backend_x86.yml`, `gluten-ut/spark41/.../VeloxTestSettings.scala`, `tools/gluten-it/common/.../Suite.scala` | | N/A | 4.1.0 | Test Exclusion | Excludes failed SQL tests that need to be fixed for Spark 4.1 compatibility. Excluded tests: `decimalArithmeticOperations.sql`, `identifier-clause.sql`, `keywords.sql`, `literals.sql`, `operators.sql`, `exists-orderby-limit.sql`, `postgreSQL/date.sql`, `nonansi/keywords.sql`, `nonansi/literals.sql`, `datetime-legacy.sql`, `datetime-parsing-invalid.sql`, `misc-functions.sql` | `gluten-ut/spark41/.../VeloxSQLQueryTestSettings.scala` |

## Changes | Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | N/A | Feat | Build | Update build configuration to support Spark 4.1 UT | `.github/workflows/velox_backend_x86.yml`, `gluten-ut/pom.xml`, `gluten-ut/spark41/pom.xml`, `tools/gluten-it/pom.xml` | | [#52165](apache/spark#52165) | Fix | Dependency | Update Parquet dependency version to 1.16.0 to avoid NoSuchMethodError issue | `gluten-ut/spark41/pom.xml` | | [#51477](apache/spark#51477) | Fix | Compatibility | Update imports to reflect streaming runtime package refactoring in Apache Spark | `gluten-ut/spark41/.../GlutenDynamicPartitionPruningSuite.scala`, `gluten-ut/spark41/.../GlutenStreamingQuerySuite.scala` | | [#50674](apache/spark#50674) | Fix | Compatibility | Fix compatibility issue introduced by `TypedConfigBuilder` | `gluten-substrait/.../ExpressionConverter.scala`, `gluten-ut/spark41/.../GlutenCSVSuite.scala`, `gluten-ut/spark41/.../GlutenJsonSuite.scala` | | [#49766](apache/spark#49766) | Fix | Compatibility | Disable V2 bucketing in GlutenDynamicPartitionPruningSuite since spark.sql.sources.v2.bucketing.enabled is now enabled by default | `gluten-ut/spark41/.../GlutenDynamicPartitionPruningSuite.scala` | | [#42414](apache/spark#42414), [#53038](apache/spark#53038) | Fix | Bug Fix | Resolve an issue introduced by SPARK-42414, as identified in SPARK-53038 | `backends-velox/.../VeloxBloomFilterAggregate.scala` | | N/A | Fix | Bug Fix | Enforce row fallback for unsupported cached batches - keep columnar execution only when schema validation succeeds | `backends-velox/.../ColumnarCachedBatchSerializer.scala` | | [SPARK-53132](apache/spark#53132), [SPARK-53142](apache/spark#53142) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 KeyGroupedPartitioningSuite tests. Excluded tests: `SPARK-53322*`, `SPARK-54439*` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [SPARK-53535](https://issues.apache.org/jira/browse/SPARK-53535), [SPARK-54220](https://issues.apache.org/jira/browse/SPARK-54220) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 GlutenParquetIOSuite tests. Excluded tests: `SPARK-53535*`, `vectorized reader: missing all struct fields*`, `SPARK-54220*` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#52645](apache/spark#52645) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 GlutenStreamingQuerySuite tests. Excluded tests: `SPARK-53942: changing the number of stateless shuffle partitions via config`, `SPARK-53942: stateful shuffle partitions are retained from old checkpoint` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#47856](apache/spark#47856) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 GlutenDataFrameWindowFunctionsSuite and GlutenJoinSuite tests. Excluded tests: `SPARK-49386: Window spill with more than the inMemoryThreshold and spillSizeThreshold`, `SPARK-49386: test SortMergeJoin (with spill by size threshold)` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#52157](apache/spark#52157) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 GlutenQueryExecutionSuite tests. Excluded test: `#53413: Cleanup shuffle dependencies for commands` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#48470](apache/spark#48470) | 4.1.0 | Test Exclusion | Exclude split test in GlutenRegexpExpressionsSuite. Excluded test: `GlutenRegexpExpressionsSuite.SPLIT` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#51623](apache/spark#51623) | 4.1.0 | Test Exclusion | Add `spark.sql.unionOutputPartitioning=false` to Maven test args. Excluded tests: `GlutenBroadcastExchangeSuite.SPARK-52962`, `GlutenDataFrameSetOperationsSuite.SPARK-52921*` | `.github/workflows/velox_backend_x86.yml`, `gluten-ut/spark41/.../VeloxTestSettings.scala`, `tools/gluten-it/common/.../Suite.scala` | | N/A | 4.1.0 | Test Exclusion | Excludes failed SQL tests that need to be fixed for Spark 4.1 compatibility. Excluded tests: `decimalArithmeticOperations.sql`, `identifier-clause.sql`, `keywords.sql`, `literals.sql`, `operators.sql`, `exists-orderby-limit.sql`, `postgreSQL/date.sql`, `nonansi/keywords.sql`, `nonansi/literals.sql`, `datetime-legacy.sql`, `datetime-parsing-invalid.sql`, `misc-functions.sql` | `gluten-ut/spark41/.../VeloxSQLQueryTestSettings.scala` | | apache#11252 | 4.1.0 | Test Exclusion | Exclude Gluten test for SPARK-47939: Explain should work with parameterized queries | `gluten-ut/spark41/.../VeloxTestSettings.scala` |

| Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | - | Feat | Feature | Introduce Spark41Shims and update build configuration to support Spark 4.1. | pom.xml shims/pom.xml shims/spark41/pom.xml shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider shims/spark41/.../spark41/Spark41Shims.scala shims/spark41/.../spark41/SparkShimProvider.scala | | [#51477](apache/spark#51477) | Fix | Compatibility | Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. | gluten-core/.../caller/CallerInfo.scala | | [#50852](apache/spark#50852) | Fix | Compatibility | Add printOutputColumns parameter to generateTreeString methods | shims/spark41/.../GenerateTreeStringShim.scala | | [#51775](apache/spark#51775) | Fix | Compatibility | Remove unused MDC import in FileSourceScanExecShim.scala | shims/spark41/.../FileSourceScanExecShim.scala | | [#51979](apache/spark#51979) | Fix | Compatibility | Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec | shims/spark41/.../v2/AbstractBatchScanExec.scala shims/spark41/.../v2/BatchScanExecShim.scala | | [#51302](apache/spark#51302) | Fix | Compatibility | Remove TimeAdd from ExpressionConverter and ExpressionMappings for test | gluten-substrait/.../ExpressionConverter.scala gluten-substrait/.../ExpressionMappings.scala | | [#50598](apache/spark#50598) | Fix | Compatibility | Adapt to QueryExecution.createSparkPlan interface change | gluten-substrait/.../GlutenImplicits.scala shims/spark\*/.../shims/spark\*/Spark*Shims.scala | | [#52599](apache/spark#52599) | Fix | Compatibility | Adapt to DataSourceV2Relation interface change | backends-velox/.../ArrowConvertorRule.scala | | [#52384](apache/spark#52384) | Fix | Compatibility | Using new interface of ParquetFooterReader | backends-velox/.../ParquetMetadataUtils.scala gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala shims/spark*/.../parquet/ParquetFooterReaderShim.scala | | [#52509](apache/spark#52509) | Fix | Build | Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'` | pom.xml | | - | Fix | Test | Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability | backends-velox/.../VeloxHashJoinSuite.scala | | [#50849](apache/spark#50849) | Fix | Test | Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 | backends-velox/.../MiscOperatorSuite.scala | | [#52723](apache/spark#52723) | Fix | Compatibility | Add GeographyVal and GeometryVal support in ColumnarArrayShim | shims/spark41/.../vectorized/ColumnarArrayShim.java | | [#48470](apache/spark#48470) | 4.1.0 | Exclude | Exclude split test in VeloxStringFunctionsSuite | backends-velox/.../VeloxStringFunctionsSuite.scala | | [#51259](apache/spark#51259) | 4.1.0 | Exclude | Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0， we need update ci python to 3.10 | backends-velox/.../python/ArrowEvalPythonExecSuite.scala |

## Changes | Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | N/A | Feat | Build | Update build configuration to support Spark 4.1 UT | `.github/workflows/velox_backend_x86.yml`, `gluten-ut/pom.xml`, `gluten-ut/spark41/pom.xml`, `tools/gluten-it/pom.xml` | | [#52165](apache/spark#52165) | Fix | Dependency | Update Parquet dependency version to 1.16.0 to avoid NoSuchMethodError issue | `gluten-ut/spark41/pom.xml` | | [#51477](apache/spark#51477) | Fix | Compatibility | Update imports to reflect streaming runtime package refactoring in Apache Spark | `gluten-ut/spark41/.../GlutenDynamicPartitionPruningSuite.scala`, `gluten-ut/spark41/.../GlutenStreamingQuerySuite.scala` | | [#50674](apache/spark#50674) | Fix | Compatibility | Fix compatibility issue introduced by `TypedConfigBuilder` | `gluten-substrait/.../ExpressionConverter.scala`, `gluten-ut/spark41/.../GlutenCSVSuite.scala`, `gluten-ut/spark41/.../GlutenJsonSuite.scala` | | [#49766](apache/spark#49766) | Fix | Compatibility | Disable V2 bucketing in GlutenDynamicPartitionPruningSuite since spark.sql.sources.v2.bucketing.enabled is now enabled by default | `gluten-ut/spark41/.../GlutenDynamicPartitionPruningSuite.scala` | | [#42414](apache/spark#42414), [#53038](apache/spark#53038) | Fix | Bug Fix | Resolve an issue introduced by SPARK-42414, as identified in SPARK-53038 | `backends-velox/.../VeloxBloomFilterAggregate.scala` | | N/A | Fix | Bug Fix | Enforce row fallback for unsupported cached batches - keep columnar execution only when schema validation succeeds | `backends-velox/.../ColumnarCachedBatchSerializer.scala` | | [SPARK-53132](apache/spark#53132), [SPARK-53142](apache/spark#53142) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 KeyGroupedPartitioningSuite tests. Excluded tests: `SPARK-53322*`, `SPARK-54439*` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [SPARK-53535](https://issues.apache.org/jira/browse/SPARK-53535), [SPARK-54220](https://issues.apache.org/jira/browse/SPARK-54220) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 GlutenParquetIOSuite tests. Excluded tests: `SPARK-53535*`, `vectorized reader: missing all struct fields*`, `SPARK-54220*` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#52645](apache/spark#52645) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 GlutenStreamingQuerySuite tests. Excluded tests: `SPARK-53942: changing the number of stateless shuffle partitions via config`, `SPARK-53942: stateful shuffle partitions are retained from old checkpoint` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#47856](apache/spark#47856) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 GlutenDataFrameWindowFunctionsSuite and GlutenJoinSuite tests. Excluded tests: `SPARK-49386: Window spill with more than the inMemoryThreshold and spillSizeThreshold`, `SPARK-49386: test SortMergeJoin (with spill by size threshold)` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#52157](apache/spark#52157) | 4.1.0 | Test Exclusion | Exclude additional Spark 4.1 GlutenQueryExecutionSuite tests. Excluded test: `#53413: Cleanup shuffle dependencies for commands` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#48470](apache/spark#48470) | 4.1.0 | Test Exclusion | Exclude split test in GlutenRegexpExpressionsSuite. Excluded test: `GlutenRegexpExpressionsSuite.SPLIT` | `gluten-ut/spark41/.../VeloxTestSettings.scala` | | [#51623](apache/spark#51623) | 4.1.0 | Test Exclusion | Add `spark.sql.unionOutputPartitioning=false` to Maven test args. Excluded tests: `GlutenBroadcastExchangeSuite.SPARK-52962`, `GlutenDataFrameSetOperationsSuite.SPARK-52921*` | `.github/workflows/velox_backend_x86.yml`, `gluten-ut/spark41/.../VeloxTestSettings.scala`, `tools/gluten-it/common/.../Suite.scala` | | N/A | 4.1.0 | Test Exclusion | Excludes failed SQL tests that need to be fixed for Spark 4.1 compatibility. Excluded tests: `decimalArithmeticOperations.sql`, `identifier-clause.sql`, `keywords.sql`, `literals.sql`, `operators.sql`, `exists-orderby-limit.sql`, `postgreSQL/date.sql`, `nonansi/keywords.sql`, `nonansi/literals.sql`, `datetime-legacy.sql`, `datetime-parsing-invalid.sql`, `misc-functions.sql` | `gluten-ut/spark41/.../VeloxSQLQueryTestSettings.scala` | | #11252 | 4.1.0 | Test Exclusion | Exclude Gluten test for SPARK-47939: Explain should work with parameterized queries | `gluten-ut/spark41/.../VeloxTestSettings.scala` |

github-actions bot added the SQL label Oct 15, 2024

DenineLu changed the title ~~[][] Fix the split function with limit being cut off incorrectly~~ [SPARK-49968][SQL] Fix the split function with limit being cut off incorrectly Oct 15, 2024

DenineLu force-pushed the fix_split_on_empty_regex branch 5 times, most recently from d611ac7 to 458fa12 Compare October 15, 2024 06:04

DenineLu changed the title ~~[SPARK-49968][SQL] Fix the split function with limit being cut off incorrectly~~ [SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit Oct 15, 2024

DenineLu force-pushed the fix_split_on_empty_regex branch from 458fa12 to fc9b461 Compare October 15, 2024 08:46

uros-db suggested changes Oct 16, 2024

View reviewed changes

DenineLu requested a review from uros-db October 17, 2024 11:34

uros-db reviewed Oct 17, 2024

View reviewed changes

...c/test/scala/org/apache/spark/sql/catalyst/expressions/CollationRegexpExpressionsSuite.scala Show resolved Hide resolved

uros-db reviewed Oct 17, 2024

View reviewed changes

sql/core/src/test/resources/sql-tests/inputs/string-functions.sql Show resolved Hide resolved

uros-db reviewed Oct 17, 2024

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java Outdated Show resolved Hide resolved

uros-db reviewed Oct 17, 2024

View reviewed changes

DenineLu force-pushed the fix_split_on_empty_regex branch 2 times, most recently from 9abf1c4 to 4ebf280 Compare October 18, 2024 03:02

DenineLu requested a review from uros-db October 18, 2024 06:32

uros-db reviewed Oct 18, 2024

View reviewed changes

uros-db approved these changes Oct 18, 2024

View reviewed changes

DenineLu force-pushed the fix_split_on_empty_regex branch from 3d295d8 to 5eb4889 Compare October 21, 2024 09:06

uros-db reviewed Oct 21, 2024

View reviewed changes

DenineLu force-pushed the fix_split_on_empty_regex branch from 5eb4889 to 7b240c3 Compare October 21, 2024 11:04

DenineLu requested a review from uros-db October 22, 2024 01:53

uros-db approved these changes Oct 22, 2024

View reviewed changes

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

fd50819

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

e87f1aa

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude SPLIT test in VeloxStringFunctionsSuite

c994fa5

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude split test in VeloxStringFunctionsSuite

031b12e

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

d4fa99b

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude split test in GlutenRegexpExpressionsSuite

586c8f1

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude split test in GlutenRegexpExpressionsSuite

a1907b4

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude split test in GlutenRegexpExpressionsSuite

e2c25b5

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude split test in GlutenRegexpExpressionsSuite

0af06c5

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude split test in GlutenRegexpExpressionsSuite

f8a1f81

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude split test in GlutenRegexpExpressionsSuite

86ba196

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 5, 2026

[4.1.0] Exclude split test in GlutenRegexpExpressionsSuite

63d0c79

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 5, 2026

[4.1.0] Exclude split test in GlutenRegexpExpressionsSuite

e1fbe37

see apache/spark#48470

baibaichen mentioned this pull request Jan 5, 2026

[GLUTEN-11346][CORE][VL] Add Spark 4.1 Shim Layer apache/incubator-gluten#11347

Merged

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 5, 2026

[4.1.0] Exclude split test in GlutenRegexpExpressionsSuite

553a974

see apache/spark#48470

baibaichen mentioned this pull request Jan 7, 2026

[GLUTEN-11343][CORE][VL] Support Spark 4.1 UT apache/incubator-gluten#11353

Merged

	result[i] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);
	result[charIndex] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);

-      StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), -1),
-      StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), 0),
-      StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1A2B3C"), 1),
-      StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2B3C"), 3),
-      StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3C"), 5),
-      StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), 100),
+      StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), -1),
+      StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), 0),
+      StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1A2B3C"), 1),
+      StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2B3C"), 3),
+      StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3C"), 5),
+      StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), 100),

[SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit #48470

[SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit #48470

Uh oh!

Conversation

DenineLu commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

TongWei1105 commented Oct 15, 2024

Uh oh!

wangyum commented Oct 15, 2024

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

DenineLu commented Oct 17, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

DenineLu commented Oct 15, 2024 •

edited

Loading