[SPARK-40697][SQL] Add read-side char padding to cover external data files #38151

cloud-fan · 2022-10-07T09:20:17Z

What changes were proposed in this pull request?

The current char/varchar feature relies on the data source to take care of all the write paths and ensure the char/varchar semantic (length check, string padding). This is good for the read performance, but has problems if some write paths did not follow the char/varchar semantic. e.g. a parquet table can be written by old Spark versions that do not have char/varchar type, or by other systems that do not recognize Spark char/varchar type.

This PR adds read-side string padding for the char type, so that we can still guarantee the char type semantic if the underlying data is valid (not over length). Char type is rarely used for legacy reasons and the perf doesn't matter that much, correctness is more important here. People can still disable read-side padding via a config if they are sure the data was written properly, such as benchmarks.

Note, we don't add read-side length check as varchar type is widely used and we don't want to introduce perf regression for the common case. Another reason is it's better to avoid invalid data at the write side, and read-side check won't help much.

Why are the changes needed?

to better enforce char type semantic

Does this PR introduce any user-facing change?

Yes. Now Spark can still return padding char type values correctly even if the data source writer wrote the char type value without padding.

How was this patch tested?

updated tests

cloud-fan · 2022-10-07T09:22:12Z

cc @yaooqinn @gengliangwang

yaooqinn · 2022-10-10T09:57:16Z

some of the string functions may be affected, such as length, rpad, concat etc

cloud-fan · 2022-10-10T17:22:52Z

@yaooqinn any thoughts on the problems? I thought the only problem is perf, but char type is rarely used anyway.

gengliangwang · 2022-10-10T18:08:16Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/CharVarcharCodegenUtils.java

+    } else if (numChars < limit) {
+      return inputStr.rpad(limit, SPACE);
+    } else {
+      return inputStr;


Shall we throw an exception if the numChars exceeds limit?

We need a test case for this branch, no matter what the expected behavior is.

We have tests for it, and I didn't add length check to match VARCHAR. Note: we add can read-side length check for both CHAR and VARCHAR, but that's bad for perf as VARCHAR is common.

gengliangwang · 2022-10-10T18:12:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ApplyCharTypePadding.scala

+ * When comparing char type column/field with string literal or char type column/field,
+ * right-pad the shorter one to the longer length.
+ */
+object ApplyCharTypePadding extends Rule[LogicalPlan] {


Why do we move the rule under the datasources package?

It was there at the first place: 5cfbddd#diff-da01d4c9147810ef330a7e70ad197fda5e768ad17558176e686d3e63139172b5

yaooqinn · 2022-10-11T01:39:18Z

would the result of concat('abc', 'defg') be changed from 'abcdefg' to 'abc defg'?

cloud-fan · 2022-10-11T02:20:29Z

@yaooqinn It will not, as read-side padding only applies to char type.

yaooqinn · 2022-10-11T02:25:27Z

can we add some unit tests for char + string funcs?

cloud-fan · 2022-10-11T03:11:34Z

@yaooqinn you are probably confused by 2 kinds of char paddings:

column value padding. This is done at the write side, and this PR adds read-side padding as well. It doesn't matter how/where the column is used. The semantic is the same as if column value was padded while writing.
char comparison padding. This is done at the read side, and this PR doesn't touch it. When comparsing a char column with a string literal, we pad the shorter one to the longer length.

yaooqinn · 2022-10-11T07:04:30Z

  withTempPath { dir =>
      withTable("t") {
        sql("SELECT '12' as col1, '12' as col2").write.format(format).save(dir.toString)
        sql(s"CREATE TABLE t (col1 char(3), col2 varchar(3)) using $format")
        sql(s"ALTER TABLE t SET LOCATION '$dir'")
        checkAnswer(spark.sql("select concat(col1, col2) from t"), Row("12 12"))
      }
    }

I mean a test case like the above. W/ this PR, the 1st case below is a behavior change that needs a doc or unit test.

// length fits
12 + 12 = 12<space>12
// length overflow w/ significant characters
1234 + 12 = 1234
// length overflow w/ spaces
12<space><space> + 12 = 12<space><space>34

cloud-fan · 2022-10-11T13:24:02Z

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala

+        sql("SELECT '12' as col1, '12' as col2").write.format(format).save(dir.toString)
+        sql(s"CREATE TABLE t (col1 char(3), col2 varchar(3)) using $format")
+        sql(s"ALTER TABLE t SET LOCATION '$dir'")
+        checkAnswer(spark.table("t"), Row("12 ", "12"))


@yaooqinn I think this test covers the behavior change you mentioned? And I'd treat it as a bug fix, as the previous result didn't follow CHAR type semantic.

make sense to me

gengliangwang

+1

cloud-fan · 2022-10-12T06:16:11Z

thanks for review, merging to master!

dongjoon-hyun

Thank you, @cloud-fan , @gengliangwang , @yaooqinn .

…plied if necessary ### What changes were proposed in this pull request? This is a followup of #38151, to fix a perf issue. When struct/array/map doesn't contain char type field, we should not recreate the struct/array/map for nothing. ### Why are the changes needed? fix a perf issue ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #38479 from cloud-fan/char. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

…plied if necessary ### What changes were proposed in this pull request? This is a followup of apache#38151, to fix a perf issue. When struct/array/map doesn't contain char type field, we should not recreate the struct/array/map for nothing. ### Why are the changes needed? fix a perf issue ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes apache#38479 from cloud-fan/char. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

add read-side char/varchar handling to cover external data files

d2f8d8a

github-actions bot added CORE SQL labels Oct 7, 2022

fix incline CTE

041ddd0

fix tests

9635442

gengliangwang reviewed Oct 10, 2022

View reviewed changes

cloud-fan commented Oct 11, 2022

View reviewed changes

gengliangwang approved these changes Oct 12, 2022

View reviewed changes

yaooqinn approved these changes Oct 12, 2022

View reviewed changes

cloud-fan closed this in b9998cf Oct 12, 2022

dongjoon-hyun reviewed Oct 12, 2022

View reviewed changes

cloud-fan mentioned this pull request Nov 2, 2022

[SPARK-40697][SQL][FOLLOWUP] Read-side char padding should only be applied if necessary #38479

Closed

This was referenced Jan 4, 2024

[GLUTEN-3559][VL] Fix TPCDS Plan Suite apache/incubator-gluten#4282

Merged

[VL] Exception when SQLConf.READ_SIDE_CHAR_PADDING config is enabled apache/incubator-gluten#4294

Open

[GLUTEN-3559][VL] Fix TPCDS Plan Suite apache/incubator-gluten#4298

Merged

[SPARK-40697][SQL] Add read-side char padding to cover external data files #38151

[SPARK-40697][SQL] Add read-side char padding to cover external data files #38151

Uh oh!

Conversation

cloud-fan commented Oct 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Oct 7, 2022

Uh oh!

yaooqinn commented Oct 10, 2022

Uh oh!

cloud-fan commented Oct 10, 2022

Uh oh!

gengliangwang Oct 10, 2022

Choose a reason for hiding this comment

Uh oh!

gengliangwang Oct 10, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Oct 10, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 11, 2022

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Oct 11, 2022

Uh oh!

cloud-fan commented Oct 11, 2022

Uh oh!

yaooqinn commented Oct 11, 2022

Uh oh!

cloud-fan commented Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Oct 11, 2022

Uh oh!

cloud-fan Oct 11, 2022

Choose a reason for hiding this comment

Uh oh!

yaooqinn Oct 12, 2022

Choose a reason for hiding this comment

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 12, 2022

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cloud-fan commented Oct 7, 2022 •

edited

Loading

cloud-fan Oct 11, 2022 •

edited

Loading

cloud-fan commented Oct 11, 2022 •

edited

Loading