Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Oct 7, 2022

What changes were proposed in this pull request?

The current char/varchar feature relies on the data source to take care of all the write paths and ensure the char/varchar semantic (length check, string padding). This is good for the read performance, but has problems if some write paths did not follow the char/varchar semantic. e.g. a parquet table can be written by old Spark versions that do not have char/varchar type, or by other systems that do not recognize Spark char/varchar type.

This PR adds read-side string padding for the char type, so that we can still guarantee the char type semantic if the underlying data is valid (not over length). Char type is rarely used for legacy reasons and the perf doesn't matter that much, correctness is more important here. People can still disable read-side padding via a config if they are sure the data was written properly, such as benchmarks.

Note, we don't add read-side length check as varchar type is widely used and we don't want to introduce perf regression for the common case. Another reason is it's better to avoid invalid data at the write side, and read-side check won't help much.

Why are the changes needed?

to better enforce char type semantic

Does this PR introduce any user-facing change?

Yes. Now Spark can still return padding char type values correctly even if the data source writer wrote the char type value without padding.

How was this patch tested?

updated tests

@cloud-fan
Copy link
Contributor Author

cc @yaooqinn @gengliangwang

@yaooqinn
Copy link
Member

some of the string functions may be affected, such as length, rpad, concat etc

@cloud-fan
Copy link
Contributor Author

@yaooqinn any thoughts on the problems? I thought the only problem is perf, but char type is rarely used anyway.

} else if (numChars < limit) {
return inputStr.rpad(limit, SPACE);
} else {
return inputStr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we throw an exception if the numChars exceeds limit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test case for this branch, no matter what the expected behavior is.

Copy link
Contributor Author

@cloud-fan cloud-fan Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have tests for it, and I didn't add length check to match VARCHAR. Note: we add can read-side length check for both CHAR and VARCHAR, but that's bad for perf as VARCHAR is common.

* When comparing char type column/field with string literal or char type column/field,
* right-pad the shorter one to the longer length.
*/
object ApplyCharTypePadding extends Rule[LogicalPlan] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we move the rule under the datasources package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn
Copy link
Member

would the result of concat('abc', 'defg') be changed from 'abcdefg' to 'abc defg'?

@cloud-fan
Copy link
Contributor Author

@yaooqinn It will not, as read-side padding only applies to char type.

@yaooqinn
Copy link
Member

can we add some unit tests for char + string funcs?

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Oct 11, 2022

@yaooqinn you are probably confused by 2 kinds of char paddings:

  1. column value padding. This is done at the write side, and this PR adds read-side padding as well. It doesn't matter how/where the column is used. The semantic is the same as if column value was padded while writing.
  2. char comparison padding. This is done at the read side, and this PR doesn't touch it. When comparsing a char column with a string literal, we pad the shorter one to the longer length.

@yaooqinn
Copy link
Member

  withTempPath { dir =>
      withTable("t") {
        sql("SELECT '12' as col1, '12' as col2").write.format(format).save(dir.toString)
        sql(s"CREATE TABLE t (col1 char(3), col2 varchar(3)) using $format")
        sql(s"ALTER TABLE t SET LOCATION '$dir'")
        checkAnswer(spark.sql("select concat(col1, col2) from t"), Row("12 12"))
      }
    }

I mean a test case like the above. W/ this PR, the 1st case below is a behavior change that needs a doc or unit test.

// length fits
12 + 12 = 12<space>12
// length overflow w/ significant characters
1234 + 12 = 1234
// length overflow w/ spaces
12<space><space> + 12 = 12<space><space>34

sql("SELECT '12' as col1, '12' as col2").write.format(format).save(dir.toString)
sql(s"CREATE TABLE t (col1 char(3), col2 varchar(3)) using $format")
sql(s"ALTER TABLE t SET LOCATION '$dir'")
checkAnswer(spark.table("t"), Row("12 ", "12"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn I think this test covers the behavior change you mentioned? And I'd treat it as a bug fix, as the previous result didn't follow CHAR type semantic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense to me

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@cloud-fan
Copy link
Contributor Author

thanks for review, merging to master!

@cloud-fan cloud-fan closed this in b9998cf Oct 12, 2022
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gengliangwang pushed a commit that referenced this pull request Nov 4, 2022
…plied if necessary

### What changes were proposed in this pull request?

This is a followup of #38151, to fix a perf issue. When struct/array/map doesn't contain char type field, we should not recreate the struct/array/map for nothing.

### Why are the changes needed?

fix a perf issue

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

Closes #38479 from cloud-fan/char.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…plied if necessary

### What changes were proposed in this pull request?

This is a followup of apache#38151, to fix a perf issue. When struct/array/map doesn't contain char type field, we should not recreate the struct/array/map for nothing.

### Why are the changes needed?

fix a perf issue

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

Closes apache#38479 from cloud-fan/char.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants