-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40697][SQL] Add read-side char padding to cover external data files #38151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
some of the string functions may be affected, such as |
|
@yaooqinn any thoughts on the problems? I thought the only problem is perf, but char type is rarely used anyway. |
| } else if (numChars < limit) { | ||
| return inputStr.rpad(limit, SPACE); | ||
| } else { | ||
| return inputStr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we throw an exception if the numChars exceeds limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a test case for this branch, no matter what the expected behavior is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have tests for it, and I didn't add length check to match VARCHAR. Note: we add can read-side length check for both CHAR and VARCHAR, but that's bad for perf as VARCHAR is common.
| * When comparing char type column/field with string literal or char type column/field, | ||
| * right-pad the shorter one to the longer length. | ||
| */ | ||
| object ApplyCharTypePadding extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we move the rule under the datasources package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was there at the first place: 5cfbddd#diff-da01d4c9147810ef330a7e70ad197fda5e768ad17558176e686d3e63139172b5
|
would the result of concat('abc', 'defg') be changed from 'abcdefg' to 'abc defg'? |
|
@yaooqinn It will not, as read-side padding only applies to char type. |
|
can we add some unit tests for |
|
@yaooqinn you are probably confused by 2 kinds of char paddings:
|
withTempPath { dir =>
withTable("t") {
sql("SELECT '12' as col1, '12' as col2").write.format(format).save(dir.toString)
sql(s"CREATE TABLE t (col1 char(3), col2 varchar(3)) using $format")
sql(s"ALTER TABLE t SET LOCATION '$dir'")
checkAnswer(spark.sql("select concat(col1, col2) from t"), Row("12 12"))
}
}I mean a test case like the above. W/ this PR, the 1st case below is a behavior change that needs a doc or unit test. |
| sql("SELECT '12' as col1, '12' as col2").write.format(format).save(dir.toString) | ||
| sql(s"CREATE TABLE t (col1 char(3), col2 varchar(3)) using $format") | ||
| sql(s"ALTER TABLE t SET LOCATION '$dir'") | ||
| checkAnswer(spark.table("t"), Row("12 ", "12")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaooqinn I think this test covers the behavior change you mentioned? And I'd treat it as a bug fix, as the previous result didn't follow CHAR type semantic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense to me
gengliangwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
thanks for review, merging to master! |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @cloud-fan , @gengliangwang , @yaooqinn .
…plied if necessary ### What changes were proposed in this pull request? This is a followup of #38151, to fix a perf issue. When struct/array/map doesn't contain char type field, we should not recreate the struct/array/map for nothing. ### Why are the changes needed? fix a perf issue ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #38479 from cloud-fan/char. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
…plied if necessary ### What changes were proposed in this pull request? This is a followup of apache#38151, to fix a perf issue. When struct/array/map doesn't contain char type field, we should not recreate the struct/array/map for nothing. ### Why are the changes needed? fix a perf issue ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes apache#38479 from cloud-fan/char. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
What changes were proposed in this pull request?
The current char/varchar feature relies on the data source to take care of all the write paths and ensure the char/varchar semantic (length check, string padding). This is good for the read performance, but has problems if some write paths did not follow the char/varchar semantic. e.g. a parquet table can be written by old Spark versions that do not have char/varchar type, or by other systems that do not recognize Spark char/varchar type.
This PR adds read-side string padding for the char type, so that we can still guarantee the char type semantic if the underlying data is valid (not over length). Char type is rarely used for legacy reasons and the perf doesn't matter that much, correctness is more important here. People can still disable read-side padding via a config if they are sure the data was written properly, such as benchmarks.
Note, we don't add read-side length check as varchar type is widely used and we don't want to introduce perf regression for the common case. Another reason is it's better to avoid invalid data at the write side, and read-side check won't help much.
Why are the changes needed?
to better enforce char type semantic
Does this PR introduce any user-facing change?
Yes. Now Spark can still return padding char type values correctly even if the data source writer wrote the char type value without padding.
How was this patch tested?
updated tests