-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35111][SQL] Support Cast string to year-month interval #32266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
FYI @MaxGekk |
|
Test build #137721 has finished for PR 32266 at commit
|
| } | ||
| } | ||
|
|
||
| test("SPARK-35111: Cast string to year-month interval") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add round trip tests: string -> year-month interval -> string, and year-month interval -> string -> year-month interval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you test corner cases when arithmetic overflow happens. Also test upper and lower cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, check the input in lower case. For example:
checkEvaluation(cast(Literal.create(" interval '1-0' YEAR TO MONTH "),
YearMonthIntervalType), 12)| if (input == null || input.toString == null) { | ||
| throw new IllegalArgumentException("Interval year-month string must be not null") | ||
| } else { | ||
| val regex = "INTERVAL '([-|+]?[0-9]+-[-|+]?[0-9]+)' YEAR TO MONTH".r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keywords INTERVAL and YEAR TO MONTH are not mandatory by SQL standard. We should be able to parse only payload too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could you move the regexp out of the method. So, we can avoid its "compilation" per every call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding to the pattern '([-|+]?[0-9]+-[-|+]?[0-9]+)', should we handle the second sign before month? Take a look at the val above yearMonthPattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding to the pattern
'([-|+]?[0-9]+-[-|+]?[0-9]+)', should we handle the second sign before month? Take a look at the val aboveyearMonthPattern
How about current
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk Can you help to list the case we need to support here,
INTERVAL [-|+]'[+|-]0-0' YEAR TO MONTH
[+|-]0-0
Any more case need to support here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, we must support our own output (in case insensitive manner by default but ideally we should respect the SQL config spark.sql.caseSensitive):
- ANSI_STYLE, like
INTERVAL -'-10-1' YEAR TO MONTH - HIVE_STYLE like
10-1 or -10-1
Rules from the SQL standard:
<interval literal> ::=
INTERVAL [ <sign> ] <interval string> <interval qualifier>
<interval string> ::=
<quote> <unquoted interval string> <quote>
<unquoted interval string> ::=
[ <sign> ] { <year-month literal> | <day-time literal> }
<year-month literal> ::=
<years value> [ <minus sign> <months value> ]
| <months value>
<years value> ::=
<datetime value>
<months value> ::=
<datetime value>
<datetime value> ::=
<unsigned integer>
<unsigned integer> ::= <digit>...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but ideally we should respect the SQL config
spark.sql.caseSensitive):
Spark Catalyst parser not respect spark.sql.caseSensitive too, should we respect it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For parsing I think it's OK to be always case insensitive. The Spark SQL parser is case insensitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, how about current approach
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137725 has finished for PR 32266 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137791 has finished for PR 32266 at commit
|
|
Test build #137803 has finished for PR 32266 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
|
|
||
| private val yearMonthPattern = "^([+|-])?(\\d+)-(\\d+)$".r | ||
|
|
||
| private val yearMonthFuzzyPattern = "[^-|+]*?([+|-]?[\\d]+-[\\d]+).*".r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a new regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a new regex?
Since yearMonthPattern only support [+|-]yy-mmmm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do we want to support here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am discussing this here #32266 (comment) with @MaxGekk ,
Do you have any other supplementary suggestions?
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #138072 has finished for PR 32266 at commit
|
|
Test build #138073 has finished for PR 32266 at commit
|
|
Test build #138075 has finished for PR 32266 at commit
|
|
Test build #138076 has finished for PR 32266 at commit
|
|
Test build #138079 has finished for PR 32266 at commit
|
| "^(INTERVAL\\s+)([+|-])?(')([+|-])?(\\d+)-(\\d+)(')(\\s+YEAR\\s+TO\\s+MONTH)$".r | ||
|
|
||
| def castStringToYMInterval(input: UTF8String): Int = { | ||
| input.trimAll().toString.toUpperCase(Locale.ROOT) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can specify that your pattern is case insensitive, and remove .toUpperCase(Locale.ROOT)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't see any tests that check proper spaces handling - I mean trimAll(). Could you add such checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wonder why don't you trim whitespaces by regular expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regular
Match case -1-0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't see any tests that check proper spaces handling - I mean
trimAll(). Could you add such checks.
done
|
|
||
| private val yearMonthPattern = "^([+|-])?(\\d+)-(\\d+)$".r | ||
| private val yearMonthStringPattern = | ||
| "^(INTERVAL\\s+)([+|-])?(')([+|-])?(\\d+)-(\\d+)(')(\\s+YEAR\\s+TO\\s+MONTH)$".r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "^(INTERVAL\\s+)([+|-])?(')([+|-])?(\\d+)-(\\d+)(')(\\s+YEAR\\s+TO\\s+MONTH)$".r | |
| "(?i)^(INTERVAL\\s+)([+|-])?(')([+|-])?(\\d+)-(\\d+)(')(\\s+YEAR\\s+TO\\s+MONTH)$".r |
The (?i) is placed at the beginning of the pattern to enable case-insensitivity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you share
| private[this] def castToYearMonthIntervalCode(from: DataType): CastFunction = from match { | ||
| case StringType => | ||
| val util = IntervalUtils.getClass.getCanonicalName.stripSuffix("$") | ||
| (c, evPrim, evNull) => code"$evPrim = $util.castStringToYMInterval($c);" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: evNull is not used
| (c, evPrim, evNull) => code"$evPrim = $util.castStringToYMInterval($c);" | |
| (c, evPrim, _) => code"$evPrim = $util.castStringToYMInterval($c);" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general except of a few comments.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
+1, LGTM. GA passed. Merging to master. |
|
Test build #138096 has finished for PR 32266 at commit
|
What changes were proposed in this pull request?
Support Cast string to year-month interval
Supported format as below
Why are the changes needed?
Support Cast string to year-month interval
Does this PR introduce any user-facing change?
User can cast year month interval string to YearMonthIntervalType
How was this patch tested?
Added UT