-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[CALCITE-5707] Add ARRAY_CONTAINS function (enabled in Spark library) #3207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| /** Support the ARRAY_CONTAINS function. */ | ||
| public static boolean contains(List list, Object element) { | ||
| final Set set = new HashSet(list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for faster search?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeap, it is no different with for loop, may cost some space. if you think it is need, i will change it to for loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, set has not been reused, and I don't think it can improve speed.
Because the construction time of the set may be longer than the search time of the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeap, fixed @JiajunBernoulli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this method at all. The code generator can just call java.util.List.contains. You will need to add it to BuiltInMethod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julianhyde I have considered this, but it can't. because the array_contains have 2 args instead of 1 in java.util.List.contains
core/src/main/java/org/apache/calcite/sql/type/ArrayElementOperandTypeChecker.java
Show resolved
Hide resolved
| /** | ||
| * Parameter type-checking strategy where types must be Array and Array element type. | ||
| */ | ||
| public class ArrayElementOperandTypeChecker implements SqlOperandTypeChecker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many codes are same as MultisetOperandTypeChecker, Can we extract them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes refer it.
do not find a good idea to abstract it.
MultisetOperandTypeChecker for two Multiset
ArrayElementOperandTypeChecker for Array and element type
core/src/main/java/org/apache/calcite/sql/fun/SqlLibraryOperators.java
Outdated
Show resolved
Hide resolved
|
hi @JiajunBernoulli fix conflict and all your reviews, and thanks for your review so much |
|
hi @tanclary , will you also help review this? because other functions depends it(type ARRAY_ELEMENT_ARG), such as array_position/array_remove/array_append/array_prepend and so on, i will support them in one pr |
core/src/main/java/org/apache/calcite/sql/type/ArrayElementOperandTypeChecker.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/calcite/sql/type/SqlTypeTransforms.java
Outdated
Show resolved
Hide resolved
| "No match found for function signature " | ||
| + "ARRAY_CONTAINS\\(<INTEGER ARRAY>, <NUMERIC>\\)", false); | ||
|
|
||
| final SqlOperatorFixture f = f0.withLibrary(SqlLibrary.SPARK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again curious what the behavior is/should be if you search an array of type X for a value of type Y, obviously it would return false but should it be allowed in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, @tanclary. The validator should give an error if you - say - search for a BOOLEAN in a DATE ARRAY. We should add a test case to this test method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @julianhyde @tanclary , there is the unit test in the end
f.checkFails("^array_contains(array[1, 2], true)^",
"INTEGER is not comparable to BOOLEAN", false);
|
hi @tanclary @JiajunBernoulli @julianhyde @MasseGuillaume fix all your reviews, do you have time to look again? |
| f.checkScalar("array_contains(array[1, null], cast(null as integer))", true, | ||
| "BOOLEAN NOT NULL"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to type check exactly as Apache Spark does?
spark.sql("select array_contains(array(1, null), null)").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(array(1, CAST(NULL AS INT)), NULL)' due to data type mismatch: Null typed values cannot be used as arguments; line 1 pos 7;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should use array_contains(array[1, null], cast(null as integer)
and for my side, spark the behavior is not good the second arg is null, also return null.
so i does flink way, in flink array_contains(array[1, null], cast(null as integer) it return true, while in spark return null
public static final BuiltInFunctionDefinition ARRAY_CONTAINS =
BuiltInFunctionDefinition.newBuilder()
.name("ARRAY_CONTAINS")
.kind(SCALAR)
.inputTypeStrategy(
sequence(
Arrays.asList("haystack", "needle"),
Arrays.asList(
logical(LogicalTypeRoot.ARRAY), ARRAY_ELEMENT_ARG)))
.outputTypeStrategy(
nullableIfArgs(
ConstantArgumentCount.of(0), explicit(DataTypes.BOOLEAN())))
.runtimeClass(
"org.apache.flink.table.runtime.functions.scalar.ArrayContainsFunction")
.build();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql("""select array_contains(array(1), cast(null as integer))""").show() this works indeed.
| f.checkScalar("array_contains(array[map[1, 'a'], map[2, 'b']], map[1, 'a'])", true, | ||
| "BOOLEAN NOT NULL"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql("""select array_contains(array(map(1, "1"), map(2, "2")), map(2, "2"))""").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(array(map(1, '1'), map(2, '2')), map(2, '2'))' due to data type mismatch: function array_contains does not support ordering on type map<int,string>; line 1 pos 7;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to implementation limitation, currently Spark can't compare or do equality check between map types. As a result, map values can't appear in EQUAL or comparison expressions, can't be grouping key, etc.()
while calcite Map runtime implementation using java collection Map, which supports equality check. while spark not
/**
* This is an internal data representation for map type in Spark SQL. This should not implement
* `equals` and `hashCode` because the type cannot be used as join keys, grouping keys, or
* in equality tests. See SPARK-9415 and PR#13847 for the discussions.
*/
abstract class MapData extends Serializable {
|
@liuyongvs Please resolve conflict files. |
|
hi @JiajunBernoulli @julianhyde fix conflicts and align with spark instead of flink's more reasonable behavior |
|
Kudos, SonarCloud Quality Gate passed! |
8a5cf83 to
cf7f71b
Compare
julianhyde
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liuyongvs @tanclary @JiajunBernoulli @MasseGuillaume Thanks to all who reviewed. I think this is in good shape. Do you all agree? If so I'll merge.
I would change only two things:
- Remove SqlFunctions.arrayContains and use java.util.List.contains directly.
- Add some comments to the test about Spark vs Flink behavor.
Replace SqlFunctions.arrayContains with List.contains Tweak Util.distinctList, and add note that SqlFunctions.distinct could use a similar algorithm. Close apache#3207
Flink has a similar function, but has slightly different behavior from Spark. array_contains(array[1, null], cast(null as integer)) returns TRUE in Flink, UNKNOWN in Spark. This change implements the Spark behavior. Replace SqlFunctions.arrayContains with List.contains (Julian Hyde). Tweak Util.distinctList, and add note that SqlFunctions.distinct could use a similar algorithm (Julian Hyde). Close apache#3207
Flink has a similar function, but has slightly different behavior from Spark. array_contains(array[1, null], cast(null as integer)) returns TRUE in Flink, UNKNOWN in Spark. This change implements the Spark behavior. Replace SqlFunctions.arrayContains with List.contains (Julian Hyde). Tweak Util.distinctList, and add note that SqlFunctions.distinct could use a similar algorithm (Julian Hyde). Close apache#3207








ARRAY_CONTAINS - Returns true if the array contains the value
For more details
https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html