Skip to content

Conversation

@liuyongvs
Copy link
Contributor

ARRAY_CONTAINS - Returns true if the array contains the value

For more details
https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html

@liuyongvs liuyongvs changed the title [CALCITE-5707] Add ARRAY_CONTAINS function (enabled in Spark library). [CALCITE-5707] Add ARRAY_CONTAINS function (enabled in Spark library) May 18, 2023

/** Support the ARRAY_CONTAINS function. */
public static boolean contains(List list, Object element) {
final Set set = new HashSet(list);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for faster search?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, it is no different with for loop, may cost some space. if you think it is need, i will change it to for loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, set has not been reused, and I don't think it can improve speed.

Because the construction time of the set may be longer than the search time of the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, fixed @JiajunBernoulli

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this method at all. The code generator can just call java.util.List.contains. You will need to add it to BuiltInMethod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julianhyde I have considered this, but it can't. because the array_contains have 2 args instead of 1 in java.util.List.contains

/**
* Parameter type-checking strategy where types must be Array and Array element type.
*/
public class ArrayElementOperandTypeChecker implements SqlOperandTypeChecker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many codes are same as MultisetOperandTypeChecker, Can we extract them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes refer it.
do not find a good idea to abstract it.
MultisetOperandTypeChecker for two Multiset
ArrayElementOperandTypeChecker for Array and element type

@liuyongvs
Copy link
Contributor Author

hi @JiajunBernoulli fix conflict and all your reviews, and thanks for your review so much

@liuyongvs liuyongvs requested a review from julianhyde May 22, 2023 04:56
@liuyongvs
Copy link
Contributor Author

hi @tanclary , will you also help review this? because other functions depends it(type ARRAY_ELEMENT_ARG), such as array_position/array_remove/array_append/array_prepend and so on, i will support them in one pr

"No match found for function signature "
+ "ARRAY_CONTAINS\\(<INTEGER ARRAY>, <NUMERIC>\\)", false);

final SqlOperatorFixture f = f0.withLibrary(SqlLibrary.SPARK);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again curious what the behavior is/should be if you search an array of type X for a value of type Y, obviously it would return false but should it be allowed in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @tanclary. The validator should give an error if you - say - search for a BOOLEAN in a DATE ARRAY. We should add a test case to this test method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @julianhyde @tanclary , there is the unit test in the end
f.checkFails("^array_contains(array[1, 2], true)^",
"INTEGER is not comparable to BOOLEAN", false);

@liuyongvs liuyongvs requested a review from tanclary June 7, 2023 01:53
@liuyongvs
Copy link
Contributor Author

hi @tanclary @JiajunBernoulli @julianhyde @MasseGuillaume fix all your reviews, do you have time to look again?

Comment on lines 5378 to 5379
f.checkScalar("array_contains(array[1, null], cast(null as integer))", true,
"BOOLEAN NOT NULL");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to type check exactly as Apache Spark does?

spark.sql("select array_contains(array(1, null), null)").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(array(1, CAST(NULL AS INT)), NULL)' due to data type mismatch: Null typed values cannot be used as arguments; line 1 pos 7;

Copy link
Contributor Author

@liuyongvs liuyongvs Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should use array_contains(array[1, null], cast(null as integer)
and for my side, spark the behavior is not good the second arg is null, also return null.
so i does flink way, in flink array_contains(array[1, null], cast(null as integer) it return true, while in spark return null

public static final BuiltInFunctionDefinition ARRAY_CONTAINS =
            BuiltInFunctionDefinition.newBuilder()
                    .name("ARRAY_CONTAINS")
                    .kind(SCALAR)
                    .inputTypeStrategy(
                            sequence(
                                    Arrays.asList("haystack", "needle"),
                                    Arrays.asList(
                                            logical(LogicalTypeRoot.ARRAY), ARRAY_ELEMENT_ARG)))
                    .outputTypeStrategy(
                            nullableIfArgs(
                                    ConstantArgumentCount.of(0), explicit(DataTypes.BOOLEAN())))
                    .runtimeClass(
                            "org.apache.flink.table.runtime.functions.scalar.ArrayContainsFunction")
                    .build();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql("""select array_contains(array(1), cast(null as integer))""").show() this works indeed.

Comment on lines +5384 to +5385
f.checkScalar("array_contains(array[map[1, 'a'], map[2, 'b']], map[1, 'a'])", true,
"BOOLEAN NOT NULL");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql("""select array_contains(array(map(1, "1"), map(2, "2")), map(2, "2"))""").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(array(map(1, '1'), map(2, '2')), map(2, '2'))' due to data type mismatch: function array_contains does not support ordering on type map<int,string>; line 1 pos 7;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to implementation limitation, currently Spark can't compare or do equality check between map types. As a result, map values can't appear in EQUAL or comparison expressions, can't be grouping key, etc.()
while calcite Map runtime implementation using java collection Map, which supports equality check. while spark not

/**
 * This is an internal data representation for map type in Spark SQL. This should not implement
 * `equals` and `hashCode` because the type cannot be used as join keys, grouping keys, or
 * in equality tests. See SPARK-9415 and PR#13847 for the discussions.
 */
abstract class MapData extends Serializable {

apache/spark#23045

@liuyongvs liuyongvs requested a review from MasseGuillaume June 7, 2023 10:05
@JiajunBernoulli
Copy link
Contributor

@liuyongvs Please resolve conflict files.

@liuyongvs
Copy link
Contributor Author

hi @JiajunBernoulli @julianhyde fix conflicts and align with spark instead of flink's more reasonable behavior

@sonarqubecloud
Copy link

sonarqubecloud bot commented Jun 8, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

79.3% 79.3% Coverage
0.0% 0.0% Duplication

@julianhyde julianhyde force-pushed the main branch 2 times, most recently from 8a5cf83 to cf7f71b Compare June 8, 2023 21:21
Copy link
Contributor

@julianhyde julianhyde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuyongvs @tanclary @JiajunBernoulli @MasseGuillaume Thanks to all who reviewed. I think this is in good shape. Do you all agree? If so I'll merge.

I would change only two things:

  • Remove SqlFunctions.arrayContains and use java.util.List.contains directly.
  • Add some comments to the test about Spark vs Flink behavor.

julianhyde added a commit to julianhyde/calcite that referenced this pull request Jun 10, 2023
Replace SqlFunctions.arrayContains with List.contains

Tweak Util.distinctList, and add note that SqlFunctions.distinct could
use a similar algorithm.

Close apache#3207
@asfgit asfgit closed this in 3dfefd1 Jun 10, 2023
jhugomoore pushed a commit to jhugomoore/calcite-jhugomoore that referenced this pull request Jun 21, 2023
Flink has a similar function, but has slightly different
behavior from Spark.
  array_contains(array[1, null], cast(null as integer))
returns TRUE in Flink, UNKNOWN in Spark. This change
implements the Spark behavior.

Replace SqlFunctions.arrayContains with List.contains (Julian Hyde).

Tweak Util.distinctList, and add note that SqlFunctions.distinct could
use a similar algorithm (Julian Hyde).

Close apache#3207
jhugomoore pushed a commit to jhugomoore/calcite-jhugomoore that referenced this pull request Jun 22, 2023
Flink has a similar function, but has slightly different
behavior from Spark.
  array_contains(array[1, null], cast(null as integer))
returns TRUE in Flink, UNKNOWN in Spark. This change
implements the Spark behavior.

Replace SqlFunctions.arrayContains with List.contains (Julian Hyde).

Tweak Util.distinctList, and add note that SqlFunctions.distinct could
use a similar algorithm (Julian Hyde).

Close apache#3207
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants