Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add anchors/refs, subsumption examples, and typo fixes
  • Loading branch information
alancai98 committed Dec 4, 2023
commit 10eaba142b3bc1c7f3f49f03df526ab247db253c
64 changes: 47 additions & 17 deletions RFCs/0051-exclude-operator.adoc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A discussion from the original issue revolves around replacing items rather than just excluding them. A major use-case of PartiQL is using PartiQL as a means of performing transformations on semi-structured, open-schema data. Mentioned in the issue are also customers who have 1000+ columns in their source tables.

From how I've been reading this RFC, we might be able to provide a useful work-around -- at least for top-level values. We can take advantage of the fact that LET evaluates before EXCLUDE. See below:

SELECT t.*, someItemThatHasBeenReplaced
EXCLUDE t.b
FROM t
LET t.b + 1 AS someItemThatHasBeenReplaced

For nested attributes, however, I couldn't immediately find an intuitive solution.

With this RFC, do you expect any future necessary RFC's to add support for REPLACE? If so, in your opinion, does this RFC impede or allow for the addition of REPLACE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this RFC, do you expect any future necessary RFC's to add support for REPLACE?

That was my assumption and to leave REPLACE out of scope for this PR. REPLACE is included in the "Future possibilities" section of the RFC.

If so, in your opinion, does this RFC impede or allow for the addition of REPLACE?

I need to think more about the relationship between EXCLUDE and REPLACE. I think the syntactic rewrite included in the RFC could be adapted to support REPLACE, so I don't believe this RFC impedes an addition of REPLACE. After I get back from the Thanksgiving holiday, I'll look more into if the syntactic rewrite approach could be applied to nested attributes of REPLACE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing around a bit with the rewrite rules from the RFC, we could do something similar in the nested case branches for REPLACE of nested attributes. For example, using the query from example-tuple-attribute-as-final-step, if we had added the REPLACE clause: REPLACE t.b.field_x AS t.b.field_x * 42, the rewrite could add a WHEN branch like

WHEN LOWER(attr_1) = LOWER('b') THEN
    CASE 
        WHEN v_1 IS STRUCT THEN (
            PIVOT (
                CASE 
                    WHEN LOWER(attr_2) = LOWER('field_x') THEN v_2 * 42
                    ELSE v_2
                END
            ) AT attr_2
            FROM UNPIVOT v_1 AS v_2 AT attr_2
        )
    ELSE v_1
    END
ELSE v_1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full query could look something like:

-- EXCLUDE t.a.field_x
-- REPLACE t.b.field_x AS t.b.field_x * 42
SELECT t.*
FROM (
    SELECT VALUE {
        't':
            CASE
                WHEN t IS STRUCT THEN (
                    PIVOT (
                        CASE
                            WHEN LOWER(attr_1) = LOWER('a') THEN
                                CASE
                                    WHEN v_1 IS STRUCT THEN (
                                        PIVOT v_2 AT attr_2
                                        FROM UNPIVOT v_1 AS v_2 AT attr_2
                                        WHERE LOWER(attr_2) NOT IN [LOWER('field_x')]
                                    )
                                    ELSE v_1
                                END
                            WHEN LOWER(attr_1) = LOWER('b') THEN
                                CASE 
                                    WHEN v_1 IS STRUCT THEN (
                                        PIVOT (
                                            CASE 
                                                WHEN LOWER(attr_2) = LOWER('field_x') THEN v_2 * 42
                                                ELSE v_2
                                            END
                                        ) AT attr_2
                                        FROM UNPIVOT v_1 AS v_2 AT attr_2
                                    )
                                ELSE v_1
                                END
                            ELSE v_1
                        END
                    ) AT attr_1 FROM UNPIVOT t AS v_1 AT attr_1
                )
                ELSE t
            END
    }
    FROM <<
    {
        'a': { 'field_x': 0, 'field_y': 'zero' },  -- `field_x` excluded
        'b': { 'field_x': 1, 'field_y': 'one' },   -- `field_y` replaced with `field_y` * 42
        'c': { 'field_x': 2, 'field_y': 'two' }
    }
    >> AS t
)

, which the Kotlin implementation will output as:

<<
  {
    'a': {
      'field_y': 'zero'
    },
    'b': {
      'field_x': 42,
      'field_y': 'one'
    },
    'c': {
      'field_x': 2,
      'field_y': 'two'
    }
  }
>>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top-level, let's format these lines to be like 80 or 120 characters wide.

Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ This doc defines the `EXCLUDE` binding tuple operator used to omit nested values

== Motivation

SQL users often use `SELECT *` to project all of the columns a table. There's frequently a use case in which a user would like to project all the columns from a table other than a subset of the columns (see https://stackoverflow.com/q/729197[slack overflow question]). There are some workarounds in some database systems that are somewhat inefficient (e.g. creating a new table and dropping a select column), but it can be helpful to have a dedicated syntax to filter out certain columns. Prior art lists out a few databases that provide some version of this column filtering.
SQL users often use `SELECT *` to project all of the columns of a table. There is frequently a use case in which a user would like to project all the columns from a table other than a subset of the columns (see https://stackoverflow.com/q/729197[slack overflow question]). There are workarounds in some database systems that are somewhat inefficient (e.g. creating a new table and dropping a specific column), but it can be helpful to have a dedicated syntax to filter out certain columns. <<Prior art>> lists out a few databases that provide some version of this column filtering.

There is a similar need among PartiQL users to exclude certain nested fields from semi-structured data. PartiQL supports `SELECT *` to project all of the field of a binding tuple. Without `EXCLUDE`, if a user wanted to omit one field from this projection, they would need to list out all of the projection fields or perform some intricate combination of `PIVOT` and ``UNPIVOT``s.

Expand Down Expand Up @@ -64,7 +64,7 @@ FROM
<right bracket> ::= "]"
----

NOTE: Despite their similar syntax and naming, ``<exclude path>``s are different from PartiQL path expressions
NOTE: Despite their similar syntax and naming, ``<exclude path>``s are different from PartiQL path expressions.

=== Terminology
* For an `<exclude path>`, we refer to the leftmost identifier as the 'root' and the other exclude path components as 'steps'.
Expand All @@ -85,7 +85,7 @@ e.g. tableFoo.a[1].*[*].b['c']

=== Out of scope / assumptions

* We restrict `<exclude path>` non-wildcard steps to be identifiers as well as int and string literals. Thus these paths are statically known. We can decide in the future whether to add other exclude paths (e.g. expressions) if a use case arises.
* We restrict tuple attribute exclude steps to use string literals and collection index exclude steps to use int literals. Thus `<exclude paths>` are statically known. We can decide in the future whether to add other exclude paths (e.g. expressions) if a use case arises.
* If sufficient schema is present and the path can be resolved, we assume the root of an `EXCLUDE` path can be omitted. The variable resolution rules follow what is already included in the PartiQL specification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to have an example of attribute as a variable.

* We require that every fully-qualified `<exclude path>` contain a root and at least one step. If a use case arises to exclude a binding tuple variable, then this functionality can be added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for this limitation? We should put that here.

* S-expressions are part of the Ion type system. footnote:[https://amazon-ion.github.io/ion-docs/docs/spec.html#sexp] Since PartiQL's type system is a superset over the Ion types, PartiQL should support s-expression types and values. Since the current PartiQL specification does not formally define s-expressions operations, we consider the definition of collection index and wildcard steps on s-expressions as out-of-scope for this RFC.
Expand All @@ -99,20 +99,37 @@ For each `<exclude path>` `p=root~p~s~1~...s~m~`, we compare it with all other `
NOTE: The following rules assume `root~p~=root~q~`.

.Subsumption rules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have the table 1. for the examples, but adding an example for each of the rules would also enhance readability.

Rule 1.a::
[[anchor-1a]] Rule 1.a::
If `m ≥ n` and `s~1~...s~m~=t~1~...t~m~`, `q` subsumes `p`. Put another way if `p` has at least as many steps as `q` and the steps up to ``q``'s length are equivalent, `q` subsumes `p`.

Otherwise, there must be some step at which `p` and `q` diverge. Let's call this index `i`.

Rule 1.b::
[[anchor-1b]] Rule 1.b::
If `s~i~` is a tuple attribute and `t~i~` is a tuple wildcard and `t~i+1~...t~n~` subsumes `s~i+1~...t~n~` (i.e. the steps following `t~i~` subsumes the steps following `s~i~`), then `q` subsumes `p`.
Rule 1.c::
[[anchor-1c]] Rule 1.c::
If `s~i~` is a collection index and `t~i~` is a collection wildcard and `t~i+1~...t~n~` subsumes `s~i+1~...s~m~` (i.e. the steps following `t~i~` subsumes the steps following `s~i~`), then `q` subsumes `p`.
Rule 1.d::
[[anchor-1d]] Rule 1.d::
If `s~i~` is a case-sensitive tuple attribute and `t~i~` is a case-insensitive tuple attribute and `t~i+1~...t~n` subsumes `s~i+1~...s~m` (i.e. the steps following `t~i~` subsumes the steps following `s~i~`), then `q` subsumes `p`.

===== Subsumption Examples:
TODO: put in table or list form and link rules
.Subsumption Examples
[options="header,footer"]
|=======================
|Exclude Path `p`|Exclude Path `q`|Notes
|`s.a` |`t.a` |No subsumption rules apply (roots differ)
|`t.a` |`t.b` |No subsumption rules apply
|`t.a.b.c` |`t.a.*.d` |No subsumption rules apply
|`t.a.b.c` |`t.a.b.c` |`q` subsumes `p` (by <<anchor-1a, 1.a>>)
|`t.a.b.c` |`t.a.b` |`q` subsumes `p` (by <<anchor-1a, 1.a>>)
|`t.a.b.c` |`t.a.b.*` |`q` subsumes `p` (by <<anchor-1b, 1.b>> then <<anchor-1a, 1.a>>)
|`t.a.b.c` |`t.a.*.c` |`q` subsumes `p` (by <<anchor-1b, 1.b>> then <<anchor-1a, 1.a>>)
|`t.a.b[1]` |`t.a.b` |`q` subsumes `p` (by <<anchor-1c, 1.c>> then <<anchor-1a, 1.a>>)
|`t.a.b[1]` |`t.a.b[*]` |`q` subsumes `p` (by <<anchor-1c, 1.c>> then <<anchor-1a, 1.a>>)
|`t.a.b[1].c` |`t.a.b[1]` |`q` subsumes `p` (by <<anchor-1a, 1.a>>)
|`t.a.b[1].c` |`t.a.b[*].c`|`q` subsumes `p` (by <<anchor-1c, 1.c>> then <<anchor-1a, 1.a>>)
|`t.a.b[1].c` |`t.a.b[*]` |`q` subsumes `p` (by <<anchor-1c, 1.c>> then <<anchor-1a, 1.a>>)
|`t.a."b"` |`t.a.b` |`q` subsumes `p` (by <<anchor-1d, 1.d>> then <<anchor-1a, 1.a>>)
|`t.a."b".c` |`t.a.b.c` |`q` subsumes `p` (by <<anchor-1d, 1.d>> then <<anchor-1a, 1.a>>)
|=======================

---
We first illustrate the rewrite rule for a single `EXCLUDE` path and then explain the syntax rewrite for multiple exclude paths.
Expand All @@ -121,6 +138,8 @@ We first illustrate the rewrite rule for a single `EXCLUDE` path and then explai

To rewrite a single `EXCLUDE` path with `n` steps, `p=r.s~1~...s~n~`, we move the clauses other than the `SELECT`/`PIVOT` into a subquery, which will `EXCLUDE` the binding tuple values at the path `p`. This subquery essentially reconstructs the binding tuple of the other clauses using a `SELECT VALUE` struct to project back the binding tuple variables. All of the variables created from the other clauses not matching the `EXCLUDE` root `r` will use the identity function (e.g. binding tuple variable `foo` will have attribute `'foo'` and value `foo` in the `SELECT VALUE` struct). For the variable matching the `EXCLUDE` path root `r`, we apply the following rewrite rules to define ``r``'s value within the `SELECT VALUE` struct. If there is no such variable matching `EXCLUDE` path root `r`, the `EXCLUDE` path will not alter any of the binding tuple values. Hence, no rewrite rule is applied.

If the other clauses includes an `ORDER BY`, we convert the top-level query back into a list by adding a position variable (i.e. `AT` clause) along with an `ORDER BY` over that position variable.

[source,partiql,subs="+{markup-in-source}"]
----
<select clause>
Expand All @@ -137,6 +156,11 @@ FROM (
<from clause>
<other clauses>
)
[ -- Include conversion back to list if `ORDER BY` present in `<other clauses>`
-- Assume `topLevelTbl` and `idx` are fresh variables
AS topLevelTbl AT idx
ORDER BY idx
]
----


Expand Down Expand Up @@ -316,6 +340,11 @@ FROM (
<from clause>
<other clauses>
)
[ -- Include conversion back to list if `ORDER BY` present in `<other clauses>`
-- Assume `topLevelTbl` and `idx` are fresh variables
AS topLevelTbl AT idx
ORDER BY idx
]
----
Like single path rewriting, we create a nested `CASE` expression for each step. However, for multiple paths, we look at all the paths in parallel and process the steps at the same level. For the following, let `i=1,...,z` where `z` is the length of the longest exclude path. The nested `CASE` expressions for all `i` are created as before:

Expand Down Expand Up @@ -1249,11 +1278,11 @@ Output:
SELECT *
EXCLUDE t.a
FROM <<
{ 'a': 1, 'b': 11, 'c': 111 },
{ 'a': 2, 'b': 22, 'c': 222 },
{ 'a': 3, 'b': 33, 'c': 333 }, -- kept
{ 'a': 2, 'b': 22, 'c': 222 },
{ 'a': 4, 'b': 44, 'c': 444 }, -- kept
{ 'a': 5, 'b': 55, 'c': 555 }
{ 'a': 5, 'b': 55, 'c': 555 },
{ 'a': 1, 'b': 11, 'c': 111 }
>> AS t
ORDER BY a
LIMIT 2
Expand All @@ -1263,7 +1292,7 @@ OFFSET 2
Rewritten query:
[source,partiql,subs="+{markup-in-source}"]
----
SELECT *
SELECT t.*
FROM (
SELECT VALUE {
't':
Expand All @@ -1275,16 +1304,17 @@ FROM (
END
}
FROM <<
{ 'a': 1, 'b': 11, 'c': 111 },
{ 'a': 2, 'b': 22, 'c': 222 },
{ 'a': 3, 'b': 33, 'c': 333 }, -- kept
{ 'a': 2, 'b': 22, 'c': 222 },
{ 'a': 4, 'b': 44, 'c': 444 }, -- kept
{ 'a': 5, 'b': 55, 'c': 555 }
{ 'a': 5, 'b': 55, 'c': 555 },
{ 'a': 1, 'b': 11, 'c': 111 }
>> AS t
ORDER BY a
LIMIT 2
OFFSET 2
)
) AS topLevelTbl AT idx
ORDER BY idx
----

Output:
Expand Down