-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ambiguous reference to fields #4068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
ping @marmbrus @liancheng |
|
ok to test |
|
Test build #25681 has started for PR 4068 at commit
|
|
Test build #25681 has finished for PR 4068 at commit
|
|
Test FAILed. |
|
The problem is that: Currently the So we need a way to indicate whether a For SPARK-3698, we can do this by searching required field with case sensitive, if success, we are done. if not, we still have chance if the resolver is case insensitive, so we can do the check in For SPARK-5278 here, it's more complicated. it seems to me that the only way is adding a flag to What do you think? @marmbrus @liancheng |
|
Test build #25747 has started for PR 4068 at commit
|
|
Test build #25747 has finished for PR 4068 at commit
|
|
Test FAILed. |
|
Test build #25750 has started for PR 4068 at commit
|
|
Test build #25750 has finished for PR 4068 at commit
|
|
Test FAILed. |
|
Test build #25754 has started for PR 4068 at commit
|
|
I had a new idea: Don't try to figure out whether a |
|
Test build #25754 has finished for PR 4068 at commit
|
|
Test PASSed. |
|
ping @marmbrus @liancheng @rxin |
|
@yhuai can you review this one first? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CheckResolution should catch it. If we cannot resolve it, just leave it unchanged. Can you see if there is a unit test for this? If not, can you add one? Maybe we can also log it like what LogicalPlan.resolve does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we leave it unchanged, CheckResolution can't catch it. The reason is that, we need Resolver to check if a GetField is resolved, but we can't get Resolver inside GetField.
Fortunately, we can catch it at runtime, as GetField will report error if it can't find the required field.
Which way should we prefer? Leaving it unchanged or reporting error right away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I see. With your change, the resolved field will always be true as long as its child's resolved is true (the existence of the desired field is not considered). Actually, we are breaking the semantic of resolved at here. I do think checking ambiguity is necessary, but I think it is also necessary to follow the definition of resolved.
I am sorry if I missed it. Can you explain why you prefer not to put this check in GetField?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said before, we need Resolver to check the existence of desired field. However, we can't get it inside GetField, that's why we also did this check in LogicalPlan.resolveNesting.
And we build GetField in SqlParser, so it's hard to put Resolver into GetField during building.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a well defined resolved field. Actually, I feel UnresolvedGetField is a better idea. I think it is reasonable to have a UnresolvedXXX whenever we need to do lookup or name matching to resolve it. @marmbrus What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @marmbrus
|
Test build #26290 has started for PR 4068 at commit
|
|
Test build #26290 has finished for PR 4068 at commit
|
|
Test PASSed. |
|
Hi @yhuai , I have updated this PR introducing |
|
Test build #26638 has started for PR 4068 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Import order
|
Okay, you have convinced me that this is cleaner. Two nits about formatting and then this LGTM. Can you also fix the description of the PR, as that becomes the commit message. Would be awesome to include this in 1.2 if you can update soon. Thanks for working on this! |
|
Test build #26815 has started for PR 4068 at commit
|
|
Test build #26815 has finished for PR 4068 at commit
|
|
Test PASSed. |
|
Hi @marmbrus , I have updated it, is it ready to go? Thanks. |
|
Test build #26833 has started for PR 4068 at commit
|
|
Test build #26833 has finished for PR 4068 at commit
|
|
Test FAILed. |
|
Test build #26837 has started for PR 4068 at commit
|
|
Test build #26837 has finished for PR 4068 at commit
|
|
Test PASSed. |
… of ambiguous reference to fields
When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken.
The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`.
So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in #2405.
Author: Wenchen Fan <[email protected]>
Closes #4068 from cloud-fan/simple and squashes the following commits:
a6857b5 [Wenchen Fan] fix import order
8411c40 [Wenchen Fan] use UnresolvedGetField
(cherry picked from commit 4793c84)
Signed-off-by: Michael Armbrust <[email protected]>
|
Thanks! merged to master and 1.3 |
|
This PR introduced two subtle regressions in the way we handle nested fields in ORDER BY queries: |
|
Hi @marmbrus , I'm looking into it, will send a fix ASAP. |
|
Here is a partial solution: #4892 |
When the
GetFieldchain(a.b.c.d.....) is interrupted byGetItemlikea.b[0].c.d...., then the check of ambiguous reference to fields is broken.The reason is that: for something like
a.b[0].c.d, we first parse it toGetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d"). Then inLogicalPlan#resolve, we resolve"a.b"and build aGetFieldchain from bottom(the relation). But for the 2 outerGetFiled, we have to resolve them inAnalyzeror do it inGetFieldlazily, check data type of child, search needed field, etc. which is similar to what we have done inLogicalPlan#resolve.