Skip to content

Conversation

@imback82
Copy link
Contributor

What changes were proposed in this pull request?

The handling of the catalog across plans should be as follows (SPARK-29014):

  • The current catalog should be used when no catalog is specified
  • The default catalog is the catalog current is initialized to
  • If the default catalog is not set, then current catalog is the built-in Spark session catalog.

This PR addresses the issue where current catalog usage is not followed as describe above.

Why are the changes needed?

It is a bug as described in the previous section.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests added.

@imback82
Copy link
Contributor Author

cc: @cloud-fan / @rdblue

@rdblue
Copy link
Contributor

rdblue commented Oct 15, 2019

I think this should also make the default catalog and the session catalog private in CatalogManager, to ensure that the only catalog accessed from rules is the current catalog. If the default catalog or session catalog is the current catalog, they can (and should) be accessed by getting the current catalog. The default and session catalogs should be internal to CatalogManager.

@imback82
Copy link
Contributor Author

I think this should also make the default catalog and the session catalog private in CatalogManager, to ensure that the only catalog accessed from rules is the current catalog. If the default catalog or session catalog is the current catalog, they can (and should) be accessed by getting the current catalog. The default and session catalogs should be internal to CatalogManager.

OK. There are few places where default catalog / session catalog is accessed directly, for example, CatalogObjectIdentifier, which is used in many places. I checked few of them, and they can be updated to use just current catalog (along with isSessionCatalog check). I will try your suggestion. Thanks @rdblue.

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112078 has finished for PR 26120 at commit 2fa927a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112102 has finished for PR 26120 at commit 8317d31.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
class Analyzer(
override val catalogManager: CatalogManager,
v1SessionCatalog: SessionCatalog,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed catalog to v1SessionCatalog in Analyzer to be explicit. Please let me know if this is not desired.

saveAsTable(sessionCatalog.asTableCatalog, ident)
case CatalogObjectIdentifier(catalog, ident)
if isSessionCatalog(catalog) && canUseV2 && ident.namespace().length <= 1 =>
saveAsTable(catalog.asTableCatalog, ident)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be correct if the current catalog is v2 session catalog that doesn't delegate to the v1 session catalog? If you look at the previous behavior, it's always using v1 session catalog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a known problem that if the v2 session catalog doesn't delegate to v1 session catalog, many things can be broken.

I think the previous version was wrong. It always use the default v2 session catalog even if users set a custom v2 session catalog.

@SparkQA
Copy link

SparkQA commented Oct 16, 2019

Test build #112184 has finished for PR 26120 at commit 46d2fa3.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

*/
class Analyzer(
override val catalogManager: CatalogManager,
v1SessionCatalog: SessionCatalog,
Copy link
Contributor Author

@imback82 imback82 Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside of this approach (passing SessionCatalog as a separate parameter) is that a SessionCatalog instance can be different from the one stored in CatalogManager. Since CatlalogManager updates the current database of the session catalog, it can be out of sync. Please let me know if this approach is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, what's the upside of doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1SessionCatalog is now private in CatalogManager. Since Analyzer uses v1SessionCatalog we need to pass this separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to make v1SessionCatalog private in CatalogManager. We only need to make sessionCatalog and defaultCatalog private. cc @rdblue

@imback82
Copy link
Contributor Author

@rdblue / @cloud-fan I applied @rdblue's suggestion to make default and session catalog as private in CatalogManager. I left few questions regarding the changes, so please have a look. Thanks!

@SparkQA
Copy link

SparkQA commented Oct 16, 2019

Test build #112188 has finished for PR 26120 at commit d12d4ce.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2019

Test build #112227 has finished for PR 26120 at commit 4131274.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Oct 18, 2019

+1

Thanks for fixing this! @cloud-fan, any other issues?

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Test build #112236 has finished for PR 26120 at commit 9f1e574.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Test build #112243 has finished for PR 26120 at commit 6d49017.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM if jenkins pass

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Test build #112269 has finished for PR 26120 at commit b93c5e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 39af51d Oct 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants