Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

In DataSource, if the table is not analyzed, we will use 0 as the default value for table size. This is dangerous, we may broadcast a large table and cause OOM. We should use defaultSizeInBytes instead.

How was this patch tested?

new regression test

@cloud-fan
Copy link
Contributor Author

cc @rxin @ericl

className = table.provider.get,
options = table.storage.properties ++ pathOption)
options = table.storage.properties ++ pathOption,
catalogTable = Some(simpleCatalogRelation.metadata))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we will always use InMemoryFileIndex because we don't pass the catalog table to DataSource...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to catch this in a test? Presumably this would have caused all the files to be scanned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i know why. The InMemoryExternalCatalog hasn't implemented all the interfaces(e.g. some partition related ones), so we did it intentionally.

@SparkQA
Copy link

SparkQA commented Dec 14, 2016

Test build #70130 has finished for PR 16280 at commit 4939d6d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70167 has finished for PR 16280 at commit 1628b29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Dec 15, 2016

Merging in master/branch-2.1.

@asfgit asfgit closed this in d6f11a1 Dec 15, 2016
asfgit pushed a commit that referenced this pull request Dec 15, 2016
… size

## What changes were proposed in this pull request?

In `DataSource`, if the table is not analyzed, we will use 0 as the default value for table size. This is dangerous, we may broadcast a large table and cause OOM. We should use `defaultSizeInBytes` instead.

## How was this patch tested?

new regression test

Author: Wenchen Fan <[email protected]>

Closes #16280 from cloud-fan/bug.

(cherry picked from commit d6f11a1)
Signed-off-by: Reynold Xin <[email protected]>
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
… size

## What changes were proposed in this pull request?

In `DataSource`, if the table is not analyzed, we will use 0 as the default value for table size. This is dangerous, we may broadcast a large table and cause OOM. We should use `defaultSizeInBytes` instead.

## How was this patch tested?

new regression test

Author: Wenchen Fan <[email protected]>

Closes apache#16280 from cloud-fan/bug.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… size

## What changes were proposed in this pull request?

In `DataSource`, if the table is not analyzed, we will use 0 as the default value for table size. This is dangerous, we may broadcast a large table and cause OOM. We should use `defaultSizeInBytes` instead.

## How was this patch tested?

new regression test

Author: Wenchen Fan <[email protected]>

Closes apache#16280 from cloud-fan/bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants