[SPARK-24850][SQL] fix str representation of CachedRDDBuilder #21805

onursatici · 2018-07-18T15:52:14Z

What changes were proposed in this pull request?

As of #21018, InMemoryRelation includes its cacheBuilder when logging query plans. This PR changes the string representation of the CachedRDDBuilder to not include the cached spark plan.

How was this patch tested?

spark-shell, query:

var df_cached = spark.read.format("csv").option("header", "true").load("test.csv").cache()
0 to 1 foreach { _ =>
df_cached = df_cached.join(spark.read.format("csv").option("header", "true").load("test.csv"), "A").cache()
}
df_cached.explain

as of master results in:

== Physical Plan ==
InMemoryTableScan [A#10, B#11, B#35, B#87]
+- InMemoryRelation [A#10, B#11, B#35, B#87], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(2) Project [A#10, B#11, B#35, B#87]
+- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight
:- *(2) Filter isnotnull(A#10)
: +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)]
: +- InMemoryRelation [A#10, B#11, B#35], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(2) Project [A#10, B#11, B#35]
+- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
:- *(2) Filter isnotnull(A#10)
: +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
: +- InMemoryRelation [A#10, B#11], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
: +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]))
+- *(1) Filter isnotnull(A#34)
+- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
+- InMemoryRelation [A#34, B#35], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
+- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
: +- *(2) Project [A#10, B#11, B#35]
: +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
: :- *(2) Filter isnotnull(A#10)
: : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
: : +- InMemoryRelation [A#10, B#11], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
: : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]))
: +- *(1) Filter isnotnull(A#34)
: +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
: +- InMemoryRelation [A#34, B#35], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
: +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]))
+- *(1) Filter isnotnull(A#86)
+- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)]
+- InMemoryRelation [A#86, B#87], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
+- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
+- *(2) Project [A#10, B#11, B#35, B#87]
+- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight
:- *(2) Filter isnotnull(A#10)
: +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)]
: +- InMemoryRelation [A#10, B#11, B#35], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(2) Project [A#10, B#11, B#35]
+- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
:- *(2) Filter isnotnull(A#10)
: +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
: +- InMemoryRelation [A#10, B#11], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
: +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]))
+- *(1) Filter isnotnull(A#34)
+- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
+- InMemoryRelation [A#34, B#35], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
+- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
: +- *(2) Project [A#10, B#11, B#35]
: +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
: :- *(2) Filter isnotnull(A#10)
: : +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
: : +- InMemoryRelation [A#10, B#11], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
: : +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]))
: +- *(1) Filter isnotnull(A#34)
: +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
: +- InMemoryRelation [A#34, B#35], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
: +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]))
+- *(1) Filter isnotnull(A#86)
+- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)]
+- InMemoryRelation [A#86, B#87], CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
,None)
+- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>

with this patch results in:

== Physical Plan ==
InMemoryTableScan [A#10, B#11, B#35, B#87]
   +- InMemoryRelation [A#10, B#11, B#35, B#87], CachedRDDBuilder(true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas))
         +- *(2) Project [A#10, B#11, B#35, B#87]
            +- *(2) BroadcastHashJoin [A#10], [A#86], Inner, BuildRight
               :- *(2) Filter isnotnull(A#10)
               :  +- InMemoryTableScan [A#10, B#11, B#35], [isnotnull(A#10)]
               :        +- InMemoryRelation [A#10, B#11, B#35], CachedRDDBuilder(true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas))
               :              +- *(2) Project [A#10, B#11, B#35]
               :                 +- *(2) BroadcastHashJoin [A#10], [A#34], Inner, BuildRight
               :                    :- *(2) Filter isnotnull(A#10)
               :                    :  +- InMemoryTableScan [A#10, B#11], [isnotnull(A#10)]
               :                    :        +- InMemoryRelation [A#10, B#11], CachedRDDBuilder(true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas))
               :                    :              +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
               :                    +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]))
               :                       +- *(1) Filter isnotnull(A#34)
               :                          +- InMemoryTableScan [A#34, B#35], [isnotnull(A#34)]
               :                                +- InMemoryRelation [A#34, B#35], CachedRDDBuilder(true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas))
               :                                      +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>
               +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]))
                  +- *(1) Filter isnotnull(A#86)
                     +- InMemoryTableScan [A#86, B#87], [isnotnull(A#86)]
                           +- InMemoryRelation [A#86, B#87], CachedRDDBuilder(true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas))
                                 +- *(1) FileScan csv [A#10,B#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<A:string,B:string>

maropu · 2018-07-19T01:11:19Z

Can you add tests in DatasetCacheSuite or somewhere? cc: @gatorsmile

gatorsmile · 2018-07-20T06:00:37Z

ok to test

gatorsmile · 2018-07-20T06:05:13Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala

+
+    assert(!inMemoryRelation.simpleString.contains(dummyQueryExecution.sparkPlan.toString))
+    assert(inMemoryRelation.simpleString.contains(
+      "CachedRDDBuilder(true, 1000, StorageLevel(memory, deserialized, 1 replicas))"))


true and 1000 look confusing to end users. Can we improve it?

Or we might not need the batch size in the plan.

How about just comparing explain output results like the query in this pr description?

@gatorsmile tried to keep this close to its default value, maybe we can do something like CachedRDDBuilder(useCompression = true, batchSize = 1000, ...)? But that will break the consistency across logging case classes.

@maropu wouldn't that be testing the same thing, as explain calls plan.treeString which calls elem.simpleString for every child? I think testing for InMemoryRelation.simpleString covers other possible places where a plan.treeString is logged. Happy to change if you have concerns

yea, but you don't need to fill useCompression and batchSize in the test case.

SparkQA · 2018-07-20T07:05:02Z

Test build #93321 has finished for PR 21805 at commit 9ccfc4e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

…-str

SparkQA · 2018-07-20T14:47:50Z

Test build #93341 has finished for PR 21805 at commit cf2eae2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AvroOptions(@transient val parameters: CaseInsensitiveMap[String])
case class ExprId(id: Long, jvmId: UUID)
class InputProcessor(store: StateStore)
case class StateData(
sealed trait StateManager extends Serializable

gatorsmile · 2018-07-20T17:55:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

    tableName: Option[String])(
    @transient private var _cachedColumnBuffers: RDD[CachedBatch] = null) {

+  override def toString: String = s"CachedRDDBuilder($useCompression, $batchSize, $storageLevel)"


My major point is whether we need to output $useCompression, $batchSize. How useful are they? Our explain output is already pretty long. Maybe we can skip them?

yea, I think the output should be the same with one in v2.3;

scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b") scala> val testDf = df.join(df, "a").join(df, "a").cache scala> testDf.groupBy("a").count().explain == Physical Plan == *(2) HashAggregate(keys=[a#309], functions=[count(1)]) +- Exchange hashpartitioning(a#309, 200) +- *(1) HashAggregate(keys=[a#309], functions=[partial_count(1)]) +- *(1) InMemoryTableScan [a#309] +- InMemoryRelation [a#309, b#310, b#314, b#319], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- *(3) Project [a#60, b#61, b#212, b#217] +- *(3) BroadcastHashJoin [a#60], [a#216], Inner, BuildRight :- *(3) Project [a#60, b#61, b#212] : +- *(3) BroadcastHashJoin [a#60], [a#211], Inner, BuildRight : :- *(3) InMemoryTableScan [a#60, b#61] : : +- InMemoryRelation [a#60, b#61], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) : : +- LocalTableScan [a#15, b#16] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) InMemoryTableScan [a#211, b#212] : +- InMemoryRelation [a#211, b#212], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) : +- LocalTableScan [a#15, b#16] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- *(2) InMemoryTableScan [a#216, b#217] +- InMemoryRelation [a#216, b#217], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [a#15, b#16]

The output of this current pr is still different, so can you fix that way? @onursatici

…-str

onursatici · 2018-07-23T11:06:34Z

@gatorsmile @maropu
I have removed batchSize and useCompression, other than that the string representation is now the same as what we had on 2.3

maropu · 2018-07-23T12:16:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala


  override protected def otherCopyArgs: Seq[AnyRef] = Seq(statsOfPlanToCache)
+
+  override def simpleString: String = s"InMemoryRelation(${output}, ${cacheBuilder.storageLevel})"


How about s"InMemoryRelation [${Utils.truncatedString(output, ", ")}], ${cacheBuilder.storageLevel}"?

maropu · 2018-07-23T12:16:53Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala

+    assert(inMemoryRelation.simpleString ==
+      s"InMemoryRelation(${inMemoryRelation.output},"
+      + " StorageLevel(memory, deserialized, 1 replicas))")
+  }


How about just comparing explain results?

val df = Seq((1, 2)).toDF("a", "b").cache val outputStream = new java.io.ByteArrayOutputStream() Console.withOut(outputStream) { df.explain(false) } assert(outputStream.toString.replaceAll("#\\d+", "#x").contains( "InMemoryRelation [a#x, b#x], StorageLevel(disk, memory, deserialized, 1 replicas)"))

maropu · 2018-07-23T12:17:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

  override protected def otherCopyArgs: Seq[AnyRef] = Seq(statsOfPlanToCache)
+
+  override def simpleString: String = s"InMemoryRelation(${output}, ${cacheBuilder.storageLevel})"
+


nit: remove the blank line

maropu · 2018-07-23T15:05:51Z

LGTM

SparkQA · 2018-07-23T15:46:56Z

Test build #93443 has finished for PR 21805 at commit de3f63e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-23T16:27:19Z

Test build #93444 has finished for PR 21805 at commit 2a21c80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-23T16:51:51Z

LGTM

Thanks! Merged to master

fix str representation of CachedRDDBuilder

2a49fe4

test

9ccfc4e

gatorsmile reviewed Jul 20, 2018

View reviewed changes

Merge remote-tracking branch 'origin/master' into os/inmemoryrelation…

cf2eae2

…-str

gatorsmile reviewed Jul 20, 2018

View reviewed changes

onursatici added 2 commits July 23, 2018 11:14

Merge remote-tracking branch 'origin/master' into os/inmemoryrelation…

45685e4

…-str

preserve 2.3 behaviour, remove useCompression and batchSize

de3f63e

maropu reviewed Jul 23, 2018

View reviewed changes

address comments

2a21c80

asfgit closed this in 2edf17e Jul 23, 2018


		override protected def otherCopyArgs: Seq[AnyRef] = Seq(statsOfPlanToCache)

		override def simpleString: String = s"InMemoryRelation(${output}, ${cacheBuilder.storageLevel})"

[SPARK-24850][SQL] fix str representation of CachedRDDBuilder #21805

[SPARK-24850][SQL] fix str representation of CachedRDDBuilder #21805

Uh oh!

Conversation

onursatici commented Jul 18, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

maropu commented Jul 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jul 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Jul 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 20, 2018

Uh oh!

SparkQA commented Jul 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

onursatici commented Jul 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 23, 2018

Uh oh!

SparkQA commented Jul 23, 2018

Uh oh!

SparkQA commented Jul 23, 2018

Uh oh!

gatorsmile commented Jul 23, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maropu commented Jul 19, 2018 •

edited

Loading

maropu Jul 20, 2018 •

edited

Loading