-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24850][SQL] fix str representation of CachedRDDBuilder #21805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can you add tests in |
|
ok to test |
|
|
||
| assert(!inMemoryRelation.simpleString.contains(dummyQueryExecution.sparkPlan.toString)) | ||
| assert(inMemoryRelation.simpleString.contains( | ||
| "CachedRDDBuilder(true, 1000, StorageLevel(memory, deserialized, 1 replicas))")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true and 1000 look confusing to end users. Can we improve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we might not need the batch size in the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just comparing explain output results like the query in this pr description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile tried to keep this close to its default value, maybe we can do something like CachedRDDBuilder(useCompression = true, batchSize = 1000, ...)? But that will break the consistency across logging case classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu wouldn't that be testing the same thing, as explain calls plan.treeString which calls elem.simpleString for every child? I think testing for InMemoryRelation.simpleString covers other possible places where a plan.treeString is logged. Happy to change if you have concerns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, but you don't need to fill useCompression and batchSize in the test case.
|
Test build #93321 has finished for PR 21805 at commit
|
|
Test build #93341 has finished for PR 21805 at commit
|
| tableName: Option[String])( | ||
| @transient private var _cachedColumnBuffers: RDD[CachedBatch] = null) { | ||
|
|
||
| override def toString: String = s"CachedRDDBuilder($useCompression, $batchSize, $storageLevel)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My major point is whether we need to output $useCompression, $batchSize. How useful are they? Our explain output is already pretty long. Maybe we can skip them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I think the output should be the same with one in v2.3;
scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b")
scala> val testDf = df.join(df, "a").join(df, "a").cache
scala> testDf.groupBy("a").count().explain
== Physical Plan ==
*(2) HashAggregate(keys=[a#309], functions=[count(1)])
+- Exchange hashpartitioning(a#309, 200)
+- *(1) HashAggregate(keys=[a#309], functions=[partial_count(1)])
+- *(1) InMemoryTableScan [a#309]
+- InMemoryRelation [a#309, b#310, b#314, b#319], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(3) Project [a#60, b#61, b#212, b#217]
+- *(3) BroadcastHashJoin [a#60], [a#216], Inner, BuildRight
:- *(3) Project [a#60, b#61, b#212]
: +- *(3) BroadcastHashJoin [a#60], [a#211], Inner, BuildRight
: :- *(3) InMemoryTableScan [a#60, b#61]
: : +- InMemoryRelation [a#60, b#61], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: : +- LocalTableScan [a#15, b#16]
: +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
: +- *(1) InMemoryTableScan [a#211, b#212]
: +- InMemoryRelation [a#211, b#212], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: +- LocalTableScan [a#15, b#16]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- *(2) InMemoryTableScan [a#216, b#217]
+- InMemoryRelation [a#216, b#217], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- LocalTableScan [a#15, b#16]
The output of this current pr is still different, so can you fix that way? @onursatici
|
@gatorsmile @maropu |
|
|
||
| override protected def otherCopyArgs: Seq[AnyRef] = Seq(statsOfPlanToCache) | ||
|
|
||
| override def simpleString: String = s"InMemoryRelation(${output}, ${cacheBuilder.storageLevel})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about s"InMemoryRelation [${Utils.truncatedString(output, ", ")}], ${cacheBuilder.storageLevel}"?
| assert(inMemoryRelation.simpleString == | ||
| s"InMemoryRelation(${inMemoryRelation.output}," | ||
| + " StorageLevel(memory, deserialized, 1 replicas))") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just comparing explain results?
val df = Seq((1, 2)).toDF("a", "b").cache
val outputStream = new java.io.ByteArrayOutputStream()
Console.withOut(outputStream) {
df.explain(false)
}
assert(outputStream.toString.replaceAll("#\\d+", "#x").contains(
"InMemoryRelation [a#x, b#x], StorageLevel(disk, memory, deserialized, 1 replicas)"))
| override protected def otherCopyArgs: Seq[AnyRef] = Seq(statsOfPlanToCache) | ||
|
|
||
| override def simpleString: String = s"InMemoryRelation(${output}, ${cacheBuilder.storageLevel})" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the blank line
|
LGTM |
|
Test build #93443 has finished for PR 21805 at commit
|
|
Test build #93444 has finished for PR 21805 at commit
|
|
LGTM Thanks! Merged to master |
What changes were proposed in this pull request?
As of #21018, InMemoryRelation includes its cacheBuilder when logging query plans. This PR changes the string representation of the CachedRDDBuilder to not include the cached spark plan.
How was this patch tested?
spark-shell, query:
as of master results in:
with this patch results in: