[SPARK-48638][CONNECT] Add ExecutionInfo support for DataFrame #46996

grundprinzip · 2024-06-16T20:40:24Z

What changes were proposed in this pull request?

One of the interesting shortcomings in Spark Connect is that the query execution metrics are not easily accessible directly. In Spark Classic, the query execution is only accessible via the _jdf private variable and this is not available in Spark Connect.

However, since the first release of Spark Connect, the response messages were already containing the metrics from the executed plan.

This patch makes them accessible directly and provides a way to visualize them.

df = spark.range(100)
df.collect()
metrics = df.executionInfo.metrics
metrics.toDot()

The toDot() method returns an instance of the graphviz.Digraph object that can be either directly displayed in a notebook or further manipulated.

The purpose of the executionInfo property and the associated ExecutionInfo class is not to provide equivalence to the QueryExecution class used internally by Spark (and, for example, access to the analyzed, optimized, and executed plan) but rather provide a convenient way of accessing execution related information.

Why are the changes needed?

User Experience

Does this PR introduce any user-facing change?

Adding a new API for accessing the query execution of a Spark SQL execution.

How was this patch tested?

Added new UT

Was this patch authored or co-authored using generative AI tooling?

No

grundprinzip · 2024-06-16T21:08:31Z

cc @SemyonSinchenko

python/pyspark/sql/metrics.py

HyukjinKwon · 2024-06-17T05:23:26Z

python/pyspark/sql/dataframe.py

        ...

+    @property
+    def queryExecution(self) -> Optional["QueryExecution"]:


Should probably have docstring here, with the added version

And I wouldn't make it Optional.

I added the version, for the optional part:

in scala QueryExecution is always present but then you have to check for executedPlan the thing I'm worried about is to bring this complexity to the client. The Query Execution object allows too much direct manipulation of the query that is not ideal.

IC so it has to be set after execution. Should we probably have a Spark Connect dedicated API? I think it'd be confusing if it has the same name with Scala side df.queryExecution

python/pyspark/sql/metrics.py

grundprinzip · 2024-06-17T06:57:21Z

@HyukjinKwon thanks for the review I'll look into it.

python/pyspark/sql/tests/connect/test_df_debug.py

python/pyspark/sql/metrics.py

python/pyspark/errors/error-conditions.json

python/pyspark/sql/metrics.py

Co-authored-by: allisonwang-db <[email protected]>

grundprinzip · 2024-06-19T19:24:22Z

@HyukjinKwon @allisonwang-db @zhengruifeng Can you please have another look?

ueshin · 2024-06-21T01:07:26Z

python/pyspark/sql/metrics.py

+        An instance of the graphviz.Digraph object.
+        """
+        try:
+            import graphviz


According to the error message, we expect the minimum version as 0.20?
If so, we should check the version here to avoid unexpected results?
Also should be documented?

The dot interface to graphviz is really stable. I did a quick check but we don't seem to assert the version of packages besides providing the error message. For the extra packages in the setup.py we do provide specific versions but if the user installs them manually, there is no enforcement.

WweiL · 2024-06-21T21:35:11Z

python/pyspark/sql/connect/dataframe.py

+        def cb(qe: "ExecutionInfo") -> None:
+            self._execution_info = qe
+
+        return DataFrameWriter(self._plan, self._session, cb)


Looks like writeStream is not overriden here. So I imagine streaming query is not supported yet.

In streaming a query could have multiple data frames, what we do in scala is to access it with query.explain(), which uses this lastExecution

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

Line 192 in 9476343

def lastExecution: IncrementalExecution = getLatestExecutionContext().executionPlan

That's, as it's name, the QueryExecution(IncrementalExecution) of the last execution.

We could also add a similar mechanism to StreamingQuery object. This sounds like an interesting followup that im interested in

Yes, we should look at streaming as a follow up.

WweiL · 2024-06-21T21:40:32Z

Also would this added "QueryExecution" object make implementing a QueryExecutionListener in Python Connect possible? There is no QueryExecutionListener in classic PySpark anyways

grundprinzip · 2024-06-23T07:57:59Z

Interestingly the Spark Connect ML / PyTorch Distributor tests are crashing for me locally both in Spark Classic and Spark Connect mode.

python/pyspark/sql/dataframe.py

cloud-fan

LGTM. One minor comment: shall we follow the EXPLAIN format to render the plan tree string in the text mode? e.g.

Aggregate ...
+- Project ...
   +- Relation ...

grundprinzip · 2024-06-24T20:16:49Z

@cloud-fan adjusted the format to +-.

python/pyspark/sql/dataframe.py

HyukjinKwon

would be good to document at python/docs/source/reference/pyspark.sql

python/pyspark/sql/metrics.py

dev/requirements.txt

Co-authored-by: Hyukjin Kwon <[email protected]>

HyukjinKwon · 2024-06-25T23:35:19Z

Merged to master.

…ed tests ### What changes were proposed in this pull request? This PR is a followup of #46996 that installs `graphviz` dependency so it runs the tests. ### Why are the changes needed? To run the tests added in #46996. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR should validate it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47155 from HyukjinKwon/SPARK-48638-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Kent Yao <[email protected]>

[SPARK-48638] Add QueryExecution support for DataFrame

0b58955

github-actions bot added SQL BUILD PYTHON CONNECT labels Jun 16, 2024

HyukjinKwon reviewed Jun 17, 2024

View reviewed changes

python/pyspark/sql/metrics.py Show resolved Hide resolved

HyukjinKwon reviewed Jun 17, 2024

View reviewed changes

python/pyspark/sql/metrics.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 17, 2024

View reviewed changes

python/pyspark/sql/metrics.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 17, 2024

View reviewed changes

python/pyspark/sql/metrics.py Show resolved Hide resolved

grundprinzip added 4 commits June 17, 2024 21:56

fix lint

07527aa

review comments

de5541a

fix lint

3e95c96

Merge remote-tracking branch 'origin/master' into HEAD

c85ce1f

grundprinzip changed the title ~~[WIP][SPARK-48638][CONNECT] Add QueryExecution support for DataFrame~~ [SPARK-48638][CONNECT] Add QueryExecution support for DataFrame Jun 18, 2024

zhengruifeng reviewed Jun 18, 2024

View reviewed changes

python/pyspark/sql/tests/connect/test_df_debug.py Show resolved Hide resolved

grundprinzip added 2 commits June 18, 2024 14:57

adding support for writer support

a310cda

fixing unpacking bugs

f08f598

github-actions bot added the STRUCTURED STREAMING label Jun 18, 2024

superdiaodiao reviewed Jun 18, 2024

View reviewed changes

python/pyspark/sql/metrics.py Outdated Show resolved Hide resolved

allisonwang-db reviewed Jun 18, 2024

View reviewed changes

python/pyspark/errors/error-conditions.json Outdated Show resolved Hide resolved

python/pyspark/sql/metrics.py Show resolved Hide resolved

grundprinzip and others added 7 commits June 18, 2024 22:47

adding text support

882c12d

adding text support

f25a9e6

Update python/pyspark/errors/error-conditions.json

4db1c0f

Co-authored-by: allisonwang-db <[email protected]>

fixing lint

1588d43

Merge remote-tracking branch 'grundprinzip/SPARK-48638' into HEAD

d04636c

fixing lint

84668cb

Merge remote-tracking branch 'origin/master' into HEAD

164342e

ueshin reviewed Jun 21, 2024

View reviewed changes

grundprinzip changed the title ~~[SPARK-48638][CONNECT] Add QueryExecution support for DataFrame~~ [SPARK-48638][CONNECT] Add ExecutionInfo support for DataFrame Jun 21, 2024

renaming to ExecutionInfo;

08b453e

WweiL reviewed Jun 21, 2024

View reviewed changes

Merge remote-tracking branch 'origin/master' into HEAD

08e281c

cloud-fan reviewed Jun 24, 2024

View reviewed changes

python/pyspark/sql/dataframe.py Show resolved Hide resolved

updating doc

1eb6281

cloud-fan approved these changes Jun 24, 2024

View reviewed changes

adjusting format

8c68b17

Merge remote-tracking branch 'origin/master' into HEAD

4669827

HyukjinKwon reviewed Jun 25, 2024

View reviewed changes

python/pyspark/sql/dataframe.py Show resolved Hide resolved

HyukjinKwon reviewed Jun 25, 2024

View reviewed changes

python/pyspark/sql/metrics.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 25, 2024

View reviewed changes

dev/requirements.txt Show resolved Hide resolved

HyukjinKwon approved these changes Jun 25, 2024

View reviewed changes

grundprinzip and others added 2 commits June 25, 2024 08:22

Update python/pyspark/sql/dataframe.py

36ca396

Co-authored-by: Hyukjin Kwon <[email protected]>

comments

0412ad5

github-actions bot added the DOCS label Jun 25, 2024

HyukjinKwon closed this in 9d4abaf Jun 25, 2024

HyukjinKwon mentioned this pull request Jul 1, 2024

[SPARK-48638][INFRA][FOLLOW-UP] Add graphviz into CI to run the related tests #47155

Closed

[SPARK-48638][CONNECT] Add ExecutionInfo support for DataFrame #46996

[SPARK-48638][CONNECT] Add ExecutionInfo support for DataFrame #46996

Uh oh!

Conversation

grundprinzip commented Jun 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

grundprinzip commented Jun 16, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon Jun 17, 2024

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 17, 2024

Choose a reason for hiding this comment

Uh oh!

grundprinzip Jun 17, 2024

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 21, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grundprinzip commented Jun 17, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

grundprinzip commented Jun 19, 2024

Uh oh!

ueshin Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grundprinzip Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

WweiL Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grundprinzip Jun 23, 2024

Choose a reason for hiding this comment

Uh oh!

WweiL commented Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grundprinzip commented Jun 23, 2024

Uh oh!

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

grundprinzip commented Jun 24, 2024

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jun 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

grundprinzip commented Jun 16, 2024 •

edited

Loading

ueshin Jun 21, 2024 •

edited

Loading

WweiL Jun 21, 2024 •

edited

Loading

WweiL commented Jun 21, 2024 •

edited

Loading