Skip to content

Conversation

@grundprinzip
Copy link
Contributor

@grundprinzip grundprinzip commented Jun 16, 2024

What changes were proposed in this pull request?

One of the interesting shortcomings in Spark Connect is that the query execution metrics are not easily accessible directly. In Spark Classic, the query execution is only accessible via the _jdf private variable and this is not available in Spark Connect.

However, since the first release of Spark Connect, the response messages were already containing the metrics from the executed plan.

This patch makes them accessible directly and provides a way to visualize them.

df = spark.range(100)
df.collect()
metrics = df.executionInfo.metrics
metrics.toDot()

The toDot() method returns an instance of the graphviz.Digraph object that can be either directly displayed in a notebook or further manipulated.

image image

The purpose of the executionInfo property and the associated ExecutionInfo class is not to provide equivalence to the QueryExecution class used internally by Spark (and, for example, access to the analyzed, optimized, and executed plan) but rather provide a convenient way of accessing execution related information.

Why are the changes needed?

User Experience

Does this PR introduce any user-facing change?

Adding a new API for accessing the query execution of a Spark SQL execution.

How was this patch tested?

Added new UT

Was this patch authored or co-authored using generative AI tooling?

No

@grundprinzip
Copy link
Contributor Author

cc @SemyonSinchenko

...

@property
def queryExecution(self) -> Optional["QueryExecution"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably have docstring here, with the added version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I wouldn't make it Optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the version, for the optional part:

  • in scala QueryExecution is always present but then you have to check for executedPlan the thing I'm worried about is to bring this complexity to the client. The Query Execution object allows too much direct manipulation of the query that is not ideal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IC so it has to be set after execution. Should we probably have a Spark Connect dedicated API? I think it'd be confusing if it has the same name with Scala side df.queryExecution

@grundprinzip
Copy link
Contributor Author

@HyukjinKwon thanks for the review I'll look into it.

@grundprinzip grundprinzip changed the title [WIP][SPARK-48638][CONNECT] Add QueryExecution support for DataFrame [SPARK-48638][CONNECT] Add QueryExecution support for DataFrame Jun 18, 2024
@grundprinzip
Copy link
Contributor Author

@HyukjinKwon @allisonwang-db @zhengruifeng Can you please have another look?

An instance of the graphviz.Digraph object.
"""
try:
import graphviz
Copy link
Member

@ueshin ueshin Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the error message, we expect the minimum version as 0.20?
If so, we should check the version here to avoid unexpected results?
Also should be documented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dot interface to graphviz is really stable. I did a quick check but we don't seem to assert the version of packages besides providing the error message. For the extra packages in the setup.py we do provide specific versions but if the user installs them manually, there is no enforcement.

@grundprinzip grundprinzip changed the title [SPARK-48638][CONNECT] Add QueryExecution support for DataFrame [SPARK-48638][CONNECT] Add ExecutionInfo support for DataFrame Jun 21, 2024
def cb(qe: "ExecutionInfo") -> None:
self._execution_info = qe

return DataFrameWriter(self._plan, self._session, cb)
Copy link
Contributor

@WweiL WweiL Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like writeStream is not overriden here. So I imagine streaming query is not supported yet.

In streaming a query could have multiple data frames, what we do in scala is to access it with query.explain(), which uses this lastExecution

def lastExecution: IncrementalExecution = getLatestExecutionContext().executionPlan

That's, as it's name, the QueryExecution(IncrementalExecution) of the last execution.

We could also add a similar mechanism to StreamingQuery object. This sounds like an interesting followup that im interested in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should look at streaming as a follow up.

@WweiL
Copy link
Contributor

WweiL commented Jun 21, 2024

Also would this added "QueryExecution" object make implementing a QueryExecutionListener in Python Connect possible? There is no QueryExecutionListener in classic PySpark anyways

@grundprinzip
Copy link
Contributor Author

Interestingly the Spark Connect ML / PyTorch Distributor tests are crashing for me locally both in Spark Classic and Spark Connect mode.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One minor comment: shall we follow the EXPLAIN format to render the plan tree string in the text mode? e.g.

Aggregate ...
+- Project ...
   +- Relation ...

@grundprinzip
Copy link
Contributor Author

@cloud-fan adjusted the format to +-.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to document at python/docs/source/reference/pyspark.sql

@github-actions github-actions bot added the DOCS label Jun 25, 2024
@HyukjinKwon
Copy link
Member

Merged to master.

yaooqinn pushed a commit that referenced this pull request Jul 1, 2024
…ed tests

### What changes were proposed in this pull request?

This PR is a followup of #46996 that installs `graphviz` dependency so it runs the tests.

### Why are the changes needed?

To run the tests added in #46996.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

CI in this PR should validate it.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47155 from HyukjinKwon/SPARK-48638-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants