[SPARK-47081][CONNECT] Support Query Execution Progress #45150

grundprinzip · 2024-02-17T10:35:47Z

What changes were proposed in this pull request?

This patch adss a new mechanism to push query execution progress for batch queries. We add a new response message type and periodically push query progress to the client. The client can consume this data to for example display a progress bar.

This patch adds support for displaying a progress bar in the PySpark shell when started with Spark Connect.

The proto message is defined as follows:

// This message is used to communicate progress about the query progress during the execution.
  // This message is used to communicate progress about the query progress during the execution.
  message ExecutionProgress {
    // Captures the progress of each individual stage.
    repeated StageInfo stages = 1;

    // Captures the currently in progress tasks.
    int64 num_inflight_tasks = 2;

    message StageInfo {
      int64 stage_id = 1;
      int64 num_tasks = 2;
      int64 num_completed_tasks = 3;
      int64 input_bytes_read = 4;
      bool done = 5;
    }
  }

Clients can simply ignore the messages or consume them. On top of that this adds additional capabilities to register a callback for progress tracking to the SparkSession.

handler = lambda **kwargs: print(kwargs)
spark.register_progress_handler(handler)
spark.range(100).collect()
spark.remove_progress_handler(handler)

Example 1

Example 2

Why are the changes needed?

Usability and Experience

Does this PR introduce any user-facing change?

When the user opens the PySpark shell with Spark Connect mode, it will use the progress bar by default.

How was this patch tested?

Added new tests.

Was this patch authored or co-authored using generative AI tooling?

No

python/pyspark/shell.py

connector/connect/common/src/main/protobuf/spark/connect/base.proto

...src/main/scala/org/apache/spark/sql/connect/execution/ConnectProgressExecutionListener.scala

grundprinzip · 2024-03-01T13:39:13Z

Any chance to get some more feedback here? @HyukjinKwon or @hvanhovell ?

dtenedor · 2024-03-25T22:35:32Z

cc @ueshin @cloud-fan we need help 🙏

cloud-fan · 2024-03-27T08:18:08Z

connector/connect/common/src/main/protobuf/spark/connect/base.proto

+  // This message is used to communicate progress about the query progress during the execution.
+  message ExecutionProgress {
+    int64 num_tasks = 1;
+    int64 num_completed_tasks = 2;


is this for the current running stage or all stages?

Across all stages. It can always be extended later.

I'm wondering how can this be accurate. With AQE we never know what is the number of partitions for the next stage, as re-optimization can happen.

The goal of the progress metrics is not to be accurate into the future but only represent the snapshot of the current state. This means that the number of tasks can be updated when new stages are added or AQE kicks in.

The point is that the number of remaining tasks will converge over time and become stable.

(Just my 2c: I think having any progress bar is much better than none. The standard Spark progress bar has some ups and some downs, definitely having new progress bars appear isn't the most intuitive either. I think it's probably net better than one progress bar that gets longer, but I would much prefer having some progress bar now that we can extend later, perhaps as we get a better sense of how to incorporate AQE and future stages into the UX.)

After a second thought, it's better to hide Spark internals (stages) to end users, and eventually we should only have one progress bar for the query. So the current PR is a good starting point.

However, this server-client protocol needs to be stable and we don't want to change the client frequently to improve the progress reporting. Can we define a minimum set of information we need to send to the client side to display the progress bar? I feel it's better to calculate the percentage at the server side.

So I refactored the code to avoid closing any doors. I did not change the way the progress bar is displayed. However, I extended the progress message to capture the stage-wise information so other clients can decide independently how to present the information to the end user.

+1 @cloud-fan what do you think about that? Capture stage-level info in the proto, but keep the display simple for now?

yea this is more flexible. The proto message contains all the information and clients can do whatever they want.

connector/connect/common/src/main/protobuf/spark/connect/base.proto

grundprinzip · 2024-04-01T18:48:35Z

@zhengruifeng @HyukjinKwon addressed your comments, please take another look.

...src/main/scala/org/apache/spark/sql/connect/execution/ConnectProgressExecutionListener.scala

cloud-fan · 2024-04-02T14:18:26Z

...src/main/scala/org/apache/spark/sql/connect/execution/ConnectProgressExecutionListener.scala

+  override def onJobEnd(jobEnd: SparkListenerJobEnd): Unit = {
+    trackedTags.foreach({ case (_, tracker) =>
+      if (tracker.jobs.contains(jobEnd.jobId)) {
+        tracker.dirty.set(true)


why do we set the dirty flag when nothing is updated?

This is mostly to make sure that all progress is reported and an update is sent to the client. If you're tracking time between progress messages, every message itself is progress.

grundprinzip · 2024-04-02T20:11:29Z

@HyukjinKwon @zhengruifeng @cloud-fan I addressed the commments, is there additional feedback?

HyukjinKwon

The logic seems fine. For the output shape and information, would be great if someone like @cloud-fan @hvanhovell reviews it.

cloud-fan · 2024-04-04T04:59:24Z

thanks, merging to master!

…ckage ### What changes were proposed in this pull request? This PR is a followup of #45150 that adds the new `shell` module into PyPI package. ### Why are the changes needed? So PyPI package contains `shell` module. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released yet. ### How was this patch tested? The test case will be added at #45870. It was found out during working on that PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45882 from HyukjinKwon/SPARK-47081-followup. Lead-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

HyukjinKwon · 2024-04-05T04:15:07Z

python/pyspark/sql/session.py

+        >>> spark.registerProgressHandler(progress_handler)
+        >>> res = spark.range(10).repartition(1).collect()
+        3 Stages known, Done: False
+        3 Stages known, Done: True


This test is flaky:

File "/__w/spark/spark/python/pyspark/sql/connect/session.py", line 346, in pyspark.sql.connect.session.SparkSession.registerProgressHandler Failed example: res = spark.range(10).repartition(1).collect() Expected: 3 Stages known, Done: False 3 Stages known, Done: True Got: 0 Stages known, Done: True

https://github.com/apache/spark/actions/runs/8564043093/job/23470007059.

Let me skip it for now.

I'll unflake it. Thanks!

### What changes were proposed in this pull request? This PR is a followup of #45150 that skips flaky doctests. ### Why are the changes needed? In order to make the build stable. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR should verify it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45889 from HyukjinKwon/SPARK-47081-followup2. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

juliuszsompolski · 2025-01-13T14:47:01Z

...server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteGrpcResponseSender.scala

-            val timeout = Math.max(1, deadlineTimeMillis - System.currentTimeMillis())
+            // Wake up more frequently to send the progress updates.
+            val progressTimeout =
+              executeHolder.sessionHolder.session.conf.get(CONNECT_PROGRESS_REPORT_INTERVAL)
+            // If the progress feature is disabled, wait for the deadline.
+            val timeout = if (progressTimeout > 0) {
+              progressTimeout
+            } else {
+              Math.max(1, deadlineTimeMillis - System.currentTimeMillis())
+            }


nit:

var timeout = Math.max(1, deadlineTimeMillis - System.currentTimeMillis()) // Wake up more frequently to send the progress updates. val progressTimeout = executeHolder.sessionHolder.session.conf.get(CONNECT_PROGRESS_REPORT_INTERVAL) if (progressTimeout > 0) { Math.min(progressTimeout, timeout) }

otherwise, progressTimeout may make us wait beyond the deadline.

…ortInterval` over timeout ### What changes were proposed in this pull request? This PR is a followup that addresses #45150 (comment) ### Why are the changes needed? To respect `spark.connect.progress.reportInterval` ### Does this PR introduce _any_ user-facing change? Virtually no. In corner case, it the progress upgrade might take longer than `spark.connect.progress.reportInterval`. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49474 from HyukjinKwon/SPARK-47081-followup3. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

virrrat · 2025-04-14T06:06:46Z

Is there a plan to port-back this feature to Spark 3.5? Not sure if that will have a dependency on some other features that's not there in Spark 3.5.

grundprinzip · 2025-04-14T07:17:25Z

@virrrat this is not planned and given the Spark 4 release soon, not practical.

nchammas · 2025-06-23T17:52:02Z

python/pyspark/sql/session.py

+        Register a progress handler to be called when a progress update is received from the server.
+
+        .. versionadded:: 4.0


Potentially silly question, but: When you look at the docs for this, it's not obvious that Spark Connect supports this method. Should this be explicitly noted in the docstring somehow? Or are users supposed to assume that everything supports Spark Connect unless explicitly noted otherwise?

Another related question: Should there be narrative documentation of ProgressHandler on the monitoring page, or are we happy with it just being tucked away in the API docs?

Creating a PR with documentation updates would be very much appreciated!

nchammas · 2025-06-23T22:59:27Z

python/pyspark/sql/connect/shell/progress.py

+@dataclass
+class StageInfo:
+    stage_id: int
+    num_tasks: int
+    num_completed_tasks: int
+    num_bytes_read: int
+    done: bool


I'm just learning about this new feature and am potentially interested in expanding on the documentation for it, as it's very useful for people building applications on top of Spark.

One thing I noticed is that a job can be marked as "done" even though the number of completed tasks is less than the number of tasks for one or more stages. I assume this is because the stage was skipped or something else, but this information is not captured in this class, so the resulting progress communicated to the user ends up being a bit misleading and/or noisy.

Is it possible to enhance this somehow with that information (in which case I'm happy to file a ticket), or have I misunderstood this data?

Here's an example from some testing I did. This is the last update I got from my progress handler:

{'stages': [StageInfo(stage_id=37, num_tasks=1, num_completed_tasks=1, num_bytes_read=0, done=True), StageInfo(stage_id=29, num_tasks=1, num_completed_tasks=1, num_bytes_read=0, done=True), StageInfo(stage_id=33, num_tasks=183, num_completed_tasks=183, num_bytes_read=0, done=True), StageInfo(stage_id=35, num_tasks=120, num_completed_tasks=0, num_bytes_read=0, done=False), StageInfo(stage_id=31, num_tasks=1, num_completed_tasks=0, num_bytes_read=0, done=False), StageInfo(stage_id=32, num_tasks=120, num_completed_tasks=0, num_bytes_read=0, done=False), StageInfo(stage_id=34, num_tasks=1, num_completed_tasks=0, num_bytes_read=0, done=False), StageInfo(stage_id=36, num_tasks=183, num_completed_tasks=0, num_bytes_read=0, done=False), StageInfo(stage_id=30, num_tasks=120, num_completed_tasks=120, num_bytes_read=0, done=True)], 'inflight_tasks': 0, 'operation_id': '1a9fbf1d-4a38-4c6b-b730-6c8b49179694', 'done': True}

Note that the overall status is "done", even though many stages are not themselves done.

The progress reporting of the query is meant to provide current information about the query with the progress message continuing until the query is observed as done from the client side.

Due to the way that the Spark event listeners work, it is not guaranteed that all events have been processed until the query is marked as done by the Spark Connect query execution.

This means there are two potential ways the data can be "off":

Skipped / Canceled stages

Events not yet processed.

The goal for the progress report is not to be 100% accurate but indicate to the user what kind of progress the operation is making, for that reason the completed and current task might not be a perfect measure but provide a reasonable approximation that converges to a reasonable progress report.

Would it makes sense to add the stage_status field to StageInfo to better handle skipped / cancelled stages scenario? Users can consume data based on status

The Spark UI does document skipped stages, but I'm not sure if that information is available before the job is done. I think that's what @grundprinzip is saying.

Initial draft

084d257

github-actions bot added SQL BUILD CORE PYTHON CONNECT labels Feb 17, 2024

grundprinzip added 4 commits February 17, 2024 21:07

update

234b927

update

f78519e

fix race condition

962dfd4

fix lint

4947e79

HyukjinKwon reviewed Feb 19, 2024

View reviewed changes

python/pyspark/shell.py Show resolved Hide resolved

HyukjinKwon reviewed Feb 19, 2024

View reviewed changes

connector/connect/common/src/main/protobuf/spark/connect/base.proto Outdated Show resolved Hide resolved

HyukjinKwon reviewed Feb 19, 2024

View reviewed changes

...src/main/scala/org/apache/spark/sql/connect/execution/ConnectProgressExecutionListener.scala Show resolved Hide resolved

grundprinzip added 2 commits February 19, 2024 11:03

lint

228717f

lint

36d7924

hvanhovell reviewed Feb 19, 2024

View reviewed changes

...src/main/scala/org/apache/spark/sql/connect/execution/ConnectProgressExecutionListener.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Feb 19, 2024

View reviewed changes

...src/main/scala/org/apache/spark/sql/connect/execution/ConnectProgressExecutionListener.scala Show resolved Hide resolved

fix

dfb29e4

grundprinzip added 7 commits March 20, 2024 12:42

more progress stuff

be08f53

Merge remote-tracking branch 'origin/master' into HEAD

be7c445

fix

1b1a61a

fix

aa924c0

doc

7cedd98

fixing tests

e2063f2

fixing lint

84425c3

cloud-fan reviewed Mar 27, 2024

View reviewed changes

connector/connect/common/src/main/protobuf/spark/connect/base.proto Outdated Show resolved Hide resolved

fixing lint

50e4cbd

review comments

ad4791e

github-actions bot added the DOCS label Apr 1, 2024

grundprinzip added 5 commits April 1, 2024 22:19

lint

b662410

merge

85caee5

doc update

ac91982

doc update

deffbbc

doc update

415bdd8

cloud-fan reviewed Apr 2, 2024

View reviewed changes

...src/main/scala/org/apache/spark/sql/connect/execution/ConnectProgressExecutionListener.scala Show resolved Hide resolved

cloud-fan reviewed Apr 2, 2024

View reviewed changes

fix lint

6fcc36f

HyukjinKwon reviewed Apr 4, 2024

View reviewed changes

cloud-fan closed this in f6999df Apr 4, 2024

HyukjinKwon mentioned this pull request Apr 4, 2024

[SPARK-47081][CONNECT][FOLLOW-UP] Add the shell module into PyPI package #45882

Closed

HyukjinKwon reviewed Apr 5, 2024

View reviewed changes

HyukjinKwon mentioned this pull request Apr 5, 2024

[SPARK-47081][CONNECT][TESTS][FOLLOW-UP] Skip the flaky doctests for now #45889

Closed

juliuszsompolski reviewed Jan 13, 2025

View reviewed changes

HyukjinKwon mentioned this pull request Jan 14, 2025

[SPARK-47081][CONNECT][FOLLOW-UP] Respect spark.connect.progress.reportInterval over timeout #49474

Closed

Deependra-Patel mentioned this pull request Apr 1, 2025

Any plans to support Spark Connect? swan-cern/sparkmonitor#36

Open

nchammas reviewed Jun 23, 2025

View reviewed changes

nchammas mentioned this pull request Jul 3, 2025

[SPARK-52598][DOCS] Reorganize Spark Connect programming guide #51305

Closed

		Register a progress handler to be called when a progress update is received from the server.

		.. versionadded:: 4.0

[SPARK-47081][CONNECT] Support Query Execution Progress #45150

[SPARK-47081][CONNECT] Support Query Execution Progress #45150

Uh oh!

Conversation

grundprinzip commented Feb 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Example 1

Example 2

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

grundprinzip commented Mar 1, 2024

Uh oh!

dtenedor commented Mar 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grundprinzip commented Apr 1, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grundprinzip commented Apr 2, 2024

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

virrrat commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grundprinzip commented Apr 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

grundprinzip commented Feb 17, 2024 •

edited

Loading

virrrat commented Apr 14, 2025 •

edited

Loading