Skip to content

Conversation

@turboFei
Copy link
Member

@turboFei turboFei commented Aug 7, 2025

Why are the changes needed?

To close #7163, in this PR, it checks whether engine context stopped in engine terminating checker.

  1. Spark context stooped dut to OOM in spark-listener-group-shared, and call tryOrStopSparkContext.
25/08/03 19:08:06 ERROR Utils: uncaught error in thread spark-listener-group-shared, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded
25/08/03 19:08:06 INFO OperationAuditLogger: operation=a7f134b9-373b-402d-a82b-2d42df568807 opType=ExecuteStatement state=INITIALIZED   user=b_hrvst    session=6a90d01c-7627-4ae6-a506-7ba826355489
...
25/08/03 19:08:23 INFO SparkSQLSessionManager: Opening session for [email protected]
25/08/03 19:08:23 ERROR SparkTBinaryFrontendService: Error opening session: 
org.apache.kyuubi.KyuubiSQLException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:951)
org.apache.kyuubi.engine.spark.SparkSQLEngine$.createSpark(SparkSQLEngine.scala:337)
org.apache.kyuubi.engine.spark.SparkSQLEngine$.main(SparkSQLEngine.scala:415)
org.apache.kyuubi.engine.spark.SparkSQLEngine.main(SparkSQLEngine.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)
The currently active SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:951)
org.apache.kyuubi.engine.spark.SparkSQLEngine$.createSpark(SparkSQLEngine.scala:337)
org.apache.kyuubi.engine.spark.SparkSQLEngine$.main(SparkSQLEngine.scala:415)
org.apache.kyuubi.engine.spark.SparkSQLEngine.main(SparkSQLEngine.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)
         
    at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:69)
    at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:73)
  1. The kyuubi engine stop after 12 hours.
25/08/04 07:13:25 ERROR ZookeeperDiscoveryClient: Zookeeper client connection state changed to: LOST, but failed to reconnect in 3 seconds. Give up retry and stop gracefully . 
25/08/04 07:13:25 INFO ClientCnxn: Session establishment complete on server zeus-slc-zk-3.vip.hadoop.ebay.com/10.147.141.240:2181, sessionid = 0x3939e22c983032e, negotiated timeout = 40000
25/08/04 07:13:25 INFO ConnectionStateManager: State change: RECONNECTED
25/08/04 07:13:25 INFO ZookeeperDiscoveryClient: Zookeeper client connection state changed to: RECONNECTED
25/08/04 07:13:25 INFO SparkSQLEngine: Service: [SparkTBinaryFrontend] is stopping.
25/08/04 07:13:25 INFO SparkTBinaryFrontendService: Service: [EngineServiceDiscovery] is stopping.
25/08/04 07:13:25 WARN EngineServiceDiscovery: The Zookeeper ensemble is LOST
25/08/04 07:13:25 INFO EngineServiceDiscovery: Service[EngineServiceDiscovery] is stopped.
25/08/04 07:13:25 INFO SparkTBinaryFrontendService: Service[SparkTBinaryFrontend] is stopped.
25/08/04 07:13:25 INFO SparkTBinaryFrontendService: SparkTBinaryFrontend has stopped
25/08/04 07:13:25 INFO SparkSQLEngine: Service: [SparkSQLBackendService] is stopping.
25/08/04 07:13:25 INFO SparkSQLBackendService: Service: [SparkSQLSessionManager] is stopping.
25/08/04 07:13:25 INFO SparkSQLSessionManager: Service: [SparkSQLOperationManager] is stopping.
25/08/04 07:13:45 INFO SparkSQLOperationManager: Service[SparkSQLOperationManager] is stopped.
25/08/04 07:13:45 INFO SparkSQLSessionManager: Service[SparkSQLSessionManager] is stopped.
  1. seem the shutdown hook does not work in such case

    // Stop engine before SparkContext stopped to avoid calling a stopped SparkContext
    addShutdownHook(() => engine.stop(), SPARK_CONTEXT_SHUTDOWN_PRIORITY + 2)

  2. and SparkSQLEngineListener did not receive ApplicationEnd message, maybe due to spark-listener-group-shared OOM? I do not have jstack for that, and can not check whether the thread alive.

    override def onApplicationEnd(event: SparkListenerApplicationEnd): Unit = {
    server.getServiceState match {
    case ServiceState.STOPPED =>
    debug("Received ApplicationEnd Message form Spark after the engine has stopped")
    case state =>
    info(s"Received ApplicationEnd Message from Spark at $state, stopping")
    server.stop()
    }
    }

How was this patch tested?

Existing GA.

Was this patch authored or co-authored using generative AI tooling?

No.

@turboFei turboFei force-pushed the check_spark_stopped branch from 03c3bfc to ca551c2 Compare August 7, 2025 06:12
@turboFei turboFei changed the title [KYUUBI #7163] Check whether engine context stopped in engine terminating checker [KYUUBI #7163][SPARK Check whether engine context stopped in engine terminating checker Aug 7, 2025
@turboFei turboFei changed the title [KYUUBI #7163][SPARK Check whether engine context stopped in engine terminating checker [KYUUBI #7163][SPARK] Check whether engine context stopped in engine terminating checker Aug 7, 2025
@codecov-commenter
Copy link

codecov-commenter commented Aug 7, 2025

Codecov Report

❌ Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (dc9c75b) to head (835cb3d).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...ala/org/apache/kyuubi/session/SessionManager.scala 0.00% 9 Missing ⚠️
.../engine/spark/session/SparkSQLSessionManager.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##           master   #7167   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         701     701           
  Lines       43515   43521    +6     
  Branches     5897    5899    +2     
======================================
- Misses      43515   43521    +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@turboFei turboFei requested a review from pan3793 August 7, 2025 08:09
@turboFei turboFei closed this in b31663f Aug 7, 2025
turboFei added a commit that referenced this pull request Aug 7, 2025
…terminating checker

### Why are the changes needed?

To close #7163, in this PR, it checks whether engine context stopped in engine terminating checker.
1. Spark context stooped dut to OOM in `spark-listener-group-shared`, and call `tryOrStopSparkContext`.

```
25/08/03 19:08:06 ERROR Utils: uncaught error in thread spark-listener-group-shared, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded
25/08/03 19:08:06 INFO OperationAuditLogger: operation=a7f134b9-373b-402d-a82b-2d42df568807 opType=ExecuteStatement state=INITIALIZED   user=b_hrvst    session=6a90d01c-7627-4ae6-a506-7ba826355489
...
25/08/03 19:08:23 INFO SparkSQLSessionManager: Opening session for b_hrvst10.147.254.115
25/08/03 19:08:23 ERROR SparkTBinaryFrontendService: Error opening session:
org.apache.kyuubi.KyuubiSQLException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:951)
org.apache.kyuubi.engine.spark.SparkSQLEngine$.createSpark(SparkSQLEngine.scala:337)
org.apache.kyuubi.engine.spark.SparkSQLEngine$.main(SparkSQLEngine.scala:415)
org.apache.kyuubi.engine.spark.SparkSQLEngine.main(SparkSQLEngine.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)
The currently active SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:951)
org.apache.kyuubi.engine.spark.SparkSQLEngine$.createSpark(SparkSQLEngine.scala:337)
org.apache.kyuubi.engine.spark.SparkSQLEngine$.main(SparkSQLEngine.scala:415)
org.apache.kyuubi.engine.spark.SparkSQLEngine.main(SparkSQLEngine.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)

    at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:69)
    at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:73)
```

2. The kyuubi engine stop after 12 hours.
```
25/08/04 07:13:25 ERROR ZookeeperDiscoveryClient: Zookeeper client connection state changed to: LOST, but failed to reconnect in 3 seconds. Give up retry and stop gracefully .
25/08/04 07:13:25 INFO ClientCnxn: Session establishment complete on server zeus-slc-zk-3.vip.hadoop.ebay.com/10.147.141.240:2181, sessionid = 0x3939e22c983032e, negotiated timeout = 40000
25/08/04 07:13:25 INFO ConnectionStateManager: State change: RECONNECTED
25/08/04 07:13:25 INFO ZookeeperDiscoveryClient: Zookeeper client connection state changed to: RECONNECTED
25/08/04 07:13:25 INFO SparkSQLEngine: Service: [SparkTBinaryFrontend] is stopping.
25/08/04 07:13:25 INFO SparkTBinaryFrontendService: Service: [EngineServiceDiscovery] is stopping.
25/08/04 07:13:25 WARN EngineServiceDiscovery: The Zookeeper ensemble is LOST
25/08/04 07:13:25 INFO EngineServiceDiscovery: Service[EngineServiceDiscovery] is stopped.
25/08/04 07:13:25 INFO SparkTBinaryFrontendService: Service[SparkTBinaryFrontend] is stopped.
25/08/04 07:13:25 INFO SparkTBinaryFrontendService: SparkTBinaryFrontend has stopped
25/08/04 07:13:25 INFO SparkSQLEngine: Service: [SparkSQLBackendService] is stopping.
25/08/04 07:13:25 INFO SparkSQLBackendService: Service: [SparkSQLSessionManager] is stopping.
25/08/04 07:13:25 INFO SparkSQLSessionManager: Service: [SparkSQLOperationManager] is stopping.
25/08/04 07:13:45 INFO SparkSQLOperationManager: Service[SparkSQLOperationManager] is stopped.
25/08/04 07:13:45 INFO SparkSQLSessionManager: Service[SparkSQLSessionManager] is stopped.
```

3. seem the shutdown hook does not work in such case
https://github.com/apache/kyuubi/blob/9a0c49e79135cd90368986176591a80d29634231/externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/SparkSQLEngine.scala#L375-L376

4. and `SparkSQLEngineListener` did not receive `ApplicationEnd` message, maybe due to `spark-listener-group-shared` OOM? I do not have jstack for that, and can not check whether the thread alive.
https://github.com/apache/kyuubi/blob/9a0c49e79135cd90368986176591a80d29634231/externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/kyuubi/SparkSQLEngineListener.scala#L55-L63

### How was this patch tested?

Existing GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #7167 from turboFei/check_spark_stopped.

Closes #7163

835cb3d [Wang, Fei] SparkContext
cd542de [Wang, Fei] Revert "no hard code"
cf9e40e [Wang, Fei] no hard code
ca551c2 [Wang, Fei] check engine context stopped

Authored-by: Wang, Fei <[email protected]>
Signed-off-by: Wang, Fei <[email protected]>
(cherry picked from commit b31663f)
Signed-off-by: Wang, Fei <[email protected]>
@turboFei turboFei self-assigned this Aug 7, 2025
@turboFei turboFei added this to the v1.10.3 milestone Aug 7, 2025
@turboFei
Copy link
Member Author

turboFei commented Aug 7, 2025

thanks, merged to 1.11.0 and 1.10.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Kyuubi engine keep alive after spark context stopped

3 participants