[SPARK-48586][SS] Remove lock acquisition in doMaintenance() by making a deep copy of file mappings in RocksDBFileManager in load() #46942

riyaverm-db · 2024-06-11T18:32:49Z

What changes were proposed in this pull request?

When change log checkpointing is enabled, the lock of the RocksDB state store is acquired when uploading the snapshot inside maintenance tasks, which causes lock contention between query processing tasks and state maintenance thread. This PR fixes lock contention issue introduced by #45724.

The changes include:

Removing lock acquisition in doMaintenance()
Adding a copyFileMappings() method to RocksDBFileManager, and using this method to deep copy the file manager state, specifically the file mappings versionToRocksDBFiles and localFilesToDfsFiles, in load()
Capture the reference to the file mappings in commit().

Why are the changes needed?

We want to eliminate lock contention to decrease latency of streaming queries so lock acquisition inside maintenance tasks should be avoided. This can introduce race conditions between task and maintenance threads. By making a deep copy of versionToRocksDBFiles and localFilesToDfsFiles in RocksDBFileManager, we can ensure that the file manager state is not updated by task thread when background snapshot uploading tasks attempt to upload a snapshot.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test cases.

Was this patch authored or co-authored using generative AI tooling?

No

riyaverm-db · 2024-06-11T18:34:42Z

@chaoqin-li1123

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

riyaverm-db · 2024-06-18T21:16:42Z

@sahnib

chaoqin-li1123 · 2024-06-18T22:22:19Z

Can you fill in testing description in the PR description?
Remove the WIP from title and finalize the title and description when it is ready for review.

chaoqin-li1123 · 2024-06-18T23:14:51Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

Can we be more specific and name it "background snapshot upload doesn't acquire rocksdb instance lock"?

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

sahnib · 2024-06-20T17:31:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

Are we trying to synchronize updates between task thread/maintenance thread here? If so, we need a lock - just synchronized would not work. This code can only be executed by one thread anyways (task thread that holds the acquire lock - so synchronized does not do anything in my opinion).

What is the race condition you are thinking about? Is calling close() on a snapshot being uploaded? In that case we can defer the close to the maintenance thread.

Sorry I missed your comment @chaoqin-li1123. The race condition is basically assigning to latestSnapshot variable. We assign the value of variable to newly created snapshot here, and set the variable to None in maintenance thread. Both these need to be synchronized, I think.

It seems to have been fixed in the latest revision.

sahnib · 2024-06-20T17:41:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

[nit] Probably better to name it as snapshotFileManager.

riyaverm-db · 2024-06-24T23:08:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

Race condition for oldSnapshots here between maintenance operations and commit addOne()

…file mappings

siying

Thanks for working on it. I don't have any other concern anymore.

siying · 2024-06-25T20:24:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

+    fileMappings = RocksDBFileMappings(newVersionToRocksDBFiles, newLocalFilesToDfsFiles)
+  }
+
+  def captureFileMapReference(): RocksDBFileMappings = {


I didn't quite get the purpose of this function.

It functions as a get method for the RocksDBFileMapping private var. I named it captureFileMapReference for code readability in RocksDB where it is used.

I believe Scala has a accessor style guild: https://docs.scala-lang.org/style/naming-conventions.html#accessorsmutators
Even if follow Java or other most languages' convention, getter is better to be named getFileMappings().
Since the PR is already closed. You don't have to fix it. It's not a big deal. I mentioned the comment just to complete the discussion loop.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

sahnib · 2024-06-26T00:54:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+          oldSnapshots += latestSnapshot
          latestSnapshot = Some(
-            RocksDBSnapshot(checkpointDir, newVersion, numKeysOnWritingVersion))
+            RocksDBSnapshot(checkpointDir,
+              newVersion,
+              numKeysOnWritingVersion,
+              fileManager.captureFileMapReference()))


I think we need to synchronize this part to prevent race with maintenance thread on mutating latestSnapshot.

sahnib · 2024-06-26T00:54:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

          // If changelog checkpointing is enabled, snapshot will be uploaded asynchronously
          // during state store maintenance.
-          latestSnapshot.foreach(_.close())
+          oldSnapshots += latestSnapshot


Do we need to check if latestSnapshot is None before we append it to the list?

This is a List[Option[Snapshot]], so checking is not needed. But I feel changing it to a List[Snapshot] and doing the null check make the code much cleaner.

sahnib

LGTM. thanks for making these changes. Its super helpful to avoid contention between maintenance and task thread for performance, and ensuring snapshots keep on uploading at regular cadence.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

sahnib · 2024-06-26T20:45:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

Sorry I missed your comment @chaoqin-li1123. The race condition is basically assigning to latestSnapshot variable. We assign the value of variable to newly created snapshot here, and set the variable to None in maintenance thread. Both these need to be synchronized, I think.

It seems to have been fixed in the latest revision.

sahnib · 2024-06-26T20:50:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

      val checkpoint = latestSnapshot
      latestSnapshot = None


[unrelated to this PR - just a note here]. I think if we fail to write to DFS (lets say due to some timeout/network error), we lose the snapshot as we assign the variable to None here. Probably better to assign this variable to None after a successful upload.

However, I think its inherently complex to do what I mentioned in above paragraph, because the task thread can change the latestSnapshot to a new snapshot while upload is happening. I guess maybe its okay to skip a snapshot on transient error (for simplicity) and upload the next snapshot.

Curious, if you have any further thoughts on this @riyaverm-db @chaoqin-li1123 .

Snapshot uploading is best effort attempt anyway. It should be fine we we skip some uploading due to transient error.

HeartSaVioR

Given there are three approvals and the PR has been up for 2 weeks, I wouldn't require round of review to fix nits. Please file a new FOLLOWUP PR to address the comments.

+1

HeartSaVioR · 2024-06-27T08:37:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala


-  case class RocksDBSnapshot(checkpointDir: File, version: Long, numKeys: Long) {
+  case class RocksDBSnapshot(
+    checkpointDir: File,


nit: 2 more spaces (4 spaces for params in multi-lines)

HeartSaVioR · 2024-06-27T08:38:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

  /** Save all the files in given local checkpoint directory as a committed version in DFS */
-  def saveCheckpointToDfs(checkpointDir: File, version: Long, numKeys: Long): Unit = {
+  def saveCheckpointToDfs(
+    checkpointDir: File,


nit: same here, 2 more spaces

HeartSaVioR · 2024-06-27T08:38:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

+ */
+
+case class RocksDBFileMappings(
+  versionToRocksDBFiles: ConcurrentHashMap[Long, Seq[RocksDBImmutableFile]],


nit: 2 more spaces

HeartSaVioR · 2024-06-27T08:41:04Z

Thanks! Merging to master/3.5/3.4.

HeartSaVioR · 2024-06-27T08:53:53Z

@riyaverm-db Looks like there is merge conflict in 3.5 (and probably 3.4 as well). Could you please help crafting a PR for 3.5 and 3.4? Thanks in advance!

…making a deep copy of file mappings in RocksDBFileManager in load() Backports apache#46942 to 3.5 When change log checkpointing is enabled, the lock of the **RocksDB** state store is acquired when uploading the snapshot inside maintenance tasks, which causes lock contention between query processing tasks and state maintenance thread. This PR fixes lock contention issue introduced by apache#45724. The changes include: 1. Removing lock acquisition in `doMaintenance()` 2. Adding a `copyFileMappings()` method to **RocksDBFileManager**, and using this method to deep copy the file manager state, specifically the file mappings `versionToRocksDBFiles` and `localFilesToDfsFiles`, in `load()` 3. Capture the reference to the file mappings in `commit()`. We want to eliminate lock contention to decrease latency of streaming queries so lock acquisition inside maintenance tasks should be avoided. This can introduce race conditions between task and maintenance threads. By making a deep copy of `versionToRocksDBFiles` and `localFilesToDfsFiles` in **RocksDBFileManager**, we can ensure that the file manager state is not updated by task thread when background snapshot uploading tasks attempt to upload a snapshot. No Added unit test cases. No Closes apache#46942 from riyaverm-db/remove-lock-contention-between-maintenance-and-task. Authored-by: Riya Verma <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

… changes ### What changes were proposed in this pull request? This is a follow up PR to #46942 addressing the style changes that were requested. ### Why are the changes needed? Style changes added. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not applicable. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47136 from riyaverm-db/rocks-db-style. Authored-by: Riya Verma <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…making a deep copy of file mappings in RocksDBFileManager in load() Backports #46942 to 3.5 ### What changes were proposed in this pull request? When change log checkpointing is enabled, the lock of the **RocksDB** state store is acquired when uploading the snapshot inside maintenance tasks, which causes lock contention between query processing tasks and state maintenance thread. This PR fixes lock contention issue introduced by #45724. The changes include: 1. Removing lock acquisition in `doMaintenance()` 2. Adding a `copyFileMappings()` method to **RocksDBFileManager**, and using this method to deep copy the file manager state, specifically the file mappings `versionToRocksDBFiles` and `localFilesToDfsFiles`, in `load()` 3. Capture the reference to the file mappings in `commit()`. ### Why are the changes needed? We want to eliminate lock contention to decrease latency of streaming queries so lock acquisition inside maintenance tasks should be avoided. This can introduce race conditions between task and maintenance threads. By making a deep copy of `versionToRocksDBFiles` and `localFilesToDfsFiles` in **RocksDBFileManager**, we can ensure that the file manager state is not updated by task thread when background snapshot uploading tasks attempt to upload a snapshot. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test cases. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47130 from riyaverm-db/remove-lock-contention-between-maintenance-and-task-3.5. Lead-authored-by: Riya Verma <[email protected]> Co-authored-by: Riya Verma <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

riyaverm-db marked this pull request as draft June 11, 2024 18:32

github-actions bot added SQL STRUCTURED STREAMING labels Jun 11, 2024

riyaverm-db changed the title ~~[WIP] [SPARK-48586] Remove lock contention between maintenance and task~~ [WIP] [SPARK-48586] Remove lock contention between maintenance and task threads Jun 11, 2024

chaoqin-li1123 reviewed Jun 11, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala Outdated Show resolved Hide resolved

chaoqin-li1123 reviewed Jun 11, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala Outdated Show resolved Hide resolved

chaoqin-li1123 reviewed Jun 11, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala Outdated Show resolved Hide resolved

riyaverm-db force-pushed the remove-lock-contention-between-maintenance-and-task branch from 8d5814b to aa56251 Compare June 12, 2024 21:32

riyaverm-db marked this pull request as ready for review June 18, 2024 21:49

chaoqin-li1123 reviewed Jun 18, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala Outdated Show resolved Hide resolved

riyaverm-db changed the title ~~[WIP] [SPARK-48586] Remove lock contention between maintenance and task threads~~ [SPARK-48586] Remove lock acquisition in doMaintenance() by making a deep copy of RocksDBFileManager in load() Jun 18, 2024

riyaverm-db requested a review from chaoqin-li1123 June 20, 2024 15:51

sahnib reviewed Jun 20, 2024

View reviewed changes

riyaverm-db force-pushed the remove-lock-contention-between-maintenance-and-task branch 2 times, most recently from bb7bbfd to 9db51f3 Compare June 24, 2024 22:55

riyaverm-db commented Jun 24, 2024

View reviewed changes

riyaverm-db added 10 commits June 24, 2024 17:53

Add deepCopy() to RocksDBFileManager

b8bf6cc

Add deepCopy test case

d9cfc70

Fix Formatting

1b64014

Revert pr changes

28451ca

Use deepCopy to avoid lock contention

a995091

Modify deep copy test case, add comments

25b9e96

Add additional asserts to deepCopy test case

d3cbf25

Add lock contention test case

0bf9acd

Add latestSnapshot race condition test case

6069779

Modify test case comments

7381b17

riyaverm-db added 4 commits June 24, 2024 17:53

Move lock contention test case

651ac74

Modify lock test case

d3354cf

Remove latestSnapshot race condition and refactor RocksDBFileManager …

7b4ebd6

…file mappings

Purge oldSnapshots one at a time

c57f958

riyaverm-db force-pushed the remove-lock-contention-between-maintenance-and-task branch from 9db51f3 to c57f958 Compare June 25, 2024 00:55

riyaverm-db requested a review from sahnib June 25, 2024 15:43

siying approved these changes Jun 25, 2024

View reviewed changes

Minor code style and naming changes

ddb97d6

sahnib reviewed Jun 26, 2024

View reviewed changes

riyaverm-db changed the title ~~[SPARK-48586] Remove lock acquisition in doMaintenance() by making a deep copy of RocksDBFileManager in load()~~ [SPARK-48586] Remove lock acquisition in doMaintenance() by making a deep copy of file mappings in RocksDBFileManager in load() Jun 26, 2024

riyaverm-db added 2 commits June 26, 2024 11:36

Add synchronization in commit to remove latestSnapshot race condition

5894a97

Remove Option in list type

bcb2307

riyaverm-db force-pushed the remove-lock-contention-between-maintenance-and-task branch from e0bf8b6 to bcb2307 Compare June 26, 2024 18:50

sahnib approved these changes Jun 26, 2024

View reviewed changes

Remove extra line

16c26fc

chaoqin-li1123 approved these changes Jun 26, 2024

View reviewed changes

HeartSaVioR approved these changes Jun 27, 2024

View reviewed changes

HeartSaVioR closed this in 40ad829 Jun 27, 2024

This was referenced Jun 27, 2024

[SPARK-48586][SS][3.5] Remove lock acquisition in doMaintenance() by making a deep copy of file mappings in RocksDBFileManager in load() #47130

Closed

[SPARK-48586][SS][FOLLOWUP] RocksDB and RocksDBFileManager code style changes #47136

Closed

[SPARK-48586][SS] Remove lock acquisition in doMaintenance() by making a deep copy of file mappings in RocksDBFileManager in load() #46942

[SPARK-48586][SS] Remove lock acquisition in doMaintenance() by making a deep copy of file mappings in RocksDBFileManager in load() #46942

Uh oh!

Conversation

riyaverm-db commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

riyaverm-db commented Jun 11, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

riyaverm-db commented Jun 18, 2024

Uh oh!

chaoqin-li1123 commented Jun 18, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siying left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sahnib left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Jun 27, 2024

Uh oh!

HeartSaVioR commented Jun 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

riyaverm-db commented Jun 11, 2024 •

edited

Loading