[SPARK-50572][SQL] Fix race condition in CachedRDDBuilder.cachedColumnBuffers #49179

eejbyfeldt · 2024-12-13T13:55:47Z

What changes were proposed in this pull request?

Fix race condition in the class CachedRDDBuilder.

Why are the changes needed?

The previous code had a race condition that mean that cachedColumnBuffers could return null if another thread was concurrently was calling clearCache.

The bug is caused by us checking _cachedColumnBuffers and return it as two separate operations outside a synchronized block. So it possible for another thread to set it to null after the check but before the return.

java.lang.NullPointerException: null
	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.filteredCachedBatches(InMemoryTableScanExec.scala:156)
	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.inputRDD$lzycompute(InMemoryTableScanExec.scala:98)
	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.inputRDD(InMemoryTableScanExec.scala:84)
	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.doExecute(InMemoryTableScanExec.scala:163)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:191)
	at org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:527)
	at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:455)
	at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:454)
	at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:498)
	at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:51)
	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:751)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:191)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:364)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:445)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4218)
	at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3459)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4208)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
	... 23 more

Does this PR introduce any user-facing change?

Yes, fixes a race condition that can cause crashes.

How was this patch tested?

Adds a test that shows that reliably fails on the old code. Not sure if want to merge that style of test as it very specific for this bug and takes 3 seconds to run.

Was this patch authored or co-authored using generative AI tooling?

No.

The previous code had a race condition that mean that `cachedColumnBuffers` could return `null` if another thread was concurrently was calling `clearCache`. The bug is caused by us checking _cachedColumnBuffers and return it as two separate operations outside a synchronized block. So it possible for another thread to set it to `null` after the check but before the return. ``` java.lang.NullPointerException: null at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.filteredCachedBatches(InMemoryTableScanExec.scala:156) at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.inputRDD$lzycompute(InMemoryTableScanExec.scala:98) at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.inputRDD(InMemoryTableScanExec.scala:84) at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.doExecute(InMemoryTableScanExec.scala:163) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:195) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:191) at org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:527) at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:455) at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:454) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:498) at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:51) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:751) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:195) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:191) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:364) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:445) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4218) at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3459) at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4208) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526) ... 23 more ```

eejbyfeldt · 2024-12-13T14:14:09Z

@cloud-fan since you were involved in #21018 where the problematic code was introduced.

eejbyfeldt · 2024-12-13T14:15:17Z

An alternative fix could be to mark the entire method as synchronized if we do not think these are on a critical path.

github-actions · 2025-03-24T00:27:24Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

SauronShepherd · 2025-03-24T04:53:33Z

I'd like to have a look at it, when I have time.
@eejbyfeldt , is there any reason why you didn't continue working on this PR?

eejbyfeldt · 2025-03-25T13:25:58Z

I'd like to have a look at it, when I have time. @eejbyfeldt , is there any reason why you didn't continue working on this PR?

It was just waiting for someone to review it. As far as I am aware it good to go as it is and it fixes a real race condition.

SauronShepherd · 2025-03-25T14:04:51Z

I find this PR quite interesting indeed. It's a pity this hasn't gone ahead. Did you send a message to the dev mail list?

eejbyfeldt · 2025-04-16T08:59:58Z

Sorry about the late response, but no I don't think I did.

github-actions bot added the SQL label Dec 13, 2024

github-actions bot added the Stale label Mar 24, 2025

github-actions bot closed this Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-50572][SQL] Fix race condition in CachedRDDBuilder.cachedColumnBuffers #49179

[SPARK-50572][SQL] Fix race condition in CachedRDDBuilder.cachedColumnBuffers #49179

Uh oh!

eejbyfeldt commented Dec 13, 2024

Uh oh!

eejbyfeldt commented Dec 13, 2024

Uh oh!

eejbyfeldt commented Dec 13, 2024

Uh oh!

github-actions bot commented Mar 24, 2025

Uh oh!

SauronShepherd commented Mar 24, 2025

Uh oh!

eejbyfeldt commented Mar 25, 2025

Uh oh!

SauronShepherd commented Mar 25, 2025 •

edited

Loading

Uh oh!

eejbyfeldt commented Apr 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-50572][SQL] Fix race condition in CachedRDDBuilder.cachedColumnBuffers #49179

[SPARK-50572][SQL] Fix race condition in CachedRDDBuilder.cachedColumnBuffers #49179

Uh oh!

Conversation

eejbyfeldt commented Dec 13, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

eejbyfeldt commented Dec 13, 2024

Uh oh!

eejbyfeldt commented Dec 13, 2024

Uh oh!

github-actions bot commented Mar 24, 2025

Uh oh!

SauronShepherd commented Mar 24, 2025

Uh oh!

eejbyfeldt commented Mar 25, 2025

Uh oh!

SauronShepherd commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eejbyfeldt commented Apr 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SauronShepherd commented Mar 25, 2025 •

edited

Loading