Skip to content

Conversation

@abellina
Copy link
Contributor

This PR follows #45930 and #46302 (which is open as of the creation of this PR) to enable shuffle file cleanup for all SQL executions, not just Spark Connect.

The prior PR #45930 introduces two new configs: spark.sql.shuffleDependency.skipMigration.enabled and spark.sql.shuffleDependency.fileCleanup.enabled. These two configs are not specifically namespaced to Spark Connect and I'd like to make sure we can use them from all QueryExecutions. Before this PR, only Spark Connect could enable it.

My change is to move the check for shuffleCleanupMode inside of QueryExecution, instead of having that be passed to this class in the constructor. I also am explicitly turning on these features in the tests, rather than using Utils.isTesting.

I would love to hear any concerns on why we shouldn't do this or what testing you want to see. I have run Standalone tests (note I needed #46302) and can run other tests if required or can code them.

@abellina abellina force-pushed the remove_shuffle_files branch from 76917d4 to cd86e54 Compare July 15, 2024 20:37
@abellina
Copy link
Contributor Author

@bozhang2820 @cloud-fan fyi

@bozhang2820
Copy link
Contributor

Hi @abellina, we tried this before but found that queries would fail, especially when DataFrame objects are reused. You can find test failures when you change the default value of the config back to Utils.isTesting.

@cloud-fan
Copy link
Contributor

I agree to expose this feature to all query executions, so that users can benefit from it if they know they won't execute a DataFrame twice. However, there are certain internal executions that we should not enable shuffle cleanup, such as cached DataFrame execution. It's hard to identify all these places, and I think it's better to limit the entry points to enable shuffle cleanup. Spark Connect is one of them, thriftserver can be another.

@tgravescs
Copy link
Contributor

I agree to expose this feature to all query executions

I agree, I think there are a bunch of cases cleaning the shuffle is very useful. We can go investigate options more.

Perhaps there is a way for to only be enabled for external calls but anything internally calling like caching wouldn't clean up.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 27, 2024
@github-actions github-actions bot closed this Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants