-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add PVC-to-Pod cache to improve volume policy performance #9441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add PVC-to-Pod cache to improve volume policy performance #9441
Conversation
Signed-off-by: Shubham Pampattiwar <[email protected]>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9441 +/- ##
==========================================
+ Coverage 60.23% 60.29% +0.05%
==========================================
Files 386 386
Lines 35937 36023 +86
==========================================
+ Hits 21648 21720 +72
- Misses 12715 12723 +8
- Partials 1574 1580 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Shubham Pampattiwar <[email protected]>
c58a94e to
8a9827e
Compare
|
Review Request: @Lyndon-Li @blackpiglet @sseago @kaovilai |
Signed-off-by: Shubham Pampattiwar <[email protected]>
7d0f19b to
cd1d4d4
Compare
|
@Lyndon-Li, thanks for the review feedback. I've addressed your comments:
The fallback logic still exists in |
The GetPodsUsingPVC function had O(N*M) complexity - for each PVC, it listed ALL pods in the namespace and iterated through each pod. With many PVCs and pods, this caused significant performance degradation (2+ seconds per PV in some cases). This change introduces a PVC-to-Pod cache that is built once per backup and reused for all PVC lookups, reducing complexity from O(N*M) to O(N+M). Changes: - Add PVCPodCache struct with thread-safe caching in podvolume pkg - Add NewVolumeHelperImplWithCache constructor for cache support - Build cache before backup item processing in backup.go - Add comprehensive unit tests for cache functionality - Graceful fallback to direct lookups if cache fails Fixes vmware-tanzu#9179 Signed-off-by: Shubham Pampattiwar <[email protected]>
Signed-off-by: Shubham Pampattiwar <[email protected]>
Add TestVolumeHelperImplWithCache_ShouldPerformSnapshot to verify: - Volume policy match with cache returns correct snapshot decision - fs-backup via opt-out with cache properly skips snapshot - Fallback to direct lookup when cache is not built These tests verify the cache-enabled code path added in the previous commit for improved volume policy performance. Signed-off-by: Shubham Pampattiwar <[email protected]>
Add TestVolumeHelperImplWithCache_ShouldPerformFSBackup to verify: - Volume policy match with cache returns correct fs-backup decision - Volume policy match with snapshot action skips fs-backup - Fallback to direct lookup when cache is not built Signed-off-by: Shubham Pampattiwar <[email protected]>
Add test case to verify that the PVC-to-Pod cache is used even when no volume policy is configured. When defaultVolumesToFSBackup is true, the cache is used to find pods using the PVC to determine if fs-backup should be used instead of snapshot. Signed-off-by: Shubham Pampattiwar <[email protected]>
- Use ResolveNamespaceList() instead of GetIncludes() for more accurate namespace resolution when building the PVC-to-Pod cache - Refactor NewVolumeHelperImpl to call NewVolumeHelperImplWithCache with nil cache parameter to avoid code duplication Signed-off-by: Shubham Pampattiwar <[email protected]>
- Rename NewVolumeHelperImplWithCache to NewVolumeHelperImplWithNamespaces - Move cache building logic from backup.go into volumehelper - Return error from NewVolumeHelperImplWithNamespaces if cache build fails - Remove fallback in main backup path - backup fails if cache build fails - Update NewVolumeHelperImpl to call NewVolumeHelperImplWithNamespaces - Add comments clarifying fallback is only used by plugins - Update tests for new error return signature This addresses review comments from @Lyndon-Li and @kaovilai: - Cache building is now encapsulated in volumehelper - No fallback in main backup path ensures predictable performance - Code reuse between constructors Fixes vmware-tanzu#9179 Signed-off-by: Shubham Pampattiwar <[email protected]>
22c05cf to
5e120a7
Compare
| namespaces, | ||
| ) | ||
| if err != nil { | ||
| log.WithError(err).Error("Failed to build PVC-to-Pod cache for volume policy lookups") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Failed to build volume helper"
|
I realized we cannot leave plugins to call the original
We need a further consideration. With the current PR, I am afraid we cannot solve the original problem. |
|
@Lyndon-Li You are right that the CSI PVC BIA (pvc_action.go) still calls Looking at the code paths:
To fix the BIA path, I see a couple of options: Option A: Backup-level cache registry Option B: Modify CSI PVC action directly What approach would you prefer? Or should we scope this PR to the main backup path fix and address the BIA issue in a follow-up? Let me know what you think ? |
|
Option A: Backup-level cache registry Option B: Modify CSI PVC action directly |
|
@Lyndon-Li Let me clarify a few things: Option A: Backup-level cache registry
Option B: Per-backup cache tracking in BIA
|
For Option A: Just notice that BIAs and Velero backupper are running in different PROCESSES, so there is no way to share the cache data without the inter-process data sharing methods, e.g., protocolBuffer, share memory, etc. |
|
@Lyndon-Li while it is possible for the plugin process to exit and restart between BIA calls for a given backup, that is not the norm. Velero doesn't explicitly shut down a plugin process until the end of backjup processing, so in general, if a backup has 1000 items that all invoke the same BIA, all of those BIA calls will occur within the same process. We have other cases where we are caching information in the BIA to avoid having to make an APIserver call to grab the same information many times in a row. |
|
@sseago Thanks for the clarification, That's helpful to know that the plugin process typically stays alive for the duration of So to summarize the path forward:
@Lyndon-Li Does this approach work for you? Should we proceed with the current PR as-is and address the BIA caching in a follow-up? or I can update this PR too if you are on board. |
Thank you for contributing to Velero!
Please add a summary of your change
The
GetPodsUsingPVCfunction had O(N*M) complexity - for each PVC (N), it listed ALL pods in the namespace and iterated through each pod (M). With many PVCs and pods in a cluster, this caused significant performance degradation during backup operations.As reported in issue #9169, one N*M iteration even took 2 seconds:
This change introduces a PVC-to-Pod cache that is built once per backup and reused for all PVC lookups, reducing complexity from O(N*M) to O(N+M).
Changes:
PVCPodCachestruct with thread-safe caching inpkg/util/podvolumeNewVolumeHelperImplWithCacheconstructor for cache supportbackup.goPerformance Impact:
Does your change fix a particular issue?
Fixes #9179
Please indicate you've done the following:
make new-changelog) or comment/kind changelog-not-requiredon this PR.site/content/docs/main.