Skip to content

Conversation

@shubham-pampattiwar
Copy link
Collaborator

@shubham-pampattiwar shubham-pampattiwar commented Dec 2, 2025

Thank you for contributing to Velero!

Please add a summary of your change

The GetPodsUsingPVC function had O(N*M) complexity - for each PVC (N), it listed ALL pods in the namespace and iterated through each pod (M). With many PVCs and pods in a cluster, this caused significant performance degradation during backup operations.

As reported in issue #9169, one N*M iteration even took 2 seconds:

time="2025-08-08T04:03:22Z" level=info msg="Executing takePVSnapshot" ...
time="2025-08-08T04:03:24Z" level=info msg="skipping snapshot action for pv pvc-e74b7300..."

This change introduces a PVC-to-Pod cache that is built once per backup and reused for all PVC lookups, reducing complexity from O(N*M) to O(N+M).

Changes:

  • Add PVCPodCache struct with thread-safe caching in pkg/util/podvolume
  • Add NewVolumeHelperImplWithCache constructor for cache support
  • Build cache before backup item processing in backup.go
  • Add comprehensive unit tests for cache functionality
  • Graceful fallback to direct lookups if cache building fails

Performance Impact:

Scenario Before After
100 PVCs, 1000 pods ~100,000 iterations ~1,100 iterations
Cache lookup O(N*M) per PVC O(1) per PVC

Does your change fix a particular issue?

Fixes #9179

Please indicate you've done the following:

shubham-pampattiwar added a commit to shubham-pampattiwar/velero that referenced this pull request Dec 2, 2025
Signed-off-by: Shubham Pampattiwar <[email protected]>
@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

❌ Patch coverage is 87.50000% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.29%. Comparing base (554b04e) to head (5e120a7).

Files with missing lines Patch % Lines
pkg/backup/backup.go 66.66% 4 Missing and 2 partials ⚠️
pkg/util/podvolume/pod_volume.go 92.98% 2 Missing and 2 partials ⚠️
internal/volumehelper/volume_policy_helper.go 90.47% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9441      +/-   ##
==========================================
+ Coverage   60.23%   60.29%   +0.05%     
==========================================
  Files         386      386              
  Lines       35937    36023      +86     
==========================================
+ Hits        21648    21720      +72     
- Misses      12715    12723       +8     
- Partials     1574     1580       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

shubham-pampattiwar added a commit to shubham-pampattiwar/velero that referenced this pull request Dec 5, 2025
Signed-off-by: Shubham Pampattiwar <[email protected]>
@shubham-pampattiwar shubham-pampattiwar force-pushed the fix-volume-policy-performance-9179 branch from c58a94e to 8a9827e Compare December 5, 2025 20:26
@shubham-pampattiwar shubham-pampattiwar marked this pull request as ready for review December 5, 2025 20:26
@shubham-pampattiwar
Copy link
Collaborator Author

Review Request: @Lyndon-Li @blackpiglet @sseago @kaovilai

shubham-pampattiwar added a commit to shubham-pampattiwar/velero that referenced this pull request Dec 10, 2025
Signed-off-by: Shubham Pampattiwar <[email protected]>
@shubham-pampattiwar shubham-pampattiwar force-pushed the fix-volume-policy-performance-9179 branch from 7d0f19b to cd1d4d4 Compare December 10, 2025 02:01
@shubham-pampattiwar
Copy link
Collaborator Author

@Lyndon-Li, thanks for the review feedback. I've addressed your comments:

  1. Moved cache building into volumehelper - Renamed NewVolumeHelperImplWithCache to NewVolumeHelperImplWithNamespaces. It now takes a list of namespaces and builds the cache internally, keeping all the logic encapsulated in volumehelper.
  2. Removed fallback in main backup path - NewVolumeHelperImplWithNamespaces now returns (VolumeHelper, error). If cache build fails, backup fails with an error instead of silently falling back to direct lookups. This ensures predictable performance.
  3. Code reuse - NewVolumeHelperImpl (used by plugins) now calls NewVolumeHelperImplWithNamespaces with nil namespaces.

The fallback logic still exists in GetPodsUsingPVCWithCache but it's only triggered for plugins that don't use the cache. In the main backup path, the cache is always built or the backup fails.

The GetPodsUsingPVC function had O(N*M) complexity - for each PVC,
it listed ALL pods in the namespace and iterated through each pod.
With many PVCs and pods, this caused significant performance
degradation (2+ seconds per PV in some cases).

This change introduces a PVC-to-Pod cache that is built once per
backup and reused for all PVC lookups, reducing complexity from
O(N*M) to O(N+M).

Changes:
- Add PVCPodCache struct with thread-safe caching in podvolume pkg
- Add NewVolumeHelperImplWithCache constructor for cache support
- Build cache before backup item processing in backup.go
- Add comprehensive unit tests for cache functionality
- Graceful fallback to direct lookups if cache fails

Fixes vmware-tanzu#9179

Signed-off-by: Shubham Pampattiwar <[email protected]>
Signed-off-by: Shubham Pampattiwar <[email protected]>
Add TestVolumeHelperImplWithCache_ShouldPerformSnapshot to verify:
- Volume policy match with cache returns correct snapshot decision
- fs-backup via opt-out with cache properly skips snapshot
- Fallback to direct lookup when cache is not built

These tests verify the cache-enabled code path added in the previous
commit for improved volume policy performance.

Signed-off-by: Shubham Pampattiwar <[email protected]>
Add TestVolumeHelperImplWithCache_ShouldPerformFSBackup to verify:
- Volume policy match with cache returns correct fs-backup decision
- Volume policy match with snapshot action skips fs-backup
- Fallback to direct lookup when cache is not built

Signed-off-by: Shubham Pampattiwar <[email protected]>
Add test case to verify that the PVC-to-Pod cache is used even when
no volume policy is configured. When defaultVolumesToFSBackup is true,
the cache is used to find pods using the PVC to determine if fs-backup
should be used instead of snapshot.

Signed-off-by: Shubham Pampattiwar <[email protected]>
- Use ResolveNamespaceList() instead of GetIncludes() for more accurate
  namespace resolution when building the PVC-to-Pod cache
- Refactor NewVolumeHelperImpl to call NewVolumeHelperImplWithCache with
  nil cache parameter to avoid code duplication

Signed-off-by: Shubham Pampattiwar <[email protected]>
- Rename NewVolumeHelperImplWithCache to NewVolumeHelperImplWithNamespaces
- Move cache building logic from backup.go into volumehelper
- Return error from NewVolumeHelperImplWithNamespaces if cache build fails
- Remove fallback in main backup path - backup fails if cache build fails
- Update NewVolumeHelperImpl to call NewVolumeHelperImplWithNamespaces
- Add comments clarifying fallback is only used by plugins
- Update tests for new error return signature

This addresses review comments from @Lyndon-Li and @kaovilai:
- Cache building is now encapsulated in volumehelper
- No fallback in main backup path ensures predictable performance
- Code reuse between constructors

Fixes vmware-tanzu#9179

Signed-off-by: Shubham Pampattiwar <[email protected]>
@shubham-pampattiwar shubham-pampattiwar force-pushed the fix-volume-policy-performance-9179 branch from 22c05cf to 5e120a7 Compare December 11, 2025 08:04
namespaces,
)
if err != nil {
log.WithError(err).Error("Failed to build PVC-to-Pod cache for volume policy lookups")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Failed to build volume helper"

@Lyndon-Li
Copy link
Contributor

@shubham-pampattiwar

I realized we cannot leave plugins to call the original NewVolumeHelperImpl, either based on my suggestion or your original change:

  1. NewVolumeHelperImpl is not only called in the backupper, but also called by plugins. E.g., for BIA of each PVC
  2. With my suggestion, NewVolumeHelperImpl is called by each BIA, and cache will be built for multiple times
  3. With your original change, NewVolumeHelperImpl is called by each BIA, since it fallbacks to the original behavior, so for each call, pod-pvc list is created once

We need a further consideration. With the current PR, I am afraid we cannot solve the original problem.

@shubham-pampattiwar
Copy link
Collaborator Author

shubham-pampattiwar commented Dec 11, 2025

@Lyndon-Li You are right that the CSI PVC BIA (pvc_action.go) still calls ShouldPerformSnapshotWithBackup which creates a new VolumeHelper each time.

Looking at the code paths:

  1. Main backup path (item_backupper.go) - Uses ib.volumeHelperImpl.ShouldPerformSnapshot which now has the cache. This should be fixed with the current PR.
  2. CSI PVC BIA (pvc_action.go) - Uses ShouldPerformSnapshotWithBackup which creates a new VolumeHelper per call. This is still O(N*M).

To fix the BIA path, I see a couple of options:

Option A: Backup-level cache registry
Store the cache in a package-level map keyed by backup name. The backupper registers the cache at backup start, and BIAs look it up by backup name. Cleanup happens at backup end.

Option B: Modify CSI PVC action directly
Since it's a built-in BIA, we could have it build/hold its own cache. This depends on whether the BIA instance is reused across all PVCs in a backup or recreated each time.

What approach would you prefer? Or should we scope this PR to the main backup path fix and address the BIA issue in a follow-up? Let me know what you think ?

@github-actions github-actions bot added the hold label Dec 11, 2025
@Lyndon-Li
Copy link
Contributor

Option A: Backup-level cache registry
But how would the plugin be able to access the cache data? Velero and plugins are different processes.

Option B: Modify CSI PVC action directly
Plugin is a short running process, even we let the plugin build the cache globally, we cannot guarantee the effect since we don't know how many BIAs the current instance handles.

@shubham-pampattiwar
Copy link
Collaborator Author

@Lyndon-Li Let me clarify a few things:

Option A: Backup-level cache registry

  • Create a new package (e.g., pkg/backup/cache) with a global registry
  • Registry uses sync.RWMutex protected map: map[string]*PVCPodCache
  • Main backup path builds cache once and calls cache.RegisterCacheForBackup(backup.UID, pvcPodCache)
  • BIAs import the package and call cache.GetCacheForBackup(backup.UID) to retrieve the cache
  • ShouldPerformSnapshotWithBackup uses the retrieved cache for lookups
  • Cleanup via cache.UnregisterCacheForBackup(backup.UID) when backup finishes

Option B: Per-backup cache tracking in BIA

@Lyndon-Li
Copy link
Contributor

@Lyndon-Li Let me clarify a few things:

Option A: Backup-level cache registry

  • Create a new package (e.g., pkg/backup/cache) with a global registry
  • Registry uses sync.RWMutex protected map: map[string]*PVCPodCache
  • Main backup path builds cache once and calls cache.RegisterCacheForBackup(backup.UID, pvcPodCache)
  • BIAs import the package and call cache.GetCacheForBackup(backup.UID) to retrieve the cache
  • ShouldPerformSnapshotWithBackup uses the retrieved cache for lookups
  • Cleanup via cache.UnregisterCacheForBackup(backup.UID) when backup finishes

Option B: Per-backup cache tracking in BIA

For Option A: Just notice that BIAs and Velero backupper are running in different PROCESSES, so there is no way to share the cache data without the inter-process data sharing methods, e.g., protocolBuffer, share memory, etc.
For Option B: Notice that BIAs are host by plugin processes which are not long running processes. The plugin framework may stop the process after ONLY ONE BIA call and then start another process later. So the cache in the plugin processes won't work as the expectation.

@sseago
Copy link
Collaborator

sseago commented Dec 12, 2025

@Lyndon-Li while it is possible for the plugin process to exit and restart between BIA calls for a given backup, that is not the norm. Velero doesn't explicitly shut down a plugin process until the end of backjup processing, so in general, if a backup has 1000 items that all invoke the same BIA, all of those BIA calls will occur within the same process. We have other cases where we are caching information in the BIA to avoid having to make an APIserver call to grab the same information many times in a row.

@shubham-pampattiwar
Copy link
Collaborator Author

@sseago Thanks for the clarification, That's helpful to know that the plugin process typically stays alive for the duration of
backup processing.

So to summarize the path forward:

  1. Current PR: Fixes the main backup path by building a PVC-to-Pod cache in volumehelper and using it in item_backupper.go (the
    check happens BEFORE calling the CSI BIA)
  2. Future enhancement (separate PR): Add per-backup cache tracking in the CSI BIA itself for third-party plugin callers, using a pattern similar to https://github.com/openshift/openshift-velero-plugin/blob/oadp-dev/velero-plugins/serviceaccount/backup.go#L26

@Lyndon-Li Does this approach work for you? Should we proceed with the current PR as-is and address the BIA caching in a follow-up? or I can update this PR too if you are on board.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Volume policy is in low performance when there are lots of pods and PVCs in the cluster

4 participants