Skip to content

Conversation

Lyndon-Li
Copy link
Contributor

Fix issue #7725, add design for backup repo cache configuration

@github-actions github-actions bot added the Area/Design Design Documents label Aug 5, 2025
@Lyndon-Li Lyndon-Li force-pushed the backup-repo-cache-design branch from 3f93316 to f10e021 Compare August 5, 2025 11:02
@Lyndon-Li Lyndon-Li force-pushed the backup-repo-cache-design branch 2 times, most recently from fec1dc8 to 349f35b Compare August 5, 2025 11:08
Copy link

codecov bot commented Aug 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.66%. Comparing base (1a52994) to head (2de5a5c).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9148   +/-   ##
=======================================
  Coverage   59.66%   59.66%           
=======================================
  Files         382      382           
  Lines       43900    43900           
=======================================
  Hits        26194    26194           
  Misses      16161    16161           
  Partials     1545     1545           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@shubham-pampattiwar
Copy link
Collaborator

/cc @mpryc @msfrucht

@msfrucht
Copy link
Contributor

msfrucht commented Aug 5, 2025

Overall, I think this design will be effective.

Another reason under Background.

On nodes where ephemeral-storage is shared with the OS volume. A full OS volume causes the node starts to undergo behavioral issues such as Pod removal and create failures. Kubelet operation degrades significantly if not outright begins to fail completely.

One area this design I would like to see if handling of default storage class labeling. But this also conflicts with golang conversion of omitempty fields to empty string. Need some way to tell the difference between cacheStorageClass: "" and the key is missing altogether.

Will CacheVolumeInfo.Limit be allowed to be set through the maintenance configmap as well? This seems to be a valuable addition given the algorithmic size calculation to prevent excessively sized volumes. I didn't see much explaining its purpose in the design. But it could also be this already exists in Velero and I just can't find a reference to it in the code.

With no mention of AccessModes, I imagine one of the requirements will be for ethe StorageClass to support ReadWriteOnce? Though I don't know of any StorageClass with ReadWriteMany support that doesn't also support ReadWriteOnce.

Finally, the `cacheVolumeSize` will be rounded up to GB considering the UX friendlyness, storage friendlyness and management friendlyness.

I know this is nitpick, but should be clarified. This should be in GiB, not GB. A number of csi drivers round up to the nearest Gi when suffix 'G' is used in resource request. (Even if this results in silliness of 1G -> 954Mi.) That would result in the allocated size may be larger than requested.

@Lyndon-Li
Copy link
Contributor Author

Will CacheVolumeInfo.Limit be allowed to be set through the maintenance configmap as well? This seems to be a valuable addition given the algorithmic size calculation to prevent excessively sized volumes. I didn't see much explaining its purpose in the design. But it could also be this already exists in Velero and I just can't find a reference to it in the code.

Yes, this one is the existing cacheLimitMB in the backup repo configmap. See this statement in the doc:
**Limit**: The limit of the cache data, that is represented by cacheLimitMB``

@Lyndon-Li
Copy link
Contributor Author

Lyndon-Li commented Aug 6, 2025

One area this design I would like to see if handling of default storage class labeling. But this also conflicts with golang conversion of omitempty fields to empty string. Need some way to tell the difference between cacheStorageClass: "" and the key is missing altogether.

The configurations in backup repo configmap are converted to golang map. So we cannot and don't need to set omitempty tag. That is, cacheStorageClass: "" is the same as cacheStorageClass is absent.

As the current design, if storageclass is not configured, cache volume will NOT be created:
**repository-type-3**: This is not a valid configuration because the storage class is absent. Velero gives up creating a cache volume of all restores of this backup repository.
I incline to keep the current design, we need users to explicitly set a storageclass, the default one may not as the expectation to take the cache volumes. But let me know if you have other strong consideration.

@Lyndon-Li Lyndon-Li force-pushed the backup-repo-cache-design branch from 349f35b to 0171e1e Compare August 6, 2025 03:32
@Lyndon-Li
Copy link
Contributor Author

On nodes where ephemeral-storage is shared with the OS volume. A full OS volume causes the node starts to undergo behavioral issues such as Pod removal and create failures. Kubelet operation degrades significantly if not outright begins to fail completely.

Yes, I have added the details into the background section. The cache volume is right to solve the problem of running out of the node's root file system, as the pod's root file system problem is primarily solved by the existing cacheLimitMB.

@Lyndon-Li
Copy link
Contributor Author

With no mention of AccessModes, I imagine one of the requirements will be for ethe StorageClass to support ReadWriteOnce? Though I don't know of any StorageClass with ReadWriteMany support that doesn't also support ReadWriteOnce.

Yes, cache PVC will be ReadWriteOnce and in FileSystem mode, the storage class should support it. Added the same into the design doc.

@Lyndon-Li
Copy link
Contributor Author

This should be in GiB, not GB. A number of csi drivers round up to the nearest Gi when suffix 'G' is used in resource request. (Even if this results in silliness of 1G -> 954Mi.) That would result in the allocated size may be larger than requested.

Yes, it should be in GiB. Modified the design.

@Lyndon-Li Lyndon-Li force-pushed the backup-repo-cache-design branch from 0171e1e to e12db73 Compare August 6, 2025 03:42
@Lyndon-Li Lyndon-Li marked this pull request as ready for review August 6, 2025 03:45
@Lyndon-Li Lyndon-Li force-pushed the backup-repo-cache-design branch from e12db73 to c8b9ec0 Compare August 6, 2025 07:10
@msfrucht
Copy link
Contributor

msfrucht commented Aug 6, 2025

One area this design I would like to see if handling of default storage class labeling. But this also conflicts with golang conversion of omitempty fields to empty string. Need some way to tell the difference between cacheStorageClass: "" and the key is missing altogether.

The configurations in backup repo configmap are converted to golang map. So we cannot and don't need to set omitempty tag. That is, cacheStorageClass: "" is the same as cacheStorageClass is absent.

As the current design, if storageclass is not configured, cache volume will NOT be created: **repository-type-3**: This is not a valid configuration because the storage class is absent. Velero gives up creating a cache volume of all restores of this backup repository. I incline to keep the current design, we need users to explicitly set a storageclass, the default one may not as the expectation to take the cache volumes. But let me know if you have other strong consideration.

It was not a strong consideration. We use scripting to configure Velero so our needs don't correspond to the average user. It was in consideration of general ease-of-use.

If you prefer explicit storage class that is not a problem for us.

@Lyndon-Li Lyndon-Li requested a review from kaovilai August 7, 2025 08:45
kaovilai
kaovilai previously approved these changes Aug 7, 2025
@msfrucht
Copy link
Contributor

msfrucht commented Aug 7, 2025

@Lyndon-Li Went back over our applications and missed one that requires a design change to work with because the application versions that causes this problem is an older version than current.

One of our applications that we support, depending on workload causes significant variation in behavior. Sometimes generating hundreds of PVCs (our max seen was 386 in a single namespace), and the one that impacts this design, over a hundred million files on a single CephFS ReadWriteMany volume, which is similar in behavior to NFS.

The vast majority of these files were small, below the kopia deduplication threshold. Any system that needed to access or iterate all the files rather than the minority at a time like the application behavior became strained, including backup and restore, OpenShift SELinux relabeling, and Ceph. Newer versions corrected this problem by shoving the files into a database.

This causes kopia index cache to grow enormously in size, up to ~25-35GB which is not included in the limits for kopia cache limit options. As we found out, the cache limit options affects the size of the data and metadata cache, but not the index cache.

The small file size meant the total data backed up was relatively small. And the snapshot size was small as well relatively to what would be normally expected of such a high file count.

The consequence is that the formula in the design breaks down in this unusual situation:
cacheVolumeSize = (min(backupSize, cacheLimit) >= residentThreshold ? min(backupSize, cacheLimit) : 0) * inflationPercentage
Specifically, this section:
(min(backupSize, cacheLimit)

The design allows for forcing a PVC cache, which is good for this situation. But not the size of the PVC to deal with the unusually large kopia index cache as min(backupSize, cacheLimit) : 0) will start from an unusually small size compared to most applications.

We believe this can be addressed by exposing the InflationPercentage as a configurable value. Though the default is ok for most situations. Some quick math and looking at existing application maintenance Jobs showed only this one situation was outside of the formula. I am completely open to other options as well as understand there may be no interest in addressing such an unusual situation.

@Lyndon-Li
Copy link
Contributor Author

Lyndon-Li commented Aug 8, 2025

As we found out, the cache limit options affects the size of the data and metadata cache, but not the index cache

I think the index cache is also constraint by the cacheLimit, see https://github.com/kopia/kopia/blob/3273e2f69ca94495a6231dbf1c085d36d5ccdbb7/repo/content/committed_read_manager.go#L440.

For Velero, the default cachLimit is 5GB, so the cache size should not exceed that size too much, even in the large index case as you mentioned.

Previously there is a bug from Velero side, the cache's hard limit is not appropriately set, as a result, the cache size could extremely exceed the configured cacheLimitMB or the default value 5GB.

@Lyndon-Li
Copy link
Contributor Author

Lyndon-Li commented Aug 8, 2025

@msfrucht

The vast majority of these files were small, below the kopia deduplication threshold. Any system that needed to access or iterate all the files rather than the minority at a time like the application behavior became strained, including backup and restore
The design allows for forcing a PVC cache, which is good for this situation. But not the size of the PVC to deal with the unusually large kopia index

Yes, Kopia always keeps full version of indexes in memory and refresh them periodically, e.g., every 15min, from cache.

Previously, I intended to avoid elaborating on the index cache data, in this design, because 1, keeping full index cache from client side is not normal for backup repositories; 2, even for Kopia, when this case happens (huge number of small files result in huge number of indexes), users probably don't have to worry about disk occupation for cache, but the memory usage --- they need to prepare huge memory for the data mover pod and they most probably cannot run concurrent data mover pods because they don't have enough memory, then the precondition/requirement for this design is gone.
For 2, take what you mentioned for example, if the index blobs are 25-35GB, much more memory is required, because the indexes persisted in the index blobs are in binary format, so compacted, but in memory they are represented by complex golang structs; and what is worse, the indexes may be duplicated in memory during execution.

But since we've come here, let me add below things into the design doc:

  1. Explain that backup repository metadata (a general name for non payload data, including Kopia's index data) may take majority of the cache data, so calculating cache size by backup size is not always scientific
  2. Claim that the current design doesn't consider the case that repository metadata takes the majority of case data, leave it for future enhancement when the requirement is clear, for which we need to support cache volumes for both backup and restore.

@Lyndon-Li Lyndon-Li force-pushed the backup-repo-cache-design branch 3 times, most recently from 73d04a2 to 65e7182 Compare August 8, 2025 10:42
@Lyndon-Li
Copy link
Contributor Author

The design allows for forcing a PVC cache, which is good for this situation. But not the size of the PVC to deal with the unusually large kopia index cache as min(backupSize, cacheLimit) : 0)

It may not a good choice to calculate the cacheVolumeSize by cacheVolumeSize = (min(backupSize, cacheLimit) >= residentThreshold ? min(backupSize, cacheLimit) : 0) * inflationPercentage, even in normal cases.
Let me refactor the design and get you back.

@Lyndon-Li Lyndon-Li force-pushed the backup-repo-cache-design branch 3 times, most recently from b481725 to b36b86d Compare August 12, 2025 07:24
@Lyndon-Li
Copy link
Contributor Author

@msfrucht
I have updated the doc, please continue the review.

@Lyndon-Li Lyndon-Li requested a review from kaovilai August 12, 2025 07:30
@Lyndon-Li Lyndon-Li force-pushed the backup-repo-cache-design branch from 1b1d966 to 5d2dad2 Compare August 13, 2025 03:20
@Lyndon-Li Lyndon-Li requested a review from kaovilai August 13, 2025 03:21
kaovilai
kaovilai previously approved these changes Aug 13, 2025
@msfrucht
Copy link
Contributor

@msfrucht I have updated the doc, please continue the review.

Thanks for the addition effort and revision.

This forumula is more flexible without loss of expressive power in my opinion.

cacheVolumeSize = (backupSize > residentThreshold ? limit : 0) * inflationPercentage

Even if we had some issue with the unusual nature of the backup size. This could be dealt with by either adjusting the cacheLimit globally or per affected namespace. The cacheLimit might be over-reserved for data and metadata in the extreme case I found in our history, but that wasn't the limiting factor and so becomes irrelevant. As far as I am aware, this particular case was the only one like that.

We have recommended previously in situations like that to backup Ceph rather than the application itself, as CephFS is not eligible for volume Block volumeMode data transfer, the preferred method for excessive file count volumes.

Said user had the memory to spare and didn't mind and was quite adamant on the matter.. Their focus was on reliability, execution time, management flexibility, and resource use way down the list of priorities.

kaovilai
kaovilai previously approved these changes Aug 14, 2025
Signed-off-by: Lyndon-Li <[email protected]>
@Lyndon-Li Lyndon-Li force-pushed the backup-repo-cache-design branch from 7f2d5f2 to 133db85 Compare August 15, 2025 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants