Skip to content

Conversation

@kaovilai
Copy link
Collaborator

@kaovilai kaovilai commented Apr 25, 2025

Add priority class support for Velero components

Summary

This PR implements priority class support for Velero components, allowing users to configure Kubernetes priority classes for the Velero server, node-agent, data mover pods, and maintenance jobs. This ensures proper pod scheduling and preemption behavior in resource-constrained environments.

Changes

Core Implementation

  • Added --server-priority-class-name and --node-agent-priority-class-name flags to velero install command
  • Modified node-agent server to read priority class from ConfigMap once at startup (minimizing API calls)
  • Updated all data mover controllers to accept priority class as a string parameter
  • Added priority class validation with fallback behavior to prevent pod creation failures
  • Removed ConfigProvider infrastructure in favor of static configuration reading

Key Features

  • Validation with Fallback: ValidatePriorityClass now returns a boolean, allowing callers to handle missing priority classes gracefully
  • Single ConfigMap Read: Node-agent reads ConfigMap once at startup and passes configuration to controllers
  • Automatic Fallback: If a priority class doesn't exist, the system logs a warning and uses default scheduling

Configuration Flow

graph TB
    subgraph "Installation Time"
        CLI[velero install] --> |"--server-priority-class-name"| Server[Velero Server Deployment]
        CLI --> |"--node-agent-priority-class-name"| NodeAgent[Node Agent DaemonSet]
    end
    
    subgraph "ConfigMap Configuration"
        NAConfig[node-agent-configmap] --> |priorityClassName| DataMovers[Data Mover Pods]
        MJConfig[repo-maintenance-job-configmap] --> |global.priorityClassName| MaintJobs[Maintenance Jobs]
    end
    
    subgraph "Runtime Flow"
        NodeAgentServer[Node Agent Server Startup] --> |Read Once| NAConfig
        NodeAgentServer --> |Validate| ValidatePC{Priority Class Exists?}
        ValidatePC -->|Yes| PassToControllers[Pass to Controllers]
        ValidatePC -->|No| ClearPC[Clear Priority Class<br/>Log Warning]
        ClearPC --> PassToControllers
        PassToControllers --> PVBController[PVB Controller]
        PassToControllers --> PVRController[PVR Controller]
        PassToControllers --> DUController[DataUpload Controller]
        PassToControllers --> DDController[DataDownload Controller]
        
        PVBController --> PVBPods[PVB Pods]
        PVRController --> PVRPods[PVR Pods]
        DUController --> DUPods[DataUpload Pods]
        DDController --> DDPods[DataDownload Pods]
    end
    
    subgraph "Maintenance Jobs"
        MaintController[Maintenance Controller] --> |Read Fresh| MJConfig
        MaintController --> MaintJobs
    end
    
    style ValidatePC fill:#f9f,stroke:#333,stroke-width:2px
    style ClearPC fill:#ff9,stroke:#333,stroke-width:2px
Loading

Priority Class Hierarchy

Component Priority Configuration Method Purpose
Velero Server High (e.g., 100) CLI flag Must remain running to coordinate operations
Node Agent Medium (e.g., 50) CLI flag Needs to be available on nodes
Data Mover Pods Low (e.g., 10) ConfigMap Temporary workloads, can be delayed
Maintenance Jobs Low (e.g., 10) ConfigMap Background tasks, can be preempted

Example Usage

# Step 1: Create priority classes
kubectl apply -f - <<EOF
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: velero-critical
value: 100
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: velero-standard
value: 50
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: velero-low
value: 10
EOF

# Step 2: Create namespace
kubectl create namespace velero

# Step 3: Create ConfigMaps
kubectl create configmap node-agent-config -n velero --from-file=config.json=/dev/stdin <<EOF
{
    "priorityClassName": "velero-low"
}
EOF

kubectl create configmap repo-maintenance-job-config -n velero --from-file=config.json=/dev/stdin <<EOF
{
    "global": {
        "priorityClassName": "velero-low"
    }
}
EOF

# Step 4: Install Velero with priority classes
velero install \
    --provider aws \
    --server-priority-class-name velero-critical \
    --node-agent-priority-class-name velero-standard \
    --node-agent-configmap node-agent-config \
    --repo-maintenance-job-configmap repo-maintenance-job-config \
    --use-node-agent

Testing

  • Unit tests for ValidatePriorityClass and GetDataMoverPriorityClassName
  • Updated controller tests to work with new signatures
  • Verified fallback behavior when priority class doesn't exist

Documentation

  • Updated design document at design/Implemented/priority-class-name-support_design.md
  • Added examples and troubleshooting guide in design doc

Fixes #8869

Checklist

@codecov
Copy link

codecov bot commented Apr 25, 2025

Codecov Report

❌ Patch coverage is 75.67568% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.72%. Comparing base (30ea894) to head (ae29030).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
pkg/cmd/cli/nodeagent/server.go 0.00% 17 Missing ⚠️
pkg/repository/maintenance/maintenance.go 61.90% 6 Missing and 2 partials ⚠️
pkg/util/kube/priority_class.go 66.66% 7 Missing ⚠️
pkg/exposer/pod_volume.go 50.00% 2 Missing and 1 partial ⚠️
pkg/controller/data_upload_controller.go 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8883      +/-   ##
==========================================
+ Coverage   59.30%   59.72%   +0.42%     
==========================================
  Files         380      381       +1     
  Lines       43784    43889     +105     
==========================================
+ Hits        25965    26213     +248     
+ Misses      16280    16130     -150     
- Partials     1539     1546       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kaovilai kaovilai force-pushed the implement8869 branch 10 times, most recently from 29af40a to 2f72032 Compare June 4, 2025 07:14
@reasonerjt reasonerjt requested a review from ywk253100 July 18, 2025 06:10
@kaovilai kaovilai force-pushed the implement8869 branch 4 times, most recently from 303ad7e to 5c9c01a Compare July 21, 2025 22:52
@kaovilai kaovilai force-pushed the implement8869 branch 6 times, most recently from 373fe4a to 9727a9e Compare July 23, 2025 13:08
@github-actions github-actions bot added the Dependencies Pull requests that update a dependency file label Jul 23, 2025
@kaovilai kaovilai force-pushed the implement8869 branch 2 times, most recently from 42ec9d1 to a67f679 Compare July 30, 2025 02:24
@Lyndon-Li
Copy link
Contributor

@kaovilai
Sorry, I forgot to publish my comments yesterday. They are all about docs, please have another look.

kaovilai added a commit to kaovilai/velero that referenced this pull request Jul 30, 2025
- Add ValidatePriorityClassWithClient function to validate priority class existence
- Integrate validation in maintenance.go when creating maintenance jobs
- Update tests to cover the new validation functionality
- Return boolean from ValidatePriorityClass to allow fallback behavior

This ensures maintenance jobs don't fail due to non-existent priority classes,
following the same pattern used for data mover pods.

Addresses feedback from:
vmware-tanzu#8883 (comment)

Refs vmware-tanzu#8869

Signed-off-by: Tiger Kaovilai <[email protected]>
kaovilai added a commit to kaovilai/velero that referenced this pull request Jul 30, 2025
- Add --server-priority-class-name and --node-agent-priority-class-name flags to velero install command
- Configure data mover pods (PVB/PVR/DataUpload/DataDownload) to use priority class from node-agent-configmap
- Configure maintenance jobs to use priority class from repo-maintenance-job-configmap (global config only)
- Add priority class validation with ValidatePriorityClass and GetDataMoverPriorityClassName utilities
- Update e2e tests to include PriorityClass testing utilities
- Move priority class design document to Implemented folder
- Add comprehensive unit tests for all priority class implementations
- Update documentation for priority class configuration
- Add changelog entry for vmware-tanzu#8883

Signed-off-by: Tiger Kaovilai <[email protected]>

remove unused test utils

Signed-off-by: Tiger Kaovilai <[email protected]>

feat: add unit test for getting priority class name in maintenance jobs

Signed-off-by: Tiger Kaovilai <[email protected]>

doc update

Signed-off-by: Tiger Kaovilai <[email protected]>

feat: add priority class validation for repository maintenance jobs

- Add ValidatePriorityClassWithClient function to validate priority class existence
- Integrate validation in maintenance.go when creating maintenance jobs
- Update tests to cover the new validation functionality
- Return boolean from ValidatePriorityClass to allow fallback behavior

This ensures maintenance jobs don't fail due to non-existent priority classes,
following the same pattern used for data mover pods.

Addresses feedback from:
vmware-tanzu#8883 (comment)

Refs vmware-tanzu#8869

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: clean up priority class handling for data mover pods

- Fix comment in node_agent.go to clarify PriorityClassName is only for data mover pods
- Simplify server.go to use dataPathConfigs.PriorityClassName directly
- Remove redundant priority class logging from controllers as it's already logged during server startup
- Keep logging centralized in the node-agent server initialization

This reduces code duplication and clarifies the scope of priority class configuration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: remove GetDataMoverPriorityClassName from kube utilities

Remove GetDataMoverPriorityClassName function and its tests as priority
class is now read directly from dataPathConfigs instead of parsing from
ConfigMap. This simplifies the codebase by eliminating the need for
indirect ConfigMap parsing.

Refs vmware-tanzu#8869

🤖 Generated with [Claude Code](https://claude.ai/code)

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: remove priority class validation from install command

Remove priority class validation during install as it's redundant
since validation already occurs during server startup. Users cannot
see console logs during install, making the validation warnings
ineffective at this stage.

The validation remains in place during server and node-agent startup
where it's more appropriate and visible to users.

Signed-off-by: Tiger Kaovilai <[email protected]>
Co-Authored-By: Claude <[email protected]>
kaovilai added a commit to kaovilai/velero that referenced this pull request Jul 30, 2025
- Add --server-priority-class-name and --node-agent-priority-class-name flags to velero install command
- Configure data mover pods (PVB/PVR/DataUpload/DataDownload) to use priority class from node-agent-configmap
- Configure maintenance jobs to use priority class from repo-maintenance-job-configmap (global config only)
- Add priority class validation with ValidatePriorityClass and GetDataMoverPriorityClassName utilities
- Update e2e tests to include PriorityClass testing utilities
- Move priority class design document to Implemented folder
- Add comprehensive unit tests for all priority class implementations
- Update documentation for priority class configuration
- Add changelog entry for vmware-tanzu#8883

Signed-off-by: Tiger Kaovilai <[email protected]>

remove unused test utils

Signed-off-by: Tiger Kaovilai <[email protected]>

feat: add unit test for getting priority class name in maintenance jobs

Signed-off-by: Tiger Kaovilai <[email protected]>

doc update

Signed-off-by: Tiger Kaovilai <[email protected]>

feat: add priority class validation for repository maintenance jobs

- Add ValidatePriorityClassWithClient function to validate priority class existence
- Integrate validation in maintenance.go when creating maintenance jobs
- Update tests to cover the new validation functionality
- Return boolean from ValidatePriorityClass to allow fallback behavior

This ensures maintenance jobs don't fail due to non-existent priority classes,
following the same pattern used for data mover pods.

Addresses feedback from:
vmware-tanzu#8883 (comment)

Refs vmware-tanzu#8869

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: clean up priority class handling for data mover pods

- Fix comment in node_agent.go to clarify PriorityClassName is only for data mover pods
- Simplify server.go to use dataPathConfigs.PriorityClassName directly
- Remove redundant priority class logging from controllers as it's already logged during server startup
- Keep logging centralized in the node-agent server initialization

This reduces code duplication and clarifies the scope of priority class configuration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: remove GetDataMoverPriorityClassName from kube utilities

Remove GetDataMoverPriorityClassName function and its tests as priority
class is now read directly from dataPathConfigs instead of parsing from
ConfigMap. This simplifies the codebase by eliminating the need for
indirect ConfigMap parsing.

Refs vmware-tanzu#8869

🤖 Generated with [Claude Code](https://claude.ai/code)

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: remove priority class validation from install command

Remove priority class validation during install as it's redundant
since validation already occurs during server startup. Users cannot
see console logs during install, making the validation warnings
ineffective at this stage.

The validation remains in place during server and node-agent startup
where it's more appropriate and visible to users.

Signed-off-by: Tiger Kaovilai <[email protected]>
Co-Authored-By: Claude <[email protected]>
kaovilai added a commit to kaovilai/velero that referenced this pull request Jul 30, 2025
- Add --server-priority-class-name and --node-agent-priority-class-name flags to velero install command
- Configure data mover pods (PVB/PVR/DataUpload/DataDownload) to use priority class from node-agent-configmap
- Configure maintenance jobs to use priority class from repo-maintenance-job-configmap (global config only)
- Add priority class validation with ValidatePriorityClass and GetDataMoverPriorityClassName utilities
- Update e2e tests to include PriorityClass testing utilities
- Move priority class design document to Implemented folder
- Add comprehensive unit tests for all priority class implementations
- Update documentation for priority class configuration
- Add changelog entry for vmware-tanzu#8883

Signed-off-by: Tiger Kaovilai <[email protected]>

remove unused test utils

Signed-off-by: Tiger Kaovilai <[email protected]>

feat: add unit test for getting priority class name in maintenance jobs

Signed-off-by: Tiger Kaovilai <[email protected]>

doc update

Signed-off-by: Tiger Kaovilai <[email protected]>

feat: add priority class validation for repository maintenance jobs

- Add ValidatePriorityClassWithClient function to validate priority class existence
- Integrate validation in maintenance.go when creating maintenance jobs
- Update tests to cover the new validation functionality
- Return boolean from ValidatePriorityClass to allow fallback behavior

This ensures maintenance jobs don't fail due to non-existent priority classes,
following the same pattern used for data mover pods.

Addresses feedback from:
vmware-tanzu#8883 (comment)

Refs vmware-tanzu#8869

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: clean up priority class handling for data mover pods

- Fix comment in node_agent.go to clarify PriorityClassName is only for data mover pods
- Simplify server.go to use dataPathConfigs.PriorityClassName directly
- Remove redundant priority class logging from controllers as it's already logged during server startup
- Keep logging centralized in the node-agent server initialization

This reduces code duplication and clarifies the scope of priority class configuration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: remove GetDataMoverPriorityClassName from kube utilities

Remove GetDataMoverPriorityClassName function and its tests as priority
class is now read directly from dataPathConfigs instead of parsing from
ConfigMap. This simplifies the codebase by eliminating the need for
indirect ConfigMap parsing.

Refs vmware-tanzu#8869

🤖 Generated with [Claude Code](https://claude.ai/code)

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: remove priority class validation from install command

Remove priority class validation during install as it's redundant
since validation already occurs during server startup. Users cannot
see console logs during install, making the validation warnings
ineffective at this stage.

The validation remains in place during server and node-agent startup
where it's more appropriate and visible to users.

Signed-off-by: Tiger Kaovilai <[email protected]>
Co-Authored-By: Claude <[email protected]>
Lyndon-Li
Lyndon-Li previously approved these changes Jul 30, 2025
ywk253100
ywk253100 previously approved these changes Aug 1, 2025
@reasonerjt
Copy link
Contributor

@kaovilai Please resolve the conflicts.

@kaovilai
Copy link
Collaborator Author

kaovilai commented Aug 4, 2025

rebasing

kaovilai added a commit to kaovilai/velero that referenced this pull request Aug 4, 2025
- Add --server-priority-class-name and --node-agent-priority-class-name flags to velero install command
- Configure data mover pods (PVB/PVR/DataUpload/DataDownload) to use priority class from node-agent-configmap
- Configure maintenance jobs to use priority class from repo-maintenance-job-configmap (global config only)
- Add priority class validation with ValidatePriorityClass and GetDataMoverPriorityClassName utilities
- Update e2e tests to include PriorityClass testing utilities
- Move priority class design document to Implemented folder
- Add comprehensive unit tests for all priority class implementations
- Update documentation for priority class configuration
- Add changelog entry for vmware-tanzu#8883

Signed-off-by: Tiger Kaovilai <[email protected]>

remove unused test utils

Signed-off-by: Tiger Kaovilai <[email protected]>

feat: add unit test for getting priority class name in maintenance jobs

Signed-off-by: Tiger Kaovilai <[email protected]>

doc update

Signed-off-by: Tiger Kaovilai <[email protected]>

feat: add priority class validation for repository maintenance jobs

- Add ValidatePriorityClassWithClient function to validate priority class existence
- Integrate validation in maintenance.go when creating maintenance jobs
- Update tests to cover the new validation functionality
- Return boolean from ValidatePriorityClass to allow fallback behavior

This ensures maintenance jobs don't fail due to non-existent priority classes,
following the same pattern used for data mover pods.

Addresses feedback from:
vmware-tanzu#8883 (comment)

Refs vmware-tanzu#8869

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: clean up priority class handling for data mover pods

- Fix comment in node_agent.go to clarify PriorityClassName is only for data mover pods
- Simplify server.go to use dataPathConfigs.PriorityClassName directly
- Remove redundant priority class logging from controllers as it's already logged during server startup
- Keep logging centralized in the node-agent server initialization

This reduces code duplication and clarifies the scope of priority class configuration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: remove GetDataMoverPriorityClassName from kube utilities

Remove GetDataMoverPriorityClassName function and its tests as priority
class is now read directly from dataPathConfigs instead of parsing from
ConfigMap. This simplifies the codebase by eliminating the need for
indirect ConfigMap parsing.

Refs vmware-tanzu#8869

🤖 Generated with [Claude Code](https://claude.ai/code)

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: remove priority class validation from install command

Remove priority class validation during install as it's redundant
since validation already occurs during server startup. Users cannot
see console logs during install, making the validation warnings
ineffective at this stage.

The validation remains in place during server and node-agent startup
where it's more appropriate and visible to users.

Signed-off-by: Tiger Kaovilai <[email protected]>
Co-Authored-By: Claude <[email protected]>
@kaovilai kaovilai dismissed stale reviews from ywk253100 and Lyndon-Li via e179da2 August 4, 2025 12:41
@Lyndon-Li
Copy link
Contributor

@kaovilai
There are conflicts, could you solve them?

@kaovilai
Copy link
Collaborator Author

kaovilai commented Aug 6, 2025

yes

- Add --server-priority-class-name and --node-agent-priority-class-name flags to velero install command
- Configure data mover pods (PVB/PVR/DataUpload/DataDownload) to use priority class from node-agent-configmap
- Configure maintenance jobs to use priority class from repo-maintenance-job-configmap (global config only)
- Add priority class validation with ValidatePriorityClass and GetDataMoverPriorityClassName utilities
- Update e2e tests to include PriorityClass testing utilities
- Move priority class design document to Implemented folder
- Add comprehensive unit tests for all priority class implementations
- Update documentation for priority class configuration
- Add changelog entry for vmware-tanzu#8883

Signed-off-by: Tiger Kaovilai <[email protected]>

remove unused test utils

Signed-off-by: Tiger Kaovilai <[email protected]>

feat: add unit test for getting priority class name in maintenance jobs

Signed-off-by: Tiger Kaovilai <[email protected]>

doc update

Signed-off-by: Tiger Kaovilai <[email protected]>

feat: add priority class validation for repository maintenance jobs

- Add ValidatePriorityClassWithClient function to validate priority class existence
- Integrate validation in maintenance.go when creating maintenance jobs
- Update tests to cover the new validation functionality
- Return boolean from ValidatePriorityClass to allow fallback behavior

This ensures maintenance jobs don't fail due to non-existent priority classes,
following the same pattern used for data mover pods.

Addresses feedback from:
vmware-tanzu#8883 (comment)

Refs vmware-tanzu#8869

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: clean up priority class handling for data mover pods

- Fix comment in node_agent.go to clarify PriorityClassName is only for data mover pods
- Simplify server.go to use dataPathConfigs.PriorityClassName directly
- Remove redundant priority class logging from controllers as it's already logged during server startup
- Keep logging centralized in the node-agent server initialization

This reduces code duplication and clarifies the scope of priority class configuration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: remove GetDataMoverPriorityClassName from kube utilities

Remove GetDataMoverPriorityClassName function and its tests as priority
class is now read directly from dataPathConfigs instead of parsing from
ConfigMap. This simplifies the codebase by eliminating the need for
indirect ConfigMap parsing.

Refs vmware-tanzu#8869

🤖 Generated with [Claude Code](https://claude.ai/code)

Signed-off-by: Tiger Kaovilai <[email protected]>

refactor: remove priority class validation from install command

Remove priority class validation during install as it's redundant
since validation already occurs during server startup. Users cannot
see console logs during install, making the validation warnings
ineffective at this stage.

The validation remains in place during server and node-agent startup
where it's more appropriate and visible to users.

Signed-off-by: Tiger Kaovilai <[email protected]>
Co-Authored-By: Claude <[email protected]>
@blackpiglet blackpiglet merged commit 0c4055c into vmware-tanzu:main Aug 6, 2025
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support configuring PriorityClass for Velero pods(server/node-agent/maintenence/data mover/etc.)

5 participants