-
Notifications
You must be signed in to change notification settings - Fork 736
ci: validate dgd in CI and pre-commit #4366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: alec-flowers <[email protected]>
Signed-off-by: alec-flowers <[email protected]>
Signed-off-by: alec-flowers <[email protected]>
WalkthroughThe changes introduce a comprehensive deployment validation framework for DynamoGraphDeployment YAML files, including a Python validation utility integrated into CI/CD workflows and pre-commit hooks. Additional updates include sample CRD configurations and refactored example manifests using shared PVC volumes. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Poem
Pre-merge checks❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: alec-flowers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (5)
deploy/utils/README_VALIDATION.md (1)
45-57: Tighten markdown: add fenced code languages and fix “GitHub” capitalizationTo keep markdown tooling quiet and improve readability:
- Add explicit languages to the non-bash fenced blocks (e.g.,
```textfor the “Valid files” / “Validation errors” output and the CRD path snippet).- Ensure “GitHub” is capitalized consistently where mentioned in prose.
These are small cleanups but will clear the current markdownlint/LanguageTool warnings.
Also applies to: 61-65, 81-95
.github/workflows/validate-deployments.yml (1)
18-37: Consider aligning CI path filters with the pre-commit hookThe workflow currently triggers only for YAML changes under
recipes/**andexamples/**(plus the validator script and CRD). The pre-commit hook, however, covers(recipes|examples|tests|deploy)/.*\.(yaml|yml)$.If someone changes a
DynamoGraphDeploymentYAML only underdeploy/**(ortests/**), CI won’t run this validator, even though the pre-commit hook would. To keep merge-time validation consistent, consider expanding thepathsfilters to include at leastdeploy/**/*.yml?(and optionallytests/**/*.yml?), e.g.:pull_request: paths: - 'recipes/**/*.yaml' - 'recipes/**/*.yml' - 'examples/**/*.yaml' - 'examples/**/*.yml' + - 'deploy/**/*.yaml' + - 'deploy/**/*.yml' - 'deploy/utils/validate_deployments.py' - 'deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml' push: @@ paths: - 'recipes/**/*.yaml' - 'recipes/**/*.yml' - 'examples/**/*.yaml' - 'examples/**/*.yml' + - 'deploy/**/*.yaml' + - 'deploy/**/*.yml' - 'deploy/utils/validate_deployments.py' - 'deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml'.pre-commit-config.yaml (1)
88-97: Leverage the pre-commit Python environment instead of hard-codingpython3The new hook looks good and matches the directories you care about. One small portability improvement:
Since
language: pythonalready runs the hook inside a managed virtualenv, you can rely on that interpreter instead of hard-codingpython3in the entry. For example:- - id: validate-deploy-yaml - name: Validate DynamoGraphDeployment YAMLs - entry: python3 deploy/utils/validate_deployments.py - language: python + - id: validate-deploy-yaml + name: Validate DynamoGraphDeployment YAMLs + entry: deploy/utils/validate_deployments.py + language: python files: '(recipes|examples|tests|deploy)/.*\.(yaml|yml)$' pass_filenames: true additional_dependencies: [pyyaml, jsonschema]This keeps the hook from depending on a
python3binary being onPATHand lets pre-commit fully manage the interpreter.Please double-check this against the current pre-commit docs to ensure it matches their recommended pattern.
deploy/utils/validate_deployments.py (2)
341-349:--strictflag is currently unusedThe
--strictflag is parsed but never consulted when validating or reporting errors, which can confuse users and linters.Either wire it into behavior (e.g., to distinguish warnings vs. errors if you later add warnings) or remove the flag for now to keep the CLI surface minimal.
41-58: Broadexcept Exceptionblocks are user-friendly but may mask unexpected failuresBoth
_load_crd_schemaandvalidate_fileuseexcept Exceptionand then print a generic error (or return a generic message). This is convenient for CLI UX but can hide programming or environment errors and makes debugging harder.If you want to tighten this up without regressing UX, consider:
- Narrowing the exceptions to explicit types (
FileNotFoundError,yaml.YAMLError,ValueError,OSError) where possible.- For truly unexpected exceptions, letting them propagate so CI fails loudly with a traceback.
This is non-blocking, but will make the validator more maintainable over time.
Also applies to: 71-75
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.github/workflows/validate-deployments.yml(1 hunks).pre-commit-config.yaml(1 hunks)deploy/cloud/operator/config/samples/nvidia.com_v1alpha1_dynamographdeployment.yaml(1 hunks)deploy/utils/README_VALIDATION.md(1 hunks)deploy/utils/validate_deployments.py(1 hunks)examples/basics/kubernetes/shared_frontend/shared_frontend.yaml(4 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Learnt from: nv-hwoo
Repo: ai-dynamo/dynamo PR: 4112
File: examples/backends/sglang/deploy/agg.yaml:76-78
Timestamp: 2025-11-05T20:23:29.539Z
Learning: In DynamoGraphDeployment (DGD) YAML manifests, the volumeMounts field uses `mountPoint` (not the standard Kubernetes `mountPath`) to specify where to mount volumes. This is defined in the DGD custom resource definition API.
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 3100
File: deploy/cloud/operator/cmd/main.go:186-190
Timestamp: 2025-09-17T22:35:40.674Z
Learning: The mpiRunSecretName validation in deploy/cloud/operator/cmd/main.go is safe for upgrades because the Helm chart automatically populates dynamo-operator.dynamo.mpiRun.secretName with a default value of "mpi-run-ssh-secret" and includes SSH key generation functionality via sshKeygen.enabled: true.
📚 Learning: 2025-11-05T20:23:29.539Z
Learnt from: nv-hwoo
Repo: ai-dynamo/dynamo PR: 4112
File: examples/backends/sglang/deploy/agg.yaml:76-78
Timestamp: 2025-11-05T20:23:29.539Z
Learning: In DynamoGraphDeployment (DGD) YAML manifests, the volumeMounts field uses `mountPoint` (not the standard Kubernetes `mountPath`) to specify where to mount volumes. This is defined in the DGD custom resource definition API.
Applied to files:
examples/basics/kubernetes/shared_frontend/shared_frontend.yaml
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.
Applied to files:
deploy/cloud/operator/config/samples/nvidia.com_v1alpha1_dynamographdeployment.yaml
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
Applied to files:
deploy/utils/README_VALIDATION.md
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4366/merge) by alec-flowers.
deploy/utils/validate_deployments.py
[error] 422-422: Validation failed: AttributeError: 'list' object has no attribute 'get' (during DynamoGraphDeployment validation)
[error] 398-398: Validation failed while validating YAMLs: is_valid, errors = validator.validate_file(yaml_file)
[error] 80-80: Validation failed: DynamoGraphDeployment doc is a list, not a mapping
🪛 LanguageTool
deploy/utils/README_VALIDATION.md
[uncategorized] ~41-~41: The official name of this software platform is spelled with a capital “H”.
Context: ... ### CI Workflow Runs on every PR via .github/workflows/validate-deployments.yml. #...
(GITHUB)
🪛 markdownlint-cli2 (0.18.1)
deploy/utils/README_VALIDATION.md
46-46: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
52-52: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
62-62: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.4)
deploy/utils/validate_deployments.py
52-52: Abstract raise to an inner function
(TRY301)
52-52: Avoid specifying long messages outside the exception class
(TRY003)
55-55: Abstract raise to an inner function
(TRY301)
55-55: Avoid specifying long messages outside the exception class
(TRY003)
56-56: Do not catch blind exception: Exception
(BLE001)
73-73: Do not catch blind exception: Exception
(BLE001)
88-88: Loop control variable doc_idx not used within loop body
Rename unused doc_idx to _doc_idx
(B007)
167-167: Consider [*data_path, key] instead of concatenation
Replace with [*data_path, key]
(RUF005)
178-178: Consider [*schema_path, key] instead of concatenation
Replace with [*schema_path, key]
(RUF005)
178-178: Consider [*data_path, key] instead of concatenation
Replace with [*data_path, key]
(RUF005)
187-187: Consider [*schema_path, key, idx] instead of concatenation
Replace with [*schema_path, key, idx]
(RUF005)
188-188: Consider [*data_path, key, idx] instead of concatenation
Replace with [*data_path, key, idx]
(RUF005)
267-267: Consider moving this statement to an else block
(TRY300)
268-268: Do not catch blind exception: Exception
(BLE001)
272-272: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
299-299: Starting a process with a partial executable path
(S607)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: sglang (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: operator (amd64)
- GitHub Check: Build and Test - dynamo
- GitHub Check: Mirror Repository to GitLab
🔇 Additional comments (2)
deploy/cloud/operator/config/samples/nvidia.com_v1alpha1_dynamographdeployment.yaml (1)
24-29: Ensure sample manifest satisfies CRD required fieldsWith the new validator in place, this sample
DynamoGraphDeploymentwill now be schema-validated. Right now, the two services only setdynamoNamespace. If the CRD marks additional fields as required for a service (e.g.,componentType, maybereplicas), this sample could start failing validation and break CI.Please run the new validator against this file and, if needed, flesh out the service specs so the sample is fully CRD-compliant.
examples/basics/kubernetes/shared_frontend/shared_frontend.yaml (1)
33-41: Shared PVC +mountPointusage looks consistent with the DGD APIThe switch to a top-level
pvcsdeclaration and per-servicevolumeMountsreferencingdynamo-model-cachelooks good, and the use ofmountPoint(rather than KubernetesmountPath) matches the DynamoGraphDeployment CRD semantics for volume mounts. This should play nicely with the new validator.Based on learnings
Also applies to: 64-71, 89-91, 109-111
| # Find DynamoGraphDeployment documents | ||
| dgd_docs = [ | ||
| (i, doc) | ||
| for i, doc in enumerate(docs) | ||
| if doc and doc.get("kind") == "DynamoGraphDeployment" | ||
| ] | ||
|
|
||
| if not dgd_docs: | ||
| return True, [] # No DGD resources, skip validation | ||
|
|
||
| # Validate each DynamoGraphDeployment | ||
| errors = [] | ||
| for doc_idx, doc in dgd_docs: | ||
| self._file_content = content | ||
| errors.extend(self._validate_document(doc)) | ||
|
|
||
| return len(errors) == 0, errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix crash on non-mapping YAML documents in validate_file (root cause of CI failure)
The pre-merge pipeline error AttributeError: 'list' object has no attribute 'get' matches this list comprehension: it assumes every YAML document is a mapping and calls .get("kind") unconditionally. For YAML files whose top-level document is a sequence (e.g., a list of resources), doc is a list and this blows up, breaking CI.
Guarding on dict here makes the validator robust while still validating all proper DynamoGraphDeployment objects.
Suggested fix:
- # Find DynamoGraphDeployment documents
- dgd_docs = [
- (i, doc)
- for i, doc in enumerate(docs)
- if doc and doc.get("kind") == "DynamoGraphDeployment"
- ]
+ # Find DynamoGraphDeployment documents; skip non-mapping docs safely
+ dgd_docs = [
+ (i, doc)
+ for i, doc in enumerate(docs)
+ if isinstance(doc, dict)
+ and doc.get("kind") == "DynamoGraphDeployment"
+ ]
@@
- # Validate each DynamoGraphDeployment
- errors = []
- for doc_idx, doc in dgd_docs:
- self._file_content = content
- errors.extend(self._validate_document(doc))
+ # Validate each DynamoGraphDeployment
+ errors: List[str] = []
+ for _doc_idx, doc in dgd_docs:
+ # Store full file content so we can map schema errors back to line numbers
+ self._file_content = content
+ errors.extend(self._validate_document(doc))This should eliminate the list/.get crash while keeping behavior unchanged for valid DGD documents.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4366/merge) by alec-flowers.
[error] 80-80: Validation failed: DynamoGraphDeployment doc is a list, not a mapping
🪛 Ruff (0.14.4)
88-88: Loop control variable doc_idx not used within loop body
Rename unused doc_idx to _doc_idx
(B007)
🤖 Prompt for AI Agents
In deploy/utils/validate_deployments.py around lines 76 to 92, the list
comprehension assumes every YAML document is a mapping and calls
doc.get("kind"), which crashes on non-mapping documents (e.g. lists); change the
filter to only consider dicts before accessing .get (e.g. check isinstance(doc,
dict) and then doc.get("kind") == "DynamoGraphDeployment") so non-mapping YAML
documents are ignored and validation proceeds normally.
| def has_dgd_resource(yaml_file: Path) -> bool: | ||
| """Check if a YAML file contains DynamoGraphDeployment resources.""" | ||
| try: | ||
| with open(yaml_file) as f: | ||
| for doc in yaml.safe_load_all(f): | ||
| if doc and doc.get("kind") == "DynamoGraphDeployment": | ||
| return True | ||
| return False | ||
| except Exception: | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid the same list.get crash and silent skipping in has_dgd_resource
has_dgd_resource has the same doc.get("kind") assumption and will also throw AttributeError when a YAML document is a list. Because find_deploy_files relies on this helper to discover candidate files, a crash here can prevent the validator from even starting.
Additionally, catching a blind Exception and returning False means any parse error silently hides a file from validation.
Suggested change to address both:
def has_dgd_resource(yaml_file: Path) -> bool:
"""Check if a YAML file contains DynamoGraphDeployment resources."""
try:
with open(yaml_file) as f:
- for doc in yaml.safe_load_all(f):
- if doc and doc.get("kind") == "DynamoGraphDeployment":
- return True
- return False
- except Exception:
- return False
+ for doc in yaml.safe_load_all(f):
+ if isinstance(doc, dict) and doc.get("kind") == "DynamoGraphDeployment":
+ return True
+ return False
+ except yaml.YAMLError as e:
+ # Surface YAML issues instead of silently skipping potentially relevant files
+ print(f"YAML parsing error while scanning {yaml_file}: {e}")
+ return FalseThis mirrors the validate_file fix (no .get on lists) and at least logs parse errors instead of silently ignoring them. If you’d prefer parse errors to fail the entire run, we can instead have this re-raise and let main() handle it.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def has_dgd_resource(yaml_file: Path) -> bool: | |
| """Check if a YAML file contains DynamoGraphDeployment resources.""" | |
| try: | |
| with open(yaml_file) as f: | |
| for doc in yaml.safe_load_all(f): | |
| if doc and doc.get("kind") == "DynamoGraphDeployment": | |
| return True | |
| return False | |
| except Exception: | |
| return False | |
| def has_dgd_resource(yaml_file: Path) -> bool: | |
| """Check if a YAML file contains DynamoGraphDeployment resources.""" | |
| try: | |
| with open(yaml_file) as f: | |
| for doc in yaml.safe_load_all(f): | |
| if isinstance(doc, dict) and doc.get("kind") == "DynamoGraphDeployment": | |
| return True | |
| return False | |
| except yaml.YAMLError as e: | |
| # Surface YAML issues instead of silently skipping potentially relevant files | |
| print(f"YAML parsing error while scanning {yaml_file}: {e}") | |
| return False |
🧰 Tools
🪛 Ruff (0.14.4)
267-267: Consider moving this statement to an else block
(TRY300)
268-268: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In deploy/utils/validate_deployments.py around lines 260 to 269,
has_dgd_resource currently assumes each YAML doc is a dict and calls
doc.get("kind"), which will raise AttributeError for list documents, and it also
swallows all exceptions returning False; change the logic to check the document
type before accessing "kind" (e.g., ensure doc is a dict) and only examine dicts
for kind == "DynamoGraphDeployment", and replace the broad except that silently
hides parse errors with explicit handling that logs the parse/error (or
re-raises if you prefer to fail the run) so parse errors aren’t silently ignored
and candidate files aren’t dropped without notice.
| # Find CRD schema | ||
| crd_path = args.crd or ( | ||
| workspace_root | ||
| / "deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also have a bunch of DGDR if I'm not mistaken
|
That's great, thanks @alec-flowers . |
tmonty12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pointed out some pre-existing tooling but overall this is great - I like the pre-commit hook and workflow for catching validation errors
| """Initialize validator with CRD schema.""" | ||
| self.crd_path = crd_path | ||
| self.schema = self._load_crd_schema() | ||
| self.validator = Draft7Validator(self.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Draft7Validator is for validation of JSON Schema Draft 7 which OAI v3 is an extension of - so it might miss edge cases (although this definitely suffices for the use case)
There's actually a kubernetes SIG tool built for local validation of CRs against CRDs called kubectl-validate.
An example:
(dynamo) tom:~/Dynamo/repos/dynamo (aflowers/validate-dgd) $ kubectl-validate deploy/discovery/dgd.yaml --local-crds deploy/cloud/helm/crds/templates
deploy/discovery/dgd.yaml...OKMaking the dgd.yaml invalid:
(dynamo) tom:~/Dynamo/repos/dynamo (aflowers/validate-dgd) $ kubectl-validate deploy/discovery/dgd.yaml --local-crds deploy/cloud/helm/crds/templates
deploy/discovery/dgd.yaml...ERROR
DynamoGraphDeployment.nvidia.com "dynamo" is invalid: [spec.backendFramework: Unsupported value: "test": supported values: "sglang", "vllm", "trtllm", <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Error: validation failed
(dynamo) tom:~/Dynamo/repos/dynamo (aflowers/validate-dgd) $ I think the logic for finding the manifests would still be needed but just wanted to point this tooling out
Overview:
To avoid DGD's being committed to the repo that will not pass validation, added a pre-commit and CI/CD check.
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Documentation
Chores