Skip to content

Conversation

@alec-flowers
Copy link
Contributor

@alec-flowers alec-flowers commented Nov 14, 2025

Overview:

To avoid DGD's being committed to the repo that will not pass validation, added a pre-commit and CI/CD check.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added automated validation for deployment YAML files to catch configuration errors early in the development process.
    • Validation now runs in CI workflows and as a pre-commit hook.
  • Documentation

    • Added comprehensive guide for the deployment YAML validation tool, including quickstart instructions and troubleshooting steps.
  • Chores

    • Updated deployment examples to use shared resource volumes instead of per-service configurations.

@alec-flowers alec-flowers requested review from a team as code owners November 14, 2025 21:48
@github-actions github-actions bot added the ci Issues/PRs that reference CI build/test label Nov 14, 2025
Signed-off-by: alec-flowers <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 14, 2025

Walkthrough

The changes introduce a comprehensive deployment validation framework for DynamoGraphDeployment YAML files, including a Python validation utility integrated into CI/CD workflows and pre-commit hooks. Additional updates include sample CRD configurations and refactored example manifests using shared PVC volumes.

Changes

Cohort / File(s) Summary
Validation Framework & CI/CD Integration
.github/workflows/validate-deployments.yml, .pre-commit-config.yaml
Adds GitHub Actions workflow and pre-commit hook to validate DynamoGraphDeployment YAML files against CRD OpenAPI v3 schema on repository changes, with triggers for recipes, examples, tests, and deploy paths.
Validation Tool Implementation
deploy/utils/validate_deployments.py, deploy/utils/README_VALIDATION.md
Introduces Python validation utility with CRD schema loading, YAML parsing, error reporting with line mapping, structural linting, and file discovery logic; includes comprehensive documentation covering features, usage, integration, examples, and troubleshooting.
Sample & Example Updates
deploy/cloud/operator/config/samples/nvidia.com_v1alpha1_dynamographdeployment.yaml, examples/basics/kubernetes/shared_frontend/shared_frontend.yaml
Updates sample CRD to include services block with namespace configuration; refactors example manifest to replace per-service PVC blocks with shared PVC volume references and centralized PVC declarations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • deploy/utils/validate_deployments.py: Requires careful review of schema extraction logic, JSON Schema validation, line number mapping heuristics, error formatting, and file discovery mechanisms with git integration fallback.
  • Error handling paths: Verify robust handling of malformed YAML, missing CRD schemas, and PyYAML/jsonschema import failures.
  • Public API surface: Confirm DeploymentValidator class, has_dgd_resource(), and find_deploy_files() function signatures and behaviors are well-defined.
  • YAML example refactoring: Verify volumeMount structural consistency across all service definitions in the shared frontend example.

Poem

🐰 A hop and a skip, validation's afoot!
Schema checks dance in the CI boot,
Pre-commit hooks catch what should not,
Shared volumes simplify, tight as they're ought,
DGD deployments now validated for the route! 🎉

Pre-merge checks

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The description is incomplete; it provides an overview but lacks detailed explanation of changes, reviewer guidance, and a valid related issue reference. Expand the Details section with specific changes (GitHub Actions workflow, pre-commit hook, validation script), and update Related Issues with an actual issue number or remove the placeholder.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding pre-commit and CI validation for DGD (DynamoGraphDeployment) YAML files.
Docstring Coverage ✅ Passed Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: alec-flowers <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
deploy/utils/README_VALIDATION.md (1)

45-57: Tighten markdown: add fenced code languages and fix “GitHub” capitalization

To keep markdown tooling quiet and improve readability:

  • Add explicit languages to the non-bash fenced blocks (e.g., ```text for the “Valid files” / “Validation errors” output and the CRD path snippet).
  • Ensure “GitHub” is capitalized consistently where mentioned in prose.

These are small cleanups but will clear the current markdownlint/LanguageTool warnings.

Also applies to: 61-65, 81-95

.github/workflows/validate-deployments.yml (1)

18-37: Consider aligning CI path filters with the pre-commit hook

The workflow currently triggers only for YAML changes under recipes/** and examples/** (plus the validator script and CRD). The pre-commit hook, however, covers (recipes|examples|tests|deploy)/.*\.(yaml|yml)$.

If someone changes a DynamoGraphDeployment YAML only under deploy/** (or tests/**), CI won’t run this validator, even though the pre-commit hook would. To keep merge-time validation consistent, consider expanding the paths filters to include at least deploy/**/*.yml? (and optionally tests/**/*.yml?), e.g.:

   pull_request:
     paths:
       - 'recipes/**/*.yaml'
       - 'recipes/**/*.yml'
       - 'examples/**/*.yaml'
       - 'examples/**/*.yml'
+      - 'deploy/**/*.yaml'
+      - 'deploy/**/*.yml'
       - 'deploy/utils/validate_deployments.py'
       - 'deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml'
   push:
@@
     paths:
       - 'recipes/**/*.yaml'
       - 'recipes/**/*.yml'
       - 'examples/**/*.yaml'
       - 'examples/**/*.yml'
+      - 'deploy/**/*.yaml'
+      - 'deploy/**/*.yml'
       - 'deploy/utils/validate_deployments.py'
       - 'deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml'
.pre-commit-config.yaml (1)

88-97: Leverage the pre-commit Python environment instead of hard-coding python3

The new hook looks good and matches the directories you care about. One small portability improvement:

Since language: python already runs the hook inside a managed virtualenv, you can rely on that interpreter instead of hard-coding python3 in the entry. For example:

-  - id: validate-deploy-yaml
-    name: Validate DynamoGraphDeployment YAMLs
-    entry: python3 deploy/utils/validate_deployments.py
-    language: python
+  - id: validate-deploy-yaml
+    name: Validate DynamoGraphDeployment YAMLs
+    entry: deploy/utils/validate_deployments.py
+    language: python
     files: '(recipes|examples|tests|deploy)/.*\.(yaml|yml)$'
     pass_filenames: true
     additional_dependencies: [pyyaml, jsonschema]

This keeps the hook from depending on a python3 binary being on PATH and lets pre-commit fully manage the interpreter.

Please double-check this against the current pre-commit docs to ensure it matches their recommended pattern.

deploy/utils/validate_deployments.py (2)

341-349: --strict flag is currently unused

The --strict flag is parsed but never consulted when validating or reporting errors, which can confuse users and linters.

Either wire it into behavior (e.g., to distinguish warnings vs. errors if you later add warnings) or remove the flag for now to keep the CLI surface minimal.


41-58: Broad except Exception blocks are user-friendly but may mask unexpected failures

Both _load_crd_schema and validate_file use except Exception and then print a generic error (or return a generic message). This is convenient for CLI UX but can hide programming or environment errors and makes debugging harder.

If you want to tighten this up without regressing UX, consider:

  • Narrowing the exceptions to explicit types (FileNotFoundError, yaml.YAMLError, ValueError, OSError) where possible.
  • For truly unexpected exceptions, letting them propagate so CI fails loudly with a traceback.

This is non-blocking, but will make the validator more maintainable over time.

Also applies to: 71-75

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7702f49 and 7f85a7a.

📒 Files selected for processing (6)
  • .github/workflows/validate-deployments.yml (1 hunks)
  • .pre-commit-config.yaml (1 hunks)
  • deploy/cloud/operator/config/samples/nvidia.com_v1alpha1_dynamographdeployment.yaml (1 hunks)
  • deploy/utils/README_VALIDATION.md (1 hunks)
  • deploy/utils/validate_deployments.py (1 hunks)
  • examples/basics/kubernetes/shared_frontend/shared_frontend.yaml (4 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Learnt from: nv-hwoo
Repo: ai-dynamo/dynamo PR: 4112
File: examples/backends/sglang/deploy/agg.yaml:76-78
Timestamp: 2025-11-05T20:23:29.539Z
Learning: In DynamoGraphDeployment (DGD) YAML manifests, the volumeMounts field uses `mountPoint` (not the standard Kubernetes `mountPath`) to specify where to mount volumes. This is defined in the DGD custom resource definition API.
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 3100
File: deploy/cloud/operator/cmd/main.go:186-190
Timestamp: 2025-09-17T22:35:40.674Z
Learning: The mpiRunSecretName validation in deploy/cloud/operator/cmd/main.go is safe for upgrades because the Helm chart automatically populates dynamo-operator.dynamo.mpiRun.secretName with a default value of "mpi-run-ssh-secret" and includes SSH key generation functionality via sshKeygen.enabled: true.
📚 Learning: 2025-11-05T20:23:29.539Z
Learnt from: nv-hwoo
Repo: ai-dynamo/dynamo PR: 4112
File: examples/backends/sglang/deploy/agg.yaml:76-78
Timestamp: 2025-11-05T20:23:29.539Z
Learning: In DynamoGraphDeployment (DGD) YAML manifests, the volumeMounts field uses `mountPoint` (not the standard Kubernetes `mountPath`) to specify where to mount volumes. This is defined in the DGD custom resource definition API.

Applied to files:

  • examples/basics/kubernetes/shared_frontend/shared_frontend.yaml
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

  • deploy/cloud/operator/config/samples/nvidia.com_v1alpha1_dynamographdeployment.yaml
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

  • deploy/utils/README_VALIDATION.md
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4366/merge) by alec-flowers.
deploy/utils/validate_deployments.py

[error] 422-422: Validation failed: AttributeError: 'list' object has no attribute 'get' (during DynamoGraphDeployment validation)


[error] 398-398: Validation failed while validating YAMLs: is_valid, errors = validator.validate_file(yaml_file)


[error] 80-80: Validation failed: DynamoGraphDeployment doc is a list, not a mapping

🪛 LanguageTool
deploy/utils/README_VALIDATION.md

[uncategorized] ~41-~41: The official name of this software platform is spelled with a capital “H”.
Context: ... ### CI Workflow Runs on every PR via .github/workflows/validate-deployments.yml. #...

(GITHUB)

🪛 markdownlint-cli2 (0.18.1)
deploy/utils/README_VALIDATION.md

46-46: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


52-52: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


62-62: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.4)
deploy/utils/validate_deployments.py

52-52: Abstract raise to an inner function

(TRY301)


52-52: Avoid specifying long messages outside the exception class

(TRY003)


55-55: Abstract raise to an inner function

(TRY301)


55-55: Avoid specifying long messages outside the exception class

(TRY003)


56-56: Do not catch blind exception: Exception

(BLE001)


73-73: Do not catch blind exception: Exception

(BLE001)


88-88: Loop control variable doc_idx not used within loop body

Rename unused doc_idx to _doc_idx

(B007)


167-167: Consider [*data_path, key] instead of concatenation

Replace with [*data_path, key]

(RUF005)


178-178: Consider [*schema_path, key] instead of concatenation

Replace with [*schema_path, key]

(RUF005)


178-178: Consider [*data_path, key] instead of concatenation

Replace with [*data_path, key]

(RUF005)


187-187: Consider [*schema_path, key, idx] instead of concatenation

Replace with [*schema_path, key, idx]

(RUF005)


188-188: Consider [*data_path, key, idx] instead of concatenation

Replace with [*data_path, key, idx]

(RUF005)


267-267: Consider moving this statement to an else block

(TRY300)


268-268: Do not catch blind exception: Exception

(BLE001)


272-272: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


299-299: Starting a process with a partial executable path

(S607)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: sglang (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: Mirror Repository to GitLab
🔇 Additional comments (2)
deploy/cloud/operator/config/samples/nvidia.com_v1alpha1_dynamographdeployment.yaml (1)

24-29: Ensure sample manifest satisfies CRD required fields

With the new validator in place, this sample DynamoGraphDeployment will now be schema-validated. Right now, the two services only set dynamoNamespace. If the CRD marks additional fields as required for a service (e.g., componentType, maybe replicas), this sample could start failing validation and break CI.

Please run the new validator against this file and, if needed, flesh out the service specs so the sample is fully CRD-compliant.

examples/basics/kubernetes/shared_frontend/shared_frontend.yaml (1)

33-41: Shared PVC + mountPoint usage looks consistent with the DGD API

The switch to a top-level pvcs declaration and per-service volumeMounts referencing dynamo-model-cache looks good, and the use of mountPoint (rather than Kubernetes mountPath) matches the DynamoGraphDeployment CRD semantics for volume mounts. This should play nicely with the new validator.

Based on learnings

Also applies to: 64-71, 89-91, 109-111

Comment on lines 76 to 92
# Find DynamoGraphDeployment documents
dgd_docs = [
(i, doc)
for i, doc in enumerate(docs)
if doc and doc.get("kind") == "DynamoGraphDeployment"
]

if not dgd_docs:
return True, [] # No DGD resources, skip validation

# Validate each DynamoGraphDeployment
errors = []
for doc_idx, doc in dgd_docs:
self._file_content = content
errors.extend(self._validate_document(doc))

return len(errors) == 0, errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix crash on non-mapping YAML documents in validate_file (root cause of CI failure)

The pre-merge pipeline error AttributeError: 'list' object has no attribute 'get' matches this list comprehension: it assumes every YAML document is a mapping and calls .get("kind") unconditionally. For YAML files whose top-level document is a sequence (e.g., a list of resources), doc is a list and this blows up, breaking CI.

Guarding on dict here makes the validator robust while still validating all proper DynamoGraphDeployment objects.

Suggested fix:

-        # Find DynamoGraphDeployment documents
-        dgd_docs = [
-            (i, doc)
-            for i, doc in enumerate(docs)
-            if doc and doc.get("kind") == "DynamoGraphDeployment"
-        ]
+        # Find DynamoGraphDeployment documents; skip non-mapping docs safely
+        dgd_docs = [
+            (i, doc)
+            for i, doc in enumerate(docs)
+            if isinstance(doc, dict)
+            and doc.get("kind") == "DynamoGraphDeployment"
+        ]
@@
-        # Validate each DynamoGraphDeployment
-        errors = []
-        for doc_idx, doc in dgd_docs:
-            self._file_content = content
-            errors.extend(self._validate_document(doc))
+        # Validate each DynamoGraphDeployment
+        errors: List[str] = []
+        for _doc_idx, doc in dgd_docs:
+            # Store full file content so we can map schema errors back to line numbers
+            self._file_content = content
+            errors.extend(self._validate_document(doc))

This should eliminate the list/.get crash while keeping behavior unchanged for valid DGD documents.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4366/merge) by alec-flowers.

[error] 80-80: Validation failed: DynamoGraphDeployment doc is a list, not a mapping

🪛 Ruff (0.14.4)

88-88: Loop control variable doc_idx not used within loop body

Rename unused doc_idx to _doc_idx

(B007)

🤖 Prompt for AI Agents
In deploy/utils/validate_deployments.py around lines 76 to 92, the list
comprehension assumes every YAML document is a mapping and calls
doc.get("kind"), which crashes on non-mapping documents (e.g. lists); change the
filter to only consider dicts before accessing .get (e.g. check isinstance(doc,
dict) and then doc.get("kind") == "DynamoGraphDeployment") so non-mapping YAML
documents are ignored and validation proceeds normally.

Comment on lines 260 to 269
def has_dgd_resource(yaml_file: Path) -> bool:
"""Check if a YAML file contains DynamoGraphDeployment resources."""
try:
with open(yaml_file) as f:
for doc in yaml.safe_load_all(f):
if doc and doc.get("kind") == "DynamoGraphDeployment":
return True
return False
except Exception:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Avoid the same list.get crash and silent skipping in has_dgd_resource

has_dgd_resource has the same doc.get("kind") assumption and will also throw AttributeError when a YAML document is a list. Because find_deploy_files relies on this helper to discover candidate files, a crash here can prevent the validator from even starting.

Additionally, catching a blind Exception and returning False means any parse error silently hides a file from validation.

Suggested change to address both:

 def has_dgd_resource(yaml_file: Path) -> bool:
     """Check if a YAML file contains DynamoGraphDeployment resources."""
     try:
         with open(yaml_file) as f:
-            for doc in yaml.safe_load_all(f):
-                if doc and doc.get("kind") == "DynamoGraphDeployment":
-                    return True
-        return False
-    except Exception:
-        return False
+            for doc in yaml.safe_load_all(f):
+                if isinstance(doc, dict) and doc.get("kind") == "DynamoGraphDeployment":
+                    return True
+        return False
+    except yaml.YAMLError as e:
+        # Surface YAML issues instead of silently skipping potentially relevant files
+        print(f"YAML parsing error while scanning {yaml_file}: {e}")
+        return False

This mirrors the validate_file fix (no .get on lists) and at least logs parse errors instead of silently ignoring them. If you’d prefer parse errors to fail the entire run, we can instead have this re-raise and let main() handle it.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def has_dgd_resource(yaml_file: Path) -> bool:
"""Check if a YAML file contains DynamoGraphDeployment resources."""
try:
with open(yaml_file) as f:
for doc in yaml.safe_load_all(f):
if doc and doc.get("kind") == "DynamoGraphDeployment":
return True
return False
except Exception:
return False
def has_dgd_resource(yaml_file: Path) -> bool:
"""Check if a YAML file contains DynamoGraphDeployment resources."""
try:
with open(yaml_file) as f:
for doc in yaml.safe_load_all(f):
if isinstance(doc, dict) and doc.get("kind") == "DynamoGraphDeployment":
return True
return False
except yaml.YAMLError as e:
# Surface YAML issues instead of silently skipping potentially relevant files
print(f"YAML parsing error while scanning {yaml_file}: {e}")
return False
🧰 Tools
🪛 Ruff (0.14.4)

267-267: Consider moving this statement to an else block

(TRY300)


268-268: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In deploy/utils/validate_deployments.py around lines 260 to 269,
has_dgd_resource currently assumes each YAML doc is a dict and calls
doc.get("kind"), which will raise AttributeError for list documents, and it also
swallows all exceptions returning False; change the logic to check the document
type before accessing "kind" (e.g., ensure doc is a dict) and only examine dicts
for kind == "DynamoGraphDeployment", and replace the broad except that silently
hides parse errors with explicit handling that logs the parse/error (or
re-raises if you prefer to fail the run) so parse errors aren’t silently ignored
and candidate files aren’t dropped without notice.

# Find CRD schema
crd_path = args.crd or (
workspace_root
/ "deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also have a bunch of DGDR if I'm not mistaken

@julienmancuso
Copy link
Contributor

That's great, thanks @alec-flowers .
Personnally, I would have reused existing tools instead of creating a brand new script to maintain ;-)
kubeconform, kube-linter, ...

Copy link
Contributor

@tmonty12 tmonty12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pointed out some pre-existing tooling but overall this is great - I like the pre-commit hook and workflow for catching validation errors

"""Initialize validator with CRD schema."""
self.crd_path = crd_path
self.schema = self._load_crd_schema()
self.validator = Draft7Validator(self.schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Draft7Validator is for validation of JSON Schema Draft 7 which OAI v3 is an extension of - so it might miss edge cases (although this definitely suffices for the use case)

There's actually a kubernetes SIG tool built for local validation of CRs against CRDs called kubectl-validate.

An example:

(dynamo) tom:~/Dynamo/repos/dynamo (aflowers/validate-dgd) $ kubectl-validate deploy/discovery/dgd.yaml --local-crds deploy/cloud/helm/crds/templates

deploy/discovery/dgd.yaml...OK

Making the dgd.yaml invalid:

(dynamo) tom:~/Dynamo/repos/dynamo (aflowers/validate-dgd) $ kubectl-validate deploy/discovery/dgd.yaml --local-crds deploy/cloud/helm/crds/templates

deploy/discovery/dgd.yaml...ERROR
DynamoGraphDeployment.nvidia.com "dynamo" is invalid: [spec.backendFramework: Unsupported value: "test": supported values: "sglang", "vllm", "trtllm", <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Error: validation failed
(dynamo) tom:~/Dynamo/repos/dynamo (aflowers/validate-dgd) $ 

I think the logic for finding the manifests would still be needed but just wanted to point this tooling out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Issues/PRs that reference CI build/test size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants