feat(nodewright-customizations): add gb200 eks support#699
Conversation
📝 WalkthroughWalkthroughThis change updates the Nodewright tuning package definitions to reference new package images and versions, upgrading Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
recipes/overlays/gb200-eks-training.yaml (1)
67-70:⚠️ Potential issue | 🟡 MinorRefresh obsolete no-op explanation in the training overlay.
Line 67–70 says tuning is unavailable, but Line 74 now consumes
tuning.yaml. Please update the comment so it reflects the new GB200 EKS behavior.🛠️ Proposed comment fix
- # GB200 uses nodewright no-op: the H100 tuning packages (nvidia-setup, - # nvidia-tuned) are not compatible with GB200's ARM64 host CPU and - # Blackwell GPU architecture. Replace with tuning.yaml once GB200-specific - # packages are available. + # GB200 on EKS uses nodewright tuning. + # This overlay points to tuning.yaml to apply GB200-compatible setup/tuning + # packages for training workloads.Also applies to: 74-74
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/overlays/gb200-eks-training.yaml` around lines 67 - 70, Update the explanatory comment around the GB200 EKS training overlay to reflect that GB200 now consumes tuning.yaml instead of using a nodewright no-op; replace the current lines mentioning "nodewright no-op" and that tuning packages are incompatible with GB200 with a concise note that GB200 uses a GB200-specific tuning.yaml (or placeholder until GB200-specific packages are available) and remove the outdated statement about H100/ARM64 incompatibility, referencing "tuning.yaml" and "GB200" so readers know where the actual tuning is now applied.recipes/overlays/gb200-eks-inference.yaml (1)
61-64:⚠️ Potential issue | 🟡 MinorUpdate stale no-op commentary to match actual tuning configuration.
Line 61–64 still says GB200 uses no-op and tuning is unavailable, but Line 68 now points to
tuning.yaml. Please update this comment to avoid misleading future changes.🛠️ Proposed comment fix
- # GB200 uses nodewright no-op: the H100 tuning packages (nvidia-setup, - # nvidia-tuned) are not compatible with GB200's ARM64 host CPU and - # Blackwell GPU architecture. Replace with tuning.yaml once GB200-specific - # packages are available. + # GB200 on EKS uses nodewright tuning. + # This overlay points to tuning.yaml to apply GB200-compatible setup/tuning + # packages for inference workloads.Also applies to: 68-68
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/overlays/gb200-eks-inference.yaml` around lines 61 - 64, Update the stale comment in recipes/overlays/gb200-eks-inference.yaml: replace the existing "GB200 uses nodewright no-op..." text with a concise note that GB200 now references tuning.yaml for GB200-specific tuning (instead of claiming tuning is unavailable), mention that H100 packages are incompatible with GB200's ARM64/Blackwell but a GB200-specific tuning.yaml is provided, and update the comment near the reference to tuning.yaml so it accurately describes the current configuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@recipes/overlays/gb200-eks-inference.yaml`:
- Around line 61-64: Update the stale comment in
recipes/overlays/gb200-eks-inference.yaml: replace the existing "GB200 uses
nodewright no-op..." text with a concise note that GB200 now references
tuning.yaml for GB200-specific tuning (instead of claiming tuning is
unavailable), mention that H100 packages are incompatible with GB200's
ARM64/Blackwell but a GB200-specific tuning.yaml is provided, and update the
comment near the reference to tuning.yaml so it accurately describes the current
configuration.
In `@recipes/overlays/gb200-eks-training.yaml`:
- Around line 67-70: Update the explanatory comment around the GB200 EKS
training overlay to reflect that GB200 now consumes tuning.yaml instead of using
a nodewright no-op; replace the current lines mentioning "nodewright no-op" and
that tuning packages are incompatible with GB200 with a concise note that GB200
uses a GB200-specific tuning.yaml (or placeholder until GB200-specific packages
are available) and remove the outdated statement about H100/ARM64
incompatibility, referencing "tuning.yaml" and "GB200" so readers know where the
actual tuning is now applied.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 6ae5d415-69f6-4bac-be1c-5fe8bfd27a96
📒 Files selected for processing (3)
recipes/components/nodewright-customizations/manifests/tuning.yamlrecipes/overlays/gb200-eks-inference.yamlrecipes/overlays/gb200-eks-training.yaml
Coverage Report ✅
Coverage BadgeNo Go source files changed in this PR. |
Co-authored-by: Mark Chmarny <mchmarny@users.noreply.github.com>
Summary
Update nvidia-tuned and nvidia-setup for bug fixes related to gb200 on eks
Fixes: #656
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Testing
Ran CUJ1 on 2 node eks gb200 cluster
< Evidence and final testing to come >
Risk Assessment
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info