Skip to content

feat(nodewright-customizations): add gb200 eks support#699

Merged
mchmarny merged 2 commits into
mainfrom
feat/gb200_eks
Apr 27, 2026
Merged

feat(nodewright-customizations): add gb200 eks support#699
mchmarny merged 2 commits into
mainfrom
feat/gb200_eks

Conversation

@ayuskauskas
Copy link
Copy Markdown
Contributor

Summary

Update nvidia-tuned and nvidia-setup for bug fixes related to gb200 on eks

Fixes: #656

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • [ X Other: recipes/components/nodewright and eks-gb200

Testing

Ran CUJ1 on 2 node eks gb200 cluster

< Evidence and final testing to come >

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

This change updates the Nodewright tuning package definitions to reference new package images and versions, upgrading nvidia-setup from version 0.2.1 to 0.2.2 and nvidia-tuned from 0.2.3 to 0.3.0. Additionally, two GB200 EKS recipe overlays are modified to use the tuning configuration instead of the no-op configuration, activating Nodewright tuning customizations for both training and inference workloads on GB200 hardware.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change—adding GB200 EKS support by enabling nodewright customizations (tuning) for this configuration.
Description check ✅ Passed The description clearly relates to the changeset, explaining bug fixes for GB200 on EKS and references the linked issue #656 that justifies the changes.
Linked Issues check ✅ Passed The PR fulfills the requirements from issue #656: updates package versions/images to support GB200 tuning, switches from no-op to tuning overlay for EKS GB200 recipes, and includes validation on a real EKS GB200 cluster.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the GB200 EKS support objective: updating tuning manifest versions and enabling tuning for GB200 EKS inference and training recipes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/gb200_eks

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
recipes/overlays/gb200-eks-training.yaml (1)

67-70: ⚠️ Potential issue | 🟡 Minor

Refresh obsolete no-op explanation in the training overlay.

Line 67–70 says tuning is unavailable, but Line 74 now consumes tuning.yaml. Please update the comment so it reflects the new GB200 EKS behavior.

🛠️ Proposed comment fix
-    # GB200 uses nodewright no-op: the H100 tuning packages (nvidia-setup,
-    # nvidia-tuned) are not compatible with GB200's ARM64 host CPU and
-    # Blackwell GPU architecture. Replace with tuning.yaml once GB200-specific
-    # packages are available.
+    # GB200 on EKS uses nodewright tuning.
+    # This overlay points to tuning.yaml to apply GB200-compatible setup/tuning
+    # packages for training workloads.

Also applies to: 74-74

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/overlays/gb200-eks-training.yaml` around lines 67 - 70, Update the
explanatory comment around the GB200 EKS training overlay to reflect that GB200
now consumes tuning.yaml instead of using a nodewright no-op; replace the
current lines mentioning "nodewright no-op" and that tuning packages are
incompatible with GB200 with a concise note that GB200 uses a GB200-specific
tuning.yaml (or placeholder until GB200-specific packages are available) and
remove the outdated statement about H100/ARM64 incompatibility, referencing
"tuning.yaml" and "GB200" so readers know where the actual tuning is now
applied.
recipes/overlays/gb200-eks-inference.yaml (1)

61-64: ⚠️ Potential issue | 🟡 Minor

Update stale no-op commentary to match actual tuning configuration.

Line 61–64 still says GB200 uses no-op and tuning is unavailable, but Line 68 now points to tuning.yaml. Please update this comment to avoid misleading future changes.

🛠️ Proposed comment fix
-    # GB200 uses nodewright no-op: the H100 tuning packages (nvidia-setup,
-    # nvidia-tuned) are not compatible with GB200's ARM64 host CPU and
-    # Blackwell GPU architecture. Replace with tuning.yaml once GB200-specific
-    # packages are available.
+    # GB200 on EKS uses nodewright tuning.
+    # This overlay points to tuning.yaml to apply GB200-compatible setup/tuning
+    # packages for inference workloads.

Also applies to: 68-68

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/overlays/gb200-eks-inference.yaml` around lines 61 - 64, Update the
stale comment in recipes/overlays/gb200-eks-inference.yaml: replace the existing
"GB200 uses nodewright no-op..." text with a concise note that GB200 now
references tuning.yaml for GB200-specific tuning (instead of claiming tuning is
unavailable), mention that H100 packages are incompatible with GB200's
ARM64/Blackwell but a GB200-specific tuning.yaml is provided, and update the
comment near the reference to tuning.yaml so it accurately describes the current
configuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@recipes/overlays/gb200-eks-inference.yaml`:
- Around line 61-64: Update the stale comment in
recipes/overlays/gb200-eks-inference.yaml: replace the existing "GB200 uses
nodewright no-op..." text with a concise note that GB200 now references
tuning.yaml for GB200-specific tuning (instead of claiming tuning is
unavailable), mention that H100 packages are incompatible with GB200's
ARM64/Blackwell but a GB200-specific tuning.yaml is provided, and update the
comment near the reference to tuning.yaml so it accurately describes the current
configuration.

In `@recipes/overlays/gb200-eks-training.yaml`:
- Around line 67-70: Update the explanatory comment around the GB200 EKS
training overlay to reflect that GB200 now consumes tuning.yaml instead of using
a nodewright no-op; replace the current lines mentioning "nodewright no-op" and
that tuning packages are incompatible with GB200 with a concise note that GB200
uses a GB200-specific tuning.yaml (or placeholder until GB200-specific packages
are available) and remove the outdated statement about H100/ARM64
incompatibility, referencing "tuning.yaml" and "GB200" so readers know where the
actual tuning is now applied.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6ae5d415-69f6-4bac-be1c-5fe8bfd27a96

📥 Commits

Reviewing files that changed from the base of the PR and between 8d0168e and 19128fc.

📒 Files selected for processing (3)
  • recipes/components/nodewright-customizations/manifests/tuning.yaml
  • recipes/overlays/gb200-eks-inference.yaml
  • recipes/overlays/gb200-eks-training.yaml

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 27, 2026

Coverage Report ✅

Metric Value
Coverage 75.1%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-75.1%25-green)

No Go source files changed in this PR.

@mchmarny mchmarny enabled auto-merge (squash) April 27, 2026 23:50
@mchmarny mchmarny merged commit 0a8d6e1 into main Apr 27, 2026
54 of 55 checks passed
@mchmarny mchmarny deleted the feat/gb200_eks branch April 27, 2026 23:56
lockwobr pushed a commit that referenced this pull request Apr 28, 2026
Co-authored-by: Mark Chmarny <mchmarny@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: EKS GB200 should use the nodewright tuning not no-op

2 participants