Skip to content

WIP: feat(validator): add AKS H100 NCCL all-reduce performance runtime#676

Draft
xdu31 wants to merge 2 commits into
NVIDIA:mainfrom
xdu31:feat/aks-h100-nccl-runtime
Draft

WIP: feat(validator): add AKS H100 NCCL all-reduce performance runtime#676
xdu31 wants to merge 2 commits into
NVIDIA:mainfrom
xdu31:feat/aks-h100-nccl-runtime

Conversation

@xdu31
Copy link
Copy Markdown
Contributor

@xdu31 xdu31 commented Apr 24, 2026

Summary

Add AKS H100 NCCL all-reduce bandwidth performance validator with InfiniBand topology file and Mellanox NIC discovery, matching excalibur and nccl-doctor patterns.

Motivation / Context

The h100-aks-ubuntu-training overlay references nccl-all-reduce-bw in its validation performance checks, but no AKS TrainingRuntime template existed. The validator failed with "unsupported service/accelerator combination" for AKS + H100.

Fixes: #442
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: validators/performance/

Implementation Notes

Runtime template (testdata/h100/aks/runtime.yaml): Kubeflow TrainJob + MPI over InfiniBand. Key tuning from excalibur/nccl-doctor:

  • NCCL_IB_PCI_RELAXED_ORDERING=1 — required for IB perf on Azure ND-series
  • NCCL_SOCKET_IFNAME=eth0 — keeps NCCL OOB off IB fabric
  • NCCL_TOPO_FILE=/etc/nccl/topo.xml — explicit ndv5 PCIe topology (2-NUMA, 8-GPU, 8-NIC)
  • No custom MCA btl/oob settings — Azure IB uses default OpenMPI transport (unlike EKS EFA)

mlnxnics discovery (nccl_aks_utils.go): Reads nvidia.com/mlnxnics from node.Status.Allocatable, same pattern as EKS EFA discovery. Count of 0 gracefully omits the resource line.

Topo ConfigMap: ndv5-topo.xml embedded as Go constant, created as ConfigMap and mounted into worker pods. Create-or-update semantics, cleaned up in cleanupNCCLResources().

Signature change: cleanupNCCLResources() now accepts kubernetes.Interface to delete the topo ConfigMap.

Testing

GOFLAGS="-mod=vendor" go test -race -v ./validators/performance/...
golangci-lint run -c .golangci.yaml ./validators/performance/...

All tests pass, 0 lint issues. New tests:

  • TestDiscoverAKSNodeConfig — 8 NICs, 0 NICs, empty allocatable
  • TestBuildMLNXResourceLine — 8, 4, 0 counts
  • TestNdv5TopoXML — validates XML structure (2 NUMA, 8 GPUs, 8 NICs)
  • TestSupportedNCCLCombinations_Variants — AKS+H100 in map
  • TestTemplatePath — AKS path resolves correctly
  • TestPlatformWorkerScheduling — AKS returns nil/nil

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: New platform support only — no changes to existing EKS/GKE/any paths.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@xdu31 xdu31 requested a review from a team as a code owner April 24, 2026 20:21
@xdu31 xdu31 changed the title feat(validator): add AKS H100 NCCL all-reduce performance runtime WIP: feat(validator): add AKS H100 NCCL all-reduce performance runtime Apr 24, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

This change adds Azure Kubernetes Service (AKS) support for NCCL all-reduce bandwidth performance validation on H100 GPUs. It introduces a new utility module that discovers Mellanox NIC counts from Kubernetes Node allocatable resources, manages NCCL topology ConfigMap lifecycle (create with fallback to update, delete with NotFound handling), and embeds an ND H100 v5/ND H200 v5 NCCL topology XML structure. The constraint validator is updated to conditionally invoke AKS discovery, inject Mellanox-related template variables, and pass a Kubernetes clientset to the cleanup function for ConfigMap deletion. New test coverage validates topology discovery, resource line formatting, XML structure, and template resolution for AKS. A TrainingRuntime manifest and topology XML file are provided for AKS H100 NCCL benchmarking.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Notes

The changes introduce heterogeneous additions across utility code, tests, and Kubernetes manifests with moderate logic density focused on resource discovery and lifecycle management. The integration point into the existing constraint validator requires understanding of the conditional branching and parameter threading patterns. Test coverage is comprehensive with edge case handling (empty resources, missing keys, XML schema validation).

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description comprehensively explains the AKS runtime implementation, Mellanox NIC discovery, topology ConfigMap handling, and testing coverage, all directly related to the changeset.
Linked Issues check ✅ Passed The PR fully addresses issue #442 requirements: provides AKS runtime template with IB-specific NCCL tuning, implements mlnxnics discovery, includes topology XML, updates cleanup logic, and adds comprehensive tests.
Out of Scope Changes check ✅ Passed All changes are scoped to AKS H100 NCCL all-reduce support: new utils/tests for AKS discovery, runtime template, topology XML, and integration into constraint validation—no unrelated modifications detected.
Title check ✅ Passed The PR title accurately summarizes the main change: adding AKS H100 NCCL all-reduce performance runtime. It is clear, specific, and directly reflects the changeset's primary objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@validators/performance/testdata/h100/aks/runtime.yaml`:
- Around line 146-148: The runtime.yaml currently runs apt-get update && apt-get
install -y openssh-server at container startup (seen in the initContainer and
container blocks), causing runtime latency and external mirror dependency;
replace this with a pre-built image that already has SSH installed (or document
that this runtime requires network package installs) and update the
initContainer/container image references to that image name, or alternatively
implement a build-stage that bakes openssh-server into the image so startup no
longer runs apt-get commands.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 67a5aa17-7fd7-47f0-8d01-5c8b5f10c7d2

📥 Commits

Reviewing files that changed from the base of the PR and between 7466275 and b0b76d2.

📒 Files selected for processing (6)
  • validators/performance/nccl_aks_utils.go
  • validators/performance/nccl_aks_utils_test.go
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_test.go
  • validators/performance/testdata/h100/aks/ndv5-topo.xml
  • validators/performance/testdata/h100/aks/runtime.yaml

Comment on lines +146 to +148
apt-get update &&
apt-get install -y --no-install-recommends openssh-server &&
mkdir -p /var/run/sshd &&
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider caching apt-get results or using a pre-built image.

Both initContainer and container run apt-get update && apt-get install openssh-server at runtime, which adds latency and introduces external dependency on package mirrors. This is a pattern seen in other runtimes, but for production reliability, consider using a pre-built image with SSH pre-installed or documenting this as expected behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@validators/performance/testdata/h100/aks/runtime.yaml` around lines 146 -
148, The runtime.yaml currently runs apt-get update && apt-get install -y
openssh-server at container startup (seen in the initContainer and container
blocks), causing runtime latency and external mirror dependency; replace this
with a pre-built image that already has SSH installed (or document that this
runtime requires network package installs) and update the
initContainer/container image references to that image name, or alternatively
implement a build-stage that bakes openssh-server into the image so startup no
longer runs apt-get commands.

@mchmarny mchmarny marked this pull request as draft April 25, 2026 10:30
@github-actions
Copy link
Copy Markdown
Contributor

@xdu31 this PR has been inactive for 14 days. Do you need help finishing it, or should we close it for now? Feel free to reopen anytime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add AKS H100 NCCL All-Reduce performance runtime

2 participants