WIP: feat(validator): add AKS H100 NCCL all-reduce performance runtime#676
WIP: feat(validator): add AKS H100 NCCL all-reduce performance runtime#676xdu31 wants to merge 2 commits into
Conversation
…h IB topo and mlnxnics discovery
📝 WalkthroughWalkthroughThis change adds Azure Kubernetes Service (AKS) support for NCCL all-reduce bandwidth performance validation on H100 GPUs. It introduces a new utility module that discovers Mellanox NIC counts from Kubernetes Node allocatable resources, manages NCCL topology ConfigMap lifecycle (create with fallback to update, delete with NotFound handling), and embeds an ND H100 v5/ND H200 v5 NCCL topology XML structure. The constraint validator is updated to conditionally invoke AKS discovery, inject Mellanox-related template variables, and pass a Kubernetes clientset to the cleanup function for ConfigMap deletion. New test coverage validates topology discovery, resource line formatting, XML structure, and template resolution for AKS. A TrainingRuntime manifest and topology XML file are provided for AKS H100 NCCL benchmarking. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes NotesThe changes introduce heterogeneous additions across utility code, tests, and Kubernetes manifests with moderate logic density focused on resource discovery and lifecycle management. The integration point into the existing constraint validator requires understanding of the conditional branching and parameter threading patterns. Test coverage is comprehensive with edge case handling (empty resources, missing keys, XML schema validation). 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@validators/performance/testdata/h100/aks/runtime.yaml`:
- Around line 146-148: The runtime.yaml currently runs apt-get update && apt-get
install -y openssh-server at container startup (seen in the initContainer and
container blocks), causing runtime latency and external mirror dependency;
replace this with a pre-built image that already has SSH installed (or document
that this runtime requires network package installs) and update the
initContainer/container image references to that image name, or alternatively
implement a build-stage that bakes openssh-server into the image so startup no
longer runs apt-get commands.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 67a5aa17-7fd7-47f0-8d01-5c8b5f10c7d2
📒 Files selected for processing (6)
validators/performance/nccl_aks_utils.govalidators/performance/nccl_aks_utils_test.govalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_test.govalidators/performance/testdata/h100/aks/ndv5-topo.xmlvalidators/performance/testdata/h100/aks/runtime.yaml
| apt-get update && | ||
| apt-get install -y --no-install-recommends openssh-server && | ||
| mkdir -p /var/run/sshd && |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider caching apt-get results or using a pre-built image.
Both initContainer and container run apt-get update && apt-get install openssh-server at runtime, which adds latency and introduces external dependency on package mirrors. This is a pattern seen in other runtimes, but for production reliability, consider using a pre-built image with SSH pre-installed or documenting this as expected behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@validators/performance/testdata/h100/aks/runtime.yaml` around lines 146 -
148, The runtime.yaml currently runs apt-get update && apt-get install -y
openssh-server at container startup (seen in the initContainer and container
blocks), causing runtime latency and external mirror dependency; replace this
with a pre-built image that already has SSH installed (or document that this
runtime requires network package installs) and update the
initContainer/container image references to that image name, or alternatively
implement a build-stage that bakes openssh-server into the image so startup no
longer runs apt-get commands.
|
@xdu31 this PR has been inactive for 14 days. Do you need help finishing it, or should we close it for now? Feel free to reopen anytime. |
Summary
Add AKS H100 NCCL all-reduce bandwidth performance validator with InfiniBand topology file and Mellanox NIC discovery, matching excalibur and nccl-doctor patterns.
Motivation / Context
The
h100-aks-ubuntu-trainingoverlay referencesnccl-all-reduce-bwin its validation performance checks, but no AKS TrainingRuntime template existed. The validator failed with "unsupported service/accelerator combination" for AKS + H100.Fixes: #442
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)validators/performance/Implementation Notes
Runtime template (
testdata/h100/aks/runtime.yaml): Kubeflow TrainJob + MPI over InfiniBand. Key tuning from excalibur/nccl-doctor:NCCL_IB_PCI_RELAXED_ORDERING=1— required for IB perf on Azure ND-seriesNCCL_SOCKET_IFNAME=eth0— keeps NCCL OOB off IB fabricNCCL_TOPO_FILE=/etc/nccl/topo.xml— explicit ndv5 PCIe topology (2-NUMA, 8-GPU, 8-NIC)mlnxnics discovery (
nccl_aks_utils.go): Readsnvidia.com/mlnxnicsfromnode.Status.Allocatable, same pattern as EKS EFA discovery. Count of 0 gracefully omits the resource line.Topo ConfigMap: ndv5-topo.xml embedded as Go constant, created as ConfigMap and mounted into worker pods. Create-or-update semantics, cleaned up in
cleanupNCCLResources().Signature change:
cleanupNCCLResources()now acceptskubernetes.Interfaceto delete the topo ConfigMap.Testing
All tests pass, 0 lint issues. New tests:
TestDiscoverAKSNodeConfig— 8 NICs, 0 NICs, empty allocatableTestBuildMLNXResourceLine— 8, 4, 0 countsTestNdv5TopoXML— validates XML structure (2 NUMA, 8 GPUs, 8 NICs)TestSupportedNCCLCombinations_Variants— AKS+H100 in mapTestTemplatePath— AKS path resolves correctlyTestPlatformWorkerScheduling— AKS returns nil/nilRisk Assessment
Rollout notes: New platform support only — no changes to existing EKS/GKE/any paths.
Checklist
make testwith-race)make lint)git commit -S)