feat: two-phase GPU collection with hardware detection support by ArangoGutierrez · Pull Request #495 · NVIDIA/aicr

ArangoGutierrez · 2026-04-06T18:51:20Z

Summary

Refactor GPU collector to support two-phase collection model:

Phase 1 (hardware): NFD-based PCI detection when HardwareDetector is set
Phase 2 (smi): Existing nvidia-smi collection (always runs)

When HardwareDetector is nil, Phase 1 is skipped — preserving identical pre-NFD behavior. On Phase 1 failure, collector logs a warning and proceeds with Phase 2.

Part of: NFD Snapshot Enrichment (Track A) — Task 2 of 3
Depends on: #482 (merged)
Parallel with: #494 (Task 1 — NFD detector implementation)

Changes

`pkg/collector/gpu/gpu.go`

Add HardwareDetector field to Collector struct (nil = skip Phase 1)
Refactor Collect() into two-phase flow with graceful degradation
Extract collectSMI() method from existing nvidia-smi logic
Add hardwareSubtype() helper for HardwareInfo → measurement.Subtype conversion
Add subtypeHardware and subtypeSMI named constants
Add nil guard for HardwareInfo return from Detect()
Remove noGPUMeasurement() (replaced by inline construction in collectSMI())

`pkg/collector/gpu/gpu_test.go`

Remove TestNoGPUMeasurement (function removed)
Update TestCollector_GracefulDegradation_WhenNvidiaSmiMissing for new structure
Add TestHardwareSubtype for conversion helper
Add 4 two-phase tests:
- Day0_GPUPresentNoDriver — GPU hardware found, no driver, both subtypes present
- NoGPUHardware — no GPU detected by hardware, both subtypes present
- HardwareDetectorFails — detector error → graceful fallback to smi-only
- NilHardwareDetector — nil detector → pre-NFD behavior (smi only)

Testing

All 16 tests pass with -race flag (4 removed, 5 added from baseline)
golangci-lint reports 0 issues
Full backward compatibility: nil HardwareDetector = identical output to pre-refactor

Design Notes

Reuses mockHardwareDetector from hardware_test.go (same package) — no duplicate test doubles
Zero file overlap with Task 1 (feat: implement NFD-based GPU hardware detection #494): this PR only touches gpu.go + gpu_test.go, Task 1 only touches hardware.go + hardware_test.go + go.mod
Named constants prevent string literal drift between production code and tests

ArangoGutierrez · 2026-04-06T18:53:12Z

cc @mchmarny

mchmarny

Two-phase design is solid and the graceful degradation works well. Main asks: use functional options for the detector field (project pattern), and DRY up the repeated zero-GPU subtype. See inline comments.

Refactor GPU collector to support two-phase collection: - Phase 1 (hardware): NFD-based PCI detection (when HardwareDetector set) - Phase 2 (smi): existing nvidia-smi collection (always runs) When HardwareDetector is nil, collector preserves pre-NFD behavior. On Phase 1 failure, collector logs warning and proceeds with Phase 2. - Add NewCollector with functional options (WithHardwareDetector, WithCommandRunner) - Extract collectSMI() with injectable command runner for testability - Add noGPUSMISubtype() helper to DRY zero-GPU construction - Add hardwareSubtype() helper for measurement conversion - Add subtypeHardware and subtypeSMI named constants - Remove noGPUMeasurement() (replaced by noGPUSMISubtype) - Add nil guard for HardwareInfo from Detect() - Tests run on all CI environments (no nvidia-smi skip) Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Adds an nfd componentRefs entry to the 12 production GPU leaf overlays (H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) flipping overrides.topologyUpdater.enable to true. Each cluster running these recipes now publishes per-node NodeResourceTopology (NRT) CRDs describing NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity — the fact base downstream NUMA-aware schedulers and recipe auto-resolution flows can consume. Overlay change: each leaf gains a fresh componentRefs entry with name=nfd, type=Helm, and overrides.topologyUpdater.enable=true. mergeComponentRef (pkg/recipe/metadata.go:532) deep-merges the overrides onto the base nfd entry from recipes/overlays/base.yaml without replacing inherited source/version/valuesFile. Chart configuration (recipes/components/nfd/values.yaml): * topologyUpdater.createCRDs: true ensures the noderesourcetopologies CRD is installed by the chart on overlays that flip enable=true. Managed K8s control planes (EKS / AKS / GKE / OKE / LKE) do not preinstall it. The chart guards the CRD template on (enable && createCRDs), so this is a no-op when TU is off. * topologyUpdater.kubeletStateDir: "" disables the broad /var/lib/kubelet hostPath mount. TU only needs the pod-resources gRPC socket, which is mounted via a dedicated hostPath; the empty state-dir stops the chart from exposing kubelet state files (TLS material, pod manifests, checkpoint data) into the TU container. Scheduling (recipes/registry.yaml): nfd.nodeScheduling.{system, accelerated}.tolerationPaths now include topologyUpdater.tolerations so the bundler injects --accelerated-node-toleration values into TU pods. Without this, on every targeted GPU cluster (which carry nvidia.com/gpu=present:NoSchedule), the TU DaemonSet would have been unschedulable. nodeSelector deliberately not added — TU runs on all nodes (same rationale as worker), and topology data on system nodes is needed for cross-NUMA scheduling decisions. Kind-chain overlays (h100-kind-*, kind-*) are intentionally excluded: KWOK and kind clusters lack a real kubelet pod-resources gRPC socket, so TU would CrashLoopBackOff. The KubeletPodResources feature gate has been Beta-default since Kubernetes 1.15 and reached GA in 1.28 (KEP-606); AICR's affected leaves require K8s >= 1.30, so the prerequisite is satisfied in practice. Coverage: * New chainsaw step assert-bundle-topology-nfd in CUJ1-training asserts the rendered NFD bundle contains topologyUpdater.enable= true, createCRDs=true, the GPU-taint toleration, master.enable= true, gc.enable=true, and enableNodeFeatureApi=true. * New TestNFDTopologyUpdater_OverlayCoverage in pkg/recipe/ metadata_test.go covers all 12 GPU platform+intent overlays (expected ON) and both kind-chain leaves (expected OFF). Type-assertion failures on ON cases promote to t.Fatal so a malformed override produces a loud regression rather than a silent false-negative. * docs/user/component-catalog.md gains the missing NFD row and a Topology Updater section documenting which recipes run TU and the kubelet pod-resources prerequisite. Closes the local NFD adoption initiative (Tracks A and B previously shipped via PRs NVIDIA#482, NVIDIA#494, NVIDIA#495, NVIDIA#502, NVIDIA#511, NVIDIA#518, NVIDIA#688). Track C (recipe auto-resolution from NRT data) and scheduler-integration work remain open for future PRs. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez requested a review from a team as a code owner April 6, 2026 18:51

github-actions Bot added area/collector size/L labels Apr 6, 2026

lockwobr previously approved these changes Apr 6, 2026

View reviewed changes

mchmarny added size/XL and removed size/L labels Apr 6, 2026

mchmarny assigned ArangoGutierrez Apr 6, 2026

mchmarny requested changes Apr 6, 2026

View reviewed changes

Comment thread pkg/collector/gpu/gpu.go Outdated

Comment thread pkg/collector/gpu/gpu.go Outdated

Comment thread pkg/collector/gpu/gpu_test.go

ArangoGutierrez dismissed lockwobr’s stale review via 0d60ebf April 7, 2026 06:20

ArangoGutierrez force-pushed the feat/nfd-two-phase-collector branch from 18449e7 to 0d60ebf Compare April 7, 2026 06:20

github-actions Bot added size/L and removed size/XL labels Apr 7, 2026

ArangoGutierrez requested a review from mchmarny April 7, 2026 06:27

mchmarny approved these changes Apr 7, 2026

View reviewed changes

mchmarny enabled auto-merge (squash) April 7, 2026 11:44

auto-merge was automatically disabled April 7, 2026 14:07
Head branch was pushed to by a user without write access

ArangoGutierrez force-pushed the feat/nfd-two-phase-collector branch from 500d91e to c435431 Compare April 7, 2026 14:07

Merge branch 'main' into feat/nfd-two-phase-collector

1b55fb3

mchmarny enabled auto-merge (squash) April 7, 2026 14:23

mchmarny merged commit 7283e9c into NVIDIA:main Apr 7, 2026
15 checks passed

ArangoGutierrez mentioned this pull request Apr 7, 2026

feat: wire NFDHardwareDetector into production snapshot pipeline #502

Merged

5 tasks

ArangoGutierrez deleted the feat/nfd-two-phase-collector branch April 8, 2026 20:02

ArangoGutierrez mentioned this pull request Apr 28, 2026

feat(recipes): enable NFD Topology Updater on production GPU recipes #711

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: two-phase GPU collection with hardware detection support#495

feat: two-phase GPU collection with hardware detection support#495
mchmarny merged 2 commits into
NVIDIA:mainfrom
ArangoGutierrez:feat/nfd-two-phase-collector

ArangoGutierrez commented Apr 6, 2026

Uh oh!

ArangoGutierrez commented Apr 6, 2026

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ArangoGutierrez commented Apr 6, 2026

Summary

Changes

pkg/collector/gpu/gpu.go

pkg/collector/gpu/gpu_test.go

Testing

Design Notes

Uh oh!

ArangoGutierrez commented Apr 6, 2026

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`pkg/collector/gpu/gpu.go`

`pkg/collector/gpu/gpu_test.go`