Skip to content

feat: two-phase GPU collection with hardware detection support#495

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
ArangoGutierrez:feat/nfd-two-phase-collector
Apr 7, 2026
Merged

feat: two-phase GPU collection with hardware detection support#495
mchmarny merged 2 commits into
NVIDIA:mainfrom
ArangoGutierrez:feat/nfd-two-phase-collector

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Contributor

Summary

Refactor GPU collector to support two-phase collection model:

  • Phase 1 (hardware): NFD-based PCI detection when HardwareDetector is set
  • Phase 2 (smi): Existing nvidia-smi collection (always runs)

When HardwareDetector is nil, Phase 1 is skipped — preserving identical pre-NFD behavior. On Phase 1 failure, collector logs a warning and proceeds with Phase 2.

Part of: NFD Snapshot Enrichment (Track A) — Task 2 of 3
Depends on: #482 (merged)
Parallel with: #494 (Task 1 — NFD detector implementation)

Changes

pkg/collector/gpu/gpu.go

  • Add HardwareDetector field to Collector struct (nil = skip Phase 1)
  • Refactor Collect() into two-phase flow with graceful degradation
  • Extract collectSMI() method from existing nvidia-smi logic
  • Add hardwareSubtype() helper for HardwareInfomeasurement.Subtype conversion
  • Add subtypeHardware and subtypeSMI named constants
  • Add nil guard for HardwareInfo return from Detect()
  • Remove noGPUMeasurement() (replaced by inline construction in collectSMI())

pkg/collector/gpu/gpu_test.go

  • Remove TestNoGPUMeasurement (function removed)
  • Update TestCollector_GracefulDegradation_WhenNvidiaSmiMissing for new structure
  • Add TestHardwareSubtype for conversion helper
  • Add 4 two-phase tests:
    • Day0_GPUPresentNoDriver — GPU hardware found, no driver, both subtypes present
    • NoGPUHardware — no GPU detected by hardware, both subtypes present
    • HardwareDetectorFails — detector error → graceful fallback to smi-only
    • NilHardwareDetector — nil detector → pre-NFD behavior (smi only)

Testing

  • All 16 tests pass with -race flag (4 removed, 5 added from baseline)
  • golangci-lint reports 0 issues
  • Full backward compatibility: nil HardwareDetector = identical output to pre-refactor

Design Notes

  • Reuses mockHardwareDetector from hardware_test.go (same package) — no duplicate test doubles
  • Zero file overlap with Task 1 (feat: implement NFD-based GPU hardware detection #494): this PR only touches gpu.go + gpu_test.go, Task 1 only touches hardware.go + hardware_test.go + go.mod
  • Named constants prevent string literal drift between production code and tests

@ArangoGutierrez
Copy link
Copy Markdown
Contributor Author

cc @mchmarny

lockwobr
lockwobr previously approved these changes Apr 6, 2026
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two-phase design is solid and the graceful degradation works well. Main asks: use functional options for the detector field (project pattern), and DRY up the repeated zero-GPU subtype. See inline comments.

Comment thread pkg/collector/gpu/gpu.go Outdated
Comment thread pkg/collector/gpu/gpu.go Outdated
Comment thread pkg/collector/gpu/gpu_test.go
@mchmarny mchmarny enabled auto-merge (squash) April 7, 2026 11:44
Refactor GPU collector to support two-phase collection:
- Phase 1 (hardware): NFD-based PCI detection (when HardwareDetector set)
- Phase 2 (smi): existing nvidia-smi collection (always runs)

When HardwareDetector is nil, collector preserves pre-NFD behavior.
On Phase 1 failure, collector logs warning and proceeds with Phase 2.

- Add NewCollector with functional options (WithHardwareDetector, WithCommandRunner)
- Extract collectSMI() with injectable command runner for testability
- Add noGPUSMISubtype() helper to DRY zero-GPU construction
- Add hardwareSubtype() helper for measurement conversion
- Add subtypeHardware and subtypeSMI named constants
- Remove noGPUMeasurement() (replaced by noGPUSMISubtype)
- Add nil guard for HardwareInfo from Detect()
- Tests run on all CI environments (no nvidia-smi skip)

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
auto-merge was automatically disabled April 7, 2026 14:07

Head branch was pushed to by a user without write access

@ArangoGutierrez ArangoGutierrez force-pushed the feat/nfd-two-phase-collector branch from 500d91e to c435431 Compare April 7, 2026 14:07
@mchmarny mchmarny enabled auto-merge (squash) April 7, 2026 14:23
@mchmarny mchmarny merged commit 7283e9c into NVIDIA:main Apr 7, 2026
15 checks passed
@ArangoGutierrez ArangoGutierrez deleted the feat/nfd-two-phase-collector branch April 8, 2026 20:02
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 28, 2026
Adds an nfd componentRefs entry to the 12 production GPU leaf overlays
(H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) flipping
overrides.topologyUpdater.enable to true. Each cluster running these
recipes now publishes per-node NodeResourceTopology (NRT) CRDs
describing NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity
— the fact base downstream NUMA-aware schedulers and recipe
auto-resolution flows can consume.

Overlay change: each leaf gains a fresh componentRefs entry with
name=nfd, type=Helm, and overrides.topologyUpdater.enable=true.
mergeComponentRef (pkg/recipe/metadata.go:532) deep-merges the
overrides onto the base nfd entry from recipes/overlays/base.yaml
without replacing inherited source/version/valuesFile.

Chart configuration (recipes/components/nfd/values.yaml):

* topologyUpdater.createCRDs: true ensures the noderesourcetopologies
  CRD is installed by the chart on overlays that flip enable=true.
  Managed K8s control planes (EKS / AKS / GKE / OKE / LKE) do not
  preinstall it. The chart guards the CRD template on
  (enable && createCRDs), so this is a no-op when TU is off.
* topologyUpdater.kubeletStateDir: "" disables the broad
  /var/lib/kubelet hostPath mount. TU only needs the pod-resources
  gRPC socket, which is mounted via a dedicated hostPath; the empty
  state-dir stops the chart from exposing kubelet state files (TLS
  material, pod manifests, checkpoint data) into the TU container.

Scheduling (recipes/registry.yaml): nfd.nodeScheduling.{system,
accelerated}.tolerationPaths now include topologyUpdater.tolerations
so the bundler injects --accelerated-node-toleration values into TU
pods. Without this, on every targeted GPU cluster (which carry
nvidia.com/gpu=present:NoSchedule), the TU DaemonSet would have been
unschedulable. nodeSelector deliberately not added — TU runs on all
nodes (same rationale as worker), and topology data on system nodes
is needed for cross-NUMA scheduling decisions.

Kind-chain overlays (h100-kind-*, kind-*) are intentionally excluded:
KWOK and kind clusters lack a real kubelet pod-resources gRPC socket,
so TU would CrashLoopBackOff. The KubeletPodResources feature gate
has been Beta-default since Kubernetes 1.15 and reached GA in 1.28
(KEP-606); AICR's affected leaves require K8s >= 1.30, so the
prerequisite is satisfied in practice.

Coverage:

* New chainsaw step assert-bundle-topology-nfd in CUJ1-training
  asserts the rendered NFD bundle contains topologyUpdater.enable=
  true, createCRDs=true, the GPU-taint toleration, master.enable=
  true, gc.enable=true, and enableNodeFeatureApi=true.
* New TestNFDTopologyUpdater_OverlayCoverage in pkg/recipe/
  metadata_test.go covers all 12 GPU platform+intent overlays
  (expected ON) and both kind-chain leaves (expected OFF).
  Type-assertion failures on ON cases promote to t.Fatal so a
  malformed override produces a loud regression rather than a
  silent false-negative.
* docs/user/component-catalog.md gains the missing NFD row and a
  Topology Updater section documenting which recipes run TU and
  the kubelet pod-resources prerequisite.

Closes the local NFD adoption initiative (Tracks A and B previously
shipped via PRs NVIDIA#482, NVIDIA#494, NVIDIA#495, NVIDIA#502, NVIDIA#511, NVIDIA#518, NVIDIA#688). Track C
(recipe auto-resolution from NRT data) and scheduler-integration
work remain open for future PRs.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants