Skip to content

add puntstats otel receiver#423

Open
terickson-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
terickson-nvidia:tom/puntstats
Open

add puntstats otel receiver#423
terickson-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
terickson-nvidia:tom/puntstats

Conversation

@terickson-nvidia
Copy link

Description

This PR adds punt stat metrics to site controller DPU and managed host DPU.

Currently, the site controller DPU does not export any metrics, so a there's a new Makefile target

    cargo make build-otel-dpu-deb-local

that builds a package to install OpenTelemetry on the site controller DPU. Since the site controller DPU also does not have forge-dpu-agent, the package depends on initial mTLS certs at /var/lib/otelcol-contrib/mtls-certs.tar and installs an agent to update those certs.

Scripts to generate mtls-certs.tar and other dependencies are in review at https://gitlab-master.nvidia.com/nvmetal/stardrive/-/merge_requests/230

Since the site controller DPU does not have the glibc version expected by the mTLS renewal agent, I containerized it in Docker to work around that. However, at least on dev8 where I tested, docker can't reach docker hub to satisfy a dependency on a ca-certificates module needed to build the container on demand, so on site controller DPUs that have this docker networking issue, a pre-built container at /usr/lib/otel-agent/docker/otel-agent-image.tar.gz can bypass this issue.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

https://nvbugspro.nvidia.com/bug/5743357

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Punt stats verified on dev8 site controller DPU and managed host DPU.

Additional Notes

So that site controller DPU and managed host DPU can use the same mTLS renewal agent, I removed the logic to copy existing forge-dpu-agent certs and moved that to a separate script. This separation makes the renewal agent simpler.

@terickson-nvidia terickson-nvidia self-assigned this Mar 2, 2026
@terickson-nvidia terickson-nvidia requested a review from a team as a code owner March 2, 2026 23:57
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant