Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adds punt stat metrics to site controller DPU and managed host DPU.
Currently, the site controller DPU does not export any metrics, so a there's a new Makefile target
that builds a package to install OpenTelemetry on the site controller DPU. Since the site controller DPU also does not have
forge-dpu-agent, the package depends on initial mTLS certs at/var/lib/otelcol-contrib/mtls-certs.tarand installs an agent to update those certs.Scripts to generate mtls-certs.tar and other dependencies are in review at https://gitlab-master.nvidia.com/nvmetal/stardrive/-/merge_requests/230
Since the site controller DPU does not have the glibc version expected by the mTLS renewal agent, I containerized it in Docker to work around that. However, at least on dev8 where I tested, docker can't reach docker hub to satisfy a dependency on a
ca-certificatesmodule needed to build the container on demand, so on site controller DPUs that have this docker networking issue, a pre-built container at/usr/lib/otel-agent/docker/otel-agent-image.tar.gzcan bypass this issue.Type of Change
Related Issues (Optional)
https://nvbugspro.nvidia.com/bug/5743357
Breaking Changes
Testing
Punt stats verified on dev8 site controller DPU and managed host DPU.
Additional Notes
So that site controller DPU and managed host DPU can use the same mTLS renewal agent, I removed the logic to copy existing forge-dpu-agent certs and moved that to a separate script. This separation makes the renewal agent simpler.