Skip to content

FORGE-7445 noDpuLogsWarning alert with SuppressExternalAlerting fires even though there are alerts.#515

Open
terickson-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
terickson-nvidia:tom/otel-host-machine-id
Open

FORGE-7445 noDpuLogsWarning alert with SuppressExternalAlerting fires even though there are alerts.#515
terickson-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
terickson-nvidia:tom/otel-host-machine-id

Conversation

@terickson-nvidia
Copy link

Description

Fixes missing host_machine_id label in DPU logs and telemetry_stats_log_records_count metric by fetching the id through carbide API in forge-dpu-agent using the FindInterfaces request. The label is needed for SuppressExternalAlerting to work with the noDpuLogsWarning alert.

The request is retried if the id isn't immediately available, using the backon crate to increase the retry interval to a maximum of every 5 minutes. Adds support for pending file contents to the duppet crate.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

https://nvbugspro.nvidia.com/bug/5668278

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Unit tests for pending file contents in duppet crate.
Manual testing in local dev to verify

  • retry on failure
  • /run/otelcol-contrib/host-machine-id is created/updated/unchanged as expected.

Additional Notes

@terickson-nvidia terickson-nvidia self-assigned this Mar 11, 2026
@terickson-nvidia terickson-nvidia requested a review from a team as a code owner March 11, 2026 00:56
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@terickson-nvidia
Copy link
Author

Testing

$ sudo rm /run/otelcol-contrib/host-machine-id
$ sudo systemctl restart forge-dpu-agent
$ cat /run/otelcol-contrib/host-machine-id
host.machine.id=fm100hthn93o41u6eq8b9ijnjtpce73m8uuh7hd462gtj9p0cvl08oo5r0g
$ sudo journalctl -u forge-dpu-agent

I looked at the forge-dpu-agent logs output by journalctl:

Mar 09 23:00:01 10-217-170-243.local.forge forge-dpu-agent[215308]: level=INFO msg="Duppet Sync Summary:" location="crates/agent/src/duppet/sync.rs:78"
Mar 09 23:00:01 10-217-170-243.local.forge forge-dpu-agent[215308]: level=INFO msg="  Unchanged  /etc/rc.local" location="crates/agent/src/duppet/sync.rs:41"
Mar 09 23:00:01 10-217-170-243.local.forge forge-dpu-agent[215308]: level=INFO msg="  Unchanged  /opt/forge/update-ovs-pipe-size.sh" location="crates/agent/src/duppet/sync.rs:41"
Mar 09 23:00:01 10-217-170-243.local.forge forge-dpu-agent[215308]: level=INFO msg="  Unchanged  /etc/cron.daily/apt-clean" location="crates/agent/src/duppet/sync.rs:41"
Mar 09 23:00:01 10-217-170-243.local.forge forge-dpu-agent[215308]: level=INFO msg="  Unchanged  /etc/dhcp/dhclient-exit-hooks.d/ntpsec" location="crates/agent/src/duppet/sync.rs:41"
Mar 09 23:00:01 10-217-170-243.local.forge forge-dpu-agent[215308]: level=INFO msg="  Unchanged  /run/otelcol-contrib/machine-id" location="crates/agent/src/duppet/sync.rs:41"
Mar 09 23:00:01 10-217-170-243.local.forge forge-dpu-agent[215308]: level=INFO msg="  Unchanged  /lib/systemd/system/update-ovs-pipe-size.service" location="crates/agent/src/duppet/sync.rs:41"
Mar 09 23:00:01 10-217-170-243.local.forge forge-dpu-agent[215308]: level=INFO msg="  Pending    /run/otelcol-contrib/host-machine-id" location="crates/agent/src/duppet/sync.rs:84"
...
Mar 09 23:00:01 10-217-170-243.local.forge forge-dpu-agent[215308]: level=INFO msg="Creating new file: /run/otelcol-contrib/host-machine-id (sha256: adef64d72d336d6e6f3265243717ce76597b2ad4aaa5826b1e73d2b660d7ff2c)" location="crates/agent/src/duppet/sync.rs:287"
Mar 09 23:00:01 10-217-170-243.local.forge forge-dpu-agent[215308]: level=INFO msg="  Created    /run/otelcol-contrib/host-machine-id" location="crates/agent/src/duppet/sync.rs:41"

I tested the duppet sub-command directly:

$ sudo forge-dpu-agent duppet
...
level=INFO msg="Duppet Sync Summary:" location="crates/agent/src/duppet/sync.rs:78"
level=INFO msg="  \u{1b}[34mUnchanged\u{1b}[0m /etc/cron.daily/apt-clean" location="crates/agent/src/duppet/sync.rs:41"
level=INFO msg="  \u{1b}[34mUnchanged\u{1b}[0m /etc/rc.local" location="crates/agent/src/duppet/sync.rs:41"
level=INFO msg="  \u{1b}[34mUnchanged\u{1b}[0m /opt/forge/update-ovs-pipe-size.sh" location="crates/agent/src/duppet/sync.rs:41"
level=INFO msg="  \u{1b}[34mUnchanged\u{1b}[0m /etc/dhcp/dhclient-exit-hooks.d/ntpsec" location="crates/agent/src/duppet/sync.rs:41"
level=INFO msg="  \u{1b}[34mUnchanged\u{1b}[0m /run/otelcol-contrib/machine-id" location="crates/agent/src/duppet/sync.rs:41"
level=INFO msg="  \u{1b}[34mUnchanged\u{1b}[0m /lib/systemd/system/update-ovs-pipe-size.service" location="crates/agent/src/duppet/sync.rs:41"
level=INFO msg="  \u{1b}[1;35mPending\u{1b}[0m /run/otelcol-contrib/host-machine-id" location="crates/agent/src/duppet/sync.rs:84"
level=INFO msg="\u{1b}[1;34mDestination file unchanged\u{1b}[0m: /run/otelcol-contrib/host-machine-id (sha256: adef64d72d336d6e6f3265243717ce76597b2ad4aaa5826b1e73d2b660d7ff2c)" location="crates/agent/src/duppet/sync.rs:266"
level=INFO msg="  \u{1b}[34mUnchanged\u{1b}[0m /run/otelcol-contrib/host-machine-id" location="crates/agent/src/duppet/sync.rs:41"

This verified that the contents were unchanged. By first removing or modifying /run/otelcol-contrib/host-machine-id I also verified the expected "Created" and "Updated" outcomes:

$ sudo rm /run/otelcol-contrib/host-machine-id
$ sudo forge-dpu-agent duppet
...
level=INFO msg="\u{1b}[1;32mCreating new file\u{1b}[0m: /run/otelcol-contrib/host-machine-id (sha256: adef64d72d336d6e6f3265243717ce76597b2ad4aaa5826b1e73d2b660d7ff2c)" location="crates/agent/src/duppet/sync.rs:287"
level=INFO msg="  \u{1b}[1;32mCreated\u{1b}[0m /run/otelcol-contrib/host-machine-id" location="crates/agent/src/duppet/sync.rs:41"
$ sudo vi /run/otelcol-contrib/host-machine-id
$ cat /run/otelcol-contrib/host-machine-id
host.machine.id=xx100hthn93o41u6eq8b9ijnjtpce73m8uuh7hd462gtj9p0cvl08oo5r0g
$ sudo forge-dpu-agent duppet
...
level=INFO msg="\u{1b}[1;33mUpdating existing file\u{1b}[0m: /run/otelcol-contrib/host-machine-id (expected sha256: adef64d72d336d6e6f3265243717ce76597b2ad4aaa5826b1e73d2b660d7ff2c, observed sha256: 2876ece9bb4c50b3449b0e04501f412dcb1aa29a9883088ed2f6d53ff6085062), diff:\n-host.machine.id=xx100hthn93o41u6eq8b9ijnjtpce73m8uuh7hd462gtj9p0cvl08oo5r0g\n+host.machine.id=fm100hthn93o41u6eq8b9ijnjtpce73m8uuh7hd462gtj9p0cvl08oo5r0g" location="crates/agent/src/duppet/sync.rs:327"
level=INFO msg="  \u{1b}[1;33mUpdated\u{1b}[0m /run/otelcol-contrib/host-machine-id" location="crates/agent/src/duppet/sync.rs:41"
$ cat /run/otelcol-contrib/host-machine-id
host.machine.id=fm100hthn93o41u6eq8b9ijnjtpce73m8uuh7hd462gtj9p0cvl08oo5r0g
$

I temporarily modified the code to inject an error until N retries and verified the expected retry behavior.

@terickson-nvidia
Copy link
Author

Testing

I repeated all the testing mentioned previously. Then I temporarily modified the code to inject an error in the api request, to verify that an error is logged, and the service exits and restarts:

Mar 12 00:52:56 10-217-170-243.local.forge forge-dpu-agent[3250132]: level=WARN msg="get_host_machine_id() failed: injected\n\nLocation:\n    /carbide/crates/agent/src/host_machine_id.rs:40:18" location="crates/agent/src/host_machine_id.rs:93"
Mar 12 00:52:56 10-217-170-243.local.forge forge-dpu-agent[3250132]: level=ERROR msg="get_host_machine_id_retry() failed: injected\n\nLocation:\n    /carbide/crates/agent/src/host_machine_id.rs:40:18" location="crates/agent/src/main_loop.rs:199"
Mar 12 00:52:56 10-217-170-243.local.forge forge-dpu-agent[3250132]: Error: main_loop error exit
Mar 12 00:52:56 10-217-170-243.local.forge forge-dpu-agent[3250132]: Caused by:
Mar 12 00:52:56 10-217-170-243.local.forge forge-dpu-agent[3250132]:     injected
Mar 12 00:52:56 10-217-170-243.local.forge forge-dpu-agent[3250132]: Location:
Mar 12 00:52:56 10-217-170-243.local.forge forge-dpu-agent[3250132]:     /carbide/crates/agent/src/host_machine_id.rs:40:18
Mar 12 00:52:56 10-217-170-243.local.forge systemd[1]: forge-dpu-agent.service: Main process exited, code=exited, status=1/FAILURE
Mar 12 00:52:56 10-217-170-243.local.forge systemd[1]: forge-dpu-agent.service: Failed with result 'exit-code'.
Mar 12 00:52:56 10-217-170-243.local.forge systemd[1]: forge-dpu-agent.service: Consumed 8.382s CPU time.
Mar 12 00:53:26 10-217-170-243.local.forge systemd[1]: forge-dpu-agent.service: Scheduled restart job, restart counter is at 3.
Mar 12 00:53:26 10-217-170-243.local.forge systemd[1]: Stopped Forge DPU agent service.
Mar 12 00:53:26 10-217-170-243.local.forge systemd[1]: forge-dpu-agent.service: Consumed 8.382s CPU time.
Mar 12 00:53:26 10-217-170-243.local.forge systemd[1]: Starting Forge DPU agent service...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants