feat(fault-injection): Add GPU fault injector agent #4043

nv-oviya · 2025-11-02T00:48:20Z

Overview:

This PR adds a DaemonSet agent that runs on every GPU node and performs actual XID injection into kernel logs. The API service (PR #4042) communicates with these agents to trigger faults on specific nodes.

Details:

Add agents/gpu-fault-injector/gpu_xid_injector.py: XID injection implementation
- inject_xid_error(): Writes XID error to kernel log via nsenter
- Supports XID types: 79, 48, 94, 95, 43, 74
- Uses nsenter to access host's kernel log from container
- Formats messages to match real NVIDIA driver XID format
Add agents/gpu-fault-injector/agent.py: Flask API server
- /health: Health check
- /inject-xid: Endpoint to trigger XID injection (called by API service)
- Validates XID types and GPU IDs
- Returns injection status
Add agents/gpu-fault-injector/requirements.txt: Flask dependencies
Add agents/gpu-fault-injector/Dockerfile: Agent container image
- Includes nsenter utility for host access
- Runs Flask server on port 5000

Where should the reviewer start?

gpu_xid_injector.py - Core XID injection:
- XID format definitions (matches real NVIDIA driver logs)
- inject_xid_error()` function:
  - Uses nsenter --target 1 --mount --uts --ipc --net --pid to access host
  - Writes to /dev/kmsg (kernel log) with proper format
  - Validates XID types (79, 48, 94, 95, 43, 74)
- Helper functions for GPU ID validation
agent.py - Flask API server:
- Flask app setup, request validation
- /inject-xid endpoint implementation
- Health check endpoint
Dockerfile*:
- Note: Requires nsenter package for host access
- Runs with host PID namespace (configured in deployment manifest)

Security considerations:

Agent requires privileged access to write to /dev/kmsg
Must run with hostPID: true to use nsenter
DaemonSet ensures one pod per node (no cross-node injection)
Should only be deployed in test clusters (not production)

Testing note:

XID logs appear in dmesg and journalctl -k
NVSentinel's syslog-health-monitor detects these XIDs

Related Issues:

Depends on: PR feat(fault-injection): Add fault injection API service #4042 (API service that calls this agent)
For now enables only XID 79, future PR will enable all XIDs + have E2E tests for them
Relates to: NVSentinel fault detection testing

Summary by CodeRabbit

New Features
- Added GPU fault injection agent with health check and fault monitoring capabilities
- Enabled containerized deployment of the fault injection service
Chores
- Added Python dependencies for the fault injection service

copy-pr-bot · 2025-11-02T00:48:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-11-04T18:35:44Z

Walkthrough

Introduces a new GPU Fault Injector Agent service for testing GPU fault scenarios. Includes a FastAPI-based agent with endpoints for health checks, XID 79 fault injection via kernel methods, and fault listing; kernel-level injection module; Docker container setup; and Python dependencies.

Changes

Cohort / File(s)	Change Summary
Service Implementation `agent.py`, `gpu_xid_injector.py`	Adds FastAPI-based GPU Fault Injector agent with GPUFaultInjector class tracking active faults and GPU metadata; provides /health, /inject-xid, and /faults endpoints. Introduces GPUXIDInjectorKernel class for kernel-level XID 79 injection via nsenter and kmsg, with privilege checks, PCI address normalization, and nvidia-smi integration.
Container & Execution `Dockerfile`	New Dockerfile using NVIDIA CUDA 12.3.0 base image; installs system dependencies (python3, pip, curl, util-linux, systemd, kmod, pciutils); configures /app working directory, logging, port 8083, and health check.
Dependencies `requirements.txt`	Declares Python packages: fastapi 0.109.0, httpx 0.26.0, kubernetes 28.1.0, pydantic 2.5.3, python-multipart 0.0.6, pyyaml 6.0.1, uvicorn[standard] 0.27.0.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas requiring extra attention:

Privilege escalation and root-level kernel injection logic in gpu_xid_injector.py; verify nsenter and kmsg write safety and error handling
PCI address parsing and normalization logic; validate nvidia-smi output parsing and sysfs format correctness
Shell command execution with timeout in agent.py; review subprocess handling and injection points
DCGM availability checks and GPU enumeration via system commands; verify fallback behavior and error cases

Poem

🐰 A fuzzy injector hops into the fold,
XID faults now tested, GPU stories told,
CUDA kernels tremble at each kernel call,
Fastened by endpoints, we'll catch them all! 🚀

Pre-merge checks

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding a GPU fault injector agent. It is concise, specific, and directly related to the changeset.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering overview, detailed changes, reviewer guidance, and security considerations. All template sections are present and substantive.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d76583 and 33c1990.

📒 Files selected for processing (4)

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/Dockerfile (1 hunks)
tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py (1 hunks)
tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/gpu_xid_injector.py (1 hunks)
tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/requirements.txt (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py (1)

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/gpu_xid_injector.py (2)

GPUXIDInjectorKernel (23-153)

inject_xid_79_gpu_fell_off_bus (56-79)

🪛 OSV Scanner (2.2.4)

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/requirements.txt

[HIGH] 1-1: fastapi 0.109.0: undefined

(PYSEC-2024-38)

[HIGH] 1-1: python-multipart 0.0.6: python-multipart vulnerable to Content-Type Header ReDoS

(GHSA-2jv5-9r88-3w3p)

[HIGH] 1-1: python-multipart 0.0.6: Denial of service (DoS) via deformation multipart/form-data boundary

(GHSA-59g5-xgcq-4qw3)

[HIGH] 1-1: starlette 0.35.1: Starlette has possible denial-of-service vector when parsing large files in multipart forms

(GHSA-2c2j-9gv5-cj73)

[HIGH] 1-1: starlette 0.35.1: Starlette Denial of service (DoS) via multipart/form-data

(GHSA-f96h-pmfr-66vw)

🪛 Ruff (0.14.3)

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py

82-82: Do not catch blind exception: Exception

(BLE001)

94-94: Starting a process with a partial executable path

(S607)

96-96: Consider moving this statement to an else block

(TRY300)

97-97: Do not catch blind exception: Exception

(BLE001)

105-105: Starting a process with a partial executable path

(S607)

112-112: Consider moving this statement to an else block

(TRY300)

113-113: Do not catch blind exception: Exception

(BLE001)

114-114: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

120-120: subprocess call: check for execution of untrusted input

(S603)

128-128: Do not catch blind exception: Exception

(BLE001)

210-210: Possible binding to all interfaces

(S104)

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/gpu_xid_injector.py

93-93: subprocess call: check for execution of untrusted input

(S603)

94-100: Starting a process with a partial executable path

(S607)

136-136: subprocess call: check for execution of untrusted input

(S603)

146-149: Consider moving this statement to an else block

(TRY300)

151-151: Do not catch blind exception: Exception

(BLE001)

152-152: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

.../fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/requirements.txt

- Agent runs as DaemonSet on GPU nodes - gpu_xid_injector.py: Injects XID errors into kernel logs - agent.py: HTTP server that receives injection requests - Dockerfile and requirements for deployment Enables API service to trigger XID injection on specific nodes. Signed-off-by: Oviya Seeniraj <[email protected]>

Signed-off-by: Oviya Seeniraj <[email protected]>

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py

Ava-A4098 · 2025-11-19T19:09:57Z

suggest adding xid-31 (memory page fault) to the list.

...ult_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/gpu_xid_injector.py

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py

…es for all the DCGM/NVSentinel monitored XIDs Signed-off-by: Oviya Seeniraj <[email protected]>

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/Dockerfile

…gpu-fault-injector/Dockerfile Signed-off-by: Harrison Saturley-Hall <[email protected]>

Signed-off-by: Harrison King Saturley-Hall <[email protected]>

Signed-off-by: Oviya Seeniraj <[email protected]> Signed-off-by: Harrison Saturley-Hall <[email protected]> Signed-off-by: Harrison King Saturley-Hall <[email protected]> Co-authored-by: Harrison Saturley-Hall <[email protected]>

pull-request-size bot added the size/L label Nov 2, 2025

github-actions bot added the feat label Nov 2, 2025

This was referenced Nov 2, 2025

feat(fault-injection): Add Kubernetes deployment manifests #4044

Merged

test(fault-injection): Add XID 79 NVSentinel E2E test #4046

Open

nv-oviya marked this pull request as ready for review November 4, 2025 18:32

nv-oviya requested review from a team as code owners November 4, 2025 18:32

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

.../fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/requirements.txt Outdated Show resolved Hide resolved

.../fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/requirements.txt Outdated Show resolved Hide resolved

nv-oviya marked this pull request as draft November 4, 2025 18:45

nv-oviya added 3 commits November 4, 2025 10:45

fixed copyright and mypy issues

9d823bb

Signed-off-by: Oviya Seeniraj <[email protected]>

Updated requirements.txt to prevent security vulnerabilities

01aa768

Signed-off-by: Oviya Seeniraj <[email protected]>

nv-oviya force-pushed the oviya/fault-injection/gpu-agent branch from 01e98d9 to 01aa768 Compare November 4, 2025 18:47

nv-oviya marked this pull request as ready for review November 4, 2025 20:19

nnshah1 reviewed Nov 19, 2025

View reviewed changes

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py Show resolved Hide resolved

nnshah1 reviewed Nov 19, 2025

View reviewed changes

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py Show resolved Hide resolved

tzulingk reviewed Nov 19, 2025

View reviewed changes

...ult_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/gpu_xid_injector.py Show resolved Hide resolved

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py Outdated Show resolved Hide resolved

support all XIDs in kmsg injector not just 79 + add predefined messag…

e9885e8

…es for all the DCGM/NVSentinel monitored XIDs Signed-off-by: Oviya Seeniraj <[email protected]>

pull-request-size bot added size/XL and removed size/L labels Nov 25, 2025

saturley-hall approved these changes Nov 26, 2025

View reviewed changes

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/Dockerfile Outdated Show resolved Hide resolved

saturley-hall added 2 commits November 26, 2025 16:30

Update tests/fault_tolerance/hardware/fault-injection-service/agents/…

9af2142

…gpu-fault-injector/Dockerfile Signed-off-by: Harrison Saturley-Hall <[email protected]>

fix: precommit formatting

97afb3f

Signed-off-by: Harrison King Saturley-Hall <[email protected]>

saturley-hall merged commit e10319f into main Nov 26, 2025
9 of 10 checks passed

saturley-hall deleted the oviya/fault-injection/gpu-agent branch November 26, 2025 21:33

nv-oviya mentioned this pull request Nov 26, 2025

fix: duplicate dictionary key 31 in XID_MESSAGES #4649

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(fault-injection): Add GPU fault injector agent #4043

feat(fault-injection): Add GPU fault injector agent #4043

Uh oh!

nv-oviya commented Nov 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Nov 2, 2025

Uh oh!

coderabbitai bot commented Nov 4, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ava-A4098 commented Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feat(fault-injection): Add GPU fault injector agent #4043

feat(fault-injection): Add GPU fault injector agent #4043

Uh oh!

Conversation

nv-oviya commented Nov 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Nov 2, 2025

Uh oh!

coderabbitai bot commented Nov 4, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ava-A4098 commented Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nv-oviya commented Nov 2, 2025 •

edited by coderabbitai bot

Loading