Skip to content

Conversation

@nv-oviya
Copy link
Contributor

@nv-oviya nv-oviya commented Nov 2, 2025

Overview:

This PR adds a DaemonSet agent that runs on every GPU node and performs actual XID injection into kernel logs. The API service (PR #4042) communicates with these agents to trigger faults on specific nodes.

Details:

  • Add agents/gpu-fault-injector/gpu_xid_injector.py: XID injection implementation
    • inject_xid_error(): Writes XID error to kernel log via nsenter
    • Supports XID types: 79, 48, 94, 95, 43, 74
    • Uses nsenter to access host's kernel log from container
    • Formats messages to match real NVIDIA driver XID format
  • Add agents/gpu-fault-injector/agent.py: Flask API server
    • /health: Health check
    • /inject-xid: Endpoint to trigger XID injection (called by API service)
    • Validates XID types and GPU IDs
    • Returns injection status
  • Add agents/gpu-fault-injector/requirements.txt: Flask dependencies
  • Add agents/gpu-fault-injector/Dockerfile: Agent container image
    • Includes nsenter utility for host access
    • Runs Flask server on port 5000

Where should the reviewer start?

  1. gpu_xid_injector.py - Core XID injection:

    • XID format definitions (matches real NVIDIA driver logs)
    • inject_xid_error()` function:
      • Uses nsenter --target 1 --mount --uts --ipc --net --pid to access host
      • Writes to /dev/kmsg (kernel log) with proper format
      • Validates XID types (79, 48, 94, 95, 43, 74)
    • Helper functions for GPU ID validation
  2. agent.py - Flask API server:

    • Flask app setup, request validation
    • /inject-xid endpoint implementation
    • Health check endpoint
  3. Dockerfile*:

    • Note: Requires nsenter package for host access
    • Runs with host PID namespace (configured in deployment manifest)

Security considerations:

  • Agent requires privileged access to write to /dev/kmsg
  • Must run with hostPID: true to use nsenter
  • DaemonSet ensures one pod per node (no cross-node injection)
  • Should only be deployed in test clusters (not production)

Testing note:

  • XID logs appear in dmesg and journalctl -k
  • NVSentinel's syslog-health-monitor detects these XIDs

Related Issues:

Summary by CodeRabbit

  • New Features

    • Added GPU fault injection agent with health check and fault monitoring capabilities
    • Enabled containerized deployment of the fault injection service
  • Chores

    • Added Python dependencies for the fault injection service

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 4, 2025

Walkthrough

Introduces a new GPU Fault Injector Agent service for testing GPU fault scenarios. Includes a FastAPI-based agent with endpoints for health checks, XID 79 fault injection via kernel methods, and fault listing; kernel-level injection module; Docker container setup; and Python dependencies.

Changes

Cohort / File(s) Change Summary
Service Implementation
agent.py, gpu_xid_injector.py
Adds FastAPI-based GPU Fault Injector agent with GPUFaultInjector class tracking active faults and GPU metadata; provides /health, /inject-xid, and /faults endpoints. Introduces GPUXIDInjectorKernel class for kernel-level XID 79 injection via nsenter and kmsg, with privilege checks, PCI address normalization, and nvidia-smi integration.
Container & Execution
Dockerfile
New Dockerfile using NVIDIA CUDA 12.3.0 base image; installs system dependencies (python3, pip, curl, util-linux, systemd, kmod, pciutils); configures /app working directory, logging, port 8083, and health check.
Dependencies
requirements.txt
Declares Python packages: fastapi 0.109.0, httpx 0.26.0, kubernetes 28.1.0, pydantic 2.5.3, python-multipart 0.0.6, pyyaml 6.0.1, uvicorn[standard] 0.27.0.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas requiring extra attention:

  • Privilege escalation and root-level kernel injection logic in gpu_xid_injector.py; verify nsenter and kmsg write safety and error handling
  • PCI address parsing and normalization logic; validate nvidia-smi output parsing and sysfs format correctness
  • Shell command execution with timeout in agent.py; review subprocess handling and injection points
  • DCGM availability checks and GPU enumeration via system commands; verify fallback behavior and error cases

Poem

🐰 A fuzzy injector hops into the fold,
XID faults now tested, GPU stories told,
CUDA kernels tremble at each kernel call,
Fastened by endpoints, we'll catch them all! 🚀

Pre-merge checks

✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a GPU fault injector agent. It is concise, specific, and directly related to the changeset.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering overview, detailed changes, reviewer guidance, and security considerations. All template sections are present and substantive.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d76583 and 33c1990.

📒 Files selected for processing (4)
  • tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/Dockerfile (1 hunks)
  • tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py (1 hunks)
  • tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/gpu_xid_injector.py (1 hunks)
  • tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/requirements.txt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py (1)
tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/gpu_xid_injector.py (2)
  • GPUXIDInjectorKernel (23-153)
  • inject_xid_79_gpu_fell_off_bus (56-79)
🪛 OSV Scanner (2.2.4)
tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/requirements.txt

[HIGH] 1-1: fastapi 0.109.0: undefined

(PYSEC-2024-38)


[HIGH] 1-1: python-multipart 0.0.6: python-multipart vulnerable to Content-Type Header ReDoS

(GHSA-2jv5-9r88-3w3p)


[HIGH] 1-1: python-multipart 0.0.6: Denial of service (DoS) via deformation multipart/form-data boundary

(GHSA-59g5-xgcq-4qw3)


[HIGH] 1-1: starlette 0.35.1: Starlette has possible denial-of-service vector when parsing large files in multipart forms

(GHSA-2c2j-9gv5-cj73)


[HIGH] 1-1: starlette 0.35.1: Starlette Denial of service (DoS) via multipart/form-data

(GHSA-f96h-pmfr-66vw)

🪛 Ruff (0.14.3)
tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py

82-82: Do not catch blind exception: Exception

(BLE001)


94-94: Starting a process with a partial executable path

(S607)


96-96: Consider moving this statement to an else block

(TRY300)


97-97: Do not catch blind exception: Exception

(BLE001)


105-105: Starting a process with a partial executable path

(S607)


112-112: Consider moving this statement to an else block

(TRY300)


113-113: Do not catch blind exception: Exception

(BLE001)


114-114: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


120-120: subprocess call: check for execution of untrusted input

(S603)


128-128: Do not catch blind exception: Exception

(BLE001)


210-210: Possible binding to all interfaces

(S104)

tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/gpu_xid_injector.py

93-93: subprocess call: check for execution of untrusted input

(S603)


94-100: Starting a process with a partial executable path

(S607)


136-136: subprocess call: check for execution of untrusted input

(S603)


146-149: Consider moving this statement to an else block

(TRY300)


151-151: Do not catch blind exception: Exception

(BLE001)


152-152: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

@nv-oviya nv-oviya marked this pull request as draft November 4, 2025 18:45
- Agent runs as DaemonSet on GPU nodes
- gpu_xid_injector.py: Injects XID errors into kernel logs
- agent.py: HTTP server that receives injection requests
- Dockerfile and requirements for deployment

Enables API service to trigger XID injection on specific nodes.

Signed-off-by: Oviya Seeniraj <[email protected]>
Signed-off-by: Oviya Seeniraj <[email protected]>
@nv-oviya nv-oviya force-pushed the oviya/fault-injection/gpu-agent branch from 01e98d9 to 01aa768 Compare November 4, 2025 18:47
@nv-oviya nv-oviya marked this pull request as ready for review November 4, 2025 20:19
@Ava-A4098
Copy link

suggest adding xid-31 (memory page fault) to the list.

…es for all the DCGM/NVSentinel monitored XIDs

Signed-off-by: Oviya Seeniraj <[email protected]>
…gpu-fault-injector/Dockerfile

Signed-off-by: Harrison Saturley-Hall <[email protected]>
Signed-off-by: Harrison King Saturley-Hall <[email protected]>
@saturley-hall saturley-hall merged commit e10319f into main Nov 26, 2025
9 of 10 checks passed
@saturley-hall saturley-hall deleted the oviya/fault-injection/gpu-agent branch November 26, 2025 21:33
zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Dec 11, 2025
Signed-off-by: Oviya Seeniraj <[email protected]>
Signed-off-by: Harrison Saturley-Hall <[email protected]>
Signed-off-by: Harrison King Saturley-Hall <[email protected]>
Co-authored-by: Harrison Saturley-Hall <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants