Skip to content

Conversation

@nv-oviya
Copy link
Contributor

@nv-oviya nv-oviya commented Dec 2, 2025

Overview:

Implements runtime fault injection control via filesystem toggles and hostPath volumes, eliminating the need for pod restarts when enabling/disabling CUDA faults (except for initial library deployment).

Previously, enabling/disabling CUDA fault injection required modifying environment variables and restarting pods, which is unrealistic for testing fault tolerance recovery scenarios. This PR introduces a 3-tier toggling system that allows instant fault activation/deactivation in running pods.


Details:

  1. cuda_intercept.c - Runtime Toggle Detection
    • Modified get_fault_config() to check fault injection status on every CUDA call rather than once at initialization
    • Implements 3-tier fallback: /host-fault/cuda_fault_enabled (persistent) → /tmp/cuda_fault_enabled (ephemeral) → CUDA_FAULT_INJECTION_ENABLED env var
    • The hostPath check ensures faults persist across pod crashes/restarts on the same node, accurately simulating persistent hardware failure
    • When pod reschedules to a different node → no fault marker file → automatic recovery
  2. inject_into_pods.py - Persistent volume infrastructure
    • Adds node-fault-marker hostPath volume mounting /var/lib/cuda-fault-test (host) to /host-fault (pod)
    • Uses DirectoryOrCreate to ensure directory exists on node
    • Introduces passthrough_mode parameter: deploys library with CUDA_FAULT_INJECTION_ENABLED=0, allowing baseline testing before fault injection
    • Sets aggressive update strategy (maxUnavailable=100%, maxSurge=0) to ensure all pods update simultaneously when enabling
    • Force-deletes pods when enabling to apply changes immediately (one-time operation)
  3. cuda_fault_injection.py - New helper methods
    • check_if_cuda_library_deployed(): Detects if CUDA library is already injected (checks LD_PRELOAD, init containers, volumes)
    • enable_cuda_faults_via_toggle(): Writes "1" to /host-fault/cuda_fault_enabled in running pods via kubectl exec (no restart needed)
    • disable_cuda_faults_via_toggle(): Writes "0" to toggle file for instant recovery
    • cleanup_node_fault_markers(): Removes /host-fault/cuda_fault_enabled from nodes (cleanup after tests)
    • verify_env_var_set(): Validates environment variable propagation to deployment spec
  4. README.md - Updated documentation

Architecture Context:

  • Phase 0: Deploy library in passthrough mode (baseline testing)
  • Phase 1: Toggle faults ON via filesystem → pods crash naturally
  • Phase 2: Toggle faults OFF via filesystem → pods recover on restart
  • No forced deletions or restarts needed after initial setup
  • hostPath persistence simulates real hardware failure that persists until pod reschedules

Where should the reviewer start?

  1. Start with cuda_intercept.c changes (lines 62-130):
    • Review the modified get_fault_config() function
    • Understand the 3-tier fallback mechanism (hostPath → /tmp → env var)
    • Ensure this is checked on every CUDA call for instant toggling
  2. Next, review inject_into_pods.py (lines 203-215, 268-278):
    • Examine the hostPath volume definition (/var/lib/cuda-fault-test)
    • Check how it is mounted to /host-fault with write permissions
    • Review passthrough_mode parameter introduction (line 348)
    • Then check cuda_fault_injection.py new methods:
      • enable_cuda_faults_via_toggle() - See how it uses kubectl exec to write toggle file
      • disable_cuda_faults_via_toggle() - Same mechanism for recovery
      • cleanup_node_fault_markers() - Cleanup logic for test teardown

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Builds on PR #4038

Summary by CodeRabbit

  • New Features

    • Added runtime fault injection toggling without pod restarts
    • Introduced persistent fault state across node restarts via hostPath volumes
    • Added passthrough mode for baseline testing with library loaded but disabled
    • Enhanced deployment verification and status checking capabilities
  • Documentation

    • Updated guide describing persistent fault markers, runtime control model, and node-specific fault isolation

✏️ Tip: You can customize this high-level summary in your review settings.

… cuda fault injections w/o requiring restarts or force deletion (besides initial setup)

Signed-off-by: Oviya Seeniraj <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nv-oviya nv-oviya marked this pull request as ready for review December 3, 2025 17:47
@nv-oviya nv-oviya requested review from a team as code owners December 3, 2025 17:47
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 3, 2025

Walkthrough

This PR enhances CUDA fault injection for Kubernetes deployments by introducing persistent fault toggling via hostPath volumes, enabling runtime control without pod restarts. It adds new orchestration helpers, implements aggressive pod restart strategies, and includes passthrough mode for baseline testing with the library loaded but disabled.

Changes

Cohort / File(s) Change Summary
Documentation
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/README.md
Rewrites high-level description to cover LD_PRELOAD interception and persistent fault state via hostPath. Adds "Key Features" section detailing persistent faults, runtime toggles, and node-specific isolation. Introduces "How It Works" section with deployment patching, LD_PRELOAD injection, runtime control, and node persistence steps. Updates "Files in This Directory" table to reflect new behaviors and expands "Scope" section with in-scope/out-of-scope details.
Runtime Fault Toggling
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/cuda_intercept.c
Introduces per-call runtime fault injection toggling by reading /host-fault/cuda_fault_enabled (with fallback to /tmp/cuda_fault_enabled). Replaces single cached flag with environment-based toggle and per-call runtime check. Updates diagnostic log message from "ENABLED - Simulating XID" to "Library loaded - XID". Removes default logging when CUDA_XID_TYPE is not provided.
Pod Patching & Deployment Control
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/inject_into_pods.py
Adds persistent node-fault-marker hostPath volume with /host-fault mount path. Introduces passthrough_mode parameter to control CUDA_FAULT_INJECTION_ENABLED environment variable ("0" in passthrough mode, "1" otherwise). Implements aggressive update strategy (100% maxUnavailable, 0 maxSurge) when enabling injection. Forces immediate pod restart by deleting worker pods and waiting for new instances. Extends cleanup paths to remove the new volume and markers during teardown.
Orchestration Helpers
tests/fault_tolerance/hardware/fault_injection_service/helpers/cuda_fault_injection.py
Adds six new methods to CUDAFaultInjector: check_if_cuda_library_deployed() to detect CUDA fault library, patch_deployment_for_cuda_fault() updated to accept passthrough_mode parameter, enable_cuda_faults_via_toggle() and disable_cuda_faults_via_toggle() for runtime toggling, cleanup_node_fault_markers() for state cleanup, and verify_env_var_set() for environment variable validation across worker services. Extends patching and cleanup flows to support persistent markers and verification steps.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • cuda_intercept.c: Runtime fault injection logic with file I/O per-call requires careful review of the fallback mechanism and thread-safety implications
  • inject_into_pods.py: Pod restart strategy and aggressive update approach may have operational impact; hostPath volume lifecycle and cleanup paths need verification
  • cuda_fault_injection.py: Six new public methods with varying complexity; passthrough mode integration and verification polling logic require scrutiny
  • Cross-file consistency: Ensure persistent marker file paths, environment variable names, and toggle semantics align across C code, Python orchestration, and helper layer

Poem

🐰 With hostPath persistent and toggles so keen,
Faults dance at runtime, a controllerʼs dream!
No restarts needed, just markers that glow,
Through /host-fault the injections now flow.
Aggressive updates chase pods to the sky,
As helpers orchestrate—watch the tests fly! 🚀

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly summarizes the main change: enabling runtime CUDA fault injection toggling without pod restarts, which is the primary objective of this PR.
Description check ✅ Passed Description follows the template structure with Overview, Details, and Where to start sections; provides comprehensive explanation of changes, architecture, and implementation details.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/inject_into_pods.py (1)

591-601: Standard Deployment path doesn't support passthrough_mode.

The standard Kubernetes Deployment path always sets CUDA_FAULT_INJECTION_ENABLED=1, ignoring the passthrough_mode parameter. This is inconsistent with the DynamoGraphDeployment path.

         if enable:
             # Add new env vars
+            fault_enabled_value = "0" if passthrough_mode else "1"
             container.env.append(
                 client.V1EnvVar(name="LD_PRELOAD", value="/tmp/cuda_intercept.so")
             )
             container.env.append(
-                client.V1EnvVar(name="CUDA_FAULT_INJECTION_ENABLED", value="1")
+                client.V1EnvVar(name="CUDA_FAULT_INJECTION_ENABLED", value=fault_enabled_value)
             )
tests/fault_tolerance/hardware/fault_injection_service/helpers/cuda_fault_injection.py (1)

167-175: Duplicate Args: block in docstring.

The docstring has two Args: sections. The first (lines 167-170) describes passthrough_mode, and the second (lines 171-175) describes the other parameters.

-        Args:
-            passthrough_mode: If True, set CUDA_FAULT_INJECTION_ENABLED=0
-                            (library loaded but faults disabled for baseline)
-
         Args:
             deployment_name: Name of the deployment
             namespace: Kubernetes namespace
             target_node: Node to pin pods to (simulates real XID behavior)
             xid_type: XID error type to simulate (79, 48, 94, 95, 43, 74). Default: 79
+            passthrough_mode: If True, set CUDA_FAULT_INJECTION_ENABLED=0
+                            (library loaded but faults disabled for baseline)
🧹 Nitpick comments (7)
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/README.md (1)

11-13: Add language specifier to fenced code block.

The code block should have a language specifier for proper syntax highlighting and linting compliance.

-```
+```text
 Pod calls cudaMalloc() → LD_PRELOAD intercepts → Checks /host-fault/cuda_fault_enabled → Returns error → Pod crashes

</blockquote></details>
<details>
<summary>tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/cuda_intercept.c (2)</summary><blockquote>

`105-128`: **Per-call file I/O may impact CUDA-heavy workloads.**

Reading the toggle file on every CUDA call introduces filesystem overhead. While this enables instant runtime control, it could affect performance for applications making thousands of CUDA calls per second. Consider:

1. **Current approach is acceptable** for fault injection testing scenarios where the overhead is negligible compared to the intentional fault behavior.
2. **Optional optimization**: Add a time-based cache (e.g., check file every 100ms) if performance becomes a concern in passthrough mode.

Additionally, the file reading logic is duplicated. A helper function would reduce code duplication.


```diff
+// Helper to read toggle value from file
+static int read_toggle_file(const char* path) {
+  FILE* f = fopen(path, "r");
+  if (!f) return -1;  // File not found
+  
+  char buf[4] = {0};
+  int result = -1;
+  if (fgets(buf, sizeof(buf), f)) {
+    result = (buf[0] == '1') ? 1 : 0;
+  }
+  fclose(f);
+  return result;
+}

   // Runtime toggle: Check node-persistent fault marker on EVERY call
   // Use hostPath (/host-fault) so fault persists across pod restarts on same node
   // Pod reschedules to different node → no file there → automatic recovery!
   int runtime_inject = env_inject;  // Default to env var

-  // Check hostPath first (persistent across restarts on same node)
-  FILE* toggle_file = fopen("/host-fault/cuda_fault_enabled", "r");
-  if (toggle_file) {
-    char toggle_value[4] = {0};
-    if (fgets(toggle_value, sizeof(toggle_value), toggle_file)) {
-      runtime_inject = (toggle_value[0] == '1');
-    }
-    fclose(toggle_file);
-  } else {
-    // Fallback to ephemeral /tmp for backwards compatibility
-    toggle_file = fopen("/tmp/cuda_fault_enabled", "r");
-    if (toggle_file) {
-      char toggle_value[4] = {0};
-      if (fgets(toggle_value, sizeof(toggle_value), toggle_file)) {
-        runtime_inject = (toggle_value[0] == '1');
-      }
-      fclose(toggle_file);
-    }
-  }
+  // Check hostPath first (persistent across restarts on same node)
+  int toggle_result = read_toggle_file("/host-fault/cuda_fault_enabled");
+  if (toggle_result >= 0) {
+    runtime_inject = toggle_result;
+  } else {
+    // Fallback to ephemeral /tmp for backwards compatibility
+    toggle_result = read_toggle_file("/tmp/cuda_fault_enabled");
+    if (toggle_result >= 0) {
+      runtime_inject = toggle_result;
+    }
+  }

155-165: Redundant get_fault_config() call in log_intercept().

log_intercept() first calls should_inject_fault() (which calls get_fault_config()), then calls get_fault_config() again to get the XID. Since get_fault_config() now reads files on every call, this doubles the I/O for logged operations.

 // Log helper
 static void
-log_intercept(const char* func_name, cudaError_t error_code)
+log_intercept(const char* func_name, int xid, cudaError_t error_code)
 {
-  if (should_inject_fault()) {
-    int inject, xid;
-    cudaError_t err;
-    get_fault_config(&inject, &xid, &err);
-    fprintf(stderr, "[XID %d SIM] %s() intercepted -> error %d\n", xid, func_name, error_code);
-  }
+  fprintf(stderr, "[XID %d SIM] %s() intercepted -> error %d\n", xid, func_name, error_code);
 }

Then update call sites to pass the XID obtained from the initial get_fault_config() check:

// Example for cudaGetDeviceCount:
int inject, xid;
cudaError_t error;
get_fault_config(&inject, &xid, &error);
if (inject) {
  log_intercept("cudaGetDeviceCount", xid, error);
  // ...
}
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/inject_into_pods.py (1)

434-454: Update strategy is intentionally aggressive - document the trade-off.

Setting maxUnavailable=100% ensures all pods receive the fault injection simultaneously, which is correct for testing scenarios. However, this will cause complete service unavailability during the update.

Consider adding a comment or log message making this explicit to avoid confusion:

             # Allow all pods to be unavailable during update
             spec["updateStrategy"]["rollingUpdate"]["maxUnavailable"] = "100%"
             # Don't create surge pods
             spec["updateStrategy"]["rollingUpdate"]["maxSurge"] = 0
             print("    → Set update strategy: maxUnavailable=100%, maxSurge=0")
-            print("       (All pods will update simultaneously)")
+            print("       (All pods will update simultaneously - service will be unavailable during update)")
tests/fault_tolerance/hardware/fault_injection_service/helpers/cuda_fault_injection.py (3)

129-146: Container name check could be fragile.

The check looks for containers named "vllm-worker" or "worker", but the actual path checks workerSpec.podSpec.containers which may differ from the extraPodSpec used elsewhere. Also, silently returning False on any exception could hide legitimate configuration issues.

Consider:

  1. Checking all containers rather than specific names, or
  2. Looking for LD_PRELOAD in the extraPodSpec.mainContainer.env path (consistent with patching logic)
-            for container in containers:
-                if container.get("name") in ["vllm-worker", "worker"]:
-                    env = container.get("env", [])
-                    for env_var in env:
-                        if env_var.get("name") == "LD_PRELOAD":
-                            return True
-
-            return False
+            # Check services (consistent with patching logic)
+            services = spec.get("services", {})
+            for service_name in ["VllmDecodeWorker", "VllmPrefillWorker"]:
+                service = services.get(service_name, {})
+                env_vars = (
+                    service.get("extraPodSpec", {})
+                    .get("mainContainer", {})
+                    .get("env", [])
+                )
+                for env_var in env_vars:
+                    if env_var.get("name") == "LD_PRELOAD":
+                        return True
+            return False

609-633: Complex control flow is correct but hard to follow.

The nested for-else-break pattern works but is difficult to reason about. Consider restructuring for clarity:

-                for service_name in ["VllmDecodeWorker", "VllmPrefillWorker"]:
-                    if service_name in dgd["spec"]["services"]:
-                        service = dgd["spec"]["services"][service_name]
-                        env_vars = (
-                            service.get("extraPodSpec", {})
-                            .get("mainContainer", {})
-                            .get("env", [])
-                        )
-
-                        for env_var in env_vars:
-                            if env_var.get("name") == "CUDA_FAULT_INJECTION_ENABLED":
-                                if env_var.get("value") != expected_value:
-                                    time.sleep(1)
-                                    break  # Try again
-                        else:
-                            continue  # This service is good
-                        break  # Inner loop broke, try again
-                else:
-                    # All services verified
-                    return True
+                all_match = True
+                for service_name in ["VllmDecodeWorker", "VllmPrefillWorker"]:
+                    if service_name not in dgd["spec"]["services"]:
+                        continue
+                    service = dgd["spec"]["services"][service_name]
+                    env_vars = (
+                        service.get("extraPodSpec", {})
+                        .get("mainContainer", {})
+                        .get("env", [])
+                    )
+                    for env_var in env_vars:
+                        if env_var.get("name") == "CUDA_FAULT_INJECTION_ENABLED":
+                            if env_var.get("value") != expected_value:
+                                all_match = False
+                            break
+                
+                if all_match:
+                    return True
+                time.sleep(1)

393-395: Inconsistent type hint.

enable_cuda_faults_via_toggle uses pods: List while disable_cuda_faults_via_toggle uses pods: List[client.V1Pod]. Use consistent typing.

     def enable_cuda_faults_via_toggle(
-        self, pods: List, namespace: str, enable: bool = True
+        self, pods: List[client.V1Pod], namespace: str, enable: bool = True
     ) -> bool:
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 135bce4 and 3a7feb1.

📒 Files selected for processing (4)
  • tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/README.md (2 hunks)
  • tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/cuda_intercept.c (3 hunks)
  • tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/inject_into_pods.py (9 hunks)
  • tests/fault_tolerance/hardware/fault_injection_service/helpers/cuda_fault_injection.py (5 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/README.md

11-11: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.7)
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/inject_into_pods.py

542-542: Do not catch blind exception: Exception

(BLE001)


549-549: Do not catch blind exception: Exception

(BLE001)

tests/fault_tolerance/hardware/fault_injection_service/helpers/cuda_fault_injection.py

142-142: Consider moving this statement to an else block

(TRY300)


144-144: Do not catch blind exception: Exception

(BLE001)


441-441: subprocess call: check for execution of untrusted input

(S603)


442-452: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


477-477: Do not catch blind exception: Exception

(BLE001)


548-548: subprocess call: check for execution of untrusted input

(S603)


549-559: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


570-571: try-except-continue detected, consider logging the exception

(S112)


570-570: Do not catch blind exception: Exception

(BLE001)


630-630: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (7)
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/README.md (1)

9-54: LGTM!

The documentation updates accurately describe the new runtime toggling mechanism, hostPath-based persistence, and the phased workflow. The "Key Features" and "How It Works" sections provide clear guidance on the architecture.

tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/cuda_intercept.c (1)

167-312: LGTM!

The CUDA function interception implementations follow a consistent pattern: check for fault injection, return error if enabled, otherwise delegate to the real function via dlsym(RTLD_NEXT, ...). This is a clean and maintainable approach.

tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/inject_into_pods.py (3)

414-422: LGTM!

The passthrough_mode implementation cleanly sets CUDA_FAULT_INJECTION_ENABLED=0 for baseline testing with the library loaded but inactive. This enables the phased workflow: deploy → toggle ON → toggle OFF.


534-551: Force deletion is aggressive but appropriate for fault injection.

Using grace_period_seconds=0 immediately terminates pods without graceful shutdown. This is acceptable for fault injection testing where you're simulating catastrophic failures.

The exception handling logs errors and continues, which is the right approach for batch operations. The static analysis warnings about blind exception catches (BLE001) are acceptable here since we're logging and intentionally continuing.


204-214: LGTM - hostPath volume for persistent fault marker is properly implemented.

Using DirectoryOrCreate for the hostPath is appropriate. The path /var/lib/cuda-fault-test (mounted as /host-fault in containers) follows the intended pattern documented in the fault injection README and is properly cleaned up by cleanup_node_fault_markers() to prevent conflicts between test runs.

tests/fault_tolerance/hardware/fault_injection_service/helpers/cuda_fault_injection.py (2)

433-456: LGTM - Toggle implementation with verification.

The toggle implementation correctly:

  1. Creates the directory if needed (mkdir -p)
  2. Writes the toggle value
  3. Reads back to verify the write succeeded

The pod_name comes from the Kubernetes API, so the S603 static analysis warning is a false positive. The timeout of 10 seconds is appropriate.


504-573: LGTM - Robust node cleanup with deduplication.

The cleanup logic correctly:

  1. Tracks cleaned nodes to avoid duplicate operations
  2. Uses rm -f to silently handle missing files
  3. Continues on failure to clean up as many nodes as possible

The silent exception catch (S112/BLE001) is acceptable here since we want resilient cleanup behavior.

…gs section and fixed inconsistent type hint

Signed-off-by: Oviya Seeniraj <[email protected]>
Signed-off-by: Oviya Seeniraj <[email protected]>
Copy link
Member

@saturley-hall saturley-hall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, however I am still not wild about adding functionality that is not exercised in the PR (cleanup_node_fault_markers(), disable_cuda_faults_via_toggle(), verify_env_var_set()). In the future please try and keep the functionality as atomic as possible and not as a chain of PRs.

@saturley-hall
Copy link
Member

/ok to test 2a9d973

@nv-oviya nv-oviya merged commit d2c23e4 into main Dec 8, 2025
28 of 29 checks passed
@nv-oviya nv-oviya deleted the oviya/fault-injection/cuda-hostpath-method branch December 8, 2025 20:45
esoba pushed a commit to esoba/dynamo that referenced this pull request Dec 9, 2025
zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants