feat(fault-injection): Add high-level CUDA fault injection helper #4041

nv-oviya · 2025-11-02T00:45:45Z

Overview:

This PR adds a high-level wrapper (CUDAFaultInjector) that simplifies CUDA fault injection for test scripts. It abstracts away the complexity of building libraries, creating ConfigMaps, and patching deployments into a clean Python API.

Details:

Add helpers/cuda_fault_injection.py: High-level CUDA fault injection API
- CUDAFaultInjector class: Manages entire CUDA fault injection lifecycle
- build_library(): Build CUDA fault library using Makefile
- create_configmap_with_library(): Wrapper for ConfigMap creation
- patch_deployment_for_cuda_fault(): Deploy fault library to specific deployment
- cleanup_cuda_fault_injection(): Remove all artifacts with optional force-delete
- trigger_pod_restart(): Delete pods to activate new environment variables

Where should the reviewer start?

cuda_fault_injection.py*:
- CUDAFaultInjector.__init__() and build_library() - Library build orchestration
- create_configmap_with_library() - ConfigMap wrapper (imports from inject_into_pods)
- patch_deployment_for_cuda_fault() - Main injection workflow:
  - Imports patch_deployment_env() from inject_into_pods.py
  - Handles node pinning for realistic fault scenarios
  - Passes XID type configuration
- cleanup_cuda_fault_injection() - Comprehensive cleanup:
  - Removes deployment patches
  - Deletes ConfigMaps
  - Optional force-delete of pods (for stuck pods)
  - Wait for new pods to become ready
- trigger_pod_restart() - Pod deletion helper

Key features:

Dynamically imports from cuda-fault-injection/inject_into_pods.py (relative path handling)
Supports node pinning to simulate "XID on specific node" scenarios
Comprehensive cleanup handles both graceful and force deletion
Integrates with k8s_operations.py for pod readiness waiting

Related Issues:

Depends on: PR feat(fault-injection): Add CUDA library injection tools #4039 (inject_into_pods.py)
Depends on: PR feat(fault-injection): Add core testing helper utilities #4040 (k8s_operations.py helper)
Relates to: GPU fault tolerance testing infrastructure

Summary by CodeRabbit

Release Notes

New Features
- Added fault injection utility for testing CUDA workloads in Kubernetes environments, enabling library management, deployment configuration patching, cleanup operations, and pod crash monitoring capabilities.

- CUDAFaultInjector class: Wraps lower-level injection tools - Build library, create ConfigMaps, patch deployments - Cleanup and verification utilities Provides simplified API for test scripts to inject CUDA faults.

copy-pr-bot · 2025-11-02T00:45:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-11-04T18:01:21Z

Walkthrough

A new CUDAFaultInjector utility class is introduced to manage CUDA fault injection for Kubernetes workloads. The class provides methods for building libraries, creating ConfigMaps, patching deployments, performing cleanup, managing pod restarts, and monitoring pod crash states.

Changes

Cohort / File(s)	Change Summary
CUDA Fault Injection Utility `tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py`	New class `CUDAFaultInjector` with seven public methods: library building, ConfigMap creation, deployment patching, fault injection cleanup with verification and optional force pod deletion, pod restart triggering, and pod crash monitoring with timeout support. Includes error handling and external `inject_into_pods` module integration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Deployment cleanup verification logic — The cleanup_cuda_fault_injection method performs multi-step verification (querying for LD_PRELOAD, nodeAffinity, volumes) and optional force-deletion; requires careful review of cleanup semantics and retry logic.
Pod crash monitoring — The wait_for_pods_to_crash method implements polling with timeout; verify edge cases and timing accuracy.
Dynamic imports and external dependencies — Methods rely on dynamic imports from inject_into_pods module; confirm integration points and error propagation are properly handled.

Poem

🐰 A sniffer's tale, with CUDA's might,
Building libraries, patching deployments bright,
Pods crash and restart, monitored with care,
Fault injection flows through Kubernetes air,
Cleanup and chaos, all working as one! ✨

Pre-merge checks

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main addition: a high-level CUDA fault injection helper class, matching the primary change in the changeset.
Description check	✅ Passed	The description is well-structured, following the template with Overview, Details, and Where should the reviewer start sections, providing comprehensive context about the new CUDAFaultInjector class and its methods.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py (4)

60-62: Consider validating the make executable path.

While subprocess.run(["make"], ...) is safe in this controlled test environment, using shutil.which("make") to validate the executable exists before invocation would improve robustness and provide clearer error messages if make is not available.

Apply this diff to add validation:

+import shutil
+
 def build_library(self) -> bool:
     """
     Build the CUDA fault injection library.
 
     Returns:
         True if build succeeded or library already exists
     """
     print("\n[→] Building CUDA fault injection library...")
 
     if not self.lib_dir.exists():
         print(f"    ✗ Directory not found: {self.lib_dir}")
         return False
 
     if self.lib_path.exists():
         print(f"    ✓ Library already exists: {self.lib_path}")
         self.lib_built = True
         return True
 
+    # Verify make is available
+    if not shutil.which("make"):
+        print("    ✗ 'make' command not found in PATH")
+        return False
+
     # Build using make
     result = subprocess.run(
         ["make"], cwd=self.lib_dir, capture_output=True, text=True
     )

89-100: Consider using a context manager to avoid polluting sys.path.

The sys.path.insert(0, ...) modification persists for the entire process lifetime. If this method is called multiple times or from different contexts, it could lead to unexpected import behavior.

Apply this approach using a context manager:

import contextlib

@contextlib.contextmanager
def temp_syspath(path: str):
    """Temporarily add path to sys.path."""
    sys.path.insert(0, path)
    try:
        yield
    finally:
        sys.path.remove(path)

def create_configmap_with_library(self, namespace: str) -> bool:
    """Create ConfigMap with CUDA fault injection library source."""
    with temp_syspath(str(self.lib_dir)):
        try:
            from inject_into_pods import create_cuda_fault_configmap
            return create_cuda_fault_configmap(namespace)
        except Exception as e:
            print(f"    ✗ Failed to create ConfigMap: {e}")
            import traceback
            traceback.print_exc()
            return False

196-274: Consider extracting the verification logic.

The deployment spec verification logic (lines 196-274) is quite complex and deeply nested. Extracting it into a separate method would improve readability and testability.

Example refactoring:

def _check_deployment_artifacts(
    self, 
    k8s_custom: client.CustomObjectsApi,
    deployment_name: str,
    namespace: str
) -> tuple[bool, list[str]]:
    """
    Check if CUDA fault artifacts exist in deployment.
    
    Returns:
        (has_artifacts, artifact_details)
    """
    dgd = k8s_custom.get_namespaced_custom_object(
        group="nvidia.com",
        version="v1alpha1",
        namespace=namespace,
        plural="dynamographdeployments",
        name=deployment_name,
    )
    
    has_artifacts = False
    artifact_details = []
    
    for service_name in ["VllmDecodeWorker", "VllmPrefillWorker"]:
        service = (
            dgd.get("spec", {})
            .get("services", {})
            .get(service_name, {})
        )
        
        # Check for LD_PRELOAD
        env_vars = (
            service.get("extraPodSpec", {})
            .get("mainContainer", {})
            .get("env", [])
        )
        for env in env_vars:
            if isinstance(env, dict) and env.get("name") == "LD_PRELOAD":
                has_artifacts = True
                artifact_details.append(f"{service_name}: LD_PRELOAD")
                break
        
        # Check for node affinity
        affinity = service.get("extraPodSpec", {}).get("affinity")
        if affinity and isinstance(affinity, dict) and "nodeAffinity" in affinity:
            has_artifacts = True
            artifact_details.append(f"{service_name}: nodeAffinity")
        
        # Check for CUDA fault volumes
        volumes = service.get("extraPodSpec", {}).get("volumes", [])
        for vol in volumes:
            if vol.get("name") in ["cuda-fault-lib", "cuda-fault-lib-source"]:
                has_artifacts = True
                artifact_details.append(f"{service_name}: cuda-fault volume")
                break
    
    return has_artifacts, artifact_details

# Then in cleanup_cuda_fault_injection:
for attempt in range(6):
    time.sleep(5)
    try:
        has_artifacts, artifact_details = self._check_deployment_artifacts(
            k8s_custom, deployment_name, namespace
        )
        
        if not has_artifacts:
            print(f"    ✓ Deployment spec verified clean after {(attempt+1)*5}s")
            spec_cleaned = True
            break
        else:
            print(f"    ... {(attempt+1)*5}s: Artifacts: {', '.join(artifact_details)}")
    except Exception as e:
        print(f"    ... {(attempt+1)*5}s: Error checking spec: {e}")

386-392: I'll search for documentation about the Kubernetes Python client's V1ContainerState to verify the behavior of the terminated attribute.

Now let me check the codebase for any similar patterns and the actual implementation:

Based on my verification, the Kubernetes Python client's V1ContainerState.terminated field returns a V1ContainerStateTerminated object when a container is terminated, or None when not set. The current code cs.state.terminated works correctly because the terminated field is None when not actively terminated, making the truthy check functionally correct.

However, being more explicit with is not None is a Python best practice that improves code clarity, even though the current implementation works. This is a style improvement rather than a correctness issue.

Use explicit None comparison for clarity.

The suggestion to change cs.state.terminated to cs.state.terminated is not None is valid. While the current code works due to Python's truthy/falsy behavior, being explicit with is not None is clearer and aligns with PEP 8 guidelines for optional field checks.

                     if (
                         cs.state.waiting
                         and cs.state.waiting.reason in ["CrashLoopBackOff", "Error"]
-                    ) or cs.state.terminated:
+                    ) or cs.state.terminated is not None:
                         crashed_count += 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d76583 and c5eacf8.

📒 Files selected for processing (1)

tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.14.3)

tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py

61-61: Starting a process with a partial executable path

(S607)

95-95: Do not catch blind exception: Exception

(BLE001)

146-146: Do not catch blind exception: Exception

(BLE001)

270-270: Do not catch blind exception: Exception

(BLE001)

281-281: Do not catch blind exception: Exception

(BLE001)

312-312: String contains ambiguous ℹ (INFORMATION SOURCE). Did you mean i (LATIN SMALL LETTER I)?

(RUF001)

314-314: Do not catch blind exception: Exception

(BLE001)

318-318: Consider moving this statement to an else block

(TRY300)

320-320: Do not catch blind exception: Exception

(BLE001)

401-401: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (1)

tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py (1)

26-39: LGTM! Clean initialization.

The default path resolution and attribute initialization are well-structured.

tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py

…e hints

nnshah1

I think this one overlap with the others and we'll need to discuss the patch / deploy time flow

saturley-hall · 2025-11-17T23:30:26Z

tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py

+            )
+            return False
+
+        print(


All print() should be from logging

…-dynamo#4041)

feat(fault-injection): Add high-level CUDA fault injection helper

9986660

- CUDAFaultInjector class: Wraps lower-level injection tools - Build library, create ConfigMaps, patch deployments - Cleanup and verification utilities Provides simplified API for test scripts to inject CUDA faults.

pull-request-size bot added the size/L label Nov 2, 2025

github-actions bot added the feat label Nov 2, 2025

nv-oviya mentioned this pull request Nov 2, 2025

test(fault-injection): Add XID 79 NVSentinel E2E test #4046

Open

added copyright headers

c5eacf8

nv-oviya marked this pull request as ready for review November 4, 2025 17:55

nv-oviya requested review from a team as code owners November 4, 2025 17:55

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

Add xid_type validation, configurable service names, and improved typ…

89a0a6c

…e hints

nv-oviya marked this pull request as draft November 4, 2025 18:26

nnshah1 requested changes Nov 19, 2025

View reviewed changes

nnshah1 approved these changes Nov 24, 2025

View reviewed changes

nv-oviya marked this pull request as ready for review November 26, 2025 18:09

saturley-hall approved these changes Nov 26, 2025

View reviewed changes

saturley-hall merged commit 26eb14c into main Nov 26, 2025
11 of 12 checks passed

saturley-hall deleted the oviya/fault-injection/cuda-integration-helper branch November 26, 2025 18:11

zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Dec 11, 2025

feat(fault-injection): Add high-level CUDA fault injection helper (ai…

64d4121

…-dynamo#4041)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(fault-injection): Add high-level CUDA fault injection helper #4041

feat(fault-injection): Add high-level CUDA fault injection helper #4041

Uh oh!

nv-oviya commented Nov 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Nov 2, 2025

Uh oh!

coderabbitai bot commented Nov 4, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nnshah1 left a comment

Uh oh!

saturley-hall Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(fault-injection): Add high-level CUDA fault injection helper #4041

feat(fault-injection): Add high-level CUDA fault injection helper #4041

Uh oh!

Conversation

nv-oviya commented Nov 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues:

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Nov 2, 2025

Uh oh!

coderabbitai bot commented Nov 4, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nnshah1 left a comment

Choose a reason for hiding this comment

Uh oh!

saturley-hall Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nv-oviya commented Nov 2, 2025 •

edited by coderabbitai bot

Loading