Skip to content

Conversation

@nv-oviya
Copy link
Contributor

@nv-oviya nv-oviya commented Nov 2, 2025

Overview:

This PR adds a high-level wrapper (CUDAFaultInjector) that simplifies CUDA fault injection for test scripts. It abstracts away the complexity of building libraries, creating ConfigMaps, and patching deployments into a clean Python API.

Details:

  • Add helpers/cuda_fault_injection.py: High-level CUDA fault injection API
    • CUDAFaultInjector class: Manages entire CUDA fault injection lifecycle
    • build_library(): Build CUDA fault library using Makefile
    • create_configmap_with_library(): Wrapper for ConfigMap creation
    • patch_deployment_for_cuda_fault(): Deploy fault library to specific deployment
    • cleanup_cuda_fault_injection(): Remove all artifacts with optional force-delete
    • trigger_pod_restart(): Delete pods to activate new environment variables

Where should the reviewer start?

  1. cuda_fault_injection.py*:
    • CUDAFaultInjector.__init__() and build_library() - Library build orchestration
    • create_configmap_with_library() - ConfigMap wrapper (imports from inject_into_pods)
    • patch_deployment_for_cuda_fault() - Main injection workflow:
      • Imports patch_deployment_env() from inject_into_pods.py
      • Handles node pinning for realistic fault scenarios
      • Passes XID type configuration
    • cleanup_cuda_fault_injection() - Comprehensive cleanup:
      • Removes deployment patches
      • Deletes ConfigMaps
      • Optional force-delete of pods (for stuck pods)
      • Wait for new pods to become ready
    • trigger_pod_restart() - Pod deletion helper

Key features:

  • Dynamically imports from cuda-fault-injection/inject_into_pods.py (relative path handling)
  • Supports node pinning to simulate "XID on specific node" scenarios
  • Comprehensive cleanup handles both graceful and force deletion
  • Integrates with k8s_operations.py for pod readiness waiting

Related Issues:

Summary by CodeRabbit

Release Notes

  • New Features
    • Added fault injection utility for testing CUDA workloads in Kubernetes environments, enabling library management, deployment configuration patching, cleanup operations, and pod crash monitoring capabilities.

- CUDAFaultInjector class: Wraps lower-level injection tools
- Build library, create ConfigMaps, patch deployments
- Cleanup and verification utilities

Provides simplified API for test scripts to inject CUDA faults.
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nv-oviya nv-oviya marked this pull request as ready for review November 4, 2025 17:55
@nv-oviya nv-oviya requested review from a team as code owners November 4, 2025 17:55
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 4, 2025

Walkthrough

A new CUDAFaultInjector utility class is introduced to manage CUDA fault injection for Kubernetes workloads. The class provides methods for building libraries, creating ConfigMaps, patching deployments, performing cleanup, managing pod restarts, and monitoring pod crash states.

Changes

Cohort / File(s) Change Summary
CUDA Fault Injection Utility
tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py
New class CUDAFaultInjector with seven public methods: library building, ConfigMap creation, deployment patching, fault injection cleanup with verification and optional force pod deletion, pod restart triggering, and pod crash monitoring with timeout support. Includes error handling and external inject_into_pods module integration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Deployment cleanup verification logic — The cleanup_cuda_fault_injection method performs multi-step verification (querying for LD_PRELOAD, nodeAffinity, volumes) and optional force-deletion; requires careful review of cleanup semantics and retry logic.
  • Pod crash monitoring — The wait_for_pods_to_crash method implements polling with timeout; verify edge cases and timing accuracy.
  • Dynamic imports and external dependencies — Methods rely on dynamic imports from inject_into_pods module; confirm integration points and error propagation are properly handled.

Poem

🐰 A sniffer's tale, with CUDA's might,
Building libraries, patching deployments bright,
Pods crash and restart, monitored with care,
Fault injection flows through Kubernetes air,
Cleanup and chaos, all working as one! ✨

Pre-merge checks

✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main addition: a high-level CUDA fault injection helper class, matching the primary change in the changeset.
Description check ✅ Passed The description is well-structured, following the template with Overview, Details, and Where should the reviewer start sections, providing comprehensive context about the new CUDAFaultInjector class and its methods.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py (4)

60-62: Consider validating the make executable path.

While subprocess.run(["make"], ...) is safe in this controlled test environment, using shutil.which("make") to validate the executable exists before invocation would improve robustness and provide clearer error messages if make is not available.

Apply this diff to add validation:

+import shutil
+
 def build_library(self) -> bool:
     """
     Build the CUDA fault injection library.
 
     Returns:
         True if build succeeded or library already exists
     """
     print("\n[→] Building CUDA fault injection library...")
 
     if not self.lib_dir.exists():
         print(f"    ✗ Directory not found: {self.lib_dir}")
         return False
 
     if self.lib_path.exists():
         print(f"    ✓ Library already exists: {self.lib_path}")
         self.lib_built = True
         return True
 
+    # Verify make is available
+    if not shutil.which("make"):
+        print("    ✗ 'make' command not found in PATH")
+        return False
+
     # Build using make
     result = subprocess.run(
         ["make"], cwd=self.lib_dir, capture_output=True, text=True
     )

89-100: Consider using a context manager to avoid polluting sys.path.

The sys.path.insert(0, ...) modification persists for the entire process lifetime. If this method is called multiple times or from different contexts, it could lead to unexpected import behavior.

Apply this approach using a context manager:

import contextlib

@contextlib.contextmanager
def temp_syspath(path: str):
    """Temporarily add path to sys.path."""
    sys.path.insert(0, path)
    try:
        yield
    finally:
        sys.path.remove(path)

def create_configmap_with_library(self, namespace: str) -> bool:
    """Create ConfigMap with CUDA fault injection library source."""
    with temp_syspath(str(self.lib_dir)):
        try:
            from inject_into_pods import create_cuda_fault_configmap
            return create_cuda_fault_configmap(namespace)
        except Exception as e:
            print(f"    ✗ Failed to create ConfigMap: {e}")
            import traceback
            traceback.print_exc()
            return False

196-274: Consider extracting the verification logic.

The deployment spec verification logic (lines 196-274) is quite complex and deeply nested. Extracting it into a separate method would improve readability and testability.

Example refactoring:

def _check_deployment_artifacts(
    self, 
    k8s_custom: client.CustomObjectsApi,
    deployment_name: str,
    namespace: str
) -> tuple[bool, list[str]]:
    """
    Check if CUDA fault artifacts exist in deployment.
    
    Returns:
        (has_artifacts, artifact_details)
    """
    dgd = k8s_custom.get_namespaced_custom_object(
        group="nvidia.com",
        version="v1alpha1",
        namespace=namespace,
        plural="dynamographdeployments",
        name=deployment_name,
    )
    
    has_artifacts = False
    artifact_details = []
    
    for service_name in ["VllmDecodeWorker", "VllmPrefillWorker"]:
        service = (
            dgd.get("spec", {})
            .get("services", {})
            .get(service_name, {})
        )
        
        # Check for LD_PRELOAD
        env_vars = (
            service.get("extraPodSpec", {})
            .get("mainContainer", {})
            .get("env", [])
        )
        for env in env_vars:
            if isinstance(env, dict) and env.get("name") == "LD_PRELOAD":
                has_artifacts = True
                artifact_details.append(f"{service_name}: LD_PRELOAD")
                break
        
        # Check for node affinity
        affinity = service.get("extraPodSpec", {}).get("affinity")
        if affinity and isinstance(affinity, dict) and "nodeAffinity" in affinity:
            has_artifacts = True
            artifact_details.append(f"{service_name}: nodeAffinity")
        
        # Check for CUDA fault volumes
        volumes = service.get("extraPodSpec", {}).get("volumes", [])
        for vol in volumes:
            if vol.get("name") in ["cuda-fault-lib", "cuda-fault-lib-source"]:
                has_artifacts = True
                artifact_details.append(f"{service_name}: cuda-fault volume")
                break
    
    return has_artifacts, artifact_details

# Then in cleanup_cuda_fault_injection:
for attempt in range(6):
    time.sleep(5)
    try:
        has_artifacts, artifact_details = self._check_deployment_artifacts(
            k8s_custom, deployment_name, namespace
        )
        
        if not has_artifacts:
            print(f"    ✓ Deployment spec verified clean after {(attempt+1)*5}s")
            spec_cleaned = True
            break
        else:
            print(f"    ... {(attempt+1)*5}s: Artifacts: {', '.join(artifact_details)}")
    except Exception as e:
        print(f"    ... {(attempt+1)*5}s: Error checking spec: {e}")

386-392: I'll search for documentation about the Kubernetes Python client's V1ContainerState to verify the behavior of the terminated attribute.

Now let me check the codebase for any similar patterns and the actual implementation:

Based on my verification, the Kubernetes Python client's V1ContainerState.terminated field returns a V1ContainerStateTerminated object when a container is terminated, or None when not set. The current code cs.state.terminated works correctly because the terminated field is None when not actively terminated, making the truthy check functionally correct.

However, being more explicit with is not None is a Python best practice that improves code clarity, even though the current implementation works. This is a style improvement rather than a correctness issue.

Use explicit None comparison for clarity.

The suggestion to change cs.state.terminated to cs.state.terminated is not None is valid. While the current code works due to Python's truthy/falsy behavior, being explicit with is not None is clearer and aligns with PEP 8 guidelines for optional field checks.

                     if (
                         cs.state.waiting
                         and cs.state.waiting.reason in ["CrashLoopBackOff", "Error"]
-                    ) or cs.state.terminated:
+                    ) or cs.state.terminated is not None:
                         crashed_count += 1
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d76583 and c5eacf8.

📒 Files selected for processing (1)
  • tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.3)
tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py

61-61: Starting a process with a partial executable path

(S607)


95-95: Do not catch blind exception: Exception

(BLE001)


146-146: Do not catch blind exception: Exception

(BLE001)


270-270: Do not catch blind exception: Exception

(BLE001)


281-281: Do not catch blind exception: Exception

(BLE001)


312-312: String contains ambiguous (INFORMATION SOURCE). Did you mean i (LATIN SMALL LETTER I)?

(RUF001)


314-314: Do not catch blind exception: Exception

(BLE001)


318-318: Consider moving this statement to an else block

(TRY300)


320-320: Do not catch blind exception: Exception

(BLE001)


401-401: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (1)
tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py (1)

26-39: LGTM! Clean initialization.

The default path resolution and attribute initialization are well-structured.

@nv-oviya nv-oviya marked this pull request as draft November 4, 2025 18:26
Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one overlap with the others and we'll need to discuss the patch / deploy time flow

@nv-oviya nv-oviya marked this pull request as ready for review November 26, 2025 18:09
)
return False

print(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All print() should be from logging

@saturley-hall saturley-hall merged commit 26eb14c into main Nov 26, 2025
11 of 12 checks passed
@saturley-hall saturley-hall deleted the oviya/fault-injection/cuda-integration-helper branch November 26, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants