-
Notifications
You must be signed in to change notification settings - Fork 739
feat(fault-injection): Add high-level CUDA fault injection helper #4041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(fault-injection): Add high-level CUDA fault injection helper #4041
Conversation
- CUDAFaultInjector class: Wraps lower-level injection tools - Build library, create ConfigMaps, patch deployments - Cleanup and verification utilities Provides simplified API for test scripts to inject CUDA faults.
WalkthroughA new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py (4)
60-62: Consider validating themakeexecutable path.While
subprocess.run(["make"], ...)is safe in this controlled test environment, usingshutil.which("make")to validate the executable exists before invocation would improve robustness and provide clearer error messages ifmakeis not available.Apply this diff to add validation:
+import shutil + def build_library(self) -> bool: """ Build the CUDA fault injection library. Returns: True if build succeeded or library already exists """ print("\n[→] Building CUDA fault injection library...") if not self.lib_dir.exists(): print(f" ✗ Directory not found: {self.lib_dir}") return False if self.lib_path.exists(): print(f" ✓ Library already exists: {self.lib_path}") self.lib_built = True return True + # Verify make is available + if not shutil.which("make"): + print(" ✗ 'make' command not found in PATH") + return False + # Build using make result = subprocess.run( ["make"], cwd=self.lib_dir, capture_output=True, text=True )
89-100: Consider using a context manager to avoid pollutingsys.path.The
sys.path.insert(0, ...)modification persists for the entire process lifetime. If this method is called multiple times or from different contexts, it could lead to unexpected import behavior.Apply this approach using a context manager:
import contextlib @contextlib.contextmanager def temp_syspath(path: str): """Temporarily add path to sys.path.""" sys.path.insert(0, path) try: yield finally: sys.path.remove(path) def create_configmap_with_library(self, namespace: str) -> bool: """Create ConfigMap with CUDA fault injection library source.""" with temp_syspath(str(self.lib_dir)): try: from inject_into_pods import create_cuda_fault_configmap return create_cuda_fault_configmap(namespace) except Exception as e: print(f" ✗ Failed to create ConfigMap: {e}") import traceback traceback.print_exc() return False
196-274: Consider extracting the verification logic.The deployment spec verification logic (lines 196-274) is quite complex and deeply nested. Extracting it into a separate method would improve readability and testability.
Example refactoring:
def _check_deployment_artifacts( self, k8s_custom: client.CustomObjectsApi, deployment_name: str, namespace: str ) -> tuple[bool, list[str]]: """ Check if CUDA fault artifacts exist in deployment. Returns: (has_artifacts, artifact_details) """ dgd = k8s_custom.get_namespaced_custom_object( group="nvidia.com", version="v1alpha1", namespace=namespace, plural="dynamographdeployments", name=deployment_name, ) has_artifacts = False artifact_details = [] for service_name in ["VllmDecodeWorker", "VllmPrefillWorker"]: service = ( dgd.get("spec", {}) .get("services", {}) .get(service_name, {}) ) # Check for LD_PRELOAD env_vars = ( service.get("extraPodSpec", {}) .get("mainContainer", {}) .get("env", []) ) for env in env_vars: if isinstance(env, dict) and env.get("name") == "LD_PRELOAD": has_artifacts = True artifact_details.append(f"{service_name}: LD_PRELOAD") break # Check for node affinity affinity = service.get("extraPodSpec", {}).get("affinity") if affinity and isinstance(affinity, dict) and "nodeAffinity" in affinity: has_artifacts = True artifact_details.append(f"{service_name}: nodeAffinity") # Check for CUDA fault volumes volumes = service.get("extraPodSpec", {}).get("volumes", []) for vol in volumes: if vol.get("name") in ["cuda-fault-lib", "cuda-fault-lib-source"]: has_artifacts = True artifact_details.append(f"{service_name}: cuda-fault volume") break return has_artifacts, artifact_details # Then in cleanup_cuda_fault_injection: for attempt in range(6): time.sleep(5) try: has_artifacts, artifact_details = self._check_deployment_artifacts( k8s_custom, deployment_name, namespace ) if not has_artifacts: print(f" ✓ Deployment spec verified clean after {(attempt+1)*5}s") spec_cleaned = True break else: print(f" ... {(attempt+1)*5}s: Artifacts: {', '.join(artifact_details)}") except Exception as e: print(f" ... {(attempt+1)*5}s: Error checking spec: {e}")
386-392: I'll search for documentation about the Kubernetes Python client's V1ContainerState to verify the behavior of theterminatedattribute.Now let me check the codebase for any similar patterns and the actual implementation:
Based on my verification, the Kubernetes Python client's V1ContainerState.terminated field returns a V1ContainerStateTerminated object when a container is terminated, or None when not set. The current code
cs.state.terminatedworks correctly because the terminated field is None when not actively terminated, making the truthy check functionally correct.However, being more explicit with
is not Noneis a Python best practice that improves code clarity, even though the current implementation works. This is a style improvement rather than a correctness issue.Use explicit None comparison for clarity.
The suggestion to change
cs.state.terminatedtocs.state.terminated is not Noneis valid. While the current code works due to Python's truthy/falsy behavior, being explicit withis not Noneis clearer and aligns with PEP 8 guidelines for optional field checks.if ( cs.state.waiting and cs.state.waiting.reason in ["CrashLoopBackOff", "Error"] - ) or cs.state.terminated: + ) or cs.state.terminated is not None: crashed_count += 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.3)
tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py
61-61: Starting a process with a partial executable path
(S607)
95-95: Do not catch blind exception: Exception
(BLE001)
146-146: Do not catch blind exception: Exception
(BLE001)
270-270: Do not catch blind exception: Exception
(BLE001)
281-281: Do not catch blind exception: Exception
(BLE001)
312-312: String contains ambiguous ℹ (INFORMATION SOURCE). Did you mean i (LATIN SMALL LETTER I)?
(RUF001)
314-314: Do not catch blind exception: Exception
(BLE001)
318-318: Consider moving this statement to an else block
(TRY300)
320-320: Do not catch blind exception: Exception
(BLE001)
401-401: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (1)
tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py (1)
26-39: LGTM! Clean initialization.The default path resolution and attribute initialization are well-structured.
tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py
Outdated
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py
Outdated
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/helpers/cuda_fault_injection.py
Outdated
Show resolved
Hide resolved
nnshah1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one overlap with the others and we'll need to discuss the patch / deploy time flow
| ) | ||
| return False | ||
|
|
||
| print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All print() should be from logging
Overview:
This PR adds a high-level wrapper (
CUDAFaultInjector) that simplifies CUDA fault injection for test scripts. It abstracts away the complexity of building libraries, creating ConfigMaps, and patching deployments into a clean Python API.Details:
helpers/cuda_fault_injection.py: High-level CUDA fault injection APICUDAFaultInjectorclass: Manages entire CUDA fault injection lifecyclebuild_library(): Build CUDA fault library using Makefilecreate_configmap_with_library(): Wrapper for ConfigMap creationpatch_deployment_for_cuda_fault(): Deploy fault library to specific deploymentcleanup_cuda_fault_injection(): Remove all artifacts with optional force-deletetrigger_pod_restart(): Delete pods to activate new environment variablesWhere should the reviewer start?
cuda_fault_injection.py*:CUDAFaultInjector.__init__()andbuild_library()- Library build orchestrationcreate_configmap_with_library()- ConfigMap wrapper (imports from inject_into_pods)patch_deployment_for_cuda_fault()- Main injection workflow:patch_deployment_env()from inject_into_pods.pycleanup_cuda_fault_injection()- Comprehensive cleanup:trigger_pod_restart()- Pod deletion helperKey features:
cuda-fault-injection/inject_into_pods.py(relative path handling)Related Issues:
Summary by CodeRabbit
Release Notes