-
Notifications
You must be signed in to change notification settings - Fork 738
feat(fault-injection): Enable runtime CUDA fault injection toggling without pod restarts #4679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… cuda fault injections w/o requiring restarts or force deletion (besides initial setup) Signed-off-by: Oviya Seeniraj <[email protected]>
Signed-off-by: Oviya Seeniraj <[email protected]>
d50a27d to
dd33678
Compare
…/cuda-hostpath-method Signed-off-by: Oviya Seeniraj <[email protected]>
WalkthroughThis PR enhances CUDA fault injection for Kubernetes deployments by introducing persistent fault toggling via hostPath volumes, enabling runtime control without pod restarts. It adds new orchestration helpers, implements aggressive pod restart strategies, and includes passthrough mode for baseline testing with the library loaded but disabled. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/inject_into_pods.py (1)
591-601: Standard Deployment path doesn't supportpassthrough_mode.The standard Kubernetes Deployment path always sets
CUDA_FAULT_INJECTION_ENABLED=1, ignoring thepassthrough_modeparameter. This is inconsistent with the DynamoGraphDeployment path.if enable: # Add new env vars + fault_enabled_value = "0" if passthrough_mode else "1" container.env.append( client.V1EnvVar(name="LD_PRELOAD", value="/tmp/cuda_intercept.so") ) container.env.append( - client.V1EnvVar(name="CUDA_FAULT_INJECTION_ENABLED", value="1") + client.V1EnvVar(name="CUDA_FAULT_INJECTION_ENABLED", value=fault_enabled_value) )tests/fault_tolerance/hardware/fault_injection_service/helpers/cuda_fault_injection.py (1)
167-175: DuplicateArgs:block in docstring.The docstring has two
Args:sections. The first (lines 167-170) describespassthrough_mode, and the second (lines 171-175) describes the other parameters.- Args: - passthrough_mode: If True, set CUDA_FAULT_INJECTION_ENABLED=0 - (library loaded but faults disabled for baseline) - Args: deployment_name: Name of the deployment namespace: Kubernetes namespace target_node: Node to pin pods to (simulates real XID behavior) xid_type: XID error type to simulate (79, 48, 94, 95, 43, 74). Default: 79 + passthrough_mode: If True, set CUDA_FAULT_INJECTION_ENABLED=0 + (library loaded but faults disabled for baseline)
🧹 Nitpick comments (7)
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/README.md (1)
11-13: Add language specifier to fenced code block.The code block should have a language specifier for proper syntax highlighting and linting compliance.
-``` +```text Pod calls cudaMalloc() → LD_PRELOAD intercepts → Checks /host-fault/cuda_fault_enabled → Returns error → Pod crashes</blockquote></details> <details> <summary>tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/cuda_intercept.c (2)</summary><blockquote> `105-128`: **Per-call file I/O may impact CUDA-heavy workloads.** Reading the toggle file on every CUDA call introduces filesystem overhead. While this enables instant runtime control, it could affect performance for applications making thousands of CUDA calls per second. Consider: 1. **Current approach is acceptable** for fault injection testing scenarios where the overhead is negligible compared to the intentional fault behavior. 2. **Optional optimization**: Add a time-based cache (e.g., check file every 100ms) if performance becomes a concern in passthrough mode. Additionally, the file reading logic is duplicated. A helper function would reduce code duplication. ```diff +// Helper to read toggle value from file +static int read_toggle_file(const char* path) { + FILE* f = fopen(path, "r"); + if (!f) return -1; // File not found + + char buf[4] = {0}; + int result = -1; + if (fgets(buf, sizeof(buf), f)) { + result = (buf[0] == '1') ? 1 : 0; + } + fclose(f); + return result; +} // Runtime toggle: Check node-persistent fault marker on EVERY call // Use hostPath (/host-fault) so fault persists across pod restarts on same node // Pod reschedules to different node → no file there → automatic recovery! int runtime_inject = env_inject; // Default to env var - // Check hostPath first (persistent across restarts on same node) - FILE* toggle_file = fopen("/host-fault/cuda_fault_enabled", "r"); - if (toggle_file) { - char toggle_value[4] = {0}; - if (fgets(toggle_value, sizeof(toggle_value), toggle_file)) { - runtime_inject = (toggle_value[0] == '1'); - } - fclose(toggle_file); - } else { - // Fallback to ephemeral /tmp for backwards compatibility - toggle_file = fopen("/tmp/cuda_fault_enabled", "r"); - if (toggle_file) { - char toggle_value[4] = {0}; - if (fgets(toggle_value, sizeof(toggle_value), toggle_file)) { - runtime_inject = (toggle_value[0] == '1'); - } - fclose(toggle_file); - } - } + // Check hostPath first (persistent across restarts on same node) + int toggle_result = read_toggle_file("/host-fault/cuda_fault_enabled"); + if (toggle_result >= 0) { + runtime_inject = toggle_result; + } else { + // Fallback to ephemeral /tmp for backwards compatibility + toggle_result = read_toggle_file("/tmp/cuda_fault_enabled"); + if (toggle_result >= 0) { + runtime_inject = toggle_result; + } + }
155-165: Redundantget_fault_config()call inlog_intercept().
log_intercept()first callsshould_inject_fault()(which callsget_fault_config()), then callsget_fault_config()again to get the XID. Sinceget_fault_config()now reads files on every call, this doubles the I/O for logged operations.// Log helper static void -log_intercept(const char* func_name, cudaError_t error_code) +log_intercept(const char* func_name, int xid, cudaError_t error_code) { - if (should_inject_fault()) { - int inject, xid; - cudaError_t err; - get_fault_config(&inject, &xid, &err); - fprintf(stderr, "[XID %d SIM] %s() intercepted -> error %d\n", xid, func_name, error_code); - } + fprintf(stderr, "[XID %d SIM] %s() intercepted -> error %d\n", xid, func_name, error_code); }Then update call sites to pass the XID obtained from the initial
get_fault_config()check:// Example for cudaGetDeviceCount: int inject, xid; cudaError_t error; get_fault_config(&inject, &xid, &error); if (inject) { log_intercept("cudaGetDeviceCount", xid, error); // ... }tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/inject_into_pods.py (1)
434-454: Update strategy is intentionally aggressive - document the trade-off.Setting
maxUnavailable=100%ensures all pods receive the fault injection simultaneously, which is correct for testing scenarios. However, this will cause complete service unavailability during the update.Consider adding a comment or log message making this explicit to avoid confusion:
# Allow all pods to be unavailable during update spec["updateStrategy"]["rollingUpdate"]["maxUnavailable"] = "100%" # Don't create surge pods spec["updateStrategy"]["rollingUpdate"]["maxSurge"] = 0 print(" → Set update strategy: maxUnavailable=100%, maxSurge=0") - print(" (All pods will update simultaneously)") + print(" (All pods will update simultaneously - service will be unavailable during update)")tests/fault_tolerance/hardware/fault_injection_service/helpers/cuda_fault_injection.py (3)
129-146: Container name check could be fragile.The check looks for containers named
"vllm-worker"or"worker", but the actual path checksworkerSpec.podSpec.containerswhich may differ from the extraPodSpec used elsewhere. Also, silently returningFalseon any exception could hide legitimate configuration issues.Consider:
- Checking all containers rather than specific names, or
- Looking for LD_PRELOAD in the
extraPodSpec.mainContainer.envpath (consistent with patching logic)- for container in containers: - if container.get("name") in ["vllm-worker", "worker"]: - env = container.get("env", []) - for env_var in env: - if env_var.get("name") == "LD_PRELOAD": - return True - - return False + # Check services (consistent with patching logic) + services = spec.get("services", {}) + for service_name in ["VllmDecodeWorker", "VllmPrefillWorker"]: + service = services.get(service_name, {}) + env_vars = ( + service.get("extraPodSpec", {}) + .get("mainContainer", {}) + .get("env", []) + ) + for env_var in env_vars: + if env_var.get("name") == "LD_PRELOAD": + return True + return False
609-633: Complex control flow is correct but hard to follow.The nested
for-else-breakpattern works but is difficult to reason about. Consider restructuring for clarity:- for service_name in ["VllmDecodeWorker", "VllmPrefillWorker"]: - if service_name in dgd["spec"]["services"]: - service = dgd["spec"]["services"][service_name] - env_vars = ( - service.get("extraPodSpec", {}) - .get("mainContainer", {}) - .get("env", []) - ) - - for env_var in env_vars: - if env_var.get("name") == "CUDA_FAULT_INJECTION_ENABLED": - if env_var.get("value") != expected_value: - time.sleep(1) - break # Try again - else: - continue # This service is good - break # Inner loop broke, try again - else: - # All services verified - return True + all_match = True + for service_name in ["VllmDecodeWorker", "VllmPrefillWorker"]: + if service_name not in dgd["spec"]["services"]: + continue + service = dgd["spec"]["services"][service_name] + env_vars = ( + service.get("extraPodSpec", {}) + .get("mainContainer", {}) + .get("env", []) + ) + for env_var in env_vars: + if env_var.get("name") == "CUDA_FAULT_INJECTION_ENABLED": + if env_var.get("value") != expected_value: + all_match = False + break + + if all_match: + return True + time.sleep(1)
393-395: Inconsistent type hint.
enable_cuda_faults_via_toggleusespods: Listwhiledisable_cuda_faults_via_toggleusespods: List[client.V1Pod]. Use consistent typing.def enable_cuda_faults_via_toggle( - self, pods: List, namespace: str, enable: bool = True + self, pods: List[client.V1Pod], namespace: str, enable: bool = True ) -> bool:
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/README.md(2 hunks)tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/cuda_intercept.c(3 hunks)tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/inject_into_pods.py(9 hunks)tests/fault_tolerance/hardware/fault_injection_service/helpers/cuda_fault_injection.py(5 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/README.md
11-11: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.7)
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/inject_into_pods.py
542-542: Do not catch blind exception: Exception
(BLE001)
549-549: Do not catch blind exception: Exception
(BLE001)
tests/fault_tolerance/hardware/fault_injection_service/helpers/cuda_fault_injection.py
142-142: Consider moving this statement to an else block
(TRY300)
144-144: Do not catch blind exception: Exception
(BLE001)
441-441: subprocess call: check for execution of untrusted input
(S603)
442-452: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
477-477: Do not catch blind exception: Exception
(BLE001)
548-548: subprocess call: check for execution of untrusted input
(S603)
549-559: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
570-571: try-except-continue detected, consider logging the exception
(S112)
570-570: Do not catch blind exception: Exception
(BLE001)
630-630: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (7)
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/README.md (1)
9-54: LGTM!The documentation updates accurately describe the new runtime toggling mechanism, hostPath-based persistence, and the phased workflow. The "Key Features" and "How It Works" sections provide clear guidance on the architecture.
tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/cuda_intercept.c (1)
167-312: LGTM!The CUDA function interception implementations follow a consistent pattern: check for fault injection, return error if enabled, otherwise delegate to the real function via
dlsym(RTLD_NEXT, ...). This is a clean and maintainable approach.tests/fault_tolerance/hardware/fault_injection_service/cuda_fault_injection/inject_into_pods.py (3)
414-422: LGTM!The
passthrough_modeimplementation cleanly setsCUDA_FAULT_INJECTION_ENABLED=0for baseline testing with the library loaded but inactive. This enables the phased workflow: deploy → toggle ON → toggle OFF.
534-551: Force deletion is aggressive but appropriate for fault injection.Using
grace_period_seconds=0immediately terminates pods without graceful shutdown. This is acceptable for fault injection testing where you're simulating catastrophic failures.The exception handling logs errors and continues, which is the right approach for batch operations. The static analysis warnings about blind exception catches (BLE001) are acceptable here since we're logging and intentionally continuing.
204-214: LGTM - hostPath volume for persistent fault marker is properly implemented.Using
DirectoryOrCreatefor the hostPath is appropriate. The path/var/lib/cuda-fault-test(mounted as/host-faultin containers) follows the intended pattern documented in the fault injection README and is properly cleaned up bycleanup_node_fault_markers()to prevent conflicts between test runs.tests/fault_tolerance/hardware/fault_injection_service/helpers/cuda_fault_injection.py (2)
433-456: LGTM - Toggle implementation with verification.The toggle implementation correctly:
- Creates the directory if needed (
mkdir -p)- Writes the toggle value
- Reads back to verify the write succeeded
The
pod_namecomes from the Kubernetes API, so the S603 static analysis warning is a false positive. The timeout of 10 seconds is appropriate.
504-573: LGTM - Robust node cleanup with deduplication.The cleanup logic correctly:
- Tracks cleaned nodes to avoid duplicate operations
- Uses
rm -fto silently handle missing files- Continues on failure to clean up as many nodes as possible
The silent exception catch (S112/BLE001) is acceptable here since we want resilient cleanup behavior.
…gs section and fixed inconsistent type hint Signed-off-by: Oviya Seeniraj <[email protected]>
Signed-off-by: Oviya Seeniraj <[email protected]>
saturley-hall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine, however I am still not wild about adding functionality that is not exercised in the PR (cleanup_node_fault_markers(), disable_cuda_faults_via_toggle(), verify_env_var_set()). In the future please try and keep the functionality as atomic as possible and not as a chain of PRs.
|
/ok to test 2a9d973 |
…ithout pod restarts (ai-dynamo#4679) Signed-off-by: Elijah Soba <[email protected]>
…ithout pod restarts (ai-dynamo#4679)
Overview:
Implements runtime fault injection control via filesystem toggles and hostPath volumes, eliminating the need for pod restarts when enabling/disabling CUDA faults (except for initial library deployment).
Previously, enabling/disabling CUDA fault injection required modifying environment variables and restarting pods, which is unrealistic for testing fault tolerance recovery scenarios. This PR introduces a 3-tier toggling system that allows instant fault activation/deactivation in running pods.
Details:
cuda_intercept.c- Runtime Toggle Detectionget_fault_config()to check fault injection status on every CUDA call rather than once at initialization/host-fault/cuda_fault_enabled(persistent) →/tmp/cuda_fault_enabled(ephemeral) → CUDA_FAULT_INJECTION_ENABLED env varinject_into_pods.py- Persistent volume infrastructure/var/lib/cuda-fault-test(host) to/host-fault(pod)CUDA_FAULT_INJECTION_ENABLED=0, allowing baseline testing before fault injectioncuda_fault_injection.py- New helper methodscheck_if_cuda_library_deployed(): Detects if CUDA library is already injected (checks LD_PRELOAD, init containers, volumes)enable_cuda_faults_via_toggle(): Writes "1" to /host-fault/cuda_fault_enabled in running pods via kubectl exec (no restart needed)disable_cuda_faults_via_toggle(): Writes "0" to toggle file for instant recoverycleanup_node_fault_markers(): Removes /host-fault/cuda_fault_enabled from nodes (cleanup after tests)verify_env_var_set(): Validates environment variable propagation to deployment specArchitecture Context:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Builds on PR #4038
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.