-
Notifications
You must be signed in to change notification settings - Fork 738
test(fault-injection): Add XID 79 NVSentinel E2E test #4046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Validates complete automated fault tolerance workflow: 1. XID 79 injection → syslog-health-monitor detects 2. CUDA faults → pods crash naturally (realistic GPU failure) 3. fault-quarantine-module cordons node automatically 4. node-drainer-module drains pods automatically 5. fault-remediation-module restarts GPU (optional) 6. Node uncordons automatically 7. Inference recovers
…for helper module is from previous PR that this builds on
pvijayakrish
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my main concern is that the test assumes namespace dynamo-oviya exists. Lets maybe use pytest fixtures to deploy isolated test workload.
WalkthroughA new end-to-end test file is added that validates NVSentinel fault tolerance workflows. It introduces a helper class for monitoring Kubernetes state changes and a comprehensive test that executes fault injection, observes quarantine and drain operations, triggers remediation, and verifies recovery. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py (3)
79-95: Verify NVSentinel quarantine annotation keys used inget_node_quarantine_status
get_node_quarantine_status()looks for raw keys:quarantine_key = "quarantineHealthEvent" is_cordoned_key = "quarantineHealthEventIsCordoned" ... "has_quarantine_annotation": quarantine_key in annotations, "is_cordoned": annotations.get(is_cordoned_key) == "True",The comment says “Actual annotation keys (without nvidia.com prefix)”, but Kubernetes annotations are often stored with vendor prefixes (e.g.,
nvidia.com/quarantineHealthEvent). If the real keys include a prefix,wait_for_quarantine()will never seeis_cordoned=Trueand will always time out.Please double‑check the actual annotation names on the nodes (e.g., via
kubectl get node -oyaml) and adjust the keys accordingly or normalize by stripping prefixes before comparison.
224-333: Tighten cleanup logging and address Ruff warnings incleanup_on_exitThe cleanup fixture is thorough and well-structured. A few small tweaks will align it with tooling and reduce noise:
- Remove f-strings without placeholders (Ruff F541):
- print(f" ✓ Node uncordoned") + print(" ✓ Node uncordoned") ... - print(f" ✓ Node already schedulable") + print(" ✓ Node already schedulable") ... - print(f" ✓ NVSentinel annotations removed") + print(" ✓ NVSentinel annotations removed") ... - print(f" ✓ No NVSentinel annotations to clean") + print(" ✓ No NVSentinel annotations to clean")
- Avoid ambiguous
ℹcharacter in logs (RUF001) while keeping the dynamic part:- print(f"[ℹ] Final pod status: {len(ready_pods)}/{len(pods.items)} ready") + print(f"[INFO] Final pod status: {len(ready_pods)}/{len(pods.items)} ready")
- Broad exception handlers (
except Exception) here are understandable for best‑effort cleanup, but if Ruff still flags BLE001 for this file, you may want to either narrow them (e.g., to specific Kubernetes/requests exceptions where applicable) or explicitly# noqa: BLE001where you intentionally want a catch‑all.Also note that
load_tester.stop()is called in both the main test and here; ensureInferenceLoadTester.stop()is idempotent, or clearcleanup_state["load_tester"] = Noneafter stopping in the test to avoid a double stop.
335-661: Main test flow looks solid; adjust minor style issues and unused piecesThe overall phase structure and assertions are clear and match the PR description. A few minor points:
- Remove f-strings with no interpolation (Ruff F541) in the main test:
- print(f"[✓] CUDA fault library active - pods will crash naturally") + print("[✓] CUDA fault library active - pods will crash naturally") ... - print(f"\n[→] Checking if node-drainer has started drain process...") + print("\n[→] Checking if node-drainer has started drain process...") ... - print(f"[✓] node-drainer is draining the node (AllowCompletion mode)") + print("[✓] node-drainer is draining the node (AllowCompletion mode)") - print(f" Config: deleteAfterTimeoutMinutes=60 (would take 60 minutes)") + print(" Config: deleteAfterTimeoutMinutes=60 (would take 60 minutes)") - print(f" Test optimization: We'll accelerate this for testing") + print(" Test optimization: We'll accelerate this for testing") ... - print(f" Pods may already be gone or drain hasn't started") + print(" Pods may already be gone or drain hasn't started")
Unused helpers / constants:
NVSentinelMonitor.wait_for_drain(),NVSentinelMonitor.wait_for_uncordon(), and theDRAIN_TIMEOUTconstant are currently unused, while the test explicitly says “Auto-uncordon not tested”. Either wire these into the flow (e.g., if you later want to assert auto‑uncordon) or drop them to keep the test focused.Status logging loop: the periodic log condition
if elapsed - (last_status_time - start_time) >= 30:is equivalent to
time.time() - last_status_time >= 30but less direct. Consider simplifying for readability if you touch this section again.Functionally the test logic and timeouts look reasonable for a long‑running E2E test; the main blocker is still the helper imports/CI setup mentioned earlier.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: tests/fault_tolerance/cancellation/test_trtllm.py:4-204
Timestamp: 2025-09-25T00:54:01.369Z
Learning: The fault tolerance tests in tests/fault_tolerance/cancellation/ run in a controlled container environment where files written to /workspace are automatically cleaned up after test completion, and tests execute sequentially without concurrency concerns, so temporary file management for config files is not necessary.
🪛 GitHub Actions: NVIDIA Dynamo Github Validation
tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py
[error] 34-34: Pytest collection failed: ModuleNotFoundError: No module named 'cuda_fault_injection'. Command: 'pytest --basetemp=/tmp --junitxml=pytest_test_report.xml -m "pre_merge or mypy"'
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4046/merge) by nv-oviya.
tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py
[error] 1-1: pre-commit run --show-diff-on-failure --color=always --all-files: Black formatting hook reformatted the file (test_xid79_nvsentinel_automated.py).
[error] 1-1: pre-commit run --show-diff-on-failure --color=always --all-files: Ruff hook fixed 10 issues in the repository (including this file).
🪛 Ruff (0.14.4)
tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py
96-96: Do not catch blind exception: Exception
(BLE001)
218-218: Consider moving this statement to an else block
(TRY300)
219-219: Do not catch blind exception: Exception
(BLE001)
257-257: Do not catch blind exception: Exception
(BLE001)
271-271: Do not catch blind exception: Exception
(BLE001)
284-284: f-string without any placeholders
Remove extraneous f prefix
(F541)
286-286: f-string without any placeholders
Remove extraneous f prefix
(F541)
304-304: f-string without any placeholders
Remove extraneous f prefix
(F541)
306-306: f-string without any placeholders
Remove extraneous f prefix
(F541)
308-308: Do not catch blind exception: Exception
(BLE001)
323-323: String contains ambiguous ℹ (INFORMATION SOURCE). Did you mean i (LATIN SMALL LETTER I)?
(RUF001)
324-324: Do not catch blind exception: Exception
(BLE001)
329-329: Do not catch blind exception: Exception
(BLE001)
489-489: f-string without any placeholders
Remove extraneous f prefix
(F541)
519-519: f-string without any placeholders
Remove extraneous f prefix
(F541)
525-525: f-string without any placeholders
Remove extraneous f prefix
(F541)
526-526: f-string without any placeholders
Remove extraneous f prefix
(F541)
527-527: f-string without any placeholders
Remove extraneous f prefix
(F541)
530-530: f-string without any placeholders
Remove extraneous f prefix
(F541)
| import os | ||
| import sys | ||
| import time | ||
| from pathlib import Path | ||
|
|
||
| import pytest | ||
| import requests | ||
| from kubernetes import client, config | ||
|
|
||
| # Add helpers to path | ||
| sys.path.insert(0, str(Path(__file__).parent.parent / "helpers")) | ||
|
|
||
| from cuda_fault_injection import CUDAFaultInjector | ||
| from inference_testing import InferenceLoadTester | ||
| from k8s_operations import NodeOperations | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Fix helper module imports to avoid ModuleNotFoundError in CI
Pytest collection is currently failing with ModuleNotFoundError: No module named 'cuda_fault_injection', even before markers are applied. Since this import happens at module import time, it breaks pytest -m "pre_merge or mypy" runs.
Either ensure the helper modules are present in the repo on this branch (preferred if the dependent PRs are already merged), or guard the imports so missing helpers cause a skip instead of a hard failure. For example:
-import os
-import sys
-import time
-from pathlib import Path
-
-import pytest
-import requests
-from kubernetes import client, config
-
-# Add helpers to path
-sys.path.insert(0, str(Path(__file__).parent.parent / "helpers"))
-
-from cuda_fault_injection import CUDAFaultInjector
-from inference_testing import InferenceLoadTester
-from k8s_operations import NodeOperations
+import os
+import sys
+import time
+from pathlib import Path
+
+import pytest
+import requests
+from kubernetes import client, config
+
+# Add helpers to path
+sys.path.insert(0, str(Path(__file__).parent.parent / "helpers"))
+
+# Import helper modules lazily; if they are missing, skip this E2E test module
+cuda_fault_injection = pytest.importorskip(
+ "cuda_fault_injection",
+ reason="NVSentinel XID-79 E2E test requires cuda_fault_injection helper module.",
+)
+inference_testing = pytest.importorskip(
+ "inference_testing",
+ reason="NVSentinel XID-79 E2E test requires inference_testing helper module.",
+)
+k8s_operations = pytest.importorskip(
+ "k8s_operations",
+ reason="NVSentinel XID-79 E2E test requires k8s_operations helper module.",
+)
+
+CUDAFaultInjector = cuda_fault_injection.CUDAFaultInjector
+InferenceLoadTester = inference_testing.InferenceLoadTester
+NodeOperations = k8s_operations.NodeOperationsAlso remember to re-run pre-commit run --all-files so Black/Ruff changes are committed; current CI shows the hooks are reformatting this file.
🏁 Script executed:
#!/bin/bash
# Check if helpers directory exists and list its contents
echo "=== Checking helpers directory ==="
find . -type d -name "helpers" | head -20
echo ""
echo "=== Contents of helpers directories ==="
find . -type d -name "helpers" -exec ls -la {} \;
echo ""
echo "=== Searching for cuda_fault_injection, inference_testing, k8s_operations files ==="
fd -e py | xargs grep -l "cuda_fault_injection\|inference_testing\|k8s_operations" | head -30
echo ""
echo "=== Checking if helpers are defined anywhere ==="
fd -e py | xargs grep -l "class CUDAFaultInjector\|class InferenceLoadTester\|class NodeOperations" | head -30Length of output: 309
🏁 Script executed:
#!/bin/bash
# Examine the test file structure
echo "=== Test file location and structure ==="
cat tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py | head -50
echo ""
echo "=== Check for pytest.importorskip usage in repo ==="
rg "pytest\.importorskip|importorskip" --type py -A 2 -B 2 | head -40Length of output: 1566
🏁 Script executed:
#!/bin/bash
# Verify if the __init__.py files exist in the helpers path
echo "=== Checking __init__.py files ==="
find tests/fault_tolerance/hardware/fault-injection-service -name "__init__.py" -exec ls -la {} \;
echo ""
echo "=== Full directory structure of fault-injection-service ==="
find tests/fault_tolerance/hardware/fault-injection-service -type f -name "*.py" | sortLength of output: 194
Based on the shell script results, I can now confirm the findings:
The review comment is factually accurate. The verification reveals:
- No helpers directory exists in the repository structure
- None of the three imported modules are defined anywhere in the codebase (cuda_fault_injection, inference_testing, k8s_operations)
- The classes are not defined (CUDAFaultInjector, InferenceLoadTester, NodeOperations)
- The test file is isolated with no supporting infrastructure present
- pytest.importorskip() is not currently used elsewhere in the repo, but it is the standard pytest pattern for optional test dependencies
The ModuleNotFoundError will indeed block pytest collection, preventing pytest -m "pre_merge or mypy" from running. The proposed solution using pytest.importorskip() is the correct approach for handling missing optional test dependencies—it will gracefully skip the entire test module rather than fail hard during collection.
Fix helper module imports to avoid ModuleNotFoundError in CI
Pytest collection is currently failing with ModuleNotFoundError: No module named 'cuda_fault_injection', blocking marker-based test runs. Since the helper modules do not exist in this branch (at tests/fault_tolerance/hardware/fault-injection-service/helpers/), guard the imports with pytest.importorskip() to skip the test module instead of failing collection:
-import os
-import sys
-import time
-from pathlib import Path
-
-import pytest
-import requests
-from kubernetes import client, config
-
-# Add helpers to path
-sys.path.insert(0, str(Path(__file__).parent.parent / "helpers"))
-
-from cuda_fault_injection import CUDAFaultInjector
-from inference_testing import InferenceLoadTester
-from k8s_operations import NodeOperations
+import os
+import sys
+import time
+from pathlib import Path
+
+import pytest
+import requests
+from kubernetes import client, config
+
+# Add helpers to path
+sys.path.insert(0, str(Path(__file__).parent.parent / "helpers"))
+
+# Import helper modules lazily; if missing, skip this E2E test module
+cuda_fault_injection = pytest.importorskip(
+ "cuda_fault_injection",
+ reason="NVSentinel XID-79 E2E test requires cuda_fault_injection helper module.",
+)
+inference_testing = pytest.importorskip(
+ "inference_testing",
+ reason="NVSentinel XID-79 E2E test requires inference_testing helper module.",
+)
+k8s_operations = pytest.importorskip(
+ "k8s_operations",
+ reason="NVSentinel XID-79 E2E test requires k8s_operations helper module.",
+)
+
+CUDAFaultInjector = cuda_fault_injection.CUDAFaultInjector
+InferenceLoadTester = inference_testing.InferenceLoadTester
+NodeOperations = k8s_operations.NodeOperationsAlso run pre-commit run --all-files to resolve remaining Black/Ruff formatting violations.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import os | |
| import sys | |
| import time | |
| from pathlib import Path | |
| import pytest | |
| import requests | |
| from kubernetes import client, config | |
| # Add helpers to path | |
| sys.path.insert(0, str(Path(__file__).parent.parent / "helpers")) | |
| from cuda_fault_injection import CUDAFaultInjector | |
| from inference_testing import InferenceLoadTester | |
| from k8s_operations import NodeOperations | |
| import os | |
| import sys | |
| import time | |
| from pathlib import Path | |
| import pytest | |
| import requests | |
| from kubernetes import client, config | |
| # Add helpers to path | |
| sys.path.insert(0, str(Path(__file__).parent.parent / "helpers")) | |
| # Import helper modules lazily; if missing, skip this E2E test module | |
| cuda_fault_injection = pytest.importorskip( | |
| "cuda_fault_injection", | |
| reason="NVSentinel XID-79 E2E test requires cuda_fault_injection helper module.", | |
| ) | |
| inference_testing = pytest.importorskip( | |
| "inference_testing", | |
| reason="NVSentinel XID-79 E2E test requires inference_testing helper module.", | |
| ) | |
| k8s_operations = pytest.importorskip( | |
| "k8s_operations", | |
| reason="NVSentinel XID-79 E2E test requires k8s_operations helper module.", | |
| ) | |
| CUDAFaultInjector = cuda_fault_injection.CUDAFaultInjector | |
| InferenceLoadTester = inference_testing.InferenceLoadTester | |
| NodeOperations = k8s_operations.NodeOperations |
🧰 Tools
🪛 GitHub Actions: NVIDIA Dynamo Github Validation
[error] 34-34: Pytest collection failed: ModuleNotFoundError: No module named 'cuda_fault_injection'. Command: 'pytest --basetemp=/tmp --junitxml=pytest_test_report.xml -m "pre_merge or mypy"'
🤖 Prompt for AI Agents
In
tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py
around lines 22-37, imports from a non-existent helpers package
(cuda_fault_injection, inference_testing, k8s_operations) cause
ModuleNotFoundError during pytest collection; replace direct imports with guards
using pytest.importorskip() for each optional helper module (e.g., call
pytest.importorskip("cuda_fault_injection") etc.) so the whole test module is
skipped when helpers are missing, and then run pre-commit (pre-commit run
--all-files) to fix any Black/Ruff style issues.
|
|
||
| # Test configuration | ||
| TARGET_DEPLOYMENT = os.getenv("TARGET_DEPLOYMENT", "vllm-v1-disagg-router") | ||
| NAMESPACE = "dynamo-oviya" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| NAMESPACE = "dynamo-oviya" | |
| NAMESPACE = "dynamo-test" |
nnshah1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing - can we change the name of the test file so it has to be run explicitly and not auto collected by pytest?
…collect it Signed-off-by: Oviya Seeniraj <[email protected]>
Signed-off-by: Oviya Seeniraj <[email protected]>
Overview:
This PR adds a comprehensive E2E test that validates the complete NVSentinel automated fault tolerance pipeline using realistic GPU failure simulation. This test validates that NVSentinel components work together end-to-end with XID injection, auto-detection, and CUDA failures.
Details:
examples/test_xid79_nvsentinel_automated.py: Fully automated NVSentinel testWhere should the reviewer start?
test_xid79_nvsentinel_automated.py- Complete E2E test:Setup and configuration (Lines 1-66):
NVSentinelMonitor class (Lines 68-221):
get_node_quarantine_status()- Check NVSentinel annotationswait_for_quarantine()- Wait for fault-quarantine to cordonwait_for_drain()- Wait for node-drainer to drainwait_for_remediation()- Wait for GPU restart (optional)wait_for_uncordon()- Wait for automatic uncordoncheck_nvsentinel_health()- Verify components runningCleanup fixture (Lines 223-332):
Main test (Lines 334-641):
Key technical details:
Why this test is significant:
Test optimizations:
NVSentinel components validated:
Running the test:
Prerequisites:
Run locally (with port-forward):
Run in-cluster (recommended):
Related Issues:
Summary by CodeRabbit
Tests