test(fault-injection): Add XID 79 NVSentinel E2E test #4046

nv-oviya · 2025-11-02T01:05:44Z

Overview:

This PR adds a comprehensive E2E test that validates the complete NVSentinel automated fault tolerance pipeline using realistic GPU failure simulation. This test validates that NVSentinel components work together end-to-end with XID injection, auto-detection, and CUDA failures.

Details:

Add examples/test_xid79_nvsentinel_automated.py: Fully automated NVSentinel test
- Phase 0: Prerequisites check (API health, NVSentinel components, baseline inference)
- Phase 1: XID 79 injection via API --> syslog-health-monitor detects
- Phase 1.5: CUDA fault injection --> pods crash naturally (simulates real GPU failure)
- Phase 2: Wait for fault-quarantine-module to cordon node automatically
- Phase 3: Wait for node-drainer-module to drain pods automatically
- Phase 4: Clean CUDA artifacts (simulates GPU remediation)
- Phase 5: Wait for automatic uncordon
- Phase 6: Validate inference recovery
- Cleanup fixture: Ensures cleanup runs even on test failure/Ctrl+C

Where should the reviewer start?

test_xid79_nvsentinel_automated.py - Complete E2E test:

Setup and configuration (Lines 1-66):
- Imports all helpers from previous PRs
- Configurable timeouts for each phase
- Environment detection (in-cluster vs local)
NVSentinelMonitor class (Lines 68-221):
- Lines 75-93: get_node_quarantine_status() - Check NVSentinel annotations
- Lines 95-112: wait_for_quarantine() - Wait for fault-quarantine to cordon
- Lines 114-150: wait_for_drain() - Wait for node-drainer to drain
- Lines 152-173: wait_for_remediation() - Wait for GPU restart (optional)
- Lines 175-191: wait_for_uncordon() - Wait for automatic uncordon
- Lines 193-220: check_nvsentinel_health() - Verify components running
Cleanup fixture (Lines 223-332):
- Lines 243-260: Stop load tester and clean CUDA faults
- Lines 262-271: Clean fault API
- Lines 274-308: Ensure node is uncordoned and clean
- Note: Runs even on Ctrl+C or test failure (pytest fixture)
Main test (Lines 334-641):
- Lines 362-443: Phase 0 - Prerequisites (API, NVSentinel, baseline)
- Lines 445-462: Phase 1 - XID 79 injection via API
- Lines 464-493: Phase 1.5 - CUDA fault injection (realistic failure)
- Lines 496-508: Phase 2 - Automatic quarantine validation
- Lines 510-529: Phase 3 - Drain detection
- Lines 531-556: Phase 4 - Accelerated cleanup (test optimization)
- Lines 558-566: Phase 5 - Automatic uncordon validation
- Lines 568-613: Phase 6 - Recovery validation (pods + inference)
- Lines 615-637: Final summary with complete workflow validation
Key technical details:

Why this test is significant:
- Uses XID injection (writes to kernel logs like real GPU failure)
- Uses CUDA failures (pods crash with CUDA_ERROR_NO_DEVICE)
- Validates real NVSentinel automation
- Tests complete workflow (detection --> quarantine --> drain --> remediation (module unavailable in current deployment) --> recovery)
Test optimizations:
- Phase 4 accelerates drain (normally 60 min timeout, we force-delete for testing)
- Cleanup fixture ensures no test artifacts left behind
- Continuous load tester validates inference availability throughout
NVSentinel components validated:
- syslog-health-monitor: Detects XID from kernel logs
- fault-quarantine-module: Cordons faulty node
- node-drainer-module: Drains pods with AllowCompletion policy
- fault-remediation-module: Restarts GPU driver (optional as the module is currently undeployed in cluster)
Running the test:

Prerequisites:
- NVSentinel deployed in cluster (namespace: nvsentinel)
- Fault injection API deployed (from PR feat(fault-injection): Add fault injection API service #4042)
- GPU fault injector agents deployed (from PR feat(fault-injection): Add GPU fault injector agent #4043)
- vLLM deployment running (e.g., vllm-v1-disagg-router)
Run locally (with port-forward):
```
# Terminal 1: Port-forward API
kubectl port-forward -n fault-injection-system svc/fault-injection-api 8080:8080

# Terminal 2: Port-forward inference
kubectl port-forward -n dynamo-oviya svc/vllm-v1-disagg-router 8000:80

# Terminal 3: Run test
pytest examples/test_xid79_nvsentinel_automated.py -v -s
```
Run in-cluster (recommended):
```
python3 scripts/run_test_incluster.py examples/test_xid79_nvsentinel_automated.py
```

Related Issues:

Depends on: PR feat(fault-injection): Add CUDA fault injection library foundation #4038 (CUDA fault injection library)
Depends on: PR feat(fault-injection): Add CUDA library injection tools #4039 (inject_into_pods.py)
Depends on: PR feat(fault-injection): Add core testing helper utilities #4040 (Helper utilities)
Depends on: PR feat(fault-injection): Add high-level CUDA fault injection helper #4041 (CUDA integration helper)
Depends on: PR feat(fault-injection): Add fault injection API service #4042(API service)
Depends on: PR feat(fault-injection): Add GPU fault injector agent #4043 (GPU fault injector agent)
Depends on: PR feat(fault-injection): Add Kubernetes deployment manifests #4044 (Deployment manifests)
Depends on: PR test(fault-injection): Add test framework configuration #4045 (Test infrastructure)
Validates: NVSentinel automated fault tolerance pipeline
Relates to ongoing fault tolerance initiative

Summary by CodeRabbit

Tests

Added comprehensive automated end-to-end test suite for NVSentinel fault tolerance workflows. Validates fault injection mechanisms, node quarantine and drain procedures, remediation processes, infrastructure recovery, and system stability with continuous health monitoring, detailed progress reporting, and robust error handling and cleanup procedures.

Validates complete automated fault tolerance workflow: 1. XID 79 injection → syslog-health-monitor detects 2. CUDA faults → pods crash naturally (realistic GPU failure) 3. fault-quarantine-module cordons node automatically 4. node-drainer-module drains pods automatically 5. fault-remediation-module restarts GPU (optional) 6. Node uncordons automatically 7. Inference recovers

copy-pr-bot · 2025-11-02T01:05:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…for helper module is from previous PR that this builds on

pvijayakrish

I guess my main concern is that the test assumes namespace dynamo-oviya exists. Lets maybe use pytest fixtures to deploy isolated test workload.

coderabbitai · 2025-11-14T23:43:30Z

Walkthrough

A new end-to-end test file is added that validates NVSentinel fault tolerance workflows. It introduces a helper class for monitoring Kubernetes state changes and a comprehensive test that executes fault injection, observes quarantine and drain operations, triggers remediation, and verifies recovery.

Changes

Cohort / File(s)	Summary
New NVSentinel End-to-End Test `tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py`	Adds `NVSentinelMonitor` helper class with methods to query and wait for Kubernetes node states (quarantine, drain, remediation, uncordon), health-check mechanism for NVSentinel components, `cleanup_on_exit` pytest fixture for robust teardown, and comprehensive `test_xid79_nvsentinel_automated` test orchestrating fault injection, state monitoring, and recovery validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Dense integration logic across Kubernetes API, fault injection API, and CUDA components requires verification of correct state transitions and timeout handling
Multiple asynchronous wait loops and conditional remediation paths need careful validation for race conditions and edge cases
Teardown fixture complexity and fault lifecycle orchestration warrant attention to ensure proper resource cleanup and test isolation

Poem

🐰 A sentinel stands guard o'er nodes of cloud,
Detecting faults with whiskers proud—
When CUDA fails and nodes turn gray,
This test ensures they heal and stay!
Hops ✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'test(fault-injection): Add XID 79 NVSentinel E2E test' clearly and specifically describes the main change: adding an end-to-end test for the NVSentinel fault tolerance workflow using XID 79 injection.
Description check	✅ Passed	The description follows the template structure with Overview, Details, Where to start, and Related Issues sections. It provides comprehensive context including test phases, NVSentinel components, technical details, and execution instructions.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py (3)
79-95: Verify NVSentinel quarantine annotation keys used in get_node_quarantine_status

get_node_quarantine_status() looks for raw keys:
quarantine_key = "quarantineHealthEvent"
is_cordoned_key = "quarantineHealthEventIsCordoned"
...
"has_quarantine_annotation": quarantine_key in annotations,
"is_cordoned": annotations.get(is_cordoned_key) == "True",
The comment says “Actual annotation keys (without nvidia.com prefix)”, but Kubernetes annotations are often stored with vendor prefixes (e.g., nvidia.com/quarantineHealthEvent). If the real keys include a prefix, wait_for_quarantine() will never see is_cordoned=True and will always time out.

Please double‑check the actual annotation names on the nodes (e.g., via kubectl get node -oyaml) and adjust the keys accordingly or normalize by stripping prefixes before comparison.

224-333: Tighten cleanup logging and address Ruff warnings in cleanup_on_exit

The cleanup fixture is thorough and well-structured. A few small tweaks will align it with tooling and reduce noise:

Remove f-strings without placeholders (Ruff F541):
-                    print(f"    ✓ Node uncordoned")
+                    print("    ✓ Node uncordoned")
...
-                    print(f"    ✓ Node already schedulable")
+                    print("    ✓ Node already schedulable")
...
-                    print(f"    ✓ NVSentinel annotations removed")
+                    print("    ✓ NVSentinel annotations removed")
...
-                    print(f"    ✓ No NVSentinel annotations to clean")
+                    print("    ✓ No NVSentinel annotations to clean")
Avoid ambiguous ℹ character in logs (RUF001) while keeping the dynamic part:
-            print(f"[ℹ] Final pod status: {len(ready_pods)}/{len(pods.items)} ready")
+            print(f"[INFO] Final pod status: {len(ready_pods)}/{len(pods.items)} ready")
Broad exception handlers (except Exception) here are understandable for best‑effort cleanup, but if Ruff still flags BLE001 for this file, you may want to either narrow them (e.g., to specific Kubernetes/requests exceptions where applicable) or explicitly # noqa: BLE001 where you intentionally want a catch‑all.

Also note that load_tester.stop() is called in both the main test and here; ensure InferenceLoadTester.stop() is idempotent, or clear cleanup_state["load_tester"] = None after stopping in the test to avoid a double stop.

335-661: Main test flow looks solid; adjust minor style issues and unused pieces

The overall phase structure and assertions are clear and match the PR description. A few minor points:

Remove f-strings with no interpolation (Ruff F541) in the main test:
-        print(f"[✓] CUDA fault library active - pods will crash naturally")
+        print("[✓] CUDA fault library active - pods will crash naturally")
...
-        print(f"\n[→] Checking if node-drainer has started drain process...")
+        print("\n[→] Checking if node-drainer has started drain process...")
...
-            print(f"[✓] node-drainer is draining the node (AllowCompletion mode)")
+            print("[✓] node-drainer is draining the node (AllowCompletion mode)")
-            print(f"    Config: deleteAfterTimeoutMinutes=60 (would take 60 minutes)")
+            print("    Config: deleteAfterTimeoutMinutes=60 (would take 60 minutes)")
-            print(f"    Test optimization: We'll accelerate this for testing")
+            print("    Test optimization: We'll accelerate this for testing")
...
-            print(f"    Pods may already be gone or drain hasn't started")
+            print("    Pods may already be gone or drain hasn't started")
Unused helpers / constants: NVSentinelMonitor.wait_for_drain(), NVSentinelMonitor.wait_for_uncordon(), and the DRAIN_TIMEOUT constant are currently unused, while the test explicitly says “Auto-uncordon not tested”. Either wire these into the flow (e.g., if you later want to assert auto‑uncordon) or drop them to keep the test focused.

Status logging loop: the periodic log condition
if elapsed - (last_status_time - start_time) >= 30:
is equivalent to time.time() - last_status_time >= 30 but less direct. Consider simplifying for readability if you touch this section again.

Functionally the test logic and timeouts look reasonable for a long‑running E2E test; the main blocker is still the helper imports/CI setup mentioned earlier.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d76583 and 77d1450.

📒 Files selected for processing (1)

tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: tests/fault_tolerance/cancellation/test_trtllm.py:4-204
Timestamp: 2025-09-25T00:54:01.369Z
Learning: The fault tolerance tests in tests/fault_tolerance/cancellation/ run in a controlled container environment where files written to /workspace are automatically cleaned up after test completion, and tests execute sequentially without concurrency concerns, so temporary file management for config files is not necessary.

🪛 GitHub Actions: NVIDIA Dynamo Github Validation

tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py

[error] 34-34: Pytest collection failed: ModuleNotFoundError: No module named 'cuda_fault_injection'. Command: 'pytest --basetemp=/tmp --junitxml=pytest_test_report.xml -m "pre_merge or mypy"'

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4046/merge) by nv-oviya.

tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py

[error] 1-1: pre-commit run --show-diff-on-failure --color=always --all-files: Black formatting hook reformatted the file (test_xid79_nvsentinel_automated.py).

[error] 1-1: pre-commit run --show-diff-on-failure --color=always --all-files: Ruff hook fixed 10 issues in the repository (including this file).

🪛 Ruff (0.14.4)

tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py

96-96: Do not catch blind exception: Exception

(BLE001)

218-218: Consider moving this statement to an else block

(TRY300)

219-219: Do not catch blind exception: Exception

(BLE001)

257-257: Do not catch blind exception: Exception

(BLE001)

271-271: Do not catch blind exception: Exception

(BLE001)

284-284: f-string without any placeholders

Remove extraneous f prefix

(F541)

286-286: f-string without any placeholders

Remove extraneous f prefix

(F541)

304-304: f-string without any placeholders

Remove extraneous f prefix

(F541)

306-306: f-string without any placeholders

Remove extraneous f prefix

(F541)

308-308: Do not catch blind exception: Exception

(BLE001)

323-323: String contains ambiguous ℹ (INFORMATION SOURCE). Did you mean i (LATIN SMALL LETTER I)?

(RUF001)

324-324: Do not catch blind exception: Exception

(BLE001)

329-329: Do not catch blind exception: Exception

(BLE001)

489-489: f-string without any placeholders

Remove extraneous f prefix

(F541)

519-519: f-string without any placeholders

Remove extraneous f prefix

(F541)

525-525: f-string without any placeholders

Remove extraneous f prefix

(F541)

526-526: f-string without any placeholders

Remove extraneous f prefix

(F541)

527-527: f-string without any placeholders

Remove extraneous f prefix

(F541)

530-530: f-string without any placeholders

Remove extraneous f prefix

(F541)

coderabbitai · 2025-11-14T23:43:34Z

...fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py

+import os
+import sys
+import time
+from pathlib import Path
+
+import pytest
+import requests
+from kubernetes import client, config
+
+# Add helpers to path
+sys.path.insert(0, str(Path(__file__).parent.parent / "helpers"))
+
+from cuda_fault_injection import CUDAFaultInjector
+from inference_testing import InferenceLoadTester
+from k8s_operations import NodeOperations
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Fix helper module imports to avoid ModuleNotFoundError in CI

Pytest collection is currently failing with ModuleNotFoundError: No module named 'cuda_fault_injection', even before markers are applied. Since this import happens at module import time, it breaks pytest -m "pre_merge or mypy" runs.

Either ensure the helper modules are present in the repo on this branch (preferred if the dependent PRs are already merged), or guard the imports so missing helpers cause a skip instead of a hard failure. For example:

-import os -import sys -import time -from pathlib import Path - -import pytest -import requests -from kubernetes import client, config - -# Add helpers to path -sys.path.insert(0, str(Path(__file__).parent.parent / "helpers")) - -from cuda_fault_injection import CUDAFaultInjector -from inference_testing import InferenceLoadTester -from k8s_operations import NodeOperations +import os +import sys +import time +from pathlib import Path + +import pytest +import requests +from kubernetes import client, config + +# Add helpers to path +sys.path.insert(0, str(Path(__file__).parent.parent / "helpers")) + +# Import helper modules lazily; if they are missing, skip this E2E test module +cuda_fault_injection = pytest.importorskip( + "cuda_fault_injection", + reason="NVSentinel XID-79 E2E test requires cuda_fault_injection helper module.", +) +inference_testing = pytest.importorskip( + "inference_testing", + reason="NVSentinel XID-79 E2E test requires inference_testing helper module.", +) +k8s_operations = pytest.importorskip( + "k8s_operations", + reason="NVSentinel XID-79 E2E test requires k8s_operations helper module.", +) + +CUDAFaultInjector = cuda_fault_injection.CUDAFaultInjector +InferenceLoadTester = inference_testing.InferenceLoadTester +NodeOperations = k8s_operations.NodeOperations

Also remember to re-run pre-commit run --all-files so Black/Ruff changes are committed; current CI shows the hooks are reformatting this file.

🏁 Script executed:

#!/bin/bash # Check if helpers directory exists and list its contents echo "=== Checking helpers directory ===" find . -type d -name "helpers" | head -20 echo "" echo "=== Contents of helpers directories ===" find . -type d -name "helpers" -exec ls -la {} \; echo "" echo "=== Searching for cuda_fault_injection, inference_testing, k8s_operations files ===" fd -e py | xargs grep -l "cuda_fault_injection\|inference_testing\|k8s_operations" | head -30 echo "" echo "=== Checking if helpers are defined anywhere ===" fd -e py | xargs grep -l "class CUDAFaultInjector\|class InferenceLoadTester\|class NodeOperations" | head -30

Length of output: 309

🏁 Script executed:

#!/bin/bash # Examine the test file structure echo "=== Test file location and structure ===" cat tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py | head -50 echo "" echo "=== Check for pytest.importorskip usage in repo ===" rg "pytest\.importorskip|importorskip" --type py -A 2 -B 2 | head -40

Length of output: 1566

🏁 Script executed:

#!/bin/bash # Verify if the __init__.py files exist in the helpers path echo "=== Checking __init__.py files ===" find tests/fault_tolerance/hardware/fault-injection-service -name "__init__.py" -exec ls -la {} \; echo "" echo "=== Full directory structure of fault-injection-service ===" find tests/fault_tolerance/hardware/fault-injection-service -type f -name "*.py" | sort

Length of output: 194

Based on the shell script results, I can now confirm the findings:

The review comment is factually accurate. The verification reveals:

No helpers directory exists in the repository structure

None of the three imported modules are defined anywhere in the codebase (cuda_fault_injection, inference_testing, k8s_operations)

The classes are not defined (CUDAFaultInjector, InferenceLoadTester, NodeOperations)

The test file is isolated with no supporting infrastructure present

pytest.importorskip() is not currently used elsewhere in the repo, but it is the standard pytest pattern for optional test dependencies

The ModuleNotFoundError will indeed block pytest collection, preventing pytest -m "pre_merge or mypy" from running. The proposed solution using pytest.importorskip() is the correct approach for handling missing optional test dependencies—it will gracefully skip the entire test module rather than fail hard during collection.

Fix helper module imports to avoid ModuleNotFoundError in CI

Pytest collection is currently failing with ModuleNotFoundError: No module named 'cuda_fault_injection', blocking marker-based test runs. Since the helper modules do not exist in this branch (at tests/fault_tolerance/hardware/fault-injection-service/helpers/), guard the imports with pytest.importorskip() to skip the test module instead of failing collection:

-import os -import sys -import time -from pathlib import Path - -import pytest -import requests -from kubernetes import client, config - -# Add helpers to path -sys.path.insert(0, str(Path(__file__).parent.parent / "helpers")) - -from cuda_fault_injection import CUDAFaultInjector -from inference_testing import InferenceLoadTester -from k8s_operations import NodeOperations +import os +import sys +import time +from pathlib import Path + +import pytest +import requests +from kubernetes import client, config + +# Add helpers to path +sys.path.insert(0, str(Path(__file__).parent.parent / "helpers")) + +# Import helper modules lazily; if missing, skip this E2E test module +cuda_fault_injection = pytest.importorskip( + "cuda_fault_injection", + reason="NVSentinel XID-79 E2E test requires cuda_fault_injection helper module.", +) +inference_testing = pytest.importorskip( + "inference_testing", + reason="NVSentinel XID-79 E2E test requires inference_testing helper module.", +) +k8s_operations = pytest.importorskip( + "k8s_operations", + reason="NVSentinel XID-79 E2E test requires k8s_operations helper module.", +) + +CUDAFaultInjector = cuda_fault_injection.CUDAFaultInjector +InferenceLoadTester = inference_testing.InferenceLoadTester +NodeOperations = k8s_operations.NodeOperations

Also run pre-commit run --all-files to resolve remaining Black/Ruff formatting violations.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import os

import sys

import time

from pathlib import Path

import pytest

import requests

from kubernetes import client, config

# Add helpers to path

sys.path.insert(0, str(Path(__file__).parent.parent / "helpers"))

from cuda_fault_injection import CUDAFaultInjector

from inference_testing import InferenceLoadTester

from k8s_operations import NodeOperations

import os

import sys

import time

from pathlib import Path

import pytest

import requests

from kubernetes import client, config

# Add helpers to path

sys.path.insert(0, str(Path(__file__).parent.parent / "helpers"))

# Import helper modules lazily; if missing, skip this E2E test module

cuda_fault_injection = pytest.importorskip(

"cuda_fault_injection",

reason="NVSentinel XID-79 E2E test requires cuda_fault_injection helper module.",

)

inference_testing = pytest.importorskip(

"inference_testing",

reason="NVSentinel XID-79 E2E test requires inference_testing helper module.",

)

k8s_operations = pytest.importorskip(

"k8s_operations",

reason="NVSentinel XID-79 E2E test requires k8s_operations helper module.",

)

CUDAFaultInjector = cuda_fault_injection.CUDAFaultInjector

InferenceLoadTester = inference_testing.InferenceLoadTester

NodeOperations = k8s_operations.NodeOperations

🧰 Tools

🪛 GitHub Actions: NVIDIA Dynamo Github Validation

[error] 34-34: Pytest collection failed: ModuleNotFoundError: No module named 'cuda_fault_injection'. Command: 'pytest --basetemp=/tmp --junitxml=pytest_test_report.xml -m "pre_merge or mypy"'

🤖 Prompt for AI Agents

In tests/fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py around lines 22-37, imports from a non-existent helpers package (cuda_fault_injection, inference_testing, k8s_operations) cause ModuleNotFoundError during pytest collection; replace direct imports with guards using pytest.importorskip() for each optional helper module (e.g., call pytest.importorskip("cuda_fault_injection") etc.) so the whole test module is skipped when helpers are missing, and then run pre-commit (pre-commit run --all-files) to fix any Black/Ruff style issues.

nnshah1 · 2025-11-24T23:57:09Z

...fault_tolerance/hardware/fault-injection-service/examples/test_xid79_nvsentinel_automated.py

+
+# Test configuration
+TARGET_DEPLOYMENT = os.getenv("TARGET_DEPLOYMENT", "vllm-v1-disagg-router")
+NAMESPACE = "dynamo-oviya"


Suggested change

NAMESPACE = "dynamo-oviya"

NAMESPACE = "dynamo-test"

nnshah1

One thing - can we change the name of the test file so it has to be run explicitly and not auto collected by pytest?

…collect it Signed-off-by: Oviya Seeniraj <[email protected]>

Signed-off-by: Oviya Seeniraj <[email protected]>

pull-request-size bot added the size/XL label Nov 2, 2025

github-actions bot added the test label Nov 2, 2025

Fixed copyright header, unused variable, unused import. Import error …

77d1450

…for helper module is from previous PR that this builds on

pvijayakrish reviewed Nov 14, 2025

View reviewed changes

nv-oviya marked this pull request as ready for review November 14, 2025 23:36

nv-oviya requested review from a team as code owners November 14, 2025 23:36

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

nnshah1 reviewed Nov 24, 2025

View reviewed changes

nnshah1 approved these changes Nov 24, 2025

View reviewed changes

renamed from test_ to manual_ prefix so pytest doesn't automatically …

a762ca7

…collect it Signed-off-by: Oviya Seeniraj <[email protected]>

pull-request-size bot added size/XS and removed size/XL labels Nov 25, 2025

changed namespace

a3d180a

Signed-off-by: Oviya Seeniraj <[email protected]>

pull-request-size bot added size/XL and removed size/XS labels Nov 25, 2025

This was referenced Dec 2, 2025

refactor(feat-injection): create PyTest fixtures to replace boilerplate in E2E HW FT tests #4690

Draft

test(feat-injection): add 200x minimized XID 79 PyTest harnessing fixtures #4694

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(fault-injection): Add XID 79 NVSentinel E2E test #4046

test(fault-injection): Add XID 79 NVSentinel E2E test #4046

Uh oh!

nv-oviya commented Nov 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Nov 2, 2025

Uh oh!

pvijayakrish left a comment

Uh oh!

coderabbitai bot commented Nov 14, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 14, 2025

Uh oh!

nnshah1 Nov 24, 2025

Uh oh!

nnshah1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

test(fault-injection): Add XID 79 NVSentinel E2E test #4046

Are you sure you want to change the base?

test(fault-injection): Add XID 79 NVSentinel E2E test #4046

Uh oh!

Conversation

nv-oviya commented Nov 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues:

Summary by CodeRabbit

Tests

Uh oh!

copy-pr-bot bot commented Nov 2, 2025

Uh oh!

pvijayakrish left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Nov 14, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

nnshah1 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

nnshah1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nv-oviya commented Nov 2, 2025 •

edited by coderabbitai bot

Loading