Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fixed copyright header, unused variable, unused import. Import error …
…for helper module is from previous PR that this builds on
  • Loading branch information
nv-oviya committed Nov 3, 2025
commit 77d14501b470a85cc5e14e5e707980455243f78f
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# SPDX-License-Identifier: Apache-2.0
#

"""
XID 79 E2E Test - Fully Automated NVSentinel Workflow

This test validates the complete NVSentinel automated fault tolerance pipeline:
1. Inject XID 79 via API → syslog-health-monitor detects it
2. Inject CUDA faults → pods crash naturally (simulates real GPU failure)
3. fault-quarantine-module cordons the node automatically
4. node-drainer-module drains pods automatically
4. node-drainer-module drains pods automatically
5. fault-remediation-module restarts GPU driver automatically (optional)
6. Node is uncordoned automatically
7. Pods reschedule and inference recovers
Expand All @@ -18,7 +23,6 @@
import sys
import time
from pathlib import Path
from typing import Optional

import pytest
import requests
Expand Down Expand Up @@ -126,9 +130,6 @@ def wait_for_drain(self, node_name: str, timeout: int) -> bool:
drain_annotations = {k: v for k, v in annotations.items()
if "drain" in k.lower() or "evict" in k.lower()}

# Check node status
status = self.get_node_quarantine_status(node_name)

if drain_annotations or any("NoExecute" in str(t.effect) for t in taints):
elapsed = time.time() - start_time
print(f"[✓] Node drain initiated by NVSentinel after {elapsed:.1f}s")
Expand Down
Loading