Skip to content

Conversation

@nv-oviya
Copy link
Contributor

@nv-oviya nv-oviya commented Dec 2, 2025

Overview:

Introduces a comprehensive pytest fixture system that reduces fault tolerance E2E tests from 600+ lines to 2-3 lines, automating test orchestration, cleanup, and phase management.

My previous E2E test harnessing the fault injection service still contained hundreds of lines of boilerplate: deploying CUDA libraries, starting inference load, injecting faults, monitoring NVSentinel responses, verifying recovery, and cleanup. This PR abstracts all orchestration into reusable fixtures, making tests easier to write, e.g.:

def test_xid79_cordon_drain_only(xid79_test, expect_cordon_and_drain):
"""XID 79 with cordon + drain (no auto-remediation)."""
xid79_test(gpu_id=0, expect=expect_cordon_and_drain)


Details:

  1. fault_test_fixtures.py - Test orchestration framework
  • Core orchestration methods:
    • _phase_prerequisites(): Validates environment, selects target node, deploys CUDA library in passthrough mode
    • _phase_inject_fault(): Injects fault and enables CUDA interception via toggle files
    • _phase_monitor_response(): Monitors NVSentinel cordon/drain/remediate/uncordon actions
    • _phase_verify_recovery(): Validates pod and inference recovery
    • _cleanup(): Ensures cleanup runs even on test failure or Ctrl+C (removes faults, uncordons nodes, removes annotations)
    • _print_latency_comparison_table(): Displays phase-by-phase latency impact analysis
      • Compares: Baseline (healthy) → During fault (degraded) → After recovery (restored)
      • Shows: Success rate + latency percentiles (p50, p95, p99)
      • Critical for disaggregated deployments where healthy pods continue serving (high success rate but increased latency)
  • Helper classes:
    • TestConfig: Environment configuration with auto-detection from env vars
    • FaultSpec: Abstract base for fault types (XID79Fault, XID74Fault, NetworkPartitionFault)
    • ResponseExpectation: Declarative expectations for NVSentinel behavior (cordon, drain, remediate, uncordon)
    • NodeOperations: Kubernetes node operations (cordon status, drain monitoring, uncordon)
  • Integrates with existing helpers:
    • CUDAFaultInjector (cuda_fault_injection.py)
    • InferenceLoadTester (inference_testing.py)
    • Kubernetes utilities (k8s_operations.py)
  1. conftest.py - Pytest configuration
  • Purpose: Import and expose fixtures to all tests in the examples/ directory
  • Fixtures provided:
    • Core: test_config, default_deployment, default_namespace, fault_test
    • XID Tests: xid79_test, xid74_test, xid79_with_custom_validation
    • Network Tests: network_partition_test
    • Expectations: expect_full_automation, expect_cordon_and_drain, expect_cordon_only
    • Environment: ensure_clean_test_environment, skip_if_no_nvsentinel, skip_if_insufficient_gpus
  • Path management: Adds helpers/ to sys.path to ensure imports work
  1. pytest.ini - Pytest settings
  • Configuration:
    • norecursedirs: Prevents searching parent directories for conftest.py (isolates these tests)
    • Markers: xid79, xid74, xid48, xid94, xid95, xid43, nvsentinel, slow
    • Logging: CLI logging enabled at INFO level with clean format
    • Warnings: Filters deprecation warnings for cleaner output

Where should the reviewer start?

  • Start with conftest.py and pytest.ini:
    • Simple boilerplate files that import fixtures and configure pytest
    • Shows what fixtures are available and how they're organized
  • Review fault_test_fixtures.py top-down:
    • Lines 1-80: Docstring, imports, TestConfig (environment auto-detection)
    • Lines 86-150: FaultSpec abstract class and implementations (XIDFaultSpec, NetworkPartitionFaultSpec)
    • Lines 152-290: ResponseExpectation class (declarative expectations for NVSentinel behavior)
    • Lines 297-900: FaultToleranceTest orchestrator class:
      • init: Setup
      • run(): Main entry point (calls phases in sequence)
      • _phase_prerequisites(): Phase 0
      • _phase_inject_fault(): Phase 1
      • _phase_monitor_response(): Phase 2
      • _phase_verify_recovery(): Phase 3
      • _cleanup(): Always-run cleanup
    • Lines 900+: Pytest fixtures that wrap the orchestrator

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nv-oviya nv-oviya changed the title add pytest fixtures for fault tolerance test orchestration with phase… refactor(feat-injection): create PyTest fixtures to replace boilerplate in E2E HW FT tests Dec 2, 2025
@nv-oviya nv-oviya force-pushed the oviya/fault-injection/test-fixtures branch from 7394380 to 63e829c Compare December 2, 2025 03:46
Signed-off-by: Oviya Seeniraj <[email protected]>
@nv-oviya nv-oviya force-pushed the oviya/fault-injection/test-fixtures branch from 36249b1 to 676dfb6 Compare December 3, 2025 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants