refactor(feat-injection): create PyTest fixtures to replace boilerplate in E2E HW FT tests #4690

nv-oviya · 2025-12-02T03:19:42Z

Overview:

Introduces a comprehensive pytest fixture system that reduces fault tolerance E2E tests from 600+ lines to 2-3 lines, automating test orchestration, cleanup, and phase management.

My previous E2E test harnessing the fault injection service still contained hundreds of lines of boilerplate: deploying CUDA libraries, starting inference load, injecting faults, monitoring NVSentinel responses, verifying recovery, and cleanup. This PR abstracts all orchestration into reusable fixtures, making tests easier to write, e.g.:

def test_xid79_cordon_drain_only(xid79_test, expect_cordon_and_drain):
"""XID 79 with cordon + drain (no auto-remediation)."""
xid79_test(gpu_id=0, expect=expect_cordon_and_drain)

Details:

fault_test_fixtures.py - Test orchestration framework

Core orchestration methods:
- _phase_prerequisites(): Validates environment, selects target node, deploys CUDA library in passthrough mode
- _phase_inject_fault(): Injects fault and enables CUDA interception via toggle files
- _phase_monitor_response(): Monitors NVSentinel cordon/drain/remediate/uncordon actions
- _phase_verify_recovery(): Validates pod and inference recovery
- _cleanup(): Ensures cleanup runs even on test failure or Ctrl+C (removes faults, uncordons nodes, removes annotations)
- _print_latency_comparison_table(): Displays phase-by-phase latency impact analysis
  - Compares: Baseline (healthy) → During fault (degraded) → After recovery (restored)
  - Shows: Success rate + latency percentiles (p50, p95, p99)
  - Critical for disaggregated deployments where healthy pods continue serving (high success rate but increased latency)
Helper classes:
- TestConfig: Environment configuration with auto-detection from env vars
- FaultSpec: Abstract base for fault types (XID79Fault, XID74Fault, NetworkPartitionFault)
- ResponseExpectation: Declarative expectations for NVSentinel behavior (cordon, drain, remediate, uncordon)
- NodeOperations: Kubernetes node operations (cordon status, drain monitoring, uncordon)
Integrates with existing helpers:
- CUDAFaultInjector (cuda_fault_injection.py)
- InferenceLoadTester (inference_testing.py)
- Kubernetes utilities (k8s_operations.py)

conftest.py - Pytest configuration

Purpose: Import and expose fixtures to all tests in the examples/ directory
Fixtures provided:
- Core: test_config, default_deployment, default_namespace, fault_test
- XID Tests: xid79_test, xid74_test, xid79_with_custom_validation
- Network Tests: network_partition_test
- Expectations: expect_full_automation, expect_cordon_and_drain, expect_cordon_only
- Environment: ensure_clean_test_environment, skip_if_no_nvsentinel, skip_if_insufficient_gpus
Path management: Adds helpers/ to sys.path to ensure imports work

pytest.ini - Pytest settings

Configuration:
- norecursedirs: Prevents searching parent directories for conftest.py (isolates these tests)
- Markers: xid79, xid74, xid48, xid94, xid95, xid43, nvsentinel, slow
- Logging: CLI logging enabled at INFO level with clean format
- Warnings: Filters deprecation warnings for cleaner output

Where should the reviewer start?

Start with conftest.py and pytest.ini:
- Simple boilerplate files that import fixtures and configure pytest
- Shows what fixtures are available and how they're organized
Review fault_test_fixtures.py top-down:
- Lines 1-80: Docstring, imports, TestConfig (environment auto-detection)
- Lines 86-150: FaultSpec abstract class and implementations (XIDFaultSpec, NetworkPartitionFaultSpec)
- Lines 152-290: ResponseExpectation class (declarative expectations for NVSentinel behavior)
- Lines 297-900: FaultToleranceTest orchestrator class:
  - init: Setup
  - run(): Main entry point (calls phases in sequence)
  - _phase_prerequisites(): Phase 0
  - _phase_inject_fault(): Phase 1
  - _phase_monitor_response(): Phase 2
  - _phase_verify_recovery(): Phase 3
  - _cleanup(): Always-run cleanup
- Lines 900+: Pytest fixtures that wrap the orchestrator

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Dependent on feat(fault-injection): Enable runtime CUDA fault injection toggling without pod restarts #4679 and feat(fault-injection): Add latency percentile metrics and per-phase statistics tracking #4692 and relates to ongoing HW FT initiative.
Simplifies test(fault-injection): Add XID 79 NVSentinel E2E test #4046.
Existing helper modules: cuda_fault_injection.py, inference_testing.py, k8s_operations.py

copy-pr-bot · 2025-12-02T03:19:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Oviya Seeniraj <[email protected]>

…o pass ci Signed-off-by: Oviya Seeniraj <[email protected]>

pull-request-size bot added the size/XXL label Dec 2, 2025

nv-oviya changed the title ~~add pytest fixtures for fault tolerance test orchestration with phase…~~ refactor(feat-injection): create PyTest fixtures to replace boilerplate in E2E HW FT tests Dec 2, 2025

github-actions bot added the refactor label Dec 2, 2025

nv-oviya force-pushed the oviya/fault-injection/test-fixtures branch from 7394380 to 63e829c Compare December 2, 2025 03:46

This was referenced Dec 2, 2025

feat(fault-injection): Add latency percentile metrics and per-phase statistics tracking #4692

Draft

test(feat-injection): add 200x minimized XID 79 PyTest harnessing fixtures #4694

Draft

moved files to _ dir from old -

676dfb6

Signed-off-by: Oviya Seeniraj <[email protected]>

nv-oviya force-pushed the oviya/fault-injection/test-fixtures branch from 36249b1 to 676dfb6 Compare December 3, 2025 22:29

kubeconfig unavailable, wrapping in try/except for graceful failure t…

7b3a6b9

…o pass ci Signed-off-by: Oviya Seeniraj <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(feat-injection): create PyTest fixtures to replace boilerplate in E2E HW FT tests #4690

refactor(feat-injection): create PyTest fixtures to replace boilerplate in E2E HW FT tests #4690

Uh oh!

nv-oviya commented Dec 2, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refactor(feat-injection): create PyTest fixtures to replace boilerplate in E2E HW FT tests #4690

Are you sure you want to change the base?

refactor(feat-injection): create PyTest fixtures to replace boilerplate in E2E HW FT tests #4690

Uh oh!

Conversation

nv-oviya commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

copy-pr-bot bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nv-oviya commented Dec 2, 2025 •

edited

Loading