refactor(feat-injection): create PyTest fixtures to replace boilerplate in E2E HW FT tests #4690
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview:
Introduces a comprehensive pytest fixture system that reduces fault tolerance E2E tests from 600+ lines to 2-3 lines, automating test orchestration, cleanup, and phase management.
My previous E2E test harnessing the fault injection service still contained hundreds of lines of boilerplate: deploying CUDA libraries, starting inference load, injecting faults, monitoring NVSentinel responses, verifying recovery, and cleanup. This PR abstracts all orchestration into reusable fixtures, making tests easier to write, e.g.:
def test_xid79_cordon_drain_only(xid79_test, expect_cordon_and_drain):"""XID 79 with cordon + drain (no auto-remediation)."""xid79_test(gpu_id=0, expect=expect_cordon_and_drain)Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)