feat(fault-injection): Add Python client library for fault injection API #4048

nv-oviya · 2025-11-02T01:23:38Z

Overview:

This PR adds a Python client library providing high-level API for fault injection. Handles GPU faults, network partitions, Kubernetes integration, and automatic cleanup via context managers.

Details:

Add client/fault_injection_client.py (798 lines): Main client library

Core components:

Enums: GPUFault, NetworkPartition, NetworkMode, FaultSeverity
Data classes: FaultInfo, Metrics
FaultInjectionClient: Main client with comprehensive API

GPU fault injection:

inject_gpu_fault(): Generic GPU fault injection
inject_xid_error(): XID-specific injection (79, 48, 94, 95, 43, 74)

Network fault injection:

inject_network_partition(): Main network fault method
Two modes: NETWORKPOLICY (complete blocking), CHAOS_MESH (packet loss, delay)
Predefined types: FRONTEND_WORKER, WORKER_NATS, WORKER_WORKER, CUSTOM

Fault management:

get_fault_status(): Query current status
delete_fault(): Remove fault and cleanup
wait_for_recovery(): Wait for healthy state

Context managers:

gpu_fault(): Auto-cleanup GPU faults
network_partition(): Auto-cleanup network faults
Ensures cleanup even on test failure/interruption

Where should the reviewer start?

inject_network_partition() - Main entry point
Context managers - Automatic cleanup
GPU injection methods - Core functionality

Key patterns:

Context managers ensure cleanup runs even on Ctrl+C
Auto-detects in-cluster vs local Kubernetes config
NetworkPolicy for complete blocking, ChaosMesh for realistic partial failures

Related Issues:

Depends on: PR feat(fault-injection): Add test helper utilities and package setup #4047 (test_helpers.py)
Depends on: PR feat(fault-injection): Add fault injection API service #4042 (API service)
Relates to: Fault tolerance initiative -- GPU and network fault testing

- FaultInjectionClient class with high-level API - GPU fault injection and XID error methods - Network partition injection (NetworkPolicy and ChaosMesh) - Context managers for automatic cleanup - Kubernetes integration for pod discovery Provides clean interface for fault injection in test scripts.

copy-pr-bot · 2025-11-02T01:23:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

nnshah1 · 2025-11-19T16:22:59Z

tests/fault_tolerance/hardware/fault-injection-service/client/fault_injection_client.py

+    # DCGM Infrastructure Management
+    # ========================================================================
+
+    def deploy_dcgm(


can we turn this into a check instead of a deploy? IIUC dcgm should already be in the cluster / naemspace

@julienmancuso for viz

nnshah1 · 2025-11-19T16:25:52Z

tests/fault_tolerance/hardware/fault-injection-service/client/fault_injection_client.py

+    # Fault Recovery
+    # ========================================================================
+
+    def recover_fault(self, fault_id: str) -> dict[str, Any]:


not sure what recover means here -

nnshah1

basic question: what can we do from the client without an api server and agent? the agent is needed to write into the kmesg log is that right? Do we need the api server or could we run all the operations from the test runner (just thinking) -

We could define fault types and move the logic to the common fault types folder ...

pull-request-size bot added the size/XL label Nov 2, 2025

github-actions bot added the feat label Nov 2, 2025

nv-oviya mentioned this pull request Nov 2, 2025

test(fault-injection): Add Worker --> Frontend network partition test #4049

Open

nnshah1 reviewed Nov 19, 2025

View reviewed changes

nnshah1 approved these changes Nov 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(fault-injection): Add Python client library for fault injection API #4048

feat(fault-injection): Add Python client library for fault injection API #4048

Uh oh!

nv-oviya commented Nov 2, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 2, 2025

Uh oh!

nnshah1 Nov 19, 2025

Uh oh!

nnshah1 Nov 19, 2025

Uh oh!

nnshah1 Nov 19, 2025

Uh oh!

nnshah1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(fault-injection): Add Python client library for fault injection API #4048

Are you sure you want to change the base?

feat(fault-injection): Add Python client library for fault injection API #4048

Uh oh!

Conversation

nv-oviya commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues:

Uh oh!

copy-pr-bot bot commented Nov 2, 2025

Uh oh!

nnshah1 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

nnshah1 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

nnshah1 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

nnshah1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nv-oviya commented Nov 2, 2025 •

edited

Loading