Skip to content

Conversation

@nv-oviya
Copy link
Contributor

@nv-oviya nv-oviya commented Nov 2, 2025

Overview:

This PR adds a Python client library providing high-level API for fault injection. Handles GPU faults, network partitions, Kubernetes integration, and automatic cleanup via context managers.


Details:

  • Add client/fault_injection_client.py (798 lines): Main client library

Core components:

  • Enums: GPUFault, NetworkPartition, NetworkMode, FaultSeverity
  • Data classes: FaultInfo, Metrics
  • FaultInjectionClient: Main client with comprehensive API

GPU fault injection:

  • inject_gpu_fault(): Generic GPU fault injection
  • inject_xid_error(): XID-specific injection (79, 48, 94, 95, 43, 74)

Network fault injection:

  • inject_network_partition(): Main network fault method
  • Two modes: NETWORKPOLICY (complete blocking), CHAOS_MESH (packet loss, delay)
  • Predefined types: FRONTEND_WORKER, WORKER_NATS, WORKER_WORKER, CUSTOM

Fault management:

  • get_fault_status(): Query current status
  • delete_fault(): Remove fault and cleanup
  • wait_for_recovery(): Wait for healthy state

Context managers:

  • gpu_fault(): Auto-cleanup GPU faults
  • network_partition(): Auto-cleanup network faults
  • Ensures cleanup even on test failure/interruption

Where should the reviewer start?

  1. inject_network_partition() - Main entry point
  2. Context managers - Automatic cleanup
  3. GPU injection methods - Core functionality

Key patterns:

  • Context managers ensure cleanup runs even on Ctrl+C
  • Auto-detects in-cluster vs local Kubernetes config
  • NetworkPolicy for complete blocking, ChaosMesh for realistic partial failures

Related Issues:

- FaultInjectionClient class with high-level API
- GPU fault injection and XID error methods
- Network partition injection (NetworkPolicy and ChaosMesh)
- Context managers for automatic cleanup
- Kubernetes integration for pod discovery

Provides clean interface for fault injection in test scripts.
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

# DCGM Infrastructure Management
# ========================================================================

def deploy_dcgm(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we turn this into a check instead of a deploy? IIUC dcgm should already be in the cluster / naemspace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienmancuso for viz

# Fault Recovery
# ========================================================================

def recover_fault(self, fault_id: str) -> dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what recover means here -

Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basic question: what can we do from the client without an api server and agent? the agent is needed to write into the kmesg log is that right? Do we need the api server or could we run all the operations from the test runner (just thinking) -

We could define fault types and move the logic to the common fault types folder ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants