-
Notifications
You must be signed in to change notification settings - Fork 738
feat(fault-injection): Add Python client library for fault injection API #4048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- FaultInjectionClient class with high-level API - GPU fault injection and XID error methods - Network partition injection (NetworkPolicy and ChaosMesh) - Context managers for automatic cleanup - Kubernetes integration for pod discovery Provides clean interface for fault injection in test scripts.
| # DCGM Infrastructure Management | ||
| # ======================================================================== | ||
|
|
||
| def deploy_dcgm( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we turn this into a check instead of a deploy? IIUC dcgm should already be in the cluster / naemspace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julienmancuso for viz
| # Fault Recovery | ||
| # ======================================================================== | ||
|
|
||
| def recover_fault(self, fault_id: str) -> dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what recover means here -
nnshah1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basic question: what can we do from the client without an api server and agent? the agent is needed to write into the kmesg log is that right? Do we need the api server or could we run all the operations from the test runner (just thinking) -
We could define fault types and move the logic to the common fault types folder ...
Overview:
This PR adds a Python client library providing high-level API for fault injection. Handles GPU faults, network partitions, Kubernetes integration, and automatic cleanup via context managers.
Details:
client/fault_injection_client.py(798 lines): Main client libraryCore components:
GPUFault,NetworkPartition,NetworkMode,FaultSeverityFaultInfo,MetricsFaultInjectionClient: Main client with comprehensive APIGPU fault injection:
inject_gpu_fault(): Generic GPU fault injectioninject_xid_error(): XID-specific injection (79, 48, 94, 95, 43, 74)Network fault injection:
inject_network_partition(): Main network fault methodNETWORKPOLICY(complete blocking),CHAOS_MESH(packet loss, delay)FRONTEND_WORKER,WORKER_NATS,WORKER_WORKER,CUSTOMFault management:
get_fault_status(): Query current statusdelete_fault(): Remove fault and cleanupwait_for_recovery(): Wait for healthy stateContext managers:
gpu_fault(): Auto-cleanup GPU faultsnetwork_partition(): Auto-cleanup network faultsWhere should the reviewer start?
inject_network_partition()- Main entry pointKey patterns:
Related Issues: