feat(fault-injection): Add fault injection API service #4042

nv-oviya · 2025-11-02T00:46:50Z

This PR adds a FastAPI-based service that provides HTTP endpoints for triggering fault injection remotely. This enables tests to inject faults without direct cluster access and provides a consistent API for both GPU and network fault injection.

Details:

Add api-service/main.py: FastAPI service with fault injection endpoints
- /health: Health check endpoint
- /api/v1/faults/gpu/inject/xid-79: Inject XID 79 on specified node
- /api/v1/faults/network/inject: Inject network partitions using NetworkPolicy or ChaosMesh
- /api/v1/faults/{fault_id}/recover: Recover from GPU or network faults
- /api/v1/faults/network/cleanup: Clean up orphaned NetworkPolicies
- Calls GPU fault injector agent DaemonSet pods on target nodes
- Creates NetworkPolicies directly for network partitions (no agent needed)
- Fault tracking with in-memory storage (fault ID --> status mapping)
- Automatic cleanup of active faults during shutdown
Add api-service/requirements.txt: Dependencies (FastAPI, httpx, kubernetes)
Add api-service/Dockerfile: Container image definition
- Python 3.11-slim base image
- Installs Python dependencies
- Runs uvicorn server on port 8080

Where should the reviewer start?

main.py - Main API implementation:
- FastAPI setup, in-cluster K8s config, fault storage
- /api/v1/faults/gpu/inject/xid-79 endpoint:
  - Finds GPU fault injector agent pod on target node
  - Forwards request to agent's /inject-xid endpoint
  - Generates fault ID for tracking
- /api/v1/faults/network/inject endpoint:
  - Creates NetworkPolicies directly (egress/ingress blocking)
  - Supports ChaosMesh NetworkChaos for advanced faults (packet loss, delay)
  - Tracks active network policies for cleanup
- Fault recovery endpoints for both GPU and network faults
- Shutdown handler that automatically recovers all active faults
- Health check endpoint
requirements.txt - Dependencies:
- FastAPI for API framework
- httpx for async HTTP calls to agents
- kubernetes client for pod discovery and NetworkPolicy management
Dockerfile - Standard Python container setup

Architecture note:

API service acts as orchestrator
GPU fault injection delegated to DaemonSet agent pods
Network fault injection handled directly via Kubernetes API (NetworkPolicy/ChaosMesh)
In-memory storage (not persistent across restarts - acceptable for tests)
Uses NodePort or LoadBalancer service for external access
Graceful shutdown with automatic fault recovery to prevent cluster degradation

Related Issues:

Relates to: GPU fault injection agent infrastructure (next PR)
Relates to: GPU fault tolerance testing
Relates to: Network partition testing and fault tolerance validation

Summary by CodeRabbit

New Features
- Introduced Fault Injection API Service for orchestrating hardware fault testing in Kubernetes environments.
- Added capabilities for GPU fault injection, network fault injection, and fault recovery operations.
- Enabled metrics collection and aggregation across monitoring agents.
- Deployed as containerized service with built-in health monitoring.

- FastAPI service for remote fault injection - Endpoints for GPU XID injection, network faults - Dockerfile for containerized deployment - Requirements with FastAPI, kubernetes client Provides HTTP API for triggering fault injection from tests.

copy-pr-bot · 2025-11-02T00:46:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-11-04T00:28:57Z

Walkthrough

This PR introduces a new Fault Injection API Service for orchestrating hardware fault testing in Kubernetes environments. It includes a containerized FastAPI application with Kubernetes integration, GPU and network fault injection capabilities, metrics collection, and fault lifecycle management.

Changes

Cohort / File(s)	Summary
Container Setup `tests/fault_tolerance/hardware/fault-injection-service/api-service/Dockerfile`	Adds Python 3.12-slim-based Dockerfile with apt dependencies (curl), requirements installation, healthcheck endpoint, non-root user (faultinjection), and uvicorn entrypoint on port 8080
Service Implementation `tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py`	Implements comprehensive Fault Injection API Service with: KubernetesHelper for K8s cluster operations; GPUFaultInjectorClient for GPU fault orchestration with XID error injection; NetworkFaultInjectorClient supporting NetworkPolicy and Chaos Mesh network faults; MonitoringAgentClient for metrics aggregation; FaultTracker for lifecycle management; FastAPI endpoints for GPU/network fault injection, recovery, metrics collection, and status queries; lifespan context manager for initialization/cleanup
Dependencies `tests/fault_tolerance/hardware/fault-injection-service/api-service/requirements.txt`	Specifies Python package dependencies including fastapi, httpx, kubernetes, pydantic, python-multipart, pyyaml, and uvicorn with standard license headers

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Areas requiring extra attention:

main.py — Largest change; review the multi-layered orchestration logic, error handling patterns, and interactions between KubernetesHelper, GPU/NetworkFaultInjectorClient, and MonitoringAgentClient
Kubernetes integration — Validate in-cluster/local config loading, pod/daemonset queries, exec logic, and namespace/label handling correctness
Fault lifecycle & state management — Verify FaultTracker thread-safety, fault status transitions, and cleanup of orphaned resources (NetworkPolicies, Chaos Mesh)
API endpoint consistency — Confirm all 18+ endpoints follow consistent error handling, response schemas, and parameter validation
GPU and network fault injection flows — Review XID error injection specifics and NetworkPolicy vs. Chaos Mesh branching logic

Poem

🐰 A fault injector hops into the cluster with glee,
GPU chaos and network storms it shall set free,
With FastAPI fangs and Kubernetes might,
Hardware faults dance through the Kubernetes night! 🌙⚡

Pre-merge checks

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change—adding a fault injection API service with FastAPI endpoints. It is concise, specific, and directly reflects the changeset content.
Description check	✅ Passed	PR description is comprehensive with clear overview, detailed implementation notes, reviewer guidance, and architecture rationale.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d76583 and 1affaa8.

📒 Files selected for processing (3)

tests/fault_tolerance/hardware/fault-injection-service/api-service/Dockerfile (1 hunks)
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py (1 hunks)
tests/fault_tolerance/hardware/fault-injection-service/api-service/requirements.txt (1 hunks)

🧰 Additional context used

🪛 Ruff (0.14.3)

tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py

174-174: Avoid specifying long messages outside the exception class

(TRY003)

213-213: Do not catch blind exception: Exception

(BLE001)

254-256: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

261-261: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

274-274: Consider moving this statement to an else block

(TRY300)

276-276: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

296-296: Consider moving this statement to an else block

(TRY300)

298-298: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

311-311: Consider moving this statement to an else block

(TRY300)

313-313: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

324-324: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

331-331: Consider moving this statement to an else block

(TRY300)

333-333: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

387-387: Do not catch blind exception: Exception

(BLE001)

388-388: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

440-440: Do not catch blind exception: Exception

(BLE001)

441-443: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

468-468: Do not catch blind exception: Exception

(BLE001)

469-469: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

487-487: Do not catch blind exception: Exception

(BLE001)

521-521: Unused method argument: target

(ARG002)

707-707: Consider moving this statement to an else block

(TRY300)

709-709: Do not catch blind exception: Exception

(BLE001)

710-710: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

717-717: Unused method argument: target

(ARG002)

930-930: Consider moving this statement to an else block

(TRY300)

932-932: Do not catch blind exception: Exception

(BLE001)

933-933: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

937-937: Use explicit conversion flag

Replace with conversion flag

(RUF010)

972-972: Consider moving this statement to an else block

(TRY300)

974-974: Do not catch blind exception: Exception

(BLE001)

975-975: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1003-1003: Consider moving this statement to an else block

(TRY300)

1005-1005: Do not catch blind exception: Exception

(BLE001)

1006-1006: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1063-1063: Do not catch blind exception: Exception

(BLE001)

1113-1113: Unused function argument: k8s

(ARG001)

1172-1172: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1223-1223: Do not catch blind exception: Exception

(BLE001)

1224-1226: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1607-1607: Do not catch blind exception: Exception

(BLE001)

1608-1608: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1609-1609: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

1621-1621: Possible binding to all interfaces

(S104)

tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py

… active GPU and network faults

tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py

tzulingk · 2025-11-19T19:22:40Z

tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py

+            logger.error(f"Failed to create NetworkPolicy: {e}")
+            return False, str(e)
+
+    async def _create_chaos_mesh_network_fault(


Can we extract common patterns across functions?
ex

async def _get_target_pod_details( self, namespace: str, target_pod_prefix: str ) -> tuple[bool, str | dict[str, str], str]: """ Looks up target Pod name and labels. Returns (success: bool, labels: dict[str, str] | error_msg: str, pod_name: str) """ if not target_pod_prefix: return False, "target_pod_prefix parameter is required", "" target_pod_name = await self.k8s.get_pod_by_prefix(namespace, target_pod_prefix) if not target_pod_name: return ( False, f"Could not find pod with prefix '{target_pod_prefix}' in namespace '{namespace}'", "", ) target_labels = await self.k8s.get_pod_labels(namespace, target_pod_name) if not target_labels: return False, f"Could not get labels for pod '{target_pod_name}'", "" logger.info(f"Found target pod: {target_pod_name} with labels: {target_labels}") return True, target_labels, target_pod_name

tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py

tzulingk · 2025-11-19T19:41:03Z

tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py

+# ============================================================================
+
+
+class GPUFaultInjectorClient:


We have class GPUFaultInjector: in tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py.
should we move this to file ex tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/client.py instead?

…laining enums, fields, and KV structure, added warnings for multi-pod findings + added comment about single pod assumption, optimized imports, refactored code based on CR Signed-off-by: Oviya Seeniraj <[email protected]>

saturley-hall

To be refactored with sharing of request types for test infrastructure and separation of API endpoints from models/utilities.
Additionally we need to add endpoints for validating the available nodes/pods/GPUs so that it can effectively be black box tested.

Signed-off-by: Harrison King Saturley-Hall <[email protected]>

Signed-off-by: Harrison King Saturley-Hall <[email protected]> Co-authored-by: Harrison King Saturley-Hall <[email protected]>

pull-request-size bot added the size/XXL label Nov 2, 2025

github-actions bot added the feat label Nov 2, 2025

nv-oviya added 3 commits November 3, 2025 12:27

Fixed copyright and mypy type checks

38a8cc2

fixed mypy type errors

c6fa834

formatting change

1affaa8

nv-oviya marked this pull request as ready for review November 4, 2025 00:22

nv-oviya requested review from a team as code owners November 4, 2025 00:22

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py Show resolved Hide resolved

tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py Outdated Show resolved Hide resolved

nv-oviya marked this pull request as draft November 4, 2025 00:53

nv-oviya added 2 commits November 3, 2025 17:26

Applied suggested fixes -- fixed FastAPI shutdown to properly recover…

c2b9645

… active GPU and network faults

Ruff fix

8eff27a

nv-oviya marked this pull request as ready for review November 4, 2025 17:59

nnshah1 reviewed Nov 19, 2025

View reviewed changes

tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py Outdated Show resolved Hide resolved

tzulingk reviewed Nov 19, 2025

View reviewed changes

nv-oviya requested a review from tzulingk November 26, 2025 20:42

saturley-hall approved these changes Nov 26, 2025

View reviewed changes

fix: formatting

d910418

Signed-off-by: Harrison King Saturley-Hall <[email protected]>

saturley-hall merged commit 39a9d0b into main Nov 26, 2025
10 checks passed

saturley-hall deleted the oviya/fault-injection/api-service branch November 26, 2025 21:11

zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Dec 11, 2025

feat(fault-injection): Add fault injection API service (ai-dynamo#4042)

0996205

Signed-off-by: Harrison King Saturley-Hall <[email protected]> Co-authored-by: Harrison King Saturley-Hall <[email protected]>

		# ============================================================================


		class GPUFaultInjectorClient:

feat(fault-injection): Add fault injection API service #4042

feat(fault-injection): Add fault injection API service #4042

Uh oh!

Conversation

nv-oviya commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Where should the reviewer start?

Related Issues:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Nov 2, 2025

Uh oh!

coderabbitai bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tzulingk Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tzulingk Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

saturley-hall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nv-oviya commented Nov 2, 2025 •

edited

Loading

coderabbitai bot commented Nov 4, 2025 •

edited

Loading