Skip to content

Conversation

@nv-oviya
Copy link
Contributor

@nv-oviya nv-oviya commented Nov 2, 2025

This PR adds a FastAPI-based service that provides HTTP endpoints for triggering fault injection remotely. This enables tests to inject faults without direct cluster access and provides a consistent API for both GPU and network fault injection.

Details:

  • Add api-service/main.py: FastAPI service with fault injection endpoints

    • /health: Health check endpoint
    • /api/v1/faults/gpu/inject/xid-79: Inject XID 79 on specified node
    • /api/v1/faults/network/inject: Inject network partitions using NetworkPolicy or ChaosMesh
    • /api/v1/faults/{fault_id}/recover: Recover from GPU or network faults
    • /api/v1/faults/network/cleanup: Clean up orphaned NetworkPolicies
    • Calls GPU fault injector agent DaemonSet pods on target nodes
    • Creates NetworkPolicies directly for network partitions (no agent needed)
    • Fault tracking with in-memory storage (fault ID --> status mapping)
    • Automatic cleanup of active faults during shutdown
  • Add api-service/requirements.txt: Dependencies (FastAPI, httpx, kubernetes)

  • Add api-service/Dockerfile: Container image definition

    • Python 3.11-slim base image
    • Installs Python dependencies
    • Runs uvicorn server on port 8080

Where should the reviewer start?

  1. main.py - Main API implementation:

    • FastAPI setup, in-cluster K8s config, fault storage
    • /api/v1/faults/gpu/inject/xid-79 endpoint:
      • Finds GPU fault injector agent pod on target node
      • Forwards request to agent's /inject-xid endpoint
      • Generates fault ID for tracking
    • /api/v1/faults/network/inject endpoint:
      • Creates NetworkPolicies directly (egress/ingress blocking)
      • Supports ChaosMesh NetworkChaos for advanced faults (packet loss, delay)
      • Tracks active network policies for cleanup
    • Fault recovery endpoints for both GPU and network faults
    • Shutdown handler that automatically recovers all active faults
    • Health check endpoint
  2. requirements.txt - Dependencies:

    • FastAPI for API framework
    • httpx for async HTTP calls to agents
    • kubernetes client for pod discovery and NetworkPolicy management
  3. Dockerfile - Standard Python container setup

Architecture note:

  • API service acts as orchestrator
  • GPU fault injection delegated to DaemonSet agent pods
  • Network fault injection handled directly via Kubernetes API (NetworkPolicy/ChaosMesh)
  • In-memory storage (not persistent across restarts - acceptable for tests)
  • Uses NodePort or LoadBalancer service for external access
  • Graceful shutdown with automatic fault recovery to prevent cluster degradation

Related Issues:

  • Relates to: GPU fault injection agent infrastructure (next PR)
  • Relates to: GPU fault tolerance testing
  • Relates to: Network partition testing and fault tolerance validation

Summary by CodeRabbit

  • New Features
    • Introduced Fault Injection API Service for orchestrating hardware fault testing in Kubernetes environments.
    • Added capabilities for GPU fault injection, network fault injection, and fault recovery operations.
    • Enabled metrics collection and aggregation across monitoring agents.
    • Deployed as containerized service with built-in health monitoring.

- FastAPI service for remote fault injection
- Endpoints for GPU XID injection, network faults
- Dockerfile for containerized deployment
- Requirements with FastAPI, kubernetes client

Provides HTTP API for triggering fault injection from tests.
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nv-oviya nv-oviya marked this pull request as ready for review November 4, 2025 00:22
@nv-oviya nv-oviya requested review from a team as code owners November 4, 2025 00:22
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 4, 2025

Walkthrough

This PR introduces a new Fault Injection API Service for orchestrating hardware fault testing in Kubernetes environments. It includes a containerized FastAPI application with Kubernetes integration, GPU and network fault injection capabilities, metrics collection, and fault lifecycle management.

Changes

Cohort / File(s) Summary
Container Setup
tests/fault_tolerance/hardware/fault-injection-service/api-service/Dockerfile
Adds Python 3.12-slim-based Dockerfile with apt dependencies (curl), requirements installation, healthcheck endpoint, non-root user (faultinjection), and uvicorn entrypoint on port 8080
Service Implementation
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Implements comprehensive Fault Injection API Service with: KubernetesHelper for K8s cluster operations; GPUFaultInjectorClient for GPU fault orchestration with XID error injection; NetworkFaultInjectorClient supporting NetworkPolicy and Chaos Mesh network faults; MonitoringAgentClient for metrics aggregation; FaultTracker for lifecycle management; FastAPI endpoints for GPU/network fault injection, recovery, metrics collection, and status queries; lifespan context manager for initialization/cleanup
Dependencies
tests/fault_tolerance/hardware/fault-injection-service/api-service/requirements.txt
Specifies Python package dependencies including fastapi, httpx, kubernetes, pydantic, python-multipart, pyyaml, and uvicorn with standard license headers

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Areas requiring extra attention:

  • main.py — Largest change; review the multi-layered orchestration logic, error handling patterns, and interactions between KubernetesHelper, GPU/NetworkFaultInjectorClient, and MonitoringAgentClient
  • Kubernetes integration — Validate in-cluster/local config loading, pod/daemonset queries, exec logic, and namespace/label handling correctness
  • Fault lifecycle & state management — Verify FaultTracker thread-safety, fault status transitions, and cleanup of orphaned resources (NetworkPolicies, Chaos Mesh)
  • API endpoint consistency — Confirm all 18+ endpoints follow consistent error handling, response schemas, and parameter validation
  • GPU and network fault injection flows — Review XID error injection specifics and NetworkPolicy vs. Chaos Mesh branching logic

Poem

🐰 A fault injector hops into the cluster with glee,
GPU chaos and network storms it shall set free,
With FastAPI fangs and Kubernetes might,
Hardware faults dance through the Kubernetes night! 🌙⚡

Pre-merge checks

✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change—adding a fault injection API service with FastAPI endpoints. It is concise, specific, and directly reflects the changeset content.
Description check ✅ Passed PR description is comprehensive with clear overview, detailed implementation notes, reviewer guidance, and architecture rationale.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d76583 and 1affaa8.

📒 Files selected for processing (3)
  • tests/fault_tolerance/hardware/fault-injection-service/api-service/Dockerfile (1 hunks)
  • tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py (1 hunks)
  • tests/fault_tolerance/hardware/fault-injection-service/api-service/requirements.txt (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.3)
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py

174-174: Avoid specifying long messages outside the exception class

(TRY003)


213-213: Do not catch blind exception: Exception

(BLE001)


254-256: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


261-261: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


274-274: Consider moving this statement to an else block

(TRY300)


276-276: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


296-296: Consider moving this statement to an else block

(TRY300)


298-298: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


311-311: Consider moving this statement to an else block

(TRY300)


313-313: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


324-324: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


331-331: Consider moving this statement to an else block

(TRY300)


333-333: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


387-387: Do not catch blind exception: Exception

(BLE001)


388-388: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


440-440: Do not catch blind exception: Exception

(BLE001)


441-443: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


468-468: Do not catch blind exception: Exception

(BLE001)


469-469: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


487-487: Do not catch blind exception: Exception

(BLE001)


521-521: Unused method argument: target

(ARG002)


707-707: Consider moving this statement to an else block

(TRY300)


709-709: Do not catch blind exception: Exception

(BLE001)


710-710: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


717-717: Unused method argument: target

(ARG002)


930-930: Consider moving this statement to an else block

(TRY300)


932-932: Do not catch blind exception: Exception

(BLE001)


933-933: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


937-937: Use explicit conversion flag

Replace with conversion flag

(RUF010)


972-972: Consider moving this statement to an else block

(TRY300)


974-974: Do not catch blind exception: Exception

(BLE001)


975-975: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


1003-1003: Consider moving this statement to an else block

(TRY300)


1005-1005: Do not catch blind exception: Exception

(BLE001)


1006-1006: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


1063-1063: Do not catch blind exception: Exception

(BLE001)


1113-1113: Unused function argument: k8s

(ARG001)


1172-1172: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


1223-1223: Do not catch blind exception: Exception

(BLE001)


1224-1226: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


1607-1607: Do not catch blind exception: Exception

(BLE001)


1608-1608: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


1609-1609: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


1621-1621: Possible binding to all interfaces

(S104)

@nv-oviya nv-oviya marked this pull request as draft November 4, 2025 00:53
@nv-oviya nv-oviya marked this pull request as ready for review November 4, 2025 17:59
logger.error(f"Failed to create NetworkPolicy: {e}")
return False, str(e)

async def _create_chaos_mesh_network_fault(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract common patterns across functions?
ex

async def _get_target_pod_details(
    self, namespace: str, target_pod_prefix: str
) -> tuple[bool, str | dict[str, str], str]:
    """
    Looks up target Pod name and labels.
    Returns (success: bool, labels: dict[str, str] | error_msg: str, pod_name: str)
    """
    if not target_pod_prefix:
        return False, "target_pod_prefix parameter is required", ""

    target_pod_name = await self.k8s.get_pod_by_prefix(namespace, target_pod_prefix)
    if not target_pod_name:
        return (
            False,
            f"Could not find pod with prefix '{target_pod_prefix}' in namespace '{namespace}'",
            "",
        )

    target_labels = await self.k8s.get_pod_labels(namespace, target_pod_name)
    if not target_labels:
        return False, f"Could not get labels for pod '{target_pod_name}'", ""

    logger.info(f"Found target pod: {target_pod_name} with labels: {target_labels}")
    return True, target_labels, target_pod_name

# ============================================================================


class GPUFaultInjectorClient:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have class GPUFaultInjector: in tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py.
should we move this to file ex tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/client.py instead?

…laining enums, fields, and KV structure, added warnings for multi-pod findings + added comment about single pod assumption, optimized imports, refactored code based on CR

Signed-off-by: Oviya Seeniraj <[email protected]>
@nv-oviya nv-oviya requested a review from tzulingk November 26, 2025 20:42
Copy link
Member

@saturley-hall saturley-hall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be refactored with sharing of request types for test infrastructure and separation of API endpoints from models/utilities.
Additionally we need to add endpoints for validating the available nodes/pods/GPUs so that it can effectively be black box tested.

Signed-off-by: Harrison King Saturley-Hall <[email protected]>
@saturley-hall saturley-hall merged commit 39a9d0b into main Nov 26, 2025
10 checks passed
@saturley-hall saturley-hall deleted the oviya/fault-injection/api-service branch November 26, 2025 21:11
zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Dec 11, 2025
Signed-off-by: Harrison King Saturley-Hall <[email protected]>
Co-authored-by: Harrison King Saturley-Hall <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants