-
Notifications
You must be signed in to change notification settings - Fork 739
feat(fault-injection): Add fault injection API service #4042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- FastAPI service for remote fault injection - Endpoints for GPU XID injection, network faults - Dockerfile for containerized deployment - Requirements with FastAPI, kubernetes client Provides HTTP API for triggering fault injection from tests.
WalkthroughThis PR introduces a new Fault Injection API Service for orchestrating hardware fault testing in Kubernetes environments. It includes a containerized FastAPI application with Kubernetes integration, GPU and network fault injection capabilities, metrics collection, and fault lifecycle management. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes Areas requiring extra attention:
Poem
Pre-merge checks✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/fault_tolerance/hardware/fault-injection-service/api-service/Dockerfile(1 hunks)tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py(1 hunks)tests/fault_tolerance/hardware/fault-injection-service/api-service/requirements.txt(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.3)
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
174-174: Avoid specifying long messages outside the exception class
(TRY003)
213-213: Do not catch blind exception: Exception
(BLE001)
254-256: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
261-261: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
274-274: Consider moving this statement to an else block
(TRY300)
276-276: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
296-296: Consider moving this statement to an else block
(TRY300)
298-298: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
311-311: Consider moving this statement to an else block
(TRY300)
313-313: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
324-324: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
331-331: Consider moving this statement to an else block
(TRY300)
333-333: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
387-387: Do not catch blind exception: Exception
(BLE001)
388-388: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
440-440: Do not catch blind exception: Exception
(BLE001)
441-443: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
468-468: Do not catch blind exception: Exception
(BLE001)
469-469: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
487-487: Do not catch blind exception: Exception
(BLE001)
521-521: Unused method argument: target
(ARG002)
707-707: Consider moving this statement to an else block
(TRY300)
709-709: Do not catch blind exception: Exception
(BLE001)
710-710: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
717-717: Unused method argument: target
(ARG002)
930-930: Consider moving this statement to an else block
(TRY300)
932-932: Do not catch blind exception: Exception
(BLE001)
933-933: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
937-937: Use explicit conversion flag
Replace with conversion flag
(RUF010)
972-972: Consider moving this statement to an else block
(TRY300)
974-974: Do not catch blind exception: Exception
(BLE001)
975-975: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
1003-1003: Consider moving this statement to an else block
(TRY300)
1005-1005: Do not catch blind exception: Exception
(BLE001)
1006-1006: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
1063-1063: Do not catch blind exception: Exception
(BLE001)
1113-1113: Unused function argument: k8s
(ARG001)
1172-1172: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
1223-1223: Do not catch blind exception: Exception
(BLE001)
1224-1226: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
1607-1607: Do not catch blind exception: Exception
(BLE001)
1608-1608: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
1609-1609: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
1621-1621: Possible binding to all interfaces
(S104)
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Outdated
Show resolved
Hide resolved
… active GPU and network faults
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Outdated
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Outdated
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Outdated
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Outdated
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Outdated
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Outdated
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Outdated
Show resolved
Hide resolved
| logger.error(f"Failed to create NetworkPolicy: {e}") | ||
| return False, str(e) | ||
|
|
||
| async def _create_chaos_mesh_network_fault( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extract common patterns across functions?
ex
async def _get_target_pod_details(
self, namespace: str, target_pod_prefix: str
) -> tuple[bool, str | dict[str, str], str]:
"""
Looks up target Pod name and labels.
Returns (success: bool, labels: dict[str, str] | error_msg: str, pod_name: str)
"""
if not target_pod_prefix:
return False, "target_pod_prefix parameter is required", ""
target_pod_name = await self.k8s.get_pod_by_prefix(namespace, target_pod_prefix)
if not target_pod_name:
return (
False,
f"Could not find pod with prefix '{target_pod_prefix}' in namespace '{namespace}'",
"",
)
target_labels = await self.k8s.get_pod_labels(namespace, target_pod_name)
if not target_labels:
return False, f"Could not get labels for pod '{target_pod_name}'", ""
logger.info(f"Found target pod: {target_pod_name} with labels: {target_labels}")
return True, target_labels, target_pod_name
tests/fault_tolerance/hardware/fault-injection-service/api-service/main.py
Outdated
Show resolved
Hide resolved
| # ============================================================================ | ||
|
|
||
|
|
||
| class GPUFaultInjectorClient: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have class GPUFaultInjector: in tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/agent.py.
should we move this to file ex tests/fault_tolerance/hardware/fault-injection-service/agents/gpu-fault-injector/client.py instead?
…laining enums, fields, and KV structure, added warnings for multi-pod findings + added comment about single pod assumption, optimized imports, refactored code based on CR Signed-off-by: Oviya Seeniraj <[email protected]>
saturley-hall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be refactored with sharing of request types for test infrastructure and separation of API endpoints from models/utilities.
Additionally we need to add endpoints for validating the available nodes/pods/GPUs so that it can effectively be black box tested.
Signed-off-by: Harrison King Saturley-Hall <[email protected]>
Signed-off-by: Harrison King Saturley-Hall <[email protected]> Co-authored-by: Harrison King Saturley-Hall <[email protected]>
This PR adds a FastAPI-based service that provides HTTP endpoints for triggering fault injection remotely. This enables tests to inject faults without direct cluster access and provides a consistent API for both GPU and network fault injection.
Details:
Add
api-service/main.py: FastAPI service with fault injection endpoints/health: Health check endpoint/api/v1/faults/gpu/inject/xid-79: Inject XID 79 on specified node/api/v1/faults/network/inject: Inject network partitions using NetworkPolicy or ChaosMesh/api/v1/faults/{fault_id}/recover: Recover from GPU or network faults/api/v1/faults/network/cleanup: Clean up orphaned NetworkPoliciesAdd
api-service/requirements.txt: Dependencies (FastAPI, httpx, kubernetes)Add
api-service/Dockerfile: Container image definitionWhere should the reviewer start?
main.py- Main API implementation:/api/v1/faults/gpu/inject/xid-79endpoint:/inject-xidendpoint/api/v1/faults/network/injectendpoint:requirements.txt- Dependencies:Dockerfile- Standard Python container setupArchitecture note:
NodePortorLoadBalancerservice for external accessRelated Issues:
Summary by CodeRabbit