-
Notifications
You must be signed in to change notification settings - Fork 739
feat(fault-injection): Add core testing helper utilities #4040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- k8s_operations.py: Node cordoning, pod draining, status checks - inference_testing.py: Continuous load generation and metrics Independent utilities for fault tolerance testing workflows.
WalkthroughThis PR introduces a new test helper package for fault-tolerance testing, adding infrastructure to support fault-injection scenarios. Three files are created: a package initializer, an inference load testing module with endpoint selection and load generation, and a Kubernetes operations module providing node and pod management utilities. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (3)
tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py (1)
107-114: Refactor latency calculation in exception handler.The
if "start_time" in locals()check is a code smell. If an exception occurs beforestart_timeis assigned (line 79), the fallback to 0 is triggered, but this approach is fragile.Apply this diff to initialize
start_timebefore the try block:def send_inference_request(self, prompt: str = "Hello, world!") -> Dict: """ Send a single inference request and return result. Args: prompt: Text prompt for inference Returns: Dict with keys: success, status_code, latency, timestamp, error """ + start_time = time.time() try: - start_time = time.time() response = requests.post( self.endpoint, json={ "model": self.model_name, "prompt": prompt, "max_tokens": 50, "temperature": 0.7, }, timeout=self.timeout, ) latency = time.time() - start_time return { "success": response.status_code == 200, "status_code": response.status_code, "latency": latency, "timestamp": time.time(), "error": None if response.status_code == 200 else response.text[:200], } except requests.exceptions.Timeout: return { "success": False, "status_code": None, "latency": self.timeout, "timestamp": time.time(), "error": "Request timeout", } except Exception as e: return { "success": False, "status_code": None, - "latency": time.time() - start_time if "start_time" in locals() else 0, + "latency": time.time() - start_time, "timestamp": time.time(), "error": str(e)[:200], }tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (1)
11-16: Consider sorting__all__for consistency.While not critical, maintaining alphabetical order in
__all__improves readability and aligns with Python style conventions.Apply this diff to sort the list:
__all__ = [ + "get_inference_endpoint", "InferenceLoadTester", - "get_inference_endpoint", "NodeOperations", "PodOperations", ]tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py (1)
347-394: Consider reporting status for all containers.Line 376 only examines the first container's status. For pods with multiple containers, this provides an incomplete view. Consider either documenting this limitation or iterating through all containers to provide complete status information.
If comprehensive status is desired, apply this diff:
details = [] for pod in pods.items: pod_name = pod.metadata.name node = pod.spec.node_name if pod.status.container_statuses: - cs = pod.status.container_statuses[0] - if cs.state.waiting: - state = cs.state.waiting.reason - elif cs.state.terminated: - state = f"Terminated ({cs.state.terminated.reason})" - elif cs.state.running: - state = "Running" - else: - state = "Unknown" + # Report status of all containers + states = [] + for cs in pod.status.container_statuses: + if cs.state.waiting: + states.append(f"{cs.name}: {cs.state.waiting.reason}") + elif cs.state.terminated: + states.append(f"{cs.name}: Terminated ({cs.state.terminated.reason})") + elif cs.state.running: + states.append(f"{cs.name}: Running") + else: + states.append(f"{cs.name}: Unknown") + state = ", ".join(states) else: state = f"{pod.status.phase} (no container status)" details.append({"name": pod_name, "node": node, "state": state}) return detailsAlternatively, document that only the first container is reported:
def get_pod_status_details( self, namespace: str, label_selector: str, node_name: Optional[str] = None ) -> List[Dict]: """ Get detailed status for each pod. Args: namespace: Kubernetes namespace label_selector: Label selector for pods node_name: If provided, only get pods on this node Returns: - List of dicts with pod name, state, and reason + List of dicts with pod name, node, and state (first container only for multi-container pods) """
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py(1 hunks)tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py(1 hunks)tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: tests/fault_tolerance/cancellation/test_trtllm.py:4-204
Timestamp: 2025-09-25T00:54:01.369Z
Learning: The fault tolerance tests in tests/fault_tolerance/cancellation/ run in a controlled container environment where files written to /workspace are automatically cleaned up after test completion, and tests execute sequentially without concurrency concerns, so temporary file management for config files is not necessary.
Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 1444
File: tests/fault_tolerance/scenarios.py:57-57
Timestamp: 2025-07-01T15:39:56.789Z
Learning: The fault tolerance tests in tests/fault_tolerance/ are designed to run only in the mounted container environment, so hardcoded absolute paths with `/workspace/` prefix are intentional and should not be changed to relative paths.
🧬 Code graph analysis (1)
tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (2)
tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py (2)
InferenceLoadTester(48-180)get_inference_endpoint(22-45)tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py (2)
NodeOperations(19-197)PodOperations(200-394)
🪛 Ruff (0.14.3)
tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py
107-107: Do not catch blind exception: Exception
(BLE001)
tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py
11-16: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py
58-58: Consider moving this statement to an else block
(TRY300)
60-60: Do not catch blind exception: Exception
(BLE001)
90-90: Consider moving this statement to an else block
(TRY300)
92-92: Do not catch blind exception: Exception
(BLE001)
100-100: Consider moving this statement to an else block
(TRY300)
101-101: Do not catch blind exception: Exception
(BLE001)
187-188: try-except-pass detected, consider logging the exception
(S110)
187-187: Do not catch blind exception: Exception
(BLE001)
193-193: Consider moving this statement to an else block
(TRY300)
195-195: Do not catch blind exception: Exception
(BLE001)
254-254: Consider moving this statement to an else block
(TRY300)
256-256: Do not catch blind exception: Exception
(BLE001)
284-284: Consider moving this statement to an else block
(TRY300)
286-286: Do not catch blind exception: Exception
(BLE001)
340-340: Do not catch blind exception: Exception
(BLE001)
390-390: Consider moving this statement to an else block
(TRY300)
392-392: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (9)
tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py (5)
22-45: LGTM! Environment detection logic is solid.The use of
KUBERNETES_SERVICE_HOSTto distinguish in-cluster from local execution is the standard approach, and the endpoint construction is appropriate for both scenarios.
51-66: LGTM! Thread-safe initialization.The initialization properly sets up thread-safe result collection with a lock and appropriate default values.
116-122: LGTM! Proper thread-safe background loop.The loop correctly uses the lock when appending results and implements the expected interval-based request pattern.
124-139: LGTM! Background thread setup is appropriate.The daemon thread configuration is suitable for fault-tolerance testing where the load tester should not prevent process exit. The early return guards against multiple start calls.
141-153: Verify thread join timeout is sufficient.The thread join uses a 5-second timeout but doesn't verify the thread actually stopped. If the join times out, the daemon thread continues running and could append to
resultsafterstop()returns the copy.For the typical 2-second interval, 5 seconds should be adequate. However, if a long-running inference request is in progress, the thread might not stop within the timeout. Consider whether this edge case needs handling, such as logging a warning if the join times out.
tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py (4)
31-62: LGTM! Proper cordon implementation with verification.The method correctly cordons the node and verifies the operation succeeded. The
test.fault-injection/*label prefix is good practice for tracking test-initiated changes.
96-102: LGTM! Simple and correct cordon status check.The implementation properly checks the
unschedulablefield and handles both exceptions and None values safely.
212-258: LGTM! Proper pod draining implementation.The method correctly filters pods by node and label, deletes them with zero grace period (appropriate for fault injection testing), and handles the 404 case for already-deleted pods.
260-288: LGTM! Clean pod distribution calculation.The method correctly counts only Running pods and builds a proper node-to-count mapping.
tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py
Show resolved
Hide resolved
tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py
Show resolved
Hide resolved
- Track and restore original node state in cordon/uncordon operations - Detect DaemonSet replacement pods by listing instead of reading by name - Validate all container statuses for multi-container pod readiness - Return consistent dict schema from get_stats() when empty
saturley-hall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should choose logging whenever possible so the test verbosity can be scaled up/down.
| namespace=namespace, | ||
| grace_period_seconds=0, | ||
| ) | ||
| print(f" ✓ Evicted: {pod.metadata.name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be using logging rather than print
| node_name: If provided, only get pods on this node | ||
|
|
||
| Returns: | ||
| List of dicts with pod name, state, and reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also includes the node that it is on. The reason that it is in the state is less certain to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one overlaps with our testing support in the existing tests - we can consolidate now or later ...
nnshah1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the k8s_operations are a good set - we 'd want to converge them into faults within the test folder
…g, .debug, etc) + fixed docstring Signed-off-by: Oviya Seeniraj <[email protected]>
Overview:
This PR adds core Kubernetes and inference testing utilities that support fault tolerance E2E tests. These are independent, reusable helpers for node operations, pod management, and continuous load generation.
Details:
helpers/k8s_operations.py: Kubernetes node and pod operationsNodeOperationsclass: Cordon/uncordon nodes, check readiness, wait for schedulingPodOperationsclass: Drain pods, delete with grace periods, monitor statushelpers/inference_testing.py: Continuous load generation and metricsInferenceLoadTesterclass: Background thread for continuous inference requestsWhere should the reviewer start?
k8s_operations.py:NodeOperationsclass - Node cordoning, uncordoning, readiness checksPodOperationsclass - Pod draining, deletion, status monitoringtest.fault-injection/cordonedto track test-initiated changesinference_testing.py:get_inference_endpoint()- Auto-detects in-cluster vs local environmentInferenceLoadTesterclass - Background load generation with threadingKey design decisions:
KUBERNETES_SERVICE_HOST) for portabilityRelated Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Release Notes