feat(fault-injection): Add core testing helper utilities #4040

nv-oviya · 2025-11-02T00:38:41Z

Overview:

This PR adds core Kubernetes and inference testing utilities that support fault tolerance E2E tests. These are independent, reusable helpers for node operations, pod management, and continuous load generation.

Details:

Add helpers/k8s_operations.py: Kubernetes node and pod operations
- NodeOperations class: Cordon/uncordon nodes, check readiness, wait for scheduling
- PodOperations class: Drain pods, delete with grace periods, monitor status
- Label and taint management for fault injection scenarios
- Wait utilities with timeout handling
Add helpers/inference_testing.py: Continuous load generation and metrics
- InferenceLoadTester class: Background thread for continuous inference requests
- Request statistics tracking (success/failure counts, latencies)
- Supports both in-cluster (Service DNS) and local (port-forward) execution
- Configurable request intervals and timeouts

Where should the reviewer start?

k8s_operations.py:
- NodeOperations class - Node cordoning, uncordoning, readiness checks
- PodOperations class - Pod draining, deletion, status monitoring
- Note: Uses labels like test.fault-injection/cordoned to track test-initiated changes
inference_testing.py:
- get_inference_endpoint() - Auto-detects in-cluster vs local environment
- InferenceLoadTester class - Background load generation with threading
- Request sending with retry logic and error tracking

Key design decisions:

Node operations use labels to distinguish test cordons from production cordons
Load tester runs in background thread to not block test execution
Environment auto-detection (checks KUBERNETES_SERVICE_HOST) for portability

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to: Fault tolerance testing infrastructure

Summary by CodeRabbit

Release Notes

Tests
- Added fault-tolerance testing framework with inference load testing capabilities including continuous request simulation, latency tracking, and performance statistics.
- Added Kubernetes operations utilities for node management (cordoning, GPU driver restarts) and pod operations (draining, distribution analysis, readiness monitoring).

- k8s_operations.py: Node cordoning, pod draining, status checks - inference_testing.py: Continuous load generation and metrics Independent utilities for fault tolerance testing workflows.

copy-pr-bot · 2025-11-02T00:38:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-11-04T00:27:41Z

Walkthrough

This PR introduces a new test helper package for fault-tolerance testing, adding infrastructure to support fault-injection scenarios. Three files are created: a package initializer, an inference load testing module with endpoint selection and load generation, and a Kubernetes operations module providing node and pod management utilities.

Changes

Cohort / File(s)	Summary
Package Initialization `tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py`	Exposes public utilities via `__all__`: `InferenceLoadTester`, `get_inference_endpoint`, `NodeOperations`, and `PodOperations` by re-exporting from submodules.
Inference Testing Utilities `tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py`	Adds `get_inference_endpoint()` for selecting endpoints based on Kubernetes environment context, and `InferenceLoadTester` class providing continuous inference request generation with result collection, latency tracking, statistics computation, and thread-based background loop control.
Kubernetes Operations `tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py`	Introduces `NodeOperations` class for node cordoning, uncordoning, cordon state checking, and GPU driver restart; and `PodOperations` class for pod draining, distribution analysis, readiness waiting with optional node exclusion, and status detail retrieval.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Thread-safety implementation in InferenceLoadTester._load_loop() and result accumulation
Kubernetes API error handling and state management in node/pod operations
Timeout and polling logic in PodOperations.wait_for_pods_ready() with exclusion handling
Edge cases in result aggregation and statistics computation (get_stats())

Poem

🐰 A rabbit hops through fault-injection dreams,
With load testers and k8s schemes,
Node cordoning, pods that wait,
Testing faults at rapid rate,
New helpers make the chaos supreme! ✨

Pre-merge checks

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change—adding core testing helper utilities for fault-injection scenarios, directly matching the PR's primary objective.
Description check	✅ Passed	The description comprehensively covers all template sections with clear overview, detailed changes across both helper modules, specific reviewer guidance, and related issue context.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py (1)

107-114: Refactor latency calculation in exception handler.

The if "start_time" in locals() check is a code smell. If an exception occurs before start_time is assigned (line 79), the fallback to 0 is triggered, but this approach is fragile.

Apply this diff to initialize start_time before the try block:

     def send_inference_request(self, prompt: str = "Hello, world!") -> Dict:
         """
         Send a single inference request and return result.

         Args:
             prompt: Text prompt for inference

         Returns:
             Dict with keys: success, status_code, latency, timestamp, error
         """
+        start_time = time.time()
         try:
-            start_time = time.time()
             response = requests.post(
                 self.endpoint,
                 json={
                     "model": self.model_name,
                     "prompt": prompt,
                     "max_tokens": 50,
                     "temperature": 0.7,
                 },
                 timeout=self.timeout,
             )
             latency = time.time() - start_time

             return {
                 "success": response.status_code == 200,
                 "status_code": response.status_code,
                 "latency": latency,
                 "timestamp": time.time(),
                 "error": None if response.status_code == 200 else response.text[:200],
             }
         except requests.exceptions.Timeout:
             return {
                 "success": False,
                 "status_code": None,
                 "latency": self.timeout,
                 "timestamp": time.time(),
                 "error": "Request timeout",
             }
         except Exception as e:
             return {
                 "success": False,
                 "status_code": None,
-                "latency": time.time() - start_time if "start_time" in locals() else 0,
+                "latency": time.time() - start_time,
                 "timestamp": time.time(),
                 "error": str(e)[:200],
             }

tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (1)

11-16: Consider sorting __all__ for consistency.

While not critical, maintaining alphabetical order in __all__ improves readability and aligns with Python style conventions.

Apply this diff to sort the list:
 __all__ = [
+    "get_inference_endpoint",
     "InferenceLoadTester",
-    "get_inference_endpoint",
     "NodeOperations",
     "PodOperations",
 ]

tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py (1)

347-394: Consider reporting status for all containers.

Line 376 only examines the first container's status. For pods with multiple containers, this provides an incomplete view. Consider either documenting this limitation or iterating through all containers to provide complete status information.

If comprehensive status is desired, apply this diff:

             details = []
             for pod in pods.items:
                 pod_name = pod.metadata.name
                 node = pod.spec.node_name

                 if pod.status.container_statuses:
-                    cs = pod.status.container_statuses[0]
-                    if cs.state.waiting:
-                        state = cs.state.waiting.reason
-                    elif cs.state.terminated:
-                        state = f"Terminated ({cs.state.terminated.reason})"
-                    elif cs.state.running:
-                        state = "Running"
-                    else:
-                        state = "Unknown"
+                    # Report status of all containers
+                    states = []
+                    for cs in pod.status.container_statuses:
+                        if cs.state.waiting:
+                            states.append(f"{cs.name}: {cs.state.waiting.reason}")
+                        elif cs.state.terminated:
+                            states.append(f"{cs.name}: Terminated ({cs.state.terminated.reason})")
+                        elif cs.state.running:
+                            states.append(f"{cs.name}: Running")
+                        else:
+                            states.append(f"{cs.name}: Unknown")
+                    state = ", ".join(states)
                 else:
                     state = f"{pod.status.phase} (no container status)"

                 details.append({"name": pod_name, "node": node, "state": state})

             return details

Alternatively, document that only the first container is reported:

     def get_pod_status_details(
         self, namespace: str, label_selector: str, node_name: Optional[str] = None
     ) -> List[Dict]:
         """
         Get detailed status for each pod.

         Args:
             namespace: Kubernetes namespace
             label_selector: Label selector for pods
             node_name: If provided, only get pods on this node

         Returns:
-            List of dicts with pod name, state, and reason
+            List of dicts with pod name, node, and state (first container only for multi-container pods)
         """

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d76583 and 8375ed6.

📒 Files selected for processing (3)

tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (1 hunks)
tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py (1 hunks)
tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: tests/fault_tolerance/cancellation/test_trtllm.py:4-204
Timestamp: 2025-09-25T00:54:01.369Z
Learning: The fault tolerance tests in tests/fault_tolerance/cancellation/ run in a controlled container environment where files written to /workspace are automatically cleaned up after test completion, and tests execute sequentially without concurrency concerns, so temporary file management for config files is not necessary.

Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 1444
File: tests/fault_tolerance/scenarios.py:57-57
Timestamp: 2025-07-01T15:39:56.789Z
Learning: The fault tolerance tests in tests/fault_tolerance/ are designed to run only in the mounted container environment, so hardcoded absolute paths with `/workspace/` prefix are intentional and should not be changed to relative paths.

🧬 Code graph analysis (1)

tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (2)

tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py (2)

InferenceLoadTester (48-180)

get_inference_endpoint (22-45)

tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py (2)

NodeOperations (19-197)

PodOperations (200-394)

🪛 Ruff (0.14.3)

tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py

107-107: Do not catch blind exception: Exception

(BLE001)

tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py

11-16: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py

58-58: Consider moving this statement to an else block

(TRY300)

60-60: Do not catch blind exception: Exception

(BLE001)

90-90: Consider moving this statement to an else block

(TRY300)

92-92: Do not catch blind exception: Exception

(BLE001)

100-100: Consider moving this statement to an else block

(TRY300)

101-101: Do not catch blind exception: Exception

(BLE001)

187-188: try-except-pass detected, consider logging the exception

(S110)

187-187: Do not catch blind exception: Exception

(BLE001)

193-193: Consider moving this statement to an else block

(TRY300)

195-195: Do not catch blind exception: Exception

(BLE001)

254-254: Consider moving this statement to an else block

(TRY300)

256-256: Do not catch blind exception: Exception

(BLE001)

284-284: Consider moving this statement to an else block

(TRY300)

286-286: Do not catch blind exception: Exception

(BLE001)

340-340: Do not catch blind exception: Exception

(BLE001)

390-390: Consider moving this statement to an else block

(TRY300)

392-392: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (9)

tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py (5)

22-45: LGTM! Environment detection logic is solid.

The use of KUBERNETES_SERVICE_HOST to distinguish in-cluster from local execution is the standard approach, and the endpoint construction is appropriate for both scenarios.

51-66: LGTM! Thread-safe initialization.

The initialization properly sets up thread-safe result collection with a lock and appropriate default values.

116-122: LGTM! Proper thread-safe background loop.

The loop correctly uses the lock when appending results and implements the expected interval-based request pattern.

124-139: LGTM! Background thread setup is appropriate.

The daemon thread configuration is suitable for fault-tolerance testing where the load tester should not prevent process exit. The early return guards against multiple start calls.

141-153: Verify thread join timeout is sufficient.

The thread join uses a 5-second timeout but doesn't verify the thread actually stopped. If the join times out, the daemon thread continues running and could append to results after stop() returns the copy.

For the typical 2-second interval, 5 seconds should be adequate. However, if a long-running inference request is in progress, the thread might not stop within the timeout. Consider whether this edge case needs handling, such as logging a warning if the join times out.

tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py (4)

31-62: LGTM! Proper cordon implementation with verification.

The method correctly cordons the node and verifies the operation succeeded. The test.fault-injection/* label prefix is good practice for tracking test-initiated changes.

96-102: LGTM! Simple and correct cordon status check.

The implementation properly checks the unschedulable field and handles both exceptions and None values safely.

212-258: LGTM! Proper pod draining implementation.

The method correctly filters pods by node and label, deletes them with zero grace period (appropriate for fault injection testing), and handles the 404 case for already-deleted pods.

260-288: LGTM! Clean pod distribution calculation.

The method correctly counts only Running pods and builds a proper node-to-count mapping.

tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py

tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py

- Track and restore original node state in cordon/uncordon operations - Detect DaemonSet replacement pods by listing instead of reading by name - Validate all container statuses for multi-container pod readiness - Return consistent dict schema from get_stats() when empty

saturley-hall

We should choose logging whenever possible so the test verbosity can be scaled up/down.

saturley-hall · 2025-11-17T23:11:13Z

tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py

+                        namespace=namespace,
+                        grace_period_seconds=0,
+                    )
+                    print(f"      ✓ Evicted: {pod.metadata.name}")


These should be using logging rather than print

saturley-hall · 2025-11-17T23:21:49Z

tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py

+            node_name: If provided, only get pods on this node
+
+        Returns:
+            List of dicts with pod name, state, and reason


This also includes the node that it is on. The reason that it is in the state is less certain to me.

nnshah1 · 2025-11-19T15:47:10Z

tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py

this one overlaps with our testing support in the existing tests - we can consolidate now or later ...

nnshah1

the k8s_operations are a good set - we 'd want to converge them into faults within the test folder

…g, .debug, etc) + fixed docstring Signed-off-by: Oviya Seeniraj <[email protected]>

)

feat(fault-injection): Add core testing helper utilities

c5339a2

- k8s_operations.py: Node cordoning, pod draining, status checks - inference_testing.py: Continuous load generation and metrics Independent utilities for fault tolerance testing workflows.

pull-request-size bot added the size/XL label Nov 2, 2025

github-actions bot added the feat label Nov 2, 2025

This was referenced Nov 2, 2025

feat(fault-injection): Add high-level CUDA fault injection helper #4041

Merged

test(fault-injection): Add XID 79 NVSentinel E2E test #4046

Open

Fixes for GitHub checks

8375ed6

nv-oviya marked this pull request as ready for review November 4, 2025 00:21

nv-oviya requested review from a team as code owners November 4, 2025 00:21

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

nv-oviya marked this pull request as draft November 4, 2025 00:36

nv-oviya marked this pull request as ready for review November 4, 2025 01:27

saturley-hall requested changes Nov 17, 2025

View reviewed changes

nnshah1 reviewed Nov 19, 2025

View reviewed changes

nnshah1 approved these changes Nov 19, 2025

View reviewed changes

replaced print statements with proper logging (.error, .info, .warnin…

71a2d36

…g, .debug, etc) + fixed docstring Signed-off-by: Oviya Seeniraj <[email protected]>

saturley-hall approved these changes Nov 26, 2025

View reviewed changes

saturley-hall merged commit 011a200 into main Nov 26, 2025
10 checks passed

saturley-hall deleted the oviya/fault-injection/helper-utilities branch November 26, 2025 18:09

zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Dec 11, 2025

feat(fault-injection): Add core testing helper utilities (ai-dynamo#4040

be06f5a

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(fault-injection): Add core testing helper utilities #4040

feat(fault-injection): Add core testing helper utilities #4040

Uh oh!

nv-oviya commented Nov 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Nov 2, 2025

Uh oh!

coderabbitai bot commented Nov 4, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saturley-hall left a comment

Uh oh!

saturley-hall Nov 17, 2025

Uh oh!

saturley-hall Nov 17, 2025

Uh oh!

nnshah1 Nov 19, 2025

Uh oh!

nnshah1 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(fault-injection): Add core testing helper utilities #4040

feat(fault-injection): Add core testing helper utilities #4040

Uh oh!

Conversation

nv-oviya commented Nov 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Nov 2, 2025

Uh oh!

coderabbitai bot commented Nov 4, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saturley-hall left a comment

Choose a reason for hiding this comment

Uh oh!

saturley-hall Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

saturley-hall Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

nnshah1 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

nnshah1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nv-oviya commented Nov 2, 2025 •

edited by coderabbitai bot

Loading