Skip to content

Conversation

@nv-oviya
Copy link
Contributor

@nv-oviya nv-oviya commented Nov 2, 2025

Overview:

This PR adds core Kubernetes and inference testing utilities that support fault tolerance E2E tests. These are independent, reusable helpers for node operations, pod management, and continuous load generation.

Details:

  • Add helpers/k8s_operations.py: Kubernetes node and pod operations
    • NodeOperations class: Cordon/uncordon nodes, check readiness, wait for scheduling
    • PodOperations class: Drain pods, delete with grace periods, monitor status
    • Label and taint management for fault injection scenarios
    • Wait utilities with timeout handling
  • Add helpers/inference_testing.py: Continuous load generation and metrics
    • InferenceLoadTester class: Background thread for continuous inference requests
    • Request statistics tracking (success/failure counts, latencies)
    • Supports both in-cluster (Service DNS) and local (port-forward) execution
    • Configurable request intervals and timeouts

Where should the reviewer start?

  1. k8s_operations.py:

    • NodeOperations class - Node cordoning, uncordoning, readiness checks
    • PodOperations class - Pod draining, deletion, status monitoring
    • Note: Uses labels like test.fault-injection/cordoned to track test-initiated changes
  2. inference_testing.py:

    • get_inference_endpoint() - Auto-detects in-cluster vs local environment
    • InferenceLoadTester class - Background load generation with threading
    • Request sending with retry logic and error tracking

Key design decisions:

  • Node operations use labels to distinguish test cordons from production cordons
  • Load tester runs in background thread to not block test execution
  • Environment auto-detection (checks KUBERNETES_SERVICE_HOST) for portability

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • Relates to: Fault tolerance testing infrastructure

Summary by CodeRabbit

Release Notes

  • Tests
    • Added fault-tolerance testing framework with inference load testing capabilities including continuous request simulation, latency tracking, and performance statistics.
    • Added Kubernetes operations utilities for node management (cordoning, GPU driver restarts) and pod operations (draining, distribution analysis, readiness monitoring).

- k8s_operations.py: Node cordoning, pod draining, status checks
- inference_testing.py: Continuous load generation and metrics

Independent utilities for fault tolerance testing workflows.
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nv-oviya nv-oviya marked this pull request as ready for review November 4, 2025 00:21
@nv-oviya nv-oviya requested review from a team as code owners November 4, 2025 00:21
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 4, 2025

Walkthrough

This PR introduces a new test helper package for fault-tolerance testing, adding infrastructure to support fault-injection scenarios. Three files are created: a package initializer, an inference load testing module with endpoint selection and load generation, and a Kubernetes operations module providing node and pod management utilities.

Changes

Cohort / File(s) Summary
Package Initialization
tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py
Exposes public utilities via __all__: InferenceLoadTester, get_inference_endpoint, NodeOperations, and PodOperations by re-exporting from submodules.
Inference Testing Utilities
tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py
Adds get_inference_endpoint() for selecting endpoints based on Kubernetes environment context, and InferenceLoadTester class providing continuous inference request generation with result collection, latency tracking, statistics computation, and thread-based background loop control.
Kubernetes Operations
tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py
Introduces NodeOperations class for node cordoning, uncordoning, cordon state checking, and GPU driver restart; and PodOperations class for pod draining, distribution analysis, readiness waiting with optional node exclusion, and status detail retrieval.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Thread-safety implementation in InferenceLoadTester._load_loop() and result accumulation
  • Kubernetes API error handling and state management in node/pod operations
  • Timeout and polling logic in PodOperations.wait_for_pods_ready() with exclusion handling
  • Edge cases in result aggregation and statistics computation (get_stats())

Poem

🐰 A rabbit hops through fault-injection dreams,
With load testers and k8s schemes,
Node cordoning, pods that wait,
Testing faults at rapid rate,
New helpers make the chaos supreme!

Pre-merge checks

✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change—adding core testing helper utilities for fault-injection scenarios, directly matching the PR's primary objective.
Description check ✅ Passed The description comprehensively covers all template sections with clear overview, detailed changes across both helper modules, specific reviewer guidance, and related issue context.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py (1)

107-114: Refactor latency calculation in exception handler.

The if "start_time" in locals() check is a code smell. If an exception occurs before start_time is assigned (line 79), the fallback to 0 is triggered, but this approach is fragile.

Apply this diff to initialize start_time before the try block:

     def send_inference_request(self, prompt: str = "Hello, world!") -> Dict:
         """
         Send a single inference request and return result.

         Args:
             prompt: Text prompt for inference

         Returns:
             Dict with keys: success, status_code, latency, timestamp, error
         """
+        start_time = time.time()
         try:
-            start_time = time.time()
             response = requests.post(
                 self.endpoint,
                 json={
                     "model": self.model_name,
                     "prompt": prompt,
                     "max_tokens": 50,
                     "temperature": 0.7,
                 },
                 timeout=self.timeout,
             )
             latency = time.time() - start_time

             return {
                 "success": response.status_code == 200,
                 "status_code": response.status_code,
                 "latency": latency,
                 "timestamp": time.time(),
                 "error": None if response.status_code == 200 else response.text[:200],
             }
         except requests.exceptions.Timeout:
             return {
                 "success": False,
                 "status_code": None,
                 "latency": self.timeout,
                 "timestamp": time.time(),
                 "error": "Request timeout",
             }
         except Exception as e:
             return {
                 "success": False,
                 "status_code": None,
-                "latency": time.time() - start_time if "start_time" in locals() else 0,
+                "latency": time.time() - start_time,
                 "timestamp": time.time(),
                 "error": str(e)[:200],
             }
tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (1)

11-16: Consider sorting __all__ for consistency.

While not critical, maintaining alphabetical order in __all__ improves readability and aligns with Python style conventions.

Apply this diff to sort the list:

 __all__ = [
+    "get_inference_endpoint",
     "InferenceLoadTester",
-    "get_inference_endpoint",
     "NodeOperations",
     "PodOperations",
 ]
tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py (1)

347-394: Consider reporting status for all containers.

Line 376 only examines the first container's status. For pods with multiple containers, this provides an incomplete view. Consider either documenting this limitation or iterating through all containers to provide complete status information.

If comprehensive status is desired, apply this diff:

             details = []
             for pod in pods.items:
                 pod_name = pod.metadata.name
                 node = pod.spec.node_name

                 if pod.status.container_statuses:
-                    cs = pod.status.container_statuses[0]
-                    if cs.state.waiting:
-                        state = cs.state.waiting.reason
-                    elif cs.state.terminated:
-                        state = f"Terminated ({cs.state.terminated.reason})"
-                    elif cs.state.running:
-                        state = "Running"
-                    else:
-                        state = "Unknown"
+                    # Report status of all containers
+                    states = []
+                    for cs in pod.status.container_statuses:
+                        if cs.state.waiting:
+                            states.append(f"{cs.name}: {cs.state.waiting.reason}")
+                        elif cs.state.terminated:
+                            states.append(f"{cs.name}: Terminated ({cs.state.terminated.reason})")
+                        elif cs.state.running:
+                            states.append(f"{cs.name}: Running")
+                        else:
+                            states.append(f"{cs.name}: Unknown")
+                    state = ", ".join(states)
                 else:
                     state = f"{pod.status.phase} (no container status)"

                 details.append({"name": pod_name, "node": node, "state": state})

             return details

Alternatively, document that only the first container is reported:

     def get_pod_status_details(
         self, namespace: str, label_selector: str, node_name: Optional[str] = None
     ) -> List[Dict]:
         """
         Get detailed status for each pod.

         Args:
             namespace: Kubernetes namespace
             label_selector: Label selector for pods
             node_name: If provided, only get pods on this node

         Returns:
-            List of dicts with pod name, state, and reason
+            List of dicts with pod name, node, and state (first container only for multi-container pods)
         """
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d76583 and 8375ed6.

📒 Files selected for processing (3)
  • tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (1 hunks)
  • tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py (1 hunks)
  • tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: tests/fault_tolerance/cancellation/test_trtllm.py:4-204
Timestamp: 2025-09-25T00:54:01.369Z
Learning: The fault tolerance tests in tests/fault_tolerance/cancellation/ run in a controlled container environment where files written to /workspace are automatically cleaned up after test completion, and tests execute sequentially without concurrency concerns, so temporary file management for config files is not necessary.
Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 1444
File: tests/fault_tolerance/scenarios.py:57-57
Timestamp: 2025-07-01T15:39:56.789Z
Learning: The fault tolerance tests in tests/fault_tolerance/ are designed to run only in the mounted container environment, so hardcoded absolute paths with `/workspace/` prefix are intentional and should not be changed to relative paths.
🧬 Code graph analysis (1)
tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (2)
tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py (2)
  • InferenceLoadTester (48-180)
  • get_inference_endpoint (22-45)
tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py (2)
  • NodeOperations (19-197)
  • PodOperations (200-394)
🪛 Ruff (0.14.3)
tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py

107-107: Do not catch blind exception: Exception

(BLE001)

tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py

11-16: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py

58-58: Consider moving this statement to an else block

(TRY300)


60-60: Do not catch blind exception: Exception

(BLE001)


90-90: Consider moving this statement to an else block

(TRY300)


92-92: Do not catch blind exception: Exception

(BLE001)


100-100: Consider moving this statement to an else block

(TRY300)


101-101: Do not catch blind exception: Exception

(BLE001)


187-188: try-except-pass detected, consider logging the exception

(S110)


187-187: Do not catch blind exception: Exception

(BLE001)


193-193: Consider moving this statement to an else block

(TRY300)


195-195: Do not catch blind exception: Exception

(BLE001)


254-254: Consider moving this statement to an else block

(TRY300)


256-256: Do not catch blind exception: Exception

(BLE001)


284-284: Consider moving this statement to an else block

(TRY300)


286-286: Do not catch blind exception: Exception

(BLE001)


340-340: Do not catch blind exception: Exception

(BLE001)


390-390: Consider moving this statement to an else block

(TRY300)


392-392: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (9)
tests/fault_tolerance/hardware/fault-injection-service/helpers/inference_testing.py (5)

22-45: LGTM! Environment detection logic is solid.

The use of KUBERNETES_SERVICE_HOST to distinguish in-cluster from local execution is the standard approach, and the endpoint construction is appropriate for both scenarios.


51-66: LGTM! Thread-safe initialization.

The initialization properly sets up thread-safe result collection with a lock and appropriate default values.


116-122: LGTM! Proper thread-safe background loop.

The loop correctly uses the lock when appending results and implements the expected interval-based request pattern.


124-139: LGTM! Background thread setup is appropriate.

The daemon thread configuration is suitable for fault-tolerance testing where the load tester should not prevent process exit. The early return guards against multiple start calls.


141-153: Verify thread join timeout is sufficient.

The thread join uses a 5-second timeout but doesn't verify the thread actually stopped. If the join times out, the daemon thread continues running and could append to results after stop() returns the copy.

For the typical 2-second interval, 5 seconds should be adequate. However, if a long-running inference request is in progress, the thread might not stop within the timeout. Consider whether this edge case needs handling, such as logging a warning if the join times out.

tests/fault_tolerance/hardware/fault-injection-service/helpers/k8s_operations.py (4)

31-62: LGTM! Proper cordon implementation with verification.

The method correctly cordons the node and verifies the operation succeeded. The test.fault-injection/* label prefix is good practice for tracking test-initiated changes.


96-102: LGTM! Simple and correct cordon status check.

The implementation properly checks the unschedulable field and handles both exceptions and None values safely.


212-258: LGTM! Proper pod draining implementation.

The method correctly filters pods by node and label, deletes them with zero grace period (appropriate for fault injection testing), and handles the 404 case for already-deleted pods.


260-288: LGTM! Clean pod distribution calculation.

The method correctly counts only Running pods and builds a proper node-to-count mapping.

- Track and restore original node state in cordon/uncordon operations
- Detect DaemonSet replacement pods by listing instead of reading by name
- Validate all container statuses for multi-container pod readiness
- Return consistent dict schema from get_stats() when empty
@nv-oviya nv-oviya marked this pull request as draft November 4, 2025 00:36
@nv-oviya nv-oviya marked this pull request as ready for review November 4, 2025 01:27
Copy link
Member

@saturley-hall saturley-hall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should choose logging whenever possible so the test verbosity can be scaled up/down.

namespace=namespace,
grace_period_seconds=0,
)
print(f" ✓ Evicted: {pod.metadata.name}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be using logging rather than print

node_name: If provided, only get pods on this node

Returns:
List of dicts with pod name, state, and reason
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also includes the node that it is on. The reason that it is in the state is less certain to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one overlaps with our testing support in the existing tests - we can consolidate now or later ...

Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the k8s_operations are a good set - we 'd want to converge them into faults within the test folder

…g, .debug, etc) + fixed docstring

Signed-off-by: Oviya Seeniraj <[email protected]>
@saturley-hall saturley-hall merged commit 011a200 into main Nov 26, 2025
10 checks passed
@saturley-hall saturley-hall deleted the oviya/fault-injection/helper-utilities branch November 26, 2025 18:09
zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants