[https://nvbugs/5429636][fix] fix KVC mem leakage by adding kv_transfer_timeout_ms #6801

raayandhar · 2025-08-11T19:49:42Z

Summary by CodeRabbit

New Features
- Configurable per-transfer KV cache timeout (milliseconds); executor enforces timeouts for context and generation transfers and auto-cleans timed-out requests.
Improvements
- Python API exposes kv_transfer_timeout_ms property and setter; pickle/state handling persists the timeout.
- Public model forwards kv_transfer_timeout_ms to bindings.
- Backend enum adds DEFAULT and a from_string helper.
Documentation
- Example config updated to show kv_transfer_timeout_ms usage.

Description

Issue was raised on Github where the context server occasionally hangs due to requests that fail to transfer piling up and accumulating resources. This PR is adding the ability for users to configure kv_transfer_timeout_ms under cache_transceiver_config which will remove resources for requests that fail to successfully transfer their KV cache in time (can be configured separately for ctx and gen). Must be configured carefully, or you may get rid of perfectly good requests.

Test Coverage

Tested locally by changing the proxy server to intentionally generate failures and then verifying that resources and requests were cleaned up after the timeout(s).

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-11T19:49:49Z

📝 Walkthrough

Walkthrough

Adds an optional per-transfer KV-cache timeout through CacheTransceiverConfig (C++ and Python bindings), persists it in pickle/state, surfaces it in the LLM API and BindKvCacheTransceiver, and implements per-request timeout tracking and cleanup inside PyExecutor.

Changes

Cohort / File(s)	Summary of changes
Core Config (C++) `cpp/include/tensorrt_llm/executor/executor.h`, `cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp`	Added optional constructor parameter `kvTransferTimeoutMs`, private storage `mKvTransferTimeoutMs`, setter `setKvTransferTimeoutMs(...)`, getter `getKvTransferTimeoutMs()`, and included timeout in equality/comparison and initialization.
Bindings — nanobind (Python) `cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp`	Extended CacheTransceiverConfig binding to accept/serialize `kv_transfer_timeout_ms` (3-arg init, getstate/setstate), added `kv_transfer_timeout_ms` property (getter/setter), and added `CacheTransceiverBackendType::DEFAULT` plus `from_string` helper.
Bindings — pybind (Python) `cpp/tensorrt_llm/pybind/executor/executorConfig.cpp`	Mirrored nanobind: 3-arg construction path, updated pickle state to include timeout, added `kv_transfer_timeout_ms` property and `setKvTransferTimeoutMs`, and enum `from_string`.
PyExecutor timeout logic `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Added `_check_kv_transfer_timeout()` and integrated timeout checks in executor loops and scheduling points; initialize/clear per-request `py_kv_transfer_start_time`; mark timed-out requests with `py_to_cleanup` to drive termination/cleanup.
Per-request fields `tensorrt_llm/_torch/pyexecutor/llm_request.py`	Added `py_kv_transfer_start_time` (None) and `py_to_cleanup` (False) attributes on request objects.
KV Transceiver wrapper `tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py`	Added `kv_transfer_timeout_ms` attribute on `BindKvCacheTransceiver`, initialized from `cache_transceiver_config.kv_transfer_timeout_ms`.
LLM API model & examples `tensorrt_llm/llmapi/llm_args.py`, `examples/disaggregated/README.md`	Added `kv_transfer_timeout_ms: Optional[int]` to CacheTransceiverConfig model and forwarded it to pybind; documented the option in example README.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant PyExecutor
  participant KvCacheTransceiver
  participant Request

  User->>PyExecutor: submit request(s)
  PyExecutor->>KvCacheTransceiver: initiate KV transfer (ctx/gen)
  PyExecutor->>Request: set py_kv_transfer_start_time

  loop executor cycles
    PyExecutor->>KvCacheTransceiver: poll transfer status
    PyExecutor->>PyExecutor: _check_kv_transfer_timeout()
    alt elapsed > kv_transfer_timeout_ms
      PyExecutor->>Request: set py_to_cleanup = True
      PyExecutor->>PyExecutor: terminate/cleanup request
    else transfer completes
      PyExecutor->>Request: clear py_kv_transfer_start_time
    end
  end

  PyExecutor-->>User: respond / terminate

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service #5234: Overlaps CacheTransceiverConfig constructor/accessor and binding changes (related changes to backend/max_tokens and bindings).

Suggested labels

Documentation

Suggested reviewers

Tabrizian
Shunkangz
chuangz0
pcastonguay
nv-guomingz
Superjomn

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9c7ef25 and 0f0256d.

📒 Files selected for processing (2)

cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

cpp/include/tensorrt_llm/executor/executor.h
cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9fe07e and 938297f.

📒 Files selected for processing (7)

cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp (2 hunks)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2 hunks)
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (9 hunks)
tensorrt_llm/llmapi/llm_args.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp
tensorrt_llm/_torch/pyexecutor/py_executor.py
cpp/include/tensorrt_llm/executor/executor.h
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp

**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp
cpp/include/tensorrt_llm/executor/executor.h
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp

**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

cpp/include/tensorrt_llm/executor/executor.h

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (14)

cpp/include/tensorrt_llm/executor/executor.h (2)

1446-1448: LGTM! Well-designed API addition.

The constructor signature properly extends the existing parameters with an optional timeout parameter, maintaining backward compatibility.

1453-1455: Good API design with proper type safety.

The setter/getter pair uses appropriate types - MillisecondsType for input (likely an integer alias) and std::chrono::milliseconds for output, providing type safety for duration values.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

332-332: LGTM! Proper initialization of transfer timestamp.

The new attribute follows the established naming convention and is appropriately initialized to None.

tensorrt_llm/llmapi/llm_args.py (2)

1067-1071: LGTM! Well-documented configuration field.

The field is properly defined with a clear description and appropriate default value, maintaining backward compatibility.

1072-1077: Correct forwarding to pybind layer.

The timeout parameter is properly passed to the underlying C++ binding.

cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp (3)

24-30: LGTM! Proper constructor initialization.

The constructor correctly initializes the new timeout member using the initializer list.

32-36: Equality operator correctly updated.

The comparison properly includes the new timeout field.

58-67: Clean getter/setter implementation.

The accessor methods are correctly implemented with appropriate const-correctness.

tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

1102-1104: Clarify why timeout check is disabled in overlap executor.

The timeout check is commented out in the overlap executor loop. Please either enable it or add a comment explaining why it's intentionally disabled in this execution path.

1384-1388: Good practice: Setting start time for new transfers.

The implementation correctly sets the transfer start time when initiating generation cache transfers, but only when timeout is configured.

1422-1426: Consistent timeout tracking for context transfers.

The start time is properly set for context cache transfers, maintaining consistency with generation transfers.

1360-1360: Proper cleanup of transfer timestamps.

The start times are correctly cleared when transfers complete or requests are terminated.

Also applies to: 1686-1686

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)

414-495: LGTM! CacheTransceiverConfig timeout implementation is well-structured.

The pybind11 bindings for the KV transfer timeout feature are correctly implemented with proper:

Type conversions between Python int64_t and C++ std::chrono::milliseconds

Optional value handling

Pickle support with backward compatibility (3-tuple state)

Property and method bindings

cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)

431-512: LGTM! Nanobind implementation mirrors pybind11 correctly.

The nanobind bindings for CacheTransceiverConfig are properly implemented with:

Consistent API with the pybind11 version

Correct use of nanobind-specific patterns (placement new in setstate)

Proper type conversions and optional handling

Complete pickle support

tensorrt_llm/_torch/pyexecutor/py_executor.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1287-1313: Fix critical issues in the timeout checking implementation

The _check_kv_transfer_timeout method has several issues that need to be addressed:

Missing null check for kv_cache_transceiver before accessing its properties
The condition at line 1304 incorrectly checks is_disagg_generation_transmission_in_progress instead of checking if the request is in progress
The method doesn't call _terminate_request as expected but only sets the py_to_cleanup flag

Apply this fix to resolve the issues:

 def _check_kv_transfer_timeout(self):
+    if not self.kv_cache_transceiver:
+        return
     current_time = time.time()
     timeout_ms = self.kv_cache_transceiver.kv_transfer_timeout_ms
     if timeout_ms is None:
         return
 
     for req in self.ctx_in_transmission_requests[:]:
         if req.py_kv_transfer_start_time is None:
             continue
         elapsed_time = (current_time - req.py_kv_transfer_start_time) * 1000
         if elapsed_time > timeout_ms:
             logger.debug(
                 f"Terminating context request {req.py_request_id} due to KV cache transfer timeout"
             )
             req.py_to_cleanup = True
 
     for req in self.active_requests[:]:
         if req.is_disagg_generation_transmission_in_progress and req.py_kv_transfer_start_time is not None:
             elapsed_time = (current_time -
                             req.py_kv_transfer_start_time) * 1000
             if elapsed_time > timeout_ms:
                 logger.debug(
                     f"Terminating generation request {req.py_request_id} due to KV cache transfer timeout"
                 )
                 req.py_to_cleanup = True
-
-    return

🧹 Nitpick comments (4)

tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

907-907: Consider reducing log verbosity for production

While the debug logging is helpful during development, it might be too verbose for production. Consider using a more structured logging approach or controlling verbosity through configuration.

-        self._log_ctx_transmission_state("Starting _executor_loop")
+        if logger.isEnabledFor(logging.DEBUG):
+            self._log_ctx_transmission_state("Starting _executor_loop")

990-1005: Improve logging helper method for better debugging

The logging helper method could be enhanced with additional context and better formatting for production use.

 def _log_ctx_transmission_state(self, context: str):
     """Helper method to log the current state of ctx_in_transmission_requests"""
+    if not logger.isEnabledFor(logging.DEBUG):
+        return
+        
     if not self.ctx_in_transmission_requests:
         logger.debug(
             f"[CTX_TRANSMISSION] {context}: ctx_in_transmission_requests is EMPTY"
         )
     else:
-        request_ids = [
-            req.py_request_id for req in self.ctx_in_transmission_requests
-        ]
-        client_ids = [
-            req.py_client_id for req in self.ctx_in_transmission_requests
-        ]
+        request_info = [
+            {
+                "request_id": req.py_request_id,
+                "client_id": req.py_client_id,
+                "state": req.state.name if hasattr(req.state, 'name') else str(req.state),
+                "has_timeout": req.py_kv_transfer_start_time is not None
+            }
+            for req in self.ctx_in_transmission_requests
+        ]
         logger.debug(
-            f"[CTX_TRANSMISSION] {context}: ctx_in_transmission_requests has {len(self.ctx_in_transmission_requests)} requests - request_ids: {request_ids}, client_ids: {client_ids}"
+            f"[CTX_TRANSMISSION] {context}: {len(self.ctx_in_transmission_requests)} requests in transmission",
+            extra={"requests": request_info}
         )

1120-1121: Clarify the commented-out timeout check in overlap executor

The comment indicates uncertainty about placing the timeout check here. This should be resolved before finalizing the PR.

The timeout check is commented out with uncertainty. This needs clarification to ensure consistent timeout handling across all executor paths. Would you like me to analyze the overlap executor flow to determine if the timeout check should be enabled here?

1658-1716: Consider reducing logging frequency in hot paths

Multiple debug log calls in the response handling path could impact performance. Consider consolidating or conditionally enabling these logs.

 def _handle_responses(self):
-    self._log_ctx_transmission_state("Start of _handle_responses")
+    if logger.isEnabledFor(logging.DEBUG):
+        self._log_ctx_transmission_state("Start of _handle_responses")
     # ... rest of the method ...
-    self._log_ctx_transmission_state("End of _handle_responses")
+    if logger.isEnabledFor(logging.DEBUG):
+        self._log_ctx_transmission_state("End of _handle_responses")

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b116bf and 296c7ae.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (14 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/py_executor.py

1004-1004: Line too long (186 > 120)

(E501)

🔇 Additional comments (7)

tensorrt_llm/_torch/pyexecutor/py_executor.py (7)

858-859: Good placement of timeout check after generation transfer status check

The timeout check is appropriately placed after checking disaggregated generation transfer status. This ensures that transfers are monitored and timed out if they exceed the configured threshold.

978-978: Consistent timeout check placement in the main executor loop

The timeout check is consistently placed after terminating context-finished requests, matching the pattern used in the PP executor loop.

1387-1387: Correctly reset timeout tracking when generation transfer completes

The timeout tracking is properly reset when a generation request completes its KV cache transfer and transitions to the generation-in-progress state.

1411-1414: Properly initialize timeout tracking for new generation requests

The start time is correctly initialized when new generation requests begin their KV cache transfer.

1449-1452: Consistently initialize timeout tracking for context transmission

The timeout tracking is properly initialized for context requests that enter the transmission state.

1672-1676: Effective cleanup mechanism for timed-out requests

The cleanup flag is properly checked during response handling to terminate requests that have exceeded the timeout threshold.

1724-1727: Proper cleanup of timeout tracking in termination handler

The timeout tracking is correctly reset when requests are terminated, whether due to completion or timeout.

tensorrt_llm/_torch/pyexecutor/py_executor.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
840-843: Swap timeout check before termination (still pending here).

Call _check_kv_transfer_timeout() before _terminate_ctx_finished_requests() to ensure timed-out ctx transfers are cleaned up immediately. This mirrors the updated ordering elsewhere in this file.

Apply this diff:
-                if self.kv_cache_transceiver and self.ctx_in_transmission_requests:
-                    self._terminate_ctx_finished_requests()
-                    self._check_kv_transfer_timeout()
+                if self.kv_cache_transceiver and self.ctx_in_transmission_requests:
+                    self._check_kv_transfer_timeout()
+                    self._terminate_ctx_finished_requests()

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 296c7ae and 76bb547.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (10 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (6)

tensorrt_llm/_torch/pyexecutor/py_executor.py (6)

857-860: LGTM: Added timeout check in prep/schedule path.

Running _check_kv_transfer_timeout() early in the scheduling path helps bound resource usage proactively.

975-978: LGTM: Correct ordering (check timeout before termination).

This prevents a one-iteration delay in cleaning up timed-out ctx transfers.

1101-1103: LGTM: Correct ordering (overlap loop).

Timeout check precedes termination, consistent with the intended behavior.

1368-1369: LGTM: Reset start time on gen transfer completion.

Clearing py_kv_transfer_start_time on completion prevents stale timeouts.

1699-1701: LGTM: Cleanup on ctx completion or timeout.

Clearing start time and removing from in-flight list ensures bounded resource usage.

1652-1657: I couldn’t find a FinishReason.TIMEOUT or a finish_by_reason method in the codebase, and the FinishReason enum only defines CANCELLED, STOP, and LENGTH. Likewise, there’s no terminal-response helper on LlmRequest. Emitting a final response will require:

• Defining a timeout finish reason
• Implementing or exposing a finish_by_reason(...) API on LlmRequest
• Ensuring create_response(...) returns a valid final chunk

Until those exist, the suggested diff won’t compile.

Action items:
Add TIMEOUT to FinishReason (e.g. in finish_reason.py)

Implement LlmRequest.finish_by_reason(reason: FinishReason)
In py_executor.py around line 1652, call:
request.finish_by_reason(FinishReason.TIMEOUT)
response = request.create_response(final_chunk=True, worker_rank=self.dist.rank)
if response:
    new_responses.append((req_id, response))
Then proceed with requests_to_terminate.append(request) as before.
Please confirm before landing these refactors.

tensorrt_llm/_torch/pyexecutor/py_executor.py

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
842-843: Swap the order: run timeout checks before terminating ctx-finished requests (consistency and correctness).

To promptly flag and clean up timed-out transfers before normal termination, call _check_kv_transfer_timeout() before _terminate_ctx_finished_requests() here (matches other loops below).
-                    self._terminate_ctx_finished_requests()
-                    self._check_kv_transfer_timeout()
+                    self._check_kv_transfer_timeout()
+                    self._terminate_ctx_finished_requests()

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76bb547 and 6936a3b.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (10 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (5)

tensorrt_llm/_torch/pyexecutor/py_executor.py (5)

857-860: LGTM: proactive timeout check during batch prep.

Running _check_kv_transfer_timeout() during batch preparation reduces time-to-detection for stalled transfers.

975-978: LGTM: correct ordering — check timeouts before terminating ctx-finished.

This matches the intended flow and previous guidance.

1101-1103: LGTM: correct ordering in overlap loop as well.

Consistent with the non-PP path; timeout check precedes ctx-finished termination.

1370-1370: LGTM: clear the KV transfer timing anchor on gen-transfer completion.

Resetting py_kv_transfer_start_time avoids stale elapsed-time checks after completion.

1701-1703: LGTM: terminate ctx transfers on completion or timeout and clear timing anchor.

This ensures resource cleanup for context-only transmissions and avoids stale timers.

tensorrt_llm/_torch/pyexecutor/py_executor.py

raayandhar · 2025-08-12T23:14:12Z

cc: @Tabrizian
... I'm not sure why it looks like I requested a ton of reviews. I did not do that...

raayandhar · 2025-08-12T23:14:49Z

/bot run

tensorrt-cicd · 2025-08-12T23:22:33Z

PR_Github #15028 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-12T23:41:02Z

PR_Github #15028 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11350 completed with status: 'FAILURE'

raayandhar · 2025-08-12T23:48:45Z

/bot run

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1-1: Add the required NVIDIA copyright header

This source lacks the required NVIDIA header per repository guidelines. Please prepend the standard header at the top of this file.

Example header:
# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
# NVIDIA and its licensors retain all intellectual property and proprietary
# rights in and to this software, related documentation and any modifications
# thereto.  Any use, reproduction, disclosure or distribution of this software
# and related documentation without an express license agreement from NVIDIA is
# strictly prohibited.

♻️ Duplicate comments (5)

tensorrt_llm/_torch/pyexecutor/py_executor.py (5)

841-844: Call timeout checks before terminating finished ctx transfers

Run _check_kv_transfer_timeout() before _terminate_ctx_finished_requests() to ensure timeouts are evaluated and flagged prior to cleanup.

-    self._terminate_ctx_finished_requests()
-    self._check_kv_transfer_timeout()
+    self._check_kv_transfer_timeout()
+    self._terminate_ctx_finished_requests()

1269-1298: Use a monotonic clock and treat non-positive timeouts as disabled

Wall-clock jumps can skew elapsed-time checks. Use time.monotonic() and keep semantics: None or <= 0 disables the timeout. Also prefer elapsed_ms for clarity.

 def _check_kv_transfer_timeout(self):
     if not self.kv_cache_transceiver:
         return
     timeout_ms = self.kv_cache_transceiver.kv_transfer_timeout_ms
-    if timeout_ms is None or timeout_ms <= 0:
-        return
-    current_time = time.time()
+    if timeout_ms is None or timeout_ms <= 0:
+        return
+    current_time = time.monotonic()
 
     for req in self.ctx_in_transmission_requests[:]:
         if req.py_kv_transfer_start_time is None:
             continue
-        elapsed_time = (current_time - req.py_kv_transfer_start_time) * 1000
-        if elapsed_time > timeout_ms:
+        elapsed_ms = (current_time - req.py_kv_transfer_start_time) * 1000
+        if elapsed_ms > timeout_ms:
             logger.warning(
                 f"Terminating context request {req.py_request_id} due to KV cache transfer timeout"
             )
             req.py_to_cleanup = True
 
     for req in self.active_requests[:]:
         if req.is_disagg_generation_transmission_in_progress and req.py_kv_transfer_start_time is not None:
-            elapsed_time = (current_time -
-                            req.py_kv_transfer_start_time) * 1000
-            if elapsed_time > timeout_ms:
+            elapsed_ms = (current_time - req.py_kv_transfer_start_time) * 1000
+            if elapsed_ms > timeout_ms:
                 logger.warning(
                     f"Terminating generation request {req.py_request_id} due to KV cache transfer timeout"
                 )
                 req.py_to_cleanup = True

1395-1399: Initialize gen-transfer timers only when timeout is enabled, and use a monotonic clock

Honor disabled/non-positive timeouts and switch to time.monotonic() for robustness.

-        if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None:
+        if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None and \
+           self.kv_cache_transceiver.kv_transfer_timeout_ms > 0:
             for req in new_gen_reqs:
                 if req.state == LlmRequestState.DISAGG_GENERATION_TRANS_IN_PROGRESS:
-                    req.py_kv_transfer_start_time = time.time()
+                    req.py_kv_transfer_start_time = time.monotonic()

1433-1437: Initialize ctx-transfer timers only when timeout is enabled, and use a monotonic clock

Keep clock sources consistent and ignore non-positive timeouts.

-        if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None:
+        if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None and \
+           self.kv_cache_transceiver.kv_transfer_timeout_ms > 0:
             for req in ctx_transmission_reqs:
                 if req.state == LlmRequestState.DISAGG_CONTEXT_TRANS_IN_PROGRESS:
-                    req.py_kv_transfer_start_time = time.time()
+                    req.py_kv_transfer_start_time = time.monotonic()

1655-1660: Emit a final response on timeout to avoid client hangs

Requests flagged for cleanup due to KV-transfer timeout are terminated without sending a final response. Emit a final response (use CANCELLED as a proxy for TIMEOUT) before cleanup so clients don't hang waiting.

             # Check if generation request needs cleanup due to KV cache transfer timeout
             if request.py_to_cleanup:
-                request.state = LlmRequestState.GENERATION_COMPLETE
-                requests_to_terminate.append(request)
-                continue
+                # Mark as cancelled due to timeout and emit a final response
+                request.finish_by_reason(FinishReason.CANCELLED)
+                request.state = LlmRequestState.GENERATION_COMPLETE
+                request.decoding_iter = request.py_decoding_iter
+                response = request.create_response(False, self.dist.rank)
+                if response:
+                    new_responses.append((req_id, response))
+                requests_to_terminate.append(request)
+                continue

🧹 Nitpick comments (1)

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)

474-495: Validate non-negative timeout in the Python setter

Guard against negative values to prevent undefined semantics; raise a Python ValueError when negative.

         .def_property(
             "kv_transfer_timeout_ms",
             [](tle::CacheTransceiverConfig const& self) -> std::optional<int64_t>
             {
                 auto timeout = self.getKvTransferTimeoutMs();
                 return timeout ? std::optional<int64_t>(timeout->count()) : std::nullopt;
             },
             [](tle::CacheTransceiverConfig& self, std::optional<int64_t> timeoutMs)
             {
-                if (timeoutMs)
+                if (timeoutMs)
                 {
-                    self.setKvTransferTimeoutMs(std::chrono::milliseconds(*timeoutMs));
+                    if (*timeoutMs < 0)
+                    {
+                        throw py::value_error("kv_transfer_timeout_ms must be non-negative");
+                    }
+                    self.setKvTransferTimeoutMs(std::chrono::milliseconds(*timeoutMs));
                 }
                 else
                 {
                     self.setKvTransferTimeoutMs(std::nullopt);
                 }
             })

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31634aa and 5924e85.

📒 Files selected for processing (9)

cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp (2 hunks)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2 hunks)
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2 hunks)
examples/disaggregated/README.md (1 hunks)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (10 hunks)
tensorrt_llm/llmapi/llm_args.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (7)

tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
cpp/include/tensorrt_llm/executor/executor.h
examples/disaggregated/README.md
tensorrt_llm/llmapi/llm_args.py
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (6)

tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

858-861: Good placement: timeout check is invoked early in the iteration

Calling _check_kv_transfer_timeout() right after _check_disagg_gen_transfer_status() ensures overdue transfers are pruned promptly.

976-979: Correct order: timeout check before ctx-finished termination

This ordering prevents missing timeouts on still-in-transit ctx requests. No change needed here.

1371-1371: LGTM: Resetting start-time on gen transmission completion

Clearing py_kv_transfer_start_time on successful generation transmission completion avoids stale timers.

1701-1703: LGTM: Terminating ctx-transmission requests on completion or timeout

Including request.py_to_cleanup here promptly frees resources and clears the timeout start marker.

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2)

414-419: Add kv_transfer_timeout_ms to pickle state: looks good

Returning a 3-tuple with timeoutMs preserves state across pickling. The conversion to optional<int64_t> is correct.

456-470: 3-arg constructor for CacheTransceiverConfig: good

The binding cleanly maps optional timeoutMs to std::chrono::milliseconds and threads it through the C++ constructor.

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp

tensorrt-cicd · 2025-08-12T23:57:28Z

PR_Github #15033 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-13T00:15:33Z

PR_Github #15033 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11354 completed with status: 'FAILURE'

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (4)

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

841-844: Run timeout checks before terminating finished ctx transfers (ordering)

Call _check_kv_transfer_timeout() before _terminate_ctx_finished_requests() so timed-out ctx transfers are flagged and cleaned in the same cycle. This mirrors the ordering you already apply in other loops.

-                if self.kv_cache_transceiver and self.ctx_in_transmission_requests:
-                    self._terminate_ctx_finished_requests()
-                    self._check_kv_transfer_timeout()
+                if self.kv_cache_transceiver and self.ctx_in_transmission_requests:
+                    self._check_kv_transfer_timeout()
+                    self._terminate_ctx_finished_requests()

1269-1298: Use a monotonic clock and unify elapsed naming

time.time() is not monotonic and can be affected by system clock changes. Use time.monotonic() and keep variable naming consistent (elapsed_ms). Also keep the existing guards (None or <= 0 => disabled).

 def _check_kv_transfer_timeout(self):
     if not self.kv_cache_transceiver:
         return
     timeout_ms = self.kv_cache_transceiver.kv_transfer_timeout_ms
     if timeout_ms is None or timeout_ms <= 0:
         return
-    current_time = time.time()
+    current_time = time.monotonic()

     for req in self.ctx_in_transmission_requests[:]:
         if req.py_kv_transfer_start_time is None:
             continue
-        elapsed_time = (current_time - req.py_kv_transfer_start_time) * 1000
-        if elapsed_time > timeout_ms:
+        elapsed_ms = (current_time - req.py_kv_transfer_start_time) * 1000
+        if elapsed_ms > timeout_ms:
             logger.warning(
                 f"Terminating context request {req.py_request_id} due to KV cache transfer timeout"
             )
             req.py_to_cleanup = True

     for req in self.active_requests[:]:
         if req.is_disagg_generation_transmission_in_progress and req.py_kv_transfer_start_time is not None:
-            elapsed_time = (current_time -
-                            req.py_kv_transfer_start_time) * 1000
-            if elapsed_time > timeout_ms:
+            elapsed_ms = (current_time - req.py_kv_transfer_start_time) * 1000
+            if elapsed_ms > timeout_ms:
                 logger.warning(
                     f"Terminating generation request {req.py_request_id} due to KV cache transfer timeout"
                 )
                 req.py_to_cleanup = True

     return

1433-1437: Initialize ctx-transfer timer with monotonic clock and don’t reset it repeatedly

Mirror the gen path: respect disabled/non-positive timeouts, use a monotonic clock, and only set once.

-        if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None:
+        if (self.kv_cache_transceiver.kv_transfer_timeout_ms is not None and
+                self.kv_cache_transceiver.kv_transfer_timeout_ms > 0):
             for req in ctx_transmission_reqs:
                 if req.state == LlmRequestState.DISAGG_CONTEXT_TRANS_IN_PROGRESS:
-                    req.py_kv_transfer_start_time = time.time()
+                    if req.py_kv_transfer_start_time is None:
+                        req.py_kv_transfer_start_time = time.monotonic()

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)

422-435: Backward compatibility: accept 2- or 3-element pickle state in setstate

Strictly requiring 3 elements breaks older pickles. Accept both sizes and apply timeout only when present.

-        if (state.size() != 3)
-        {
-            throw std::runtime_error("Invalid CacheTransceiverConfig state!");
-        }
-        auto config
-            = tle::CacheTransceiverConfig(state[0].cast<std::optional<tle::CacheTransceiverConfig::BackendType>>(),
-                state[1].cast<std::optional<size_t>>());
-        auto timeoutMs = state[2].cast<std::optional<int64_t>>();
-        if (timeoutMs)
-        {
-            config.setKvTransferTimeoutMs(std::chrono::milliseconds(*timeoutMs));
-        }
-        return config;
+        if (state.size() < 2 || state.size() > 3)
+        {
+            throw std::runtime_error("Invalid CacheTransceiverConfig state!");
+        }
+        auto config
+            = tle::CacheTransceiverConfig(state[0].cast<std::optional<tle::CacheTransceiverConfig::BackendType>>(),
+                state[1].cast<std::optional<size_t>>());
+        if (state.size() == 3)
+        {
+            auto timeoutMs = state[2].cast<std::optional<int64_t>>();
+            if (timeoutMs)
+            {
+                config.setKvTransferTimeoutMs(std::chrono::milliseconds(*timeoutMs));
+            }
+        }
+        return config;

🧹 Nitpick comments (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1-10: Missing NVIDIA copyright header

Per repository guidelines, Python sources should include the NVIDIA header. Please prepend the standard header.

Example (outside this diff hunk):
# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
# NVIDIA and its licensors retain all intellectual property and proprietary
# rights in and to this software, related documentation and any modifications
# thereto.  Any use, reproduction, disclosure or distribution of this software
# and related documentation without an express license agreement from NVIDIA is
# strictly prohibited.
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)

1-16: Header years

The SPDX header ends at 2024, while other updated files use 2025. If the project policy is to bump the year on touched files, consider updating.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5924e85 and e671749.

📒 Files selected for processing (8)

cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp (2 hunks)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2 hunks)
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2 hunks)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (10 hunks)
tensorrt_llm/llmapi/llm_args.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
cpp/include/tensorrt_llm/executor/executor.h
cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp

🧰 Additional context used

📓 Path-based instructions (5)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (10)

tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

858-861: Good: early timeout check after gen-transfer status

Placing _check_kv_transfer_timeout() right after _check_disagg_gen_transfer_status() ensures in-flight timeouts are handled before scheduling.

976-979: Good: correct ordering (timeout before ctx-finished termination)

This ordering avoids missing a timeout in the same iteration.

1101-1104: Good: consistent ordering in overlap loop

Timeout is checked before ctx-finished termination here as well.

1701-1706: Good: clear start-time on ctx completion or timeout

Resetting py_kv_transfer_start_time on completion or timeout prevents stale timestamps on reused request objects.

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (3)

414-419: LGTM: pickle getstate includes timeout

Returning a 3-tuple with optional timeout preserves the new field across serialization.

457-470: LGTM: 3-arg constructor wiring

The lambda converting optional int64_t to optional milliseconds is correct and keeps the Python API ergonomic.

474-491: LGTM: kv_transfer_timeout_ms property binding

Getter/setter semantics and nullopt handling look correct.

cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (3)

431-436: LGTM: pickle getstate includes timeout

3-tuple state with optional timeout is correct.

473-486: LGTM: 3-arg constructor wiring

Optional ms conversion is correct.

491-507: LGTM: kv_transfer_timeout_ms property binding

Correct optional handling and ms conversion.

cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp

tensorrt_llm/_torch/pyexecutor/py_executor.py

raayandhar · 2025-08-14T16:03:54Z

I have two questions here.

Currently, we are using C++ KV Cache Transceiver. It seems that we only terminate the requests from python level and the timeout requests are still in the queue of C++ transceiver. Is there any problem? Because the KV cache has been released.

The context worker might be timeout first and the generation workers still acquire the kv cache. In this case, the C++ transceiver is not aware of this and might try to send the requests whose kv cache released.

For 1. could you elaborate on why the timeout requests are still in the queue of the C++ transceiver? In my testing I recreate conditions where context requests pile up and fail to send their KV cache which is a use case this feature targets. That's why I make sure to cleanup these requests as would also cleanup for requests that were successfully in transmitting their KV cache, i.e. the normal cleanup flow. I think they might still be in the queue but once checkContextTransferStatus is called later they will also get removed.
For 2. are you saying that there is some race condition, where requests in ctx_in_transmission_requests can have their generation workers acquire their KV cache before _terminate_ctx_finished_requests? I did not know about this possible case. My thinking is that as long as I cleanup in the regular flow we should be avoiding issues like this and I did not face any issue in testing. But if you could elaborate on this case that would be very helpful.

Signed-off-by: raayandhar <[email protected]>

Tabrizian · 2025-08-14T16:13:01Z

Re https://github.com/NVIDIA/TensorRT-LLM/pull/6801#issuecomment-3185781656>
I meant if we send a context request but never send any request to gen server or context server afterwards. Would we still cancel the KVCache send request from context worker side?

raayandhar · 2025-08-14T17:37:25Z

/bot run

tensorrt-cicd · 2025-08-14T17:43:35Z

PR_Github #15332 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-15T00:13:24Z

PR_Github #15332 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11573 completed with status: 'FAILURE'

raayandhar · 2025-08-15T17:23:39Z

Re https://github.com/NVIDIA/TensorRT-LLM/pull/6801#issuecomment-3185781656> I meant if we send a context request but never send any request to gen server or context server afterwards. Would we still cancel the KVCache send request from context worker side?

If we set the timeout low enough, yes. It will check in under 1ms in my testing so I had to artificially introduce delay via time.sleep.

raayandhar · 2025-08-15T17:50:17Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-15T17:55:32Z

PR_Github #15468 [ run ] triggered by Bot

Shunkangz · 2025-08-16T03:40:02Z

You only cancel the requests from PyExecutor instead of C++ cache transceiver. https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp#L254
Timeout behavior cannot ensure the context workers and generation workers in the same view. We should exchange acknowledgment information one more time. I have a draft PR here to do the similar thing. Instead of using timeout, we believe that we should provide cancelRequest interface to the upper level scheduler and throw an exception with useful information. This PR is waiting for the ready of exception refactor for disagg. cc @Shixiaowei02 @chuangz0
There are two cases that we can encounter timeout. (1) the generation workers do not acquire the KV cache in time. Maybe because of some resource limitation. In this case, we can simply remove the request from ready queue. However, the generation workers might still acquire the kv cache. We should let the generation workers know the latest status. In the distributed system, it might not be proper to rely on "tricky timeout". (2) the current send/recv hangs. In this case, you cannot even cancel the cache transceiver. Because the ucx or RDMA does not support this. The only way is to destroy the endpoint and let both sender and receiver know this.

I hope that I am clear. If you need more discussion, we have schedule a meeting.

tensorrt-cicd · 2025-08-16T04:01:16Z

PR_Github #15468 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11651 completed with status: 'FAILURE'

raayandhar · 2025-08-18T13:12:50Z

You only cancel the requests from PyExecutor instead of C++ cache transceiver. https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp#L254

Timeout behavior cannot ensure the context workers and generation workers in the same view. We should exchange acknowledgment information one more time. I have a draft PR here to do the similar thing. Instead of using timeout, we believe that we should provide cancelRequest interface to the upper level scheduler and throw an exception with useful information. This PR is waiting for the ready of exception refactor for disagg. cc @Shixiaowei02 @chuangz0

There are two cases that we can encounter timeout. (1) the generation workers do not acquire the KV cache in time. Maybe because of some resource limitation. In this case, we can simply remove the request from ready queue. However, the generation workers might still acquire the kv cache. We should let the generation workers know the latest status. In the distributed system, it might not be proper to rely on "tricky timeout". (2) the current send/recv hangs. In this case, you cannot even cancel the cache transceiver. Because the ucx or RDMA does not support this. The only way is to destroy the endpoint and let both sender and receiver know this.

I hope that I am clear. If you need more discussion, we have schedule a meeting.

I see, thanks Shunkang for the detailed response. I agree that the cancelRequest interface makes more sense. What's not clear to me is that I don't think it will be able to directly solve the problem in this issue? Since exceptions are never thrown here. Instead the timeout behavior could use the cancelRequest interface you've provided instead to deal with this, is what I'm thinking. But I may not have understood something. Let me know your thoughts.

Shunkangz · 2025-08-18T13:18:17Z

You only cancel the requests from PyExecutor instead of C++ cache transceiver. https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp#L254

Timeout behavior cannot ensure the context workers and generation workers in the same view. We should exchange acknowledgment information one more time. I have a draft PR here to do the similar thing. Instead of using timeout, we believe that we should provide cancelRequest interface to the upper level scheduler and throw an exception with useful information. This PR is waiting for the ready of exception refactor for disagg. cc @Shixiaowei02 @chuangz0

There are two cases that we can encounter timeout. (1) the generation workers do not acquire the KV cache in time. Maybe because of some resource limitation. In this case, we can simply remove the request from ready queue. However, the generation workers might still acquire the kv cache. We should let the generation workers know the latest status. In the distributed system, it might not be proper to rely on "tricky timeout". (2) the current send/recv hangs. In this case, you cannot even cancel the cache transceiver. Because the ucx or RDMA does not support this. The only way is to destroy the endpoint and let both sender and receiver know this.

I hope that I am clear. If you need more discussion, we have schedule a meeting.

I see, thanks Shunkang for the detailed response. I agree that the cancelRequest interface makes more sense. What's not clear to me is that I don't think it will be able to directly solve the problem in this issue? Since exceptions are never thrown here. Instead the timeout behavior could use the cancelRequest interface you've provided instead to deal with this, is what I'm thinking. But I may not have understood something. Let me know your thoughts.

You are right. The PR is still in progress. Because I am waiting for the previous one merged. I hope that it can be merged within a couple of days.

Tabrizian

Please add testing for this feature.

raayandhar · 2025-08-18T17:16:43Z

Please add testing for this feature.

Will do, having discussion with @Shunkangz on how to move forward.

pcastonguay · 2025-08-18T18:34:34Z

@Shunkangz it seems to me that the changes in this MR are still needed for the case where for some reason the gen request never makes it to the generation worker. In that case we would need to cancel the request on the context server. And I agree, we should also call the C++ dataTransceiver cancelRequest to make sure the transceiver removes that request as well.

tensorrt_llm/_torch/pyexecutor/py_executor.py

raayandhar · 2025-09-05T22:17:55Z

Closing since #7575 is more up to date.

raayandhar requested review from a team as code owners August 11, 2025 19:49

raayandhar requested review from Superjomn and achartier August 11, 2025 19:49

raayandhar changed the title ~~[nvbugs/5429636][fix] DRAFT: fix KVC mem leakage by adding kv_transfer_timeout_ms~~ [https://nvbugs/5429636][fix] DRAFT: fix KVC mem leakage by adding kv_transfer_timeout_ms Aug 11, 2025

coderabbitai bot reviewed Aug 11, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

raayandhar marked this pull request as draft August 11, 2025 21:08

coderabbitai bot reviewed Aug 12, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

coderabbitai bot reviewed Aug 12, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

coderabbitai bot reviewed Aug 12, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

raayandhar marked this pull request as ready for review August 12, 2025 23:13

raayandhar requested review from a team as code owners August 12, 2025 23:13

raayandhar requested review from QiJune, Shixiaowei02, chuangz0 and schetlur-nv August 12, 2025 23:13

raayandhar changed the title ~~[https://nvbugs/5429636][fix] DRAFT: fix KVC mem leakage by adding kv_transfer_timeout_ms~~ [https://nvbugs/5429636][fix] fix KVC mem leakage by adding kv_transfer_timeout_ms Aug 12, 2025

raayandhar force-pushed the kvc_mem_leak_pr branch from 31634aa to 5924e85 Compare August 12, 2025 23:48

coderabbitai bot reviewed Aug 12, 2025

View reviewed changes

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp Show resolved Hide resolved

raayandhar force-pushed the kvc_mem_leak_pr branch from 5924e85 to e671749 Compare August 13, 2025 15:22

coderabbitai bot reviewed Aug 13, 2025

View reviewed changes

raayandhar added 10 commits August 14, 2025 09:04

moving main work from debug branch to pr branch

4c6ce1a

Signed-off-by: raayandhar <[email protected]>

fix to gen transfer timeout

d2c654d

Signed-off-by: raayandhar <[email protected]>

small changes & fix

ec18038

Signed-off-by: raayandhar <[email protected]>

better way to handle requests that should be cleaned up

aadf406

Signed-off-by: raayandhar <[email protected]>

debugging statements

c52d9ab

Signed-off-by: raayandhar <[email protected]>

clean up and prep for PR

9a27832

Signed-off-by: raayandhar <[email protected]>

minor fixes

9ed6931

Signed-off-by: raayandhar <[email protected]>

reapply doc change

dd5705c

Signed-off-by: raayandhar <[email protected]>

try to fix nanobind error

a70a1bd

Signed-off-by: raayandhar <[email protected]>

remove shallow copies

0f0256d

Signed-off-by: raayandhar <[email protected]>

raayandhar force-pushed the kvc_mem_leak_pr branch from 9c7ef25 to 0f0256d Compare August 14, 2025 16:04

Tabrizian requested changes Aug 18, 2025

View reviewed changes

pcastonguay reviewed Sep 3, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

raayandhar closed this Sep 5, 2025

raayandhar mentioned this pull request Sep 5, 2025

[https://nvbugs/5429636][feat] Configurable kv_transfer_timeout_ms for failed transfer memory leaks #7575

Closed

1 task

[https://nvbugs/5429636][fix] fix KVC mem leakage by adding kv_transfer_timeout_ms #6801

[https://nvbugs/5429636][fix] fix KVC mem leakage by adding kv_transfer_timeout_ms #6801

Uh oh!

Conversation

raayandhar commented Aug 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

raayandhar commented Aug 12, 2025

Uh oh!

raayandhar commented Aug 12, 2025

Uh oh!

tensorrt-cicd commented Aug 12, 2025

Uh oh!

tensorrt-cicd commented Aug 12, 2025

Uh oh!

raayandhar commented Aug 12, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 12, 2025

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

raayandhar commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tabrizian commented Aug 14, 2025

Uh oh!

raayandhar commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 15, 2025

raayandhar commented Aug 11, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 11, 2025 •

edited

Loading

raayandhar commented Aug 14, 2025 •

edited

Loading

raayandhar commented Aug 15, 2025 •

edited

Loading