Skip to content

Conversation

@raayandhar
Copy link
Contributor

@raayandhar raayandhar commented Aug 11, 2025

Summary by CodeRabbit

  • New Features

    • Configurable per-transfer KV cache timeout (milliseconds); executor enforces timeouts for context and generation transfers and auto-cleans timed-out requests.
  • Improvements

    • Python API exposes kv_transfer_timeout_ms property and setter; pickle/state handling persists the timeout.
    • Public model forwards kv_transfer_timeout_ms to bindings.
    • Backend enum adds DEFAULT and a from_string helper.
  • Documentation

    • Example config updated to show kv_transfer_timeout_ms usage.

Description

Issue was raised on Github where the context server occasionally hangs due to requests that fail to transfer piling up and accumulating resources. This PR is adding the ability for users to configure kv_transfer_timeout_ms under cache_transceiver_config which will remove resources for requests that fail to successfully transfer their KV cache in time (can be configured separately for ctx and gen). Must be configured carefully, or you may get rid of perfectly good requests.

Test Coverage

Tested locally by changing the proxy server to intentionally generate failures and then verifying that resources and requests were cleaned up after the timeout(s).

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@raayandhar raayandhar requested review from a team as code owners August 11, 2025 19:49
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 11, 2025

📝 Walkthrough

Walkthrough

Adds an optional per-transfer KV-cache timeout through CacheTransceiverConfig (C++ and Python bindings), persists it in pickle/state, surfaces it in the LLM API and BindKvCacheTransceiver, and implements per-request timeout tracking and cleanup inside PyExecutor.

Changes

Cohort / File(s) Summary of changes
Core Config (C++)
cpp/include/tensorrt_llm/executor/executor.h, cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp
Added optional constructor parameter kvTransferTimeoutMs, private storage mKvTransferTimeoutMs, setter setKvTransferTimeoutMs(...), getter getKvTransferTimeoutMs(), and included timeout in equality/comparison and initialization.
Bindings — nanobind (Python)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
Extended CacheTransceiverConfig binding to accept/serialize kv_transfer_timeout_ms (3-arg init, getstate/setstate), added kv_transfer_timeout_ms property (getter/setter), and added CacheTransceiverBackendType::DEFAULT plus from_string helper.
Bindings — pybind (Python)
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
Mirrored nanobind: 3-arg construction path, updated pickle state to include timeout, added kv_transfer_timeout_ms property and setKvTransferTimeoutMs, and enum from_string.
PyExecutor timeout logic
tensorrt_llm/_torch/pyexecutor/py_executor.py
Added _check_kv_transfer_timeout() and integrated timeout checks in executor loops and scheduling points; initialize/clear per-request py_kv_transfer_start_time; mark timed-out requests with py_to_cleanup to drive termination/cleanup.
Per-request fields
tensorrt_llm/_torch/pyexecutor/llm_request.py
Added py_kv_transfer_start_time (None) and py_to_cleanup (False) attributes on request objects.
KV Transceiver wrapper
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
Added kv_transfer_timeout_ms attribute on BindKvCacheTransceiver, initialized from cache_transceiver_config.kv_transfer_timeout_ms.
LLM API model & examples
tensorrt_llm/llmapi/llm_args.py, examples/disaggregated/README.md
Added kv_transfer_timeout_ms: Optional[int] to CacheTransceiverConfig model and forwarded it to pybind; documented the option in example README.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant PyExecutor
  participant KvCacheTransceiver
  participant Request

  User->>PyExecutor: submit request(s)
  PyExecutor->>KvCacheTransceiver: initiate KV transfer (ctx/gen)
  PyExecutor->>Request: set py_kv_transfer_start_time

  loop executor cycles
    PyExecutor->>KvCacheTransceiver: poll transfer status
    PyExecutor->>PyExecutor: _check_kv_transfer_timeout()
    alt elapsed > kv_transfer_timeout_ms
      PyExecutor->>Request: set py_to_cleanup = True
      PyExecutor->>PyExecutor: terminate/cleanup request
    else transfer completes
      PyExecutor->>Request: clear py_kv_transfer_start_time
    end
  end

  PyExecutor-->>User: respond / terminate
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

Suggested labels

Documentation

Suggested reviewers

  • Tabrizian
  • Shunkangz
  • chuangz0
  • pcastonguay
  • nv-guomingz
  • Superjomn

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9c7ef25 and 0f0256d.

📒 Files selected for processing (2)
  • cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
  • cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/include/tensorrt_llm/executor/executor.h
  • cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@raayandhar raayandhar changed the title [nvbugs/5429636][fix] DRAFT: fix KVC mem leakage by adding kv_transfer_timeout_ms [https://nvbugs/5429636][fix] DRAFT: fix KVC mem leakage by adding kv_transfer_timeout_ms Aug 11, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9fe07e and 938297f.

📒 Files selected for processing (7)
  • cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
  • cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2 hunks)
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (9 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • cpp/include/tensorrt_llm/executor/executor.h
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp
  • cpp/include/tensorrt_llm/executor/executor.h
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/include/tensorrt_llm/executor/executor.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (14)
cpp/include/tensorrt_llm/executor/executor.h (2)

1446-1448: LGTM! Well-designed API addition.

The constructor signature properly extends the existing parameters with an optional timeout parameter, maintaining backward compatibility.


1453-1455: Good API design with proper type safety.

The setter/getter pair uses appropriate types - MillisecondsType for input (likely an integer alias) and std::chrono::milliseconds for output, providing type safety for duration values.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

332-332: LGTM! Proper initialization of transfer timestamp.

The new attribute follows the established naming convention and is appropriately initialized to None.

tensorrt_llm/llmapi/llm_args.py (2)

1067-1071: LGTM! Well-documented configuration field.

The field is properly defined with a clear description and appropriate default value, maintaining backward compatibility.


1072-1077: Correct forwarding to pybind layer.

The timeout parameter is properly passed to the underlying C++ binding.

cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp (3)

24-30: LGTM! Proper constructor initialization.

The constructor correctly initializes the new timeout member using the initializer list.


32-36: Equality operator correctly updated.

The comparison properly includes the new timeout field.


58-67: Clean getter/setter implementation.

The accessor methods are correctly implemented with appropriate const-correctness.

tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

1102-1104: Clarify why timeout check is disabled in overlap executor.

The timeout check is commented out in the overlap executor loop. Please either enable it or add a comment explaining why it's intentionally disabled in this execution path.


1384-1388: Good practice: Setting start time for new transfers.

The implementation correctly sets the transfer start time when initiating generation cache transfers, but only when timeout is configured.


1422-1426: Consistent timeout tracking for context transfers.

The start time is properly set for context cache transfers, maintaining consistency with generation transfers.


1360-1360: Proper cleanup of transfer timestamps.

The start times are correctly cleared when transfers complete or requests are terminated.

Also applies to: 1686-1686

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)

414-495: LGTM! CacheTransceiverConfig timeout implementation is well-structured.

The pybind11 bindings for the KV transfer timeout feature are correctly implemented with proper:

  • Type conversions between Python int64_t and C++ std::chrono::milliseconds
  • Optional value handling
  • Pickle support with backward compatibility (3-tuple state)
  • Property and method bindings
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)

431-512: LGTM! Nanobind implementation mirrors pybind11 correctly.

The nanobind bindings for CacheTransceiverConfig are properly implemented with:

  • Consistent API with the pybind11 version
  • Correct use of nanobind-specific patterns (placement new in setstate)
  • Proper type conversions and optional handling
  • Complete pickle support

@raayandhar raayandhar marked this pull request as draft August 11, 2025 21:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1287-1313: Fix critical issues in the timeout checking implementation

The _check_kv_transfer_timeout method has several issues that need to be addressed:

  1. Missing null check for kv_cache_transceiver before accessing its properties
  2. The condition at line 1304 incorrectly checks is_disagg_generation_transmission_in_progress instead of checking if the request is in progress
  3. The method doesn't call _terminate_request as expected but only sets the py_to_cleanup flag

Apply this fix to resolve the issues:

 def _check_kv_transfer_timeout(self):
+    if not self.kv_cache_transceiver:
+        return
     current_time = time.time()
     timeout_ms = self.kv_cache_transceiver.kv_transfer_timeout_ms
     if timeout_ms is None:
         return
 
     for req in self.ctx_in_transmission_requests[:]:
         if req.py_kv_transfer_start_time is None:
             continue
         elapsed_time = (current_time - req.py_kv_transfer_start_time) * 1000
         if elapsed_time > timeout_ms:
             logger.debug(
                 f"Terminating context request {req.py_request_id} due to KV cache transfer timeout"
             )
             req.py_to_cleanup = True
 
     for req in self.active_requests[:]:
         if req.is_disagg_generation_transmission_in_progress and req.py_kv_transfer_start_time is not None:
             elapsed_time = (current_time -
                             req.py_kv_transfer_start_time) * 1000
             if elapsed_time > timeout_ms:
                 logger.debug(
                     f"Terminating generation request {req.py_request_id} due to KV cache transfer timeout"
                 )
                 req.py_to_cleanup = True
-
-    return
🧹 Nitpick comments (4)
tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

907-907: Consider reducing log verbosity for production

While the debug logging is helpful during development, it might be too verbose for production. Consider using a more structured logging approach or controlling verbosity through configuration.

-        self._log_ctx_transmission_state("Starting _executor_loop")
+        if logger.isEnabledFor(logging.DEBUG):
+            self._log_ctx_transmission_state("Starting _executor_loop")

990-1005: Improve logging helper method for better debugging

The logging helper method could be enhanced with additional context and better formatting for production use.

 def _log_ctx_transmission_state(self, context: str):
     """Helper method to log the current state of ctx_in_transmission_requests"""
+    if not logger.isEnabledFor(logging.DEBUG):
+        return
+        
     if not self.ctx_in_transmission_requests:
         logger.debug(
             f"[CTX_TRANSMISSION] {context}: ctx_in_transmission_requests is EMPTY"
         )
     else:
-        request_ids = [
-            req.py_request_id for req in self.ctx_in_transmission_requests
-        ]
-        client_ids = [
-            req.py_client_id for req in self.ctx_in_transmission_requests
-        ]
+        request_info = [
+            {
+                "request_id": req.py_request_id,
+                "client_id": req.py_client_id,
+                "state": req.state.name if hasattr(req.state, 'name') else str(req.state),
+                "has_timeout": req.py_kv_transfer_start_time is not None
+            }
+            for req in self.ctx_in_transmission_requests
+        ]
         logger.debug(
-            f"[CTX_TRANSMISSION] {context}: ctx_in_transmission_requests has {len(self.ctx_in_transmission_requests)} requests - request_ids: {request_ids}, client_ids: {client_ids}"
+            f"[CTX_TRANSMISSION] {context}: {len(self.ctx_in_transmission_requests)} requests in transmission",
+            extra={"requests": request_info}
         )

1120-1121: Clarify the commented-out timeout check in overlap executor

The comment indicates uncertainty about placing the timeout check here. This should be resolved before finalizing the PR.

The timeout check is commented out with uncertainty. This needs clarification to ensure consistent timeout handling across all executor paths. Would you like me to analyze the overlap executor flow to determine if the timeout check should be enabled here?


1658-1716: Consider reducing logging frequency in hot paths

Multiple debug log calls in the response handling path could impact performance. Consider consolidating or conditionally enabling these logs.

 def _handle_responses(self):
-    self._log_ctx_transmission_state("Start of _handle_responses")
+    if logger.isEnabledFor(logging.DEBUG):
+        self._log_ctx_transmission_state("Start of _handle_responses")
     # ... rest of the method ...
-    self._log_ctx_transmission_state("End of _handle_responses")
+    if logger.isEnabledFor(logging.DEBUG):
+        self._log_ctx_transmission_state("End of _handle_responses")
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b116bf and 296c7ae.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (14 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

1004-1004: Line too long (186 > 120)

(E501)

🔇 Additional comments (7)
tensorrt_llm/_torch/pyexecutor/py_executor.py (7)

858-859: Good placement of timeout check after generation transfer status check

The timeout check is appropriately placed after checking disaggregated generation transfer status. This ensures that transfers are monitored and timed out if they exceed the configured threshold.


978-978: Consistent timeout check placement in the main executor loop

The timeout check is consistently placed after terminating context-finished requests, matching the pattern used in the PP executor loop.


1387-1387: Correctly reset timeout tracking when generation transfer completes

The timeout tracking is properly reset when a generation request completes its KV cache transfer and transitions to the generation-in-progress state.


1411-1414: Properly initialize timeout tracking for new generation requests

The start time is correctly initialized when new generation requests begin their KV cache transfer.


1449-1452: Consistently initialize timeout tracking for context transmission

The timeout tracking is properly initialized for context requests that enter the transmission state.


1672-1676: Effective cleanup mechanism for timed-out requests

The cleanup flag is properly checked during response handling to terminate requests that have exceeded the timeout threshold.


1724-1727: Proper cleanup of timeout tracking in termination handler

The timeout tracking is correctly reset when requests are terminated, whether due to completion or timeout.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

840-843: Swap timeout check before termination (still pending here).

Call _check_kv_transfer_timeout() before _terminate_ctx_finished_requests() to ensure timed-out ctx transfers are cleaned up immediately. This mirrors the updated ordering elsewhere in this file.

Apply this diff:

-                if self.kv_cache_transceiver and self.ctx_in_transmission_requests:
-                    self._terminate_ctx_finished_requests()
-                    self._check_kv_transfer_timeout()
+                if self.kv_cache_transceiver and self.ctx_in_transmission_requests:
+                    self._check_kv_transfer_timeout()
+                    self._terminate_ctx_finished_requests()
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 296c7ae and 76bb547.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (10 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/py_executor.py (6)

857-860: LGTM: Added timeout check in prep/schedule path.

Running _check_kv_transfer_timeout() early in the scheduling path helps bound resource usage proactively.


975-978: LGTM: Correct ordering (check timeout before termination).

This prevents a one-iteration delay in cleaning up timed-out ctx transfers.


1101-1103: LGTM: Correct ordering (overlap loop).

Timeout check precedes termination, consistent with the intended behavior.


1368-1369: LGTM: Reset start time on gen transfer completion.

Clearing py_kv_transfer_start_time on completion prevents stale timeouts.


1699-1701: LGTM: Cleanup on ctx completion or timeout.

Clearing start time and removing from in-flight list ensures bounded resource usage.


1652-1657: I couldn’t find a FinishReason.TIMEOUT or a finish_by_reason method in the codebase, and the FinishReason enum only defines CANCELLED, STOP, and LENGTH. Likewise, there’s no terminal-response helper on LlmRequest. Emitting a final response will require:

• Defining a timeout finish reason
• Implementing or exposing a finish_by_reason(...) API on LlmRequest
• Ensuring create_response(...) returns a valid final chunk

Until those exist, the suggested diff won’t compile.

Action items:

  • Add TIMEOUT to FinishReason (e.g. in finish_reason.py)

  • Implement LlmRequest.finish_by_reason(reason: FinishReason)

  • In py_executor.py around line 1652, call:

    request.finish_by_reason(FinishReason.TIMEOUT)
    response = request.create_response(final_chunk=True, worker_rank=self.dist.rank)
    if response:
        new_responses.append((req_id, response))
  • Then proceed with requests_to_terminate.append(request) as before.

Please confirm before landing these refactors.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

842-843: Swap the order: run timeout checks before terminating ctx-finished requests (consistency and correctness).

To promptly flag and clean up timed-out transfers before normal termination, call _check_kv_transfer_timeout() before _terminate_ctx_finished_requests() here (matches other loops below).

-                    self._terminate_ctx_finished_requests()
-                    self._check_kv_transfer_timeout()
+                    self._check_kv_transfer_timeout()
+                    self._terminate_ctx_finished_requests()
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76bb547 and 6936a3b.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (10 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/_torch/pyexecutor/py_executor.py (5)

857-860: LGTM: proactive timeout check during batch prep.

Running _check_kv_transfer_timeout() during batch preparation reduces time-to-detection for stalled transfers.


975-978: LGTM: correct ordering — check timeouts before terminating ctx-finished.

This matches the intended flow and previous guidance.


1101-1103: LGTM: correct ordering in overlap loop as well.

Consistent with the non-PP path; timeout check precedes ctx-finished termination.


1370-1370: LGTM: clear the KV transfer timing anchor on gen-transfer completion.

Resetting py_kv_transfer_start_time avoids stale elapsed-time checks after completion.


1701-1703: LGTM: terminate ctx transfers on completion or timeout and clear timing anchor.

This ensures resource cleanup for context-only transmissions and avoids stale timers.

@raayandhar raayandhar marked this pull request as ready for review August 12, 2025 23:13
@raayandhar raayandhar requested review from a team as code owners August 12, 2025 23:13
@raayandhar raayandhar changed the title [https://nvbugs/5429636][fix] DRAFT: fix KVC mem leakage by adding kv_transfer_timeout_ms [https://nvbugs/5429636][fix] fix KVC mem leakage by adding kv_transfer_timeout_ms Aug 12, 2025
@raayandhar
Copy link
Contributor Author

cc: @Tabrizian
... I'm not sure why it looks like I requested a ton of reviews. I did not do that...

@raayandhar
Copy link
Contributor Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15028 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15028 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11350 completed with status: 'FAILURE'

@raayandhar
Copy link
Contributor Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1-1: Add the required NVIDIA copyright header

This source lacks the required NVIDIA header per repository guidelines. Please prepend the standard header at the top of this file.

Example header:

# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
# NVIDIA and its licensors retain all intellectual property and proprietary
# rights in and to this software, related documentation and any modifications
# thereto.  Any use, reproduction, disclosure or distribution of this software
# and related documentation without an express license agreement from NVIDIA is
# strictly prohibited.
♻️ Duplicate comments (5)
tensorrt_llm/_torch/pyexecutor/py_executor.py (5)

841-844: Call timeout checks before terminating finished ctx transfers

Run _check_kv_transfer_timeout() before _terminate_ctx_finished_requests() to ensure timeouts are evaluated and flagged prior to cleanup.

-    self._terminate_ctx_finished_requests()
-    self._check_kv_transfer_timeout()
+    self._check_kv_transfer_timeout()
+    self._terminate_ctx_finished_requests()

1269-1298: Use a monotonic clock and treat non-positive timeouts as disabled

Wall-clock jumps can skew elapsed-time checks. Use time.monotonic() and keep semantics: None or <= 0 disables the timeout. Also prefer elapsed_ms for clarity.

 def _check_kv_transfer_timeout(self):
     if not self.kv_cache_transceiver:
         return
     timeout_ms = self.kv_cache_transceiver.kv_transfer_timeout_ms
-    if timeout_ms is None or timeout_ms <= 0:
-        return
-    current_time = time.time()
+    if timeout_ms is None or timeout_ms <= 0:
+        return
+    current_time = time.monotonic()
 
     for req in self.ctx_in_transmission_requests[:]:
         if req.py_kv_transfer_start_time is None:
             continue
-        elapsed_time = (current_time - req.py_kv_transfer_start_time) * 1000
-        if elapsed_time > timeout_ms:
+        elapsed_ms = (current_time - req.py_kv_transfer_start_time) * 1000
+        if elapsed_ms > timeout_ms:
             logger.warning(
                 f"Terminating context request {req.py_request_id} due to KV cache transfer timeout"
             )
             req.py_to_cleanup = True
 
     for req in self.active_requests[:]:
         if req.is_disagg_generation_transmission_in_progress and req.py_kv_transfer_start_time is not None:
-            elapsed_time = (current_time -
-                            req.py_kv_transfer_start_time) * 1000
-            if elapsed_time > timeout_ms:
+            elapsed_ms = (current_time - req.py_kv_transfer_start_time) * 1000
+            if elapsed_ms > timeout_ms:
                 logger.warning(
                     f"Terminating generation request {req.py_request_id} due to KV cache transfer timeout"
                 )
                 req.py_to_cleanup = True

1395-1399: Initialize gen-transfer timers only when timeout is enabled, and use a monotonic clock

Honor disabled/non-positive timeouts and switch to time.monotonic() for robustness.

-        if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None:
+        if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None and \
+           self.kv_cache_transceiver.kv_transfer_timeout_ms > 0:
             for req in new_gen_reqs:
                 if req.state == LlmRequestState.DISAGG_GENERATION_TRANS_IN_PROGRESS:
-                    req.py_kv_transfer_start_time = time.time()
+                    req.py_kv_transfer_start_time = time.monotonic()

1433-1437: Initialize ctx-transfer timers only when timeout is enabled, and use a monotonic clock

Keep clock sources consistent and ignore non-positive timeouts.

-        if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None:
+        if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None and \
+           self.kv_cache_transceiver.kv_transfer_timeout_ms > 0:
             for req in ctx_transmission_reqs:
                 if req.state == LlmRequestState.DISAGG_CONTEXT_TRANS_IN_PROGRESS:
-                    req.py_kv_transfer_start_time = time.time()
+                    req.py_kv_transfer_start_time = time.monotonic()

1655-1660: Emit a final response on timeout to avoid client hangs

Requests flagged for cleanup due to KV-transfer timeout are terminated without sending a final response. Emit a final response (use CANCELLED as a proxy for TIMEOUT) before cleanup so clients don't hang waiting.

             # Check if generation request needs cleanup due to KV cache transfer timeout
             if request.py_to_cleanup:
-                request.state = LlmRequestState.GENERATION_COMPLETE
-                requests_to_terminate.append(request)
-                continue
+                # Mark as cancelled due to timeout and emit a final response
+                request.finish_by_reason(FinishReason.CANCELLED)
+                request.state = LlmRequestState.GENERATION_COMPLETE
+                request.decoding_iter = request.py_decoding_iter
+                response = request.create_response(False, self.dist.rank)
+                if response:
+                    new_responses.append((req_id, response))
+                requests_to_terminate.append(request)
+                continue
🧹 Nitpick comments (1)
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)

474-495: Validate non-negative timeout in the Python setter

Guard against negative values to prevent undefined semantics; raise a Python ValueError when negative.

         .def_property(
             "kv_transfer_timeout_ms",
             [](tle::CacheTransceiverConfig const& self) -> std::optional<int64_t>
             {
                 auto timeout = self.getKvTransferTimeoutMs();
                 return timeout ? std::optional<int64_t>(timeout->count()) : std::nullopt;
             },
             [](tle::CacheTransceiverConfig& self, std::optional<int64_t> timeoutMs)
             {
-                if (timeoutMs)
+                if (timeoutMs)
                 {
-                    self.setKvTransferTimeoutMs(std::chrono::milliseconds(*timeoutMs));
+                    if (*timeoutMs < 0)
+                    {
+                        throw py::value_error("kv_transfer_timeout_ms must be non-negative");
+                    }
+                    self.setKvTransferTimeoutMs(std::chrono::milliseconds(*timeoutMs));
                 }
                 else
                 {
                     self.setKvTransferTimeoutMs(std::nullopt);
                 }
             })
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31634aa and 5924e85.

📒 Files selected for processing (9)
  • cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
  • cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2 hunks)
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2 hunks)
  • examples/disaggregated/README.md (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (10 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • cpp/include/tensorrt_llm/executor/executor.h
  • examples/disaggregated/README.md
  • tensorrt_llm/llmapi/llm_args.py
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
  • cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

858-861: Good placement: timeout check is invoked early in the iteration

Calling _check_kv_transfer_timeout() right after _check_disagg_gen_transfer_status() ensures overdue transfers are pruned promptly.


976-979: Correct order: timeout check before ctx-finished termination

This ordering prevents missing timeouts on still-in-transit ctx requests. No change needed here.


1371-1371: LGTM: Resetting start-time on gen transmission completion

Clearing py_kv_transfer_start_time on successful generation transmission completion avoids stale timers.


1701-1703: LGTM: Terminating ctx-transmission requests on completion or timeout

Including request.py_to_cleanup here promptly frees resources and clears the timeout start marker.

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2)

414-419: Add kv_transfer_timeout_ms to pickle state: looks good

Returning a 3-tuple with timeoutMs preserves state across pickling. The conversion to optional<int64_t> is correct.


456-470: 3-arg constructor for CacheTransceiverConfig: good

The binding cleanly maps optional timeoutMs to std::chrono::milliseconds and threads it through the C++ constructor.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15033 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15033 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11354 completed with status: 'FAILURE'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (4)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

841-844: Run timeout checks before terminating finished ctx transfers (ordering)

Call _check_kv_transfer_timeout() before _terminate_ctx_finished_requests() so timed-out ctx transfers are flagged and cleaned in the same cycle. This mirrors the ordering you already apply in other loops.

-                if self.kv_cache_transceiver and self.ctx_in_transmission_requests:
-                    self._terminate_ctx_finished_requests()
-                    self._check_kv_transfer_timeout()
+                if self.kv_cache_transceiver and self.ctx_in_transmission_requests:
+                    self._check_kv_transfer_timeout()
+                    self._terminate_ctx_finished_requests()

1269-1298: Use a monotonic clock and unify elapsed naming

time.time() is not monotonic and can be affected by system clock changes. Use time.monotonic() and keep variable naming consistent (elapsed_ms). Also keep the existing guards (None or <= 0 => disabled).

 def _check_kv_transfer_timeout(self):
     if not self.kv_cache_transceiver:
         return
     timeout_ms = self.kv_cache_transceiver.kv_transfer_timeout_ms
     if timeout_ms is None or timeout_ms <= 0:
         return
-    current_time = time.time()
+    current_time = time.monotonic()

     for req in self.ctx_in_transmission_requests[:]:
         if req.py_kv_transfer_start_time is None:
             continue
-        elapsed_time = (current_time - req.py_kv_transfer_start_time) * 1000
-        if elapsed_time > timeout_ms:
+        elapsed_ms = (current_time - req.py_kv_transfer_start_time) * 1000
+        if elapsed_ms > timeout_ms:
             logger.warning(
                 f"Terminating context request {req.py_request_id} due to KV cache transfer timeout"
             )
             req.py_to_cleanup = True

     for req in self.active_requests[:]:
         if req.is_disagg_generation_transmission_in_progress and req.py_kv_transfer_start_time is not None:
-            elapsed_time = (current_time -
-                            req.py_kv_transfer_start_time) * 1000
-            if elapsed_time > timeout_ms:
+            elapsed_ms = (current_time - req.py_kv_transfer_start_time) * 1000
+            if elapsed_ms > timeout_ms:
                 logger.warning(
                     f"Terminating generation request {req.py_request_id} due to KV cache transfer timeout"
                 )
                 req.py_to_cleanup = True

     return

1433-1437: Initialize ctx-transfer timer with monotonic clock and don’t reset it repeatedly

Mirror the gen path: respect disabled/non-positive timeouts, use a monotonic clock, and only set once.

-        if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None:
+        if (self.kv_cache_transceiver.kv_transfer_timeout_ms is not None and
+                self.kv_cache_transceiver.kv_transfer_timeout_ms > 0):
             for req in ctx_transmission_reqs:
                 if req.state == LlmRequestState.DISAGG_CONTEXT_TRANS_IN_PROGRESS:
-                    req.py_kv_transfer_start_time = time.time()
+                    if req.py_kv_transfer_start_time is None:
+                        req.py_kv_transfer_start_time = time.monotonic()
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)

422-435: Backward compatibility: accept 2- or 3-element pickle state in setstate

Strictly requiring 3 elements breaks older pickles. Accept both sizes and apply timeout only when present.

-        if (state.size() != 3)
-        {
-            throw std::runtime_error("Invalid CacheTransceiverConfig state!");
-        }
-        auto config
-            = tle::CacheTransceiverConfig(state[0].cast<std::optional<tle::CacheTransceiverConfig::BackendType>>(),
-                state[1].cast<std::optional<size_t>>());
-        auto timeoutMs = state[2].cast<std::optional<int64_t>>();
-        if (timeoutMs)
-        {
-            config.setKvTransferTimeoutMs(std::chrono::milliseconds(*timeoutMs));
-        }
-        return config;
+        if (state.size() < 2 || state.size() > 3)
+        {
+            throw std::runtime_error("Invalid CacheTransceiverConfig state!");
+        }
+        auto config
+            = tle::CacheTransceiverConfig(state[0].cast<std::optional<tle::CacheTransceiverConfig::BackendType>>(),
+                state[1].cast<std::optional<size_t>>());
+        if (state.size() == 3)
+        {
+            auto timeoutMs = state[2].cast<std::optional<int64_t>>();
+            if (timeoutMs)
+            {
+                config.setKvTransferTimeoutMs(std::chrono::milliseconds(*timeoutMs));
+            }
+        }
+        return config;
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1-10: Missing NVIDIA copyright header

Per repository guidelines, Python sources should include the NVIDIA header. Please prepend the standard header.

Example (outside this diff hunk):

# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
# NVIDIA and its licensors retain all intellectual property and proprietary
# rights in and to this software, related documentation and any modifications
# thereto.  Any use, reproduction, disclosure or distribution of this software
# and related documentation without an express license agreement from NVIDIA is
# strictly prohibited.
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)

1-16: Header years

The SPDX header ends at 2024, while other updated files use 2025. If the project policy is to bump the year on touched files, consider updating.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5924e85 and e671749.

📒 Files selected for processing (8)
  • cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
  • cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2 hunks)
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (10 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
  • cpp/include/tensorrt_llm/executor/executor.h
  • cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

858-861: Good: early timeout check after gen-transfer status

Placing _check_kv_transfer_timeout() right after _check_disagg_gen_transfer_status() ensures in-flight timeouts are handled before scheduling.


976-979: Good: correct ordering (timeout before ctx-finished termination)

This ordering avoids missing a timeout in the same iteration.


1101-1104: Good: consistent ordering in overlap loop

Timeout is checked before ctx-finished termination here as well.


1701-1706: Good: clear start-time on ctx completion or timeout

Resetting py_kv_transfer_start_time on completion or timeout prevents stale timestamps on reused request objects.

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (3)

414-419: LGTM: pickle getstate includes timeout

Returning a 3-tuple with optional timeout preserves the new field across serialization.


457-470: LGTM: 3-arg constructor wiring

The lambda converting optional int64_t to optional milliseconds is correct and keeps the Python API ergonomic.


474-491: LGTM: kv_transfer_timeout_ms property binding

Getter/setter semantics and nullopt handling look correct.

cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (3)

431-436: LGTM: pickle getstate includes timeout

3-tuple state with optional timeout is correct.


473-486: LGTM: 3-arg constructor wiring

Optional ms conversion is correct.


491-507: LGTM: kv_transfer_timeout_ms property binding

Correct optional handling and ms conversion.

@raayandhar
Copy link
Contributor Author

raayandhar commented Aug 14, 2025

I have two questions here.

  1. Currently, we are using C++ KV Cache Transceiver. It seems that we only terminate the requests from python level and the timeout requests are still in the queue of C++ transceiver. Is there any problem? Because the KV cache has been released.
  2. The context worker might be timeout first and the generation workers still acquire the kv cache. In this case, the C++ transceiver is not aware of this and might try to send the requests whose kv cache released.

For 1. could you elaborate on why the timeout requests are still in the queue of the C++ transceiver? In my testing I recreate conditions where context requests pile up and fail to send their KV cache which is a use case this feature targets. That's why I make sure to cleanup these requests as would also cleanup for requests that were successfully in transmitting their KV cache, i.e. the normal cleanup flow. I think they might still be in the queue but once checkContextTransferStatus is called later they will also get removed.
For 2. are you saying that there is some race condition, where requests in ctx_in_transmission_requests can have their generation workers acquire their KV cache before _terminate_ctx_finished_requests? I did not know about this possible case. My thinking is that as long as I cleanup in the regular flow we should be avoiding issues like this and I did not face any issue in testing. But if you could elaborate on this case that would be very helpful.

@Tabrizian
Copy link
Member

Re https://github.com/NVIDIA/TensorRT-LLM/pull/6801#issuecomment-3185781656>
I meant if we send a context request but never send any request to gen server or context server afterwards. Would we still cancel the KVCache send request from context worker side?

@raayandhar
Copy link
Contributor Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15332 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15332 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11573 completed with status: 'FAILURE'

@raayandhar
Copy link
Contributor Author

raayandhar commented Aug 15, 2025

Re https://github.com/NVIDIA/TensorRT-LLM/pull/6801#issuecomment-3185781656> I meant if we send a context request but never send any request to gen server or context server afterwards. Would we still cancel the KVCache send request from context worker side?

If we set the timeout low enough, yes. It will check in under 1ms in my testing so I had to artificially introduce delay via time.sleep.

@raayandhar
Copy link
Contributor Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15468 [ run ] triggered by Bot

@Shunkangz
Copy link
Collaborator

  1. You only cancel the requests from PyExecutor instead of C++ cache transceiver. https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp#L254
  2. Timeout behavior cannot ensure the context workers and generation workers in the same view. We should exchange acknowledgment information one more time. I have a draft PR here to do the similar thing. Instead of using timeout, we believe that we should provide cancelRequest interface to the upper level scheduler and throw an exception with useful information. This PR is waiting for the ready of exception refactor for disagg. cc @Shixiaowei02 @chuangz0
  3. There are two cases that we can encounter timeout. (1) the generation workers do not acquire the KV cache in time. Maybe because of some resource limitation. In this case, we can simply remove the request from ready queue. However, the generation workers might still acquire the kv cache. We should let the generation workers know the latest status. In the distributed system, it might not be proper to rely on "tricky timeout". (2) the current send/recv hangs. In this case, you cannot even cancel the cache transceiver. Because the ucx or RDMA does not support this. The only way is to destroy the endpoint and let both sender and receiver know this.

I hope that I am clear. If you need more discussion, we have schedule a meeting.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15468 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11651 completed with status: 'FAILURE'

@raayandhar
Copy link
Contributor Author

  1. You only cancel the requests from PyExecutor instead of C++ cache transceiver. https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp#L254
  2. Timeout behavior cannot ensure the context workers and generation workers in the same view. We should exchange acknowledgment information one more time. I have a draft PR here to do the similar thing. Instead of using timeout, we believe that we should provide cancelRequest interface to the upper level scheduler and throw an exception with useful information. This PR is waiting for the ready of exception refactor for disagg. cc @Shixiaowei02 @chuangz0
  3. There are two cases that we can encounter timeout. (1) the generation workers do not acquire the KV cache in time. Maybe because of some resource limitation. In this case, we can simply remove the request from ready queue. However, the generation workers might still acquire the kv cache. We should let the generation workers know the latest status. In the distributed system, it might not be proper to rely on "tricky timeout". (2) the current send/recv hangs. In this case, you cannot even cancel the cache transceiver. Because the ucx or RDMA does not support this. The only way is to destroy the endpoint and let both sender and receiver know this.

I hope that I am clear. If you need more discussion, we have schedule a meeting.

I see, thanks Shunkang for the detailed response. I agree that the cancelRequest interface makes more sense. What's not clear to me is that I don't think it will be able to directly solve the problem in this issue? Since exceptions are never thrown here. Instead the timeout behavior could use the cancelRequest interface you've provided instead to deal with this, is what I'm thinking. But I may not have understood something. Let me know your thoughts.

@Shunkangz
Copy link
Collaborator

  1. You only cancel the requests from PyExecutor instead of C++ cache transceiver. https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp#L254
  2. Timeout behavior cannot ensure the context workers and generation workers in the same view. We should exchange acknowledgment information one more time. I have a draft PR here to do the similar thing. Instead of using timeout, we believe that we should provide cancelRequest interface to the upper level scheduler and throw an exception with useful information. This PR is waiting for the ready of exception refactor for disagg. cc @Shixiaowei02 @chuangz0
  3. There are two cases that we can encounter timeout. (1) the generation workers do not acquire the KV cache in time. Maybe because of some resource limitation. In this case, we can simply remove the request from ready queue. However, the generation workers might still acquire the kv cache. We should let the generation workers know the latest status. In the distributed system, it might not be proper to rely on "tricky timeout". (2) the current send/recv hangs. In this case, you cannot even cancel the cache transceiver. Because the ucx or RDMA does not support this. The only way is to destroy the endpoint and let both sender and receiver know this.

I hope that I am clear. If you need more discussion, we have schedule a meeting.

I see, thanks Shunkang for the detailed response. I agree that the cancelRequest interface makes more sense. What's not clear to me is that I don't think it will be able to directly solve the problem in this issue? Since exceptions are never thrown here. Instead the timeout behavior could use the cancelRequest interface you've provided instead to deal with this, is what I'm thinking. But I may not have understood something. Let me know your thoughts.

You are right. The PR is still in progress. Because I am waiting for the previous one merged. I hope that it can be merged within a couple of days.

Copy link
Member

@Tabrizian Tabrizian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add testing for this feature.

@raayandhar
Copy link
Contributor Author

Please add testing for this feature.

Will do, having discussion with @Shunkangz on how to move forward.

@pcastonguay
Copy link
Collaborator

@Shunkangz it seems to me that the changes in this MR are still needed for the case where for some reason the gen request never makes it to the generation worker. In that case we would need to cancel the request on the context server. And I agree, we should also call the C++ dataTransceiver cancelRequest to make sure the transceiver removes that request as well.

@raayandhar
Copy link
Contributor Author

Closing since #7575 is more up to date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants