-
Notifications
You must be signed in to change notification settings - Fork 2k
[https://nvbugs/5429636][fix] fix KVC mem leakage by adding kv_transfer_timeout_ms #6801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds an optional per-transfer KV-cache timeout through CacheTransceiverConfig (C++ and Python bindings), persists it in pickle/state, surfaces it in the LLM API and BindKvCacheTransceiver, and implements per-request timeout tracking and cleanup inside PyExecutor. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant PyExecutor
participant KvCacheTransceiver
participant Request
User->>PyExecutor: submit request(s)
PyExecutor->>KvCacheTransceiver: initiate KV transfer (ctx/gen)
PyExecutor->>Request: set py_kv_transfer_start_time
loop executor cycles
PyExecutor->>KvCacheTransceiver: poll transfer status
PyExecutor->>PyExecutor: _check_kv_transfer_timeout()
alt elapsed > kv_transfer_timeout_ms
PyExecutor->>Request: set py_to_cleanup = True
PyExecutor->>PyExecutor: terminate/cleanup request
else transfer completes
PyExecutor->>Request: clear py_kv_transfer_start_time
end
end
PyExecutor-->>User: respond / terminate
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 💡 Knowledge Base configuration:
You can enable these settings in your CodeRabbit configuration. 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cpp/include/tensorrt_llm/executor/executor.h(2 hunks)cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp(2 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp(2 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp(2 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(9 hunks)tensorrt_llm/llmapi/llm_args.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/llmapi/llm_args.pytensorrt_llm/_torch/pyexecutor/llm_request.pytensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/llmapi/llm_args.pytensorrt_llm/_torch/pyexecutor/llm_request.pycpp/tensorrt_llm/executor/cacheTransceiverConfig.cpptensorrt_llm/_torch/pyexecutor/py_executor.pycpp/include/tensorrt_llm/executor/executor.hcpp/tensorrt_llm/nanobind/executor/executorConfig.cppcpp/tensorrt_llm/pybind/executor/executorConfig.cpp
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/executor/cacheTransceiverConfig.cppcpp/include/tensorrt_llm/executor/executor.hcpp/tensorrt_llm/nanobind/executor/executorConfig.cppcpp/tensorrt_llm/pybind/executor/executorConfig.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/include/tensorrt_llm/executor/executor.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (14)
cpp/include/tensorrt_llm/executor/executor.h (2)
1446-1448: LGTM! Well-designed API addition.The constructor signature properly extends the existing parameters with an optional timeout parameter, maintaining backward compatibility.
1453-1455: Good API design with proper type safety.The setter/getter pair uses appropriate types -
MillisecondsTypefor input (likely an integer alias) andstd::chrono::millisecondsfor output, providing type safety for duration values.tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
332-332: LGTM! Proper initialization of transfer timestamp.The new attribute follows the established naming convention and is appropriately initialized to None.
tensorrt_llm/llmapi/llm_args.py (2)
1067-1071: LGTM! Well-documented configuration field.The field is properly defined with a clear description and appropriate default value, maintaining backward compatibility.
1072-1077: Correct forwarding to pybind layer.The timeout parameter is properly passed to the underlying C++ binding.
cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp (3)
24-30: LGTM! Proper constructor initialization.The constructor correctly initializes the new timeout member using the initializer list.
32-36: Equality operator correctly updated.The comparison properly includes the new timeout field.
58-67: Clean getter/setter implementation.The accessor methods are correctly implemented with appropriate const-correctness.
tensorrt_llm/_torch/pyexecutor/py_executor.py (4)
1102-1104: Clarify why timeout check is disabled in overlap executor.The timeout check is commented out in the overlap executor loop. Please either enable it or add a comment explaining why it's intentionally disabled in this execution path.
1384-1388: Good practice: Setting start time for new transfers.The implementation correctly sets the transfer start time when initiating generation cache transfers, but only when timeout is configured.
1422-1426: Consistent timeout tracking for context transfers.The start time is properly set for context cache transfers, maintaining consistency with generation transfers.
1360-1360: Proper cleanup of transfer timestamps.The start times are correctly cleared when transfers complete or requests are terminated.
Also applies to: 1686-1686
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)
414-495: LGTM! CacheTransceiverConfig timeout implementation is well-structured.The pybind11 bindings for the KV transfer timeout feature are correctly implemented with proper:
- Type conversions between Python int64_t and C++ std::chrono::milliseconds
- Optional value handling
- Pickle support with backward compatibility (3-tuple state)
- Property and method bindings
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
431-512: LGTM! Nanobind implementation mirrors pybind11 correctly.The nanobind bindings for CacheTransceiverConfig are properly implemented with:
- Consistent API with the pybind11 version
- Correct use of nanobind-specific patterns (placement new in setstate)
- Proper type conversions and optional handling
- Complete pickle support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1287-1313: Fix critical issues in the timeout checking implementationThe
_check_kv_transfer_timeoutmethod has several issues that need to be addressed:
- Missing null check for
kv_cache_transceiverbefore accessing its properties- The condition at line 1304 incorrectly checks
is_disagg_generation_transmission_in_progressinstead of checking if the request is in progress- The method doesn't call
_terminate_requestas expected but only sets thepy_to_cleanupflagApply this fix to resolve the issues:
def _check_kv_transfer_timeout(self): + if not self.kv_cache_transceiver: + return current_time = time.time() timeout_ms = self.kv_cache_transceiver.kv_transfer_timeout_ms if timeout_ms is None: return for req in self.ctx_in_transmission_requests[:]: if req.py_kv_transfer_start_time is None: continue elapsed_time = (current_time - req.py_kv_transfer_start_time) * 1000 if elapsed_time > timeout_ms: logger.debug( f"Terminating context request {req.py_request_id} due to KV cache transfer timeout" ) req.py_to_cleanup = True for req in self.active_requests[:]: if req.is_disagg_generation_transmission_in_progress and req.py_kv_transfer_start_time is not None: elapsed_time = (current_time - req.py_kv_transfer_start_time) * 1000 if elapsed_time > timeout_ms: logger.debug( f"Terminating generation request {req.py_request_id} due to KV cache transfer timeout" ) req.py_to_cleanup = True - - return
🧹 Nitpick comments (4)
tensorrt_llm/_torch/pyexecutor/py_executor.py (4)
907-907: Consider reducing log verbosity for productionWhile the debug logging is helpful during development, it might be too verbose for production. Consider using a more structured logging approach or controlling verbosity through configuration.
- self._log_ctx_transmission_state("Starting _executor_loop") + if logger.isEnabledFor(logging.DEBUG): + self._log_ctx_transmission_state("Starting _executor_loop")
990-1005: Improve logging helper method for better debuggingThe logging helper method could be enhanced with additional context and better formatting for production use.
def _log_ctx_transmission_state(self, context: str): """Helper method to log the current state of ctx_in_transmission_requests""" + if not logger.isEnabledFor(logging.DEBUG): + return + if not self.ctx_in_transmission_requests: logger.debug( f"[CTX_TRANSMISSION] {context}: ctx_in_transmission_requests is EMPTY" ) else: - request_ids = [ - req.py_request_id for req in self.ctx_in_transmission_requests - ] - client_ids = [ - req.py_client_id for req in self.ctx_in_transmission_requests - ] + request_info = [ + { + "request_id": req.py_request_id, + "client_id": req.py_client_id, + "state": req.state.name if hasattr(req.state, 'name') else str(req.state), + "has_timeout": req.py_kv_transfer_start_time is not None + } + for req in self.ctx_in_transmission_requests + ] logger.debug( - f"[CTX_TRANSMISSION] {context}: ctx_in_transmission_requests has {len(self.ctx_in_transmission_requests)} requests - request_ids: {request_ids}, client_ids: {client_ids}" + f"[CTX_TRANSMISSION] {context}: {len(self.ctx_in_transmission_requests)} requests in transmission", + extra={"requests": request_info} )
1120-1121: Clarify the commented-out timeout check in overlap executorThe comment indicates uncertainty about placing the timeout check here. This should be resolved before finalizing the PR.
The timeout check is commented out with uncertainty. This needs clarification to ensure consistent timeout handling across all executor paths. Would you like me to analyze the overlap executor flow to determine if the timeout check should be enabled here?
1658-1716: Consider reducing logging frequency in hot pathsMultiple debug log calls in the response handling path could impact performance. Consider consolidating or conditionally enabling these logs.
def _handle_responses(self): - self._log_ctx_transmission_state("Start of _handle_responses") + if logger.isEnabledFor(logging.DEBUG): + self._log_ctx_transmission_state("Start of _handle_responses") # ... rest of the method ... - self._log_ctx_transmission_state("End of _handle_responses") + if logger.isEnabledFor(logging.DEBUG): + self._log_ctx_transmission_state("End of _handle_responses")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py(14 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
1004-1004: Line too long (186 > 120)
(E501)
🔇 Additional comments (7)
tensorrt_llm/_torch/pyexecutor/py_executor.py (7)
858-859: Good placement of timeout check after generation transfer status checkThe timeout check is appropriately placed after checking disaggregated generation transfer status. This ensures that transfers are monitored and timed out if they exceed the configured threshold.
978-978: Consistent timeout check placement in the main executor loopThe timeout check is consistently placed after terminating context-finished requests, matching the pattern used in the PP executor loop.
1387-1387: Correctly reset timeout tracking when generation transfer completesThe timeout tracking is properly reset when a generation request completes its KV cache transfer and transitions to the generation-in-progress state.
1411-1414: Properly initialize timeout tracking for new generation requestsThe start time is correctly initialized when new generation requests begin their KV cache transfer.
1449-1452: Consistently initialize timeout tracking for context transmissionThe timeout tracking is properly initialized for context requests that enter the transmission state.
1672-1676: Effective cleanup mechanism for timed-out requestsThe cleanup flag is properly checked during response handling to terminate requests that have exceeded the timeout threshold.
1724-1727: Proper cleanup of timeout tracking in termination handlerThe timeout tracking is correctly reset when requests are terminated, whether due to completion or timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
840-843: Swap timeout check before termination (still pending here).Call _check_kv_transfer_timeout() before _terminate_ctx_finished_requests() to ensure timed-out ctx transfers are cleaned up immediately. This mirrors the updated ordering elsewhere in this file.
Apply this diff:
- if self.kv_cache_transceiver and self.ctx_in_transmission_requests: - self._terminate_ctx_finished_requests() - self._check_kv_transfer_timeout() + if self.kv_cache_transceiver and self.ctx_in_transmission_requests: + self._check_kv_transfer_timeout() + self._terminate_ctx_finished_requests()
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py(10 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/py_executor.py (6)
857-860: LGTM: Added timeout check in prep/schedule path.Running _check_kv_transfer_timeout() early in the scheduling path helps bound resource usage proactively.
975-978: LGTM: Correct ordering (check timeout before termination).This prevents a one-iteration delay in cleaning up timed-out ctx transfers.
1101-1103: LGTM: Correct ordering (overlap loop).Timeout check precedes termination, consistent with the intended behavior.
1368-1369: LGTM: Reset start time on gen transfer completion.Clearing py_kv_transfer_start_time on completion prevents stale timeouts.
1699-1701: LGTM: Cleanup on ctx completion or timeout.Clearing start time and removing from in-flight list ensures bounded resource usage.
1652-1657: I couldn’t find aFinishReason.TIMEOUTor afinish_by_reasonmethod in the codebase, and theFinishReasonenum only defines CANCELLED, STOP, and LENGTH. Likewise, there’s no terminal-response helper onLlmRequest. Emitting a final response will require:• Defining a timeout finish reason
• Implementing or exposing afinish_by_reason(...)API onLlmRequest
• Ensuringcreate_response(...)returns a valid final chunkUntil those exist, the suggested diff won’t compile.
Action items:
Add
TIMEOUTtoFinishReason(e.g. infinish_reason.py)Implement
LlmRequest.finish_by_reason(reason: FinishReason)In
py_executor.pyaround line 1652, call:request.finish_by_reason(FinishReason.TIMEOUT) response = request.create_response(final_chunk=True, worker_rank=self.dist.rank) if response: new_responses.append((req_id, response))Then proceed with
requests_to_terminate.append(request)as before.Please confirm before landing these refactors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
842-843: Swap the order: run timeout checks before terminating ctx-finished requests (consistency and correctness).To promptly flag and clean up timed-out transfers before normal termination, call
_check_kv_transfer_timeout()before_terminate_ctx_finished_requests()here (matches other loops below).- self._terminate_ctx_finished_requests() - self._check_kv_transfer_timeout() + self._check_kv_transfer_timeout() + self._terminate_ctx_finished_requests()
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py(10 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/_torch/pyexecutor/py_executor.py (5)
857-860: LGTM: proactive timeout check during batch prep.Running
_check_kv_transfer_timeout()during batch preparation reduces time-to-detection for stalled transfers.
975-978: LGTM: correct ordering — check timeouts before terminating ctx-finished.This matches the intended flow and previous guidance.
1101-1103: LGTM: correct ordering in overlap loop as well.Consistent with the non-PP path; timeout check precedes ctx-finished termination.
1370-1370: LGTM: clear the KV transfer timing anchor on gen-transfer completion.Resetting
py_kv_transfer_start_timeavoids stale elapsed-time checks after completion.
1701-1703: LGTM: terminate ctx transfers on completion or timeout and clear timing anchor.This ensures resource cleanup for context-only transmissions and avoids stale timers.
|
cc: @Tabrizian |
|
/bot run |
|
PR_Github #15028 [ run ] triggered by Bot |
|
PR_Github #15028 [ run ] completed with state |
31634aa to
5924e85
Compare
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1-1: Add the required NVIDIA copyright headerThis source lacks the required NVIDIA header per repository guidelines. Please prepend the standard header at the top of this file.
Example header:
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. # NVIDIA and its licensors retain all intellectual property and proprietary # rights in and to this software, related documentation and any modifications # thereto. Any use, reproduction, disclosure or distribution of this software # and related documentation without an express license agreement from NVIDIA is # strictly prohibited.
♻️ Duplicate comments (5)
tensorrt_llm/_torch/pyexecutor/py_executor.py (5)
841-844: Call timeout checks before terminating finished ctx transfersRun
_check_kv_transfer_timeout()before_terminate_ctx_finished_requests()to ensure timeouts are evaluated and flagged prior to cleanup.- self._terminate_ctx_finished_requests() - self._check_kv_transfer_timeout() + self._check_kv_transfer_timeout() + self._terminate_ctx_finished_requests()
1269-1298: Use a monotonic clock and treat non-positive timeouts as disabledWall-clock jumps can skew elapsed-time checks. Use
time.monotonic()and keep semantics: None or <= 0 disables the timeout. Also preferelapsed_msfor clarity.def _check_kv_transfer_timeout(self): if not self.kv_cache_transceiver: return timeout_ms = self.kv_cache_transceiver.kv_transfer_timeout_ms - if timeout_ms is None or timeout_ms <= 0: - return - current_time = time.time() + if timeout_ms is None or timeout_ms <= 0: + return + current_time = time.monotonic() for req in self.ctx_in_transmission_requests[:]: if req.py_kv_transfer_start_time is None: continue - elapsed_time = (current_time - req.py_kv_transfer_start_time) * 1000 - if elapsed_time > timeout_ms: + elapsed_ms = (current_time - req.py_kv_transfer_start_time) * 1000 + if elapsed_ms > timeout_ms: logger.warning( f"Terminating context request {req.py_request_id} due to KV cache transfer timeout" ) req.py_to_cleanup = True for req in self.active_requests[:]: if req.is_disagg_generation_transmission_in_progress and req.py_kv_transfer_start_time is not None: - elapsed_time = (current_time - - req.py_kv_transfer_start_time) * 1000 - if elapsed_time > timeout_ms: + elapsed_ms = (current_time - req.py_kv_transfer_start_time) * 1000 + if elapsed_ms > timeout_ms: logger.warning( f"Terminating generation request {req.py_request_id} due to KV cache transfer timeout" ) req.py_to_cleanup = True
1395-1399: Initialize gen-transfer timers only when timeout is enabled, and use a monotonic clockHonor disabled/non-positive timeouts and switch to
time.monotonic()for robustness.- if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None: + if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None and \ + self.kv_cache_transceiver.kv_transfer_timeout_ms > 0: for req in new_gen_reqs: if req.state == LlmRequestState.DISAGG_GENERATION_TRANS_IN_PROGRESS: - req.py_kv_transfer_start_time = time.time() + req.py_kv_transfer_start_time = time.monotonic()
1433-1437: Initialize ctx-transfer timers only when timeout is enabled, and use a monotonic clockKeep clock sources consistent and ignore non-positive timeouts.
- if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None: + if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None and \ + self.kv_cache_transceiver.kv_transfer_timeout_ms > 0: for req in ctx_transmission_reqs: if req.state == LlmRequestState.DISAGG_CONTEXT_TRANS_IN_PROGRESS: - req.py_kv_transfer_start_time = time.time() + req.py_kv_transfer_start_time = time.monotonic()
1655-1660: Emit a final response on timeout to avoid client hangsRequests flagged for cleanup due to KV-transfer timeout are terminated without sending a final response. Emit a final response (use CANCELLED as a proxy for TIMEOUT) before cleanup so clients don't hang waiting.
# Check if generation request needs cleanup due to KV cache transfer timeout if request.py_to_cleanup: - request.state = LlmRequestState.GENERATION_COMPLETE - requests_to_terminate.append(request) - continue + # Mark as cancelled due to timeout and emit a final response + request.finish_by_reason(FinishReason.CANCELLED) + request.state = LlmRequestState.GENERATION_COMPLETE + request.decoding_iter = request.py_decoding_iter + response = request.create_response(False, self.dist.rank) + if response: + new_responses.append((req_id, response)) + requests_to_terminate.append(request) + continue
🧹 Nitpick comments (1)
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)
474-495: Validate non-negative timeout in the Python setterGuard against negative values to prevent undefined semantics; raise a Python ValueError when negative.
.def_property( "kv_transfer_timeout_ms", [](tle::CacheTransceiverConfig const& self) -> std::optional<int64_t> { auto timeout = self.getKvTransferTimeoutMs(); return timeout ? std::optional<int64_t>(timeout->count()) : std::nullopt; }, [](tle::CacheTransceiverConfig& self, std::optional<int64_t> timeoutMs) { - if (timeoutMs) + if (timeoutMs) { - self.setKvTransferTimeoutMs(std::chrono::milliseconds(*timeoutMs)); + if (*timeoutMs < 0) + { + throw py::value_error("kv_transfer_timeout_ms must be non-negative"); + } + self.setKvTransferTimeoutMs(std::chrono::milliseconds(*timeoutMs)); } else { self.setKvTransferTimeoutMs(std::nullopt); } })
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
cpp/include/tensorrt_llm/executor/executor.h(2 hunks)cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp(2 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp(2 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp(2 hunks)examples/disaggregated/README.md(1 hunks)tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py(1 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(10 hunks)tensorrt_llm/llmapi/llm_args.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
- tensorrt_llm/_torch/pyexecutor/llm_request.py
- cpp/include/tensorrt_llm/executor/executor.h
- examples/disaggregated/README.md
- tensorrt_llm/llmapi/llm_args.py
- cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
- cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/pybind/executor/executorConfig.cpptensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/py_executor.py (4)
858-861: Good placement: timeout check is invoked early in the iterationCalling
_check_kv_transfer_timeout()right after_check_disagg_gen_transfer_status()ensures overdue transfers are pruned promptly.
976-979: Correct order: timeout check before ctx-finished terminationThis ordering prevents missing timeouts on still-in-transit ctx requests. No change needed here.
1371-1371: LGTM: Resetting start-time on gen transmission completionClearing
py_kv_transfer_start_timeon successful generation transmission completion avoids stale timers.
1701-1703: LGTM: Terminating ctx-transmission requests on completion or timeoutIncluding
request.py_to_cleanuphere promptly frees resources and clears the timeout start marker.cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (2)
414-419: Add kv_transfer_timeout_ms to pickle state: looks goodReturning a 3-tuple with
timeoutMspreserves state across pickling. The conversion to optional<int64_t> is correct.
456-470: 3-arg constructor for CacheTransceiverConfig: goodThe binding cleanly maps optional timeoutMs to std::chrono::milliseconds and threads it through the C++ constructor.
|
PR_Github #15033 [ run ] triggered by Bot |
|
PR_Github #15033 [ run ] completed with state |
5924e85 to
e671749
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (4)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
841-844: Run timeout checks before terminating finished ctx transfers (ordering)Call _check_kv_transfer_timeout() before _terminate_ctx_finished_requests() so timed-out ctx transfers are flagged and cleaned in the same cycle. This mirrors the ordering you already apply in other loops.
- if self.kv_cache_transceiver and self.ctx_in_transmission_requests: - self._terminate_ctx_finished_requests() - self._check_kv_transfer_timeout() + if self.kv_cache_transceiver and self.ctx_in_transmission_requests: + self._check_kv_transfer_timeout() + self._terminate_ctx_finished_requests()
1269-1298: Use a monotonic clock and unify elapsed namingtime.time() is not monotonic and can be affected by system clock changes. Use time.monotonic() and keep variable naming consistent (elapsed_ms). Also keep the existing guards (None or <= 0 => disabled).
def _check_kv_transfer_timeout(self): if not self.kv_cache_transceiver: return timeout_ms = self.kv_cache_transceiver.kv_transfer_timeout_ms if timeout_ms is None or timeout_ms <= 0: return - current_time = time.time() + current_time = time.monotonic() for req in self.ctx_in_transmission_requests[:]: if req.py_kv_transfer_start_time is None: continue - elapsed_time = (current_time - req.py_kv_transfer_start_time) * 1000 - if elapsed_time > timeout_ms: + elapsed_ms = (current_time - req.py_kv_transfer_start_time) * 1000 + if elapsed_ms > timeout_ms: logger.warning( f"Terminating context request {req.py_request_id} due to KV cache transfer timeout" ) req.py_to_cleanup = True for req in self.active_requests[:]: if req.is_disagg_generation_transmission_in_progress and req.py_kv_transfer_start_time is not None: - elapsed_time = (current_time - - req.py_kv_transfer_start_time) * 1000 - if elapsed_time > timeout_ms: + elapsed_ms = (current_time - req.py_kv_transfer_start_time) * 1000 + if elapsed_ms > timeout_ms: logger.warning( f"Terminating generation request {req.py_request_id} due to KV cache transfer timeout" ) req.py_to_cleanup = True return
1433-1437: Initialize ctx-transfer timer with monotonic clock and don’t reset it repeatedlyMirror the gen path: respect disabled/non-positive timeouts, use a monotonic clock, and only set once.
- if self.kv_cache_transceiver.kv_transfer_timeout_ms is not None: + if (self.kv_cache_transceiver.kv_transfer_timeout_ms is not None and + self.kv_cache_transceiver.kv_transfer_timeout_ms > 0): for req in ctx_transmission_reqs: if req.state == LlmRequestState.DISAGG_CONTEXT_TRANS_IN_PROGRESS: - req.py_kv_transfer_start_time = time.time() + if req.py_kv_transfer_start_time is None: + req.py_kv_transfer_start_time = time.monotonic()cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)
422-435: Backward compatibility: accept 2- or 3-element pickle state in setstateStrictly requiring 3 elements breaks older pickles. Accept both sizes and apply timeout only when present.
- if (state.size() != 3) - { - throw std::runtime_error("Invalid CacheTransceiverConfig state!"); - } - auto config - = tle::CacheTransceiverConfig(state[0].cast<std::optional<tle::CacheTransceiverConfig::BackendType>>(), - state[1].cast<std::optional<size_t>>()); - auto timeoutMs = state[2].cast<std::optional<int64_t>>(); - if (timeoutMs) - { - config.setKvTransferTimeoutMs(std::chrono::milliseconds(*timeoutMs)); - } - return config; + if (state.size() < 2 || state.size() > 3) + { + throw std::runtime_error("Invalid CacheTransceiverConfig state!"); + } + auto config + = tle::CacheTransceiverConfig(state[0].cast<std::optional<tle::CacheTransceiverConfig::BackendType>>(), + state[1].cast<std::optional<size_t>>()); + if (state.size() == 3) + { + auto timeoutMs = state[2].cast<std::optional<int64_t>>(); + if (timeoutMs) + { + config.setKvTransferTimeoutMs(std::chrono::milliseconds(*timeoutMs)); + } + } + return config;
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1-10: Missing NVIDIA copyright headerPer repository guidelines, Python sources should include the NVIDIA header. Please prepend the standard header.
Example (outside this diff hunk):
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. # NVIDIA and its licensors retain all intellectual property and proprietary # rights in and to this software, related documentation and any modifications # thereto. Any use, reproduction, disclosure or distribution of this software # and related documentation without an express license agreement from NVIDIA is # strictly prohibited.cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)
1-16: Header yearsThe SPDX header ends at 2024, while other updated files use 2025. If the project policy is to bump the year on touched files, consider updating.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
cpp/include/tensorrt_llm/executor/executor.h(2 hunks)cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp(2 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp(2 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp(2 hunks)tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py(1 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(10 hunks)tensorrt_llm/llmapi/llm_args.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- tensorrt_llm/_torch/pyexecutor/llm_request.py
- tensorrt_llm/llmapi/llm_args.py
- tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
- cpp/include/tensorrt_llm/executor/executor.h
- cpp/tensorrt_llm/executor/cacheTransceiverConfig.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pycpp/tensorrt_llm/pybind/executor/executorConfig.cppcpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/pybind/executor/executorConfig.cppcpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/pybind/executor/executorConfig.cppcpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/pybind/executor/executorConfig.cppcpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tensorrt_llm/_torch/pyexecutor/py_executor.py (4)
858-861: Good: early timeout check after gen-transfer statusPlacing _check_kv_transfer_timeout() right after _check_disagg_gen_transfer_status() ensures in-flight timeouts are handled before scheduling.
976-979: Good: correct ordering (timeout before ctx-finished termination)This ordering avoids missing a timeout in the same iteration.
1101-1104: Good: consistent ordering in overlap loopTimeout is checked before ctx-finished termination here as well.
1701-1706: Good: clear start-time on ctx completion or timeoutResetting py_kv_transfer_start_time on completion or timeout prevents stale timestamps on reused request objects.
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (3)
414-419: LGTM: pickle getstate includes timeoutReturning a 3-tuple with optional timeout preserves the new field across serialization.
457-470: LGTM: 3-arg constructor wiringThe lambda converting optional int64_t to optional milliseconds is correct and keeps the Python API ergonomic.
474-491: LGTM: kv_transfer_timeout_ms property bindingGetter/setter semantics and nullopt handling look correct.
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (3)
431-436: LGTM: pickle getstate includes timeout3-tuple state with optional timeout is correct.
473-486: LGTM: 3-arg constructor wiringOptional ms conversion is correct.
491-507: LGTM: kv_transfer_timeout_ms property bindingCorrect optional handling and ms conversion.
For 1. could you elaborate on why the timeout requests are still in the queue of the C++ transceiver? In my testing I recreate conditions where context requests pile up and fail to send their KV cache which is a use case this feature targets. That's why I make sure to cleanup these requests as would also cleanup for requests that were successfully in transmitting their KV cache, i.e. the normal cleanup flow. I think they might still be in the queue but once |
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
9c7ef25 to
0f0256d
Compare
|
Re https://github.com/NVIDIA/TensorRT-LLM/pull/6801#issuecomment-3185781656> |
|
/bot run |
|
PR_Github #15332 [ run ] triggered by Bot |
|
PR_Github #15332 [ run ] completed with state |
If we set the timeout low enough, yes. It will check in under 1ms in my testing so I had to artificially introduce delay via |
|
/bot run --disable-fail-fast |
|
PR_Github #15468 [ run ] triggered by Bot |
I hope that I am clear. If you need more discussion, we have schedule a meeting. |
|
PR_Github #15468 [ run ] completed with state |
I see, thanks Shunkang for the detailed response. I agree that the |
You are right. The PR is still in progress. Because I am waiting for the previous one merged. I hope that it can be merged within a couple of days. |
Tabrizian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add testing for this feature.
Will do, having discussion with @Shunkangz on how to move forward. |
|
@Shunkangz it seems to me that the changes in this MR are still needed for the case where for some reason the gen request never makes it to the generation worker. In that case we would need to cancel the request on the context server. And I agree, we should also call the C++ dataTransceiver |
|
Closing since #7575 is more up to date. |
Summary by CodeRabbit
New Features
Improvements
Documentation
Description
Issue was raised on Github where the context server occasionally hangs due to requests that fail to transfer piling up and accumulating resources. This PR is adding the ability for users to configure
kv_transfer_timeout_msundercache_transceiver_configwhich will remove resources for requests that fail to successfully transfer their KV cache in time (can be configured separately for ctx and gen). Must be configured carefully, or you may get rid of perfectly good requests.Test Coverage
Tested locally by changing the proxy server to intentionally generate failures and then verifying that resources and requests were cleaned up after the timeout(s).
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.