-
Notifications
You must be signed in to change notification settings - Fork 2k
Draft: Nanobind integration tests #6203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes update exception handling in nanobind bindings, revise constructor argument handling for OutputConfig, switch the build script's default binding type to nanobind, ensure backend string conversion to enum in cache transceiver config, and enforce boolean conversion for logprobs in OpenAI protocol sampling parameter construction. Changes
Sequence Diagram(s)sequenceDiagram
participant Python
participant OutputConfig
Python->>OutputConfig: __init__(optional<bool>..., optional<vector>)
OutputConfig-->>OutputConfig: Convert each optional<bool> to bool using value_or(false)
OutputConfig-->>Python: Constructed OutputConfig instance
sequenceDiagram
participant CacheTransceiverConfig
participant _CacheTransceiverBackendType
CacheTransceiverConfig->>_CacheTransceiverBackendType: from_string(self.backend)
_CacheTransceiverBackendType-->>CacheTransceiverConfig: Enum value
CacheTransceiverConfig-->>pybind: Pass enum value (not string)
sequenceDiagram
participant CompletionRequest
participant SamplingParams
CompletionRequest->>CompletionRequest: to_sampling_params()
CompletionRequest-->>SamplingParams: Set _return_log_probs = bool(self.logprobs)
SamplingParams-->>CompletionRequest: SamplingParams instance
Estimated code review effort2 (~15 minutes) Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
|
/bot run |
|
PR_Github #12383 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
🧹 Nitpick comments (14)
cpp/tensorrt_llm/nanobind/executor/request.h (1)
20-27: Consider forward-declaring nanobind::module_ instead of including the full header
#include <nanobind/nanobind.h>pulls in a fairly large header and propagates it to every TU that includesrequest.h, which can noticeably impact build times.
A lightweight alternative is to forward-declarenanobind::module_and use the fully-qualified name in the declaration; the implementation file can keep the heavy include.-#include <nanobind/nanobind.h> -namespace nb = nanobind; - -namespace tensorrt_llm::nanobind::executor -{ -// Register bindings for executor API. -void initRequestBindings(nb::module_& m); +namespace nanobind { class module_; } + +namespace tensorrt_llm::nanobind::executor +{ +// Register bindings for executor API. +void initRequestBindings(nanobind::module_& m); }This avoids polluting including translation units with the
nbalias and trims compile dependencies.cpp/tensorrt_llm/nanobind/userbuffers/bindings.h (1)
20-29: Same compile-time concern: replace heavy include with forward declarationFor consistency and faster incremental builds, forward-declare
nanobind::module_and drop the header/alias here too.-#include <nanobind/nanobind.h> -namespace nb = nanobind; +namespace nanobind { class module_; }Then update the signature:
-static void initBindings(nb::module_& m); +static void initBindings(nanobind::module_& m);cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h (1)
18-28: Reduce header fan-out by forward-declaring nanobind::module_As with the other binding headers, swap the full include for a forward declaration to keep compile units slim.
-#include <nanobind/nanobind.h> -namespace nb = nanobind; +namespace nanobind { class module_; }And adjust the method parameter accordingly.
cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h (1)
20-28: Lightweight forward declaration recommendedFollow the same pattern to avoid repeatedly including Nanobind’s heavy header from public headers.
-#include <nanobind/nanobind.h> -namespace nb = nanobind; +namespace nanobind { class module_; }This keeps compile times predictable when the testing headers are transitively included.
cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h (1)
18-28: Apply the forward-declaration pattern here as wellConsistently forward-declaring
nanobind::module_across all binding headers will maximise the compile-time benefit and avoid leaking thenbalias into every translation unit.-#include <nanobind/nanobind.h> -namespace nb = nanobind; +namespace nanobind { class module_; }Then use
nanobind::module_ininitBindings.cpp/tensorrt_llm/nanobind/common/bindTypes.h (1)
44-65: Solid pickling implementation with minor suggestion.The serialization logic correctly preserves set contents through vector conversion and reconstruction. The validation and placement new usage are appropriate.
Consider using a more descriptive error message:
- throw std::runtime_error("Invalid state!"); + throw std::runtime_error("Invalid state: expected tuple with 1 element for set deserialization");tests/unittest/bindings/test_executor_bindings.py (2)
1204-1211: Remove commented debug codeThe commented debugging code for checking picklability should be removed to keep the test clean.
-# for prop in ['is_final', 'output_token_ids', 'cum_log_probs', 'log_probs', 'context_logits', 'generation_logits', 'encoder_output', 'finish_reasons', 'sequence_index', 'is_sequence_final', 'decoding_iter', 'context_phase_params', 'request_perf_metrics']: -# try: -# value = getattr(result, prop) -# pickle.dumps(value) -# print(f"✓ {prop} pickles successfully") -# except Exception as e: -# print(f"✗ {prop} fails: {e}") -
1923-1929: Consider removing or making debug prints conditionalDebug print statements have been added to the test. While useful during development, they may clutter test output in CI/CD pipelines.
Consider either removing the debug prints or making them conditional based on an environment variable:
-for i, (req_id, client_id, logits) in enumerate( - zip(req_id_batch, client_id_batch, logits_batch)): - print(f"DEBUG: Request {i}: req_id={req_id}, client_id={client_id}") - print( - f"DEBUG: Request {i}: logits dtype={logits.dtype}, shape={logits.shape}, device={logits.device}" - ) +for i, (req_id, client_id, logits) in enumerate( + zip(req_id_batch, client_id_batch, logits_batch)): + if os.environ.get('DEBUG_TESTS'): + print(f"DEBUG: Request {i}: req_id={req_id}, client_id={client_id}") + print( + f"DEBUG: Request {i}: logits dtype={logits.dtype}, shape={logits.shape}, device={logits.device}" + )cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1)
248-253: Consider handling fractional milliseconds or documenting the precision loss.The conversion from
doubletoint64_tin line 252 will truncate fractional milliseconds. If sub-millisecond precision is needed, consider usingstd::chrono::duration<double, std::milli>instead.- if (timeout_ms) - { - return self.getLatestEvents(std::chrono::milliseconds(static_cast<int64_t>(*timeout_ms))); - } + if (timeout_ms) + { + // Note: Fractional milliseconds are truncated + return self.getLatestEvents(std::chrono::milliseconds(static_cast<int64_t>(*timeout_ms))); + }cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp (1)
44-44: Complete the enum value documentation.The documentation for
CUM_LOG_PROBSis incomplete.- .value("CUM_LOG_PROBS", OutputContentType::kCUM_LOG_PROBS, "Cumulative Log"); + .value("CUM_LOG_PROBS", OutputContentType::kCUM_LOG_PROBS, "Cumulative Log Probs");cpp/tensorrt_llm/nanobind/common/customCasters.h (2)
44-47: Update comment to reference nanobind instead of pybind.The comment references "Pybind" but this is a nanobind implementation. Update the comment to avoid confusion.
-// Pybind requires to have a central include in order for type casters to work. +// Nanobind requires to have a central include in order for type casters to work. // Opaque bindings add a type caster, so they have the same requirement. -// See the warning in https://pybind11.readthedocs.io/en/stable/advanced/cast/custom.html +// See the nanobind documentation on custom type casters.
280-291: Document that this caster copies referenced objects.The
from_cppmethod creates copies of all referenced objects rather than maintaining references. This changes the semantics but may be necessary for Python interop.Add a comment explaining this behavior:
static handle from_cpp(VectorType const& src, rv_policy policy, cleanup_list* cleanup) noexcept { - + // Note: We copy the referenced objects to avoid lifetime issues when passing to Python std::vector<T> result;cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)
254-331: Consider refactoring the 49-parameter constructor for better maintainability.The constructor has 49 parameters, which makes it difficult to maintain and use correctly.
Consider using a builder pattern or configuration struct to group related parameters. This would improve readability and reduce the chance of parameter ordering errors. For example:
- Group tensor-related parameters into a TensorConfig struct
- Group LoRA-related parameters into a LoraConfig struct
- Group multimodal parameters into a MultimodalConfig struct
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
584-584: Naming inconsistency between parameter and property methods.The parameter
mm_embedding_offloading(line 584) maps togetPromptTableOffloading/setPromptTableOffloadingmethods (lines 631-632). Consider using a consistent name to avoid confusion.Either rename the parameter to match the methods:
- nb::arg("gather_generation_logits") = false, nb::arg("mm_embedding_offloading") = false, + nb::arg("gather_generation_logits") = false, nb::arg("prompt_table_offloading") = false,Or use an alias in the property definition to maintain backward compatibility while clarifying the mapping.
Also applies to: 631-632
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (47)
cpp/CMakeLists.txt(2 hunks)cpp/tensorrt_llm/nanobind/CMakeLists.txt(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.h(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h(1 hunks)cpp/tensorrt_llm/nanobind/bindings.cpp(2 hunks)cpp/tensorrt_llm/nanobind/common/bindTypes.h(1 hunks)cpp/tensorrt_llm/nanobind/common/customCasters.h(1 hunks)cpp/tensorrt_llm/nanobind/executor/bindings.cpp(1 hunks)cpp/tensorrt_llm/nanobind/executor/bindings.h(1 hunks)cpp/tensorrt_llm/nanobind/executor/executor.cpp(1 hunks)cpp/tensorrt_llm/nanobind/executor/executor.h(1 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp(1 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.h(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.h(1 hunks)cpp/tensorrt_llm/nanobind/runtime/bindings.cpp(1 hunks)cpp/tensorrt_llm/nanobind/runtime/bindings.h(1 hunks)cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp(1 hunks)cpp/tensorrt_llm/nanobind/runtime/moeBindings.h(1 hunks)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp(1 hunks)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h(1 hunks)cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp(1 hunks)cpp/tensorrt_llm/nanobind/userbuffers/bindings.h(1 hunks)cpp/tensorrt_llm/pybind/bindings.cpp(2 hunks)cpp/tensorrt_llm/pybind/executor/bindings.cpp(1 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp(1 hunks)examples/models/core/llama/summarize_long.py(1 hunks)examples/models/core/qwen2audio/run.py(1 hunks)examples/models/core/qwenvl/run.py(1 hunks)jenkins/Build.groovy(4 hunks)jenkins/L0_Test.groovy(3 hunks)scripts/build_wheel.py(2 hunks)tensorrt_llm/builder.py(1 hunks)tensorrt_llm/commands/build.py(2 hunks)tensorrt_llm/llmapi/llm_args.py(2 hunks)tensorrt_llm/runtime/model_runner.py(1 hunks)tests/integration/test_lists/test-db/l0_a10.yml(1 hunks)tests/unittest/bindings/test_bindings_ut.py(0 hunks)tests/unittest/bindings/test_executor_bindings.py(7 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/bindings/test_bindings_ut.py
🧰 Additional context used
🧠 Learnings (5)
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/integration/test_lists/test-db/l0_a10.yml (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (7)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h (3)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.h (1)
tensorrt_llm(23-28)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h (1)
tensorrt_llm(22-29)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (4)
initBindings(293-460)initBindings(293-293)initBindings(462-490)initBindings(462-462)
cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h (11)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.h (1)
tensorrt_llm(23-28)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h (1)
tensorrt_llm(22-29)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h (2)
tensorrt_llm(23-30)tensorrt_llm(32-39)cpp/tensorrt_llm/nanobind/executor/bindings.h (1)
tensorrt_llm(23-29)cpp/tensorrt_llm/nanobind/runtime/bindings.h (1)
tensorrt_llm(24-30)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h (1)
tensorrt_llm(24-29)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (4)
initBindings(293-460)initBindings(293-293)initBindings(462-490)initBindings(462-462)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2)
initBindings(55-523)initBindings(55-55)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp (2)
initBindings(55-178)initBindings(55-55)cpp/tensorrt_llm/nanobind/executor/bindings.cpp (2)
initBindings(48-261)initBindings(48-48)cpp/tensorrt_llm/nanobind/runtime/bindings.cpp (2)
initBindings(102-362)initBindings(102-102)
cpp/tensorrt_llm/pybind/bindings.cpp (1)
cpp/include/tensorrt_llm/runtime/modelConfig.h (1)
KVCacheTypeFromString(76-94)
cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp (2)
cpp/include/tensorrt_llm/runtime/modelConfig.h (3)
useGptAttentionPlugin(293-296)setKVCacheType(586-589)useLoraPlugin(540-543)cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
setDraftTokens(1093-1096)
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
shutdown(179-188)shutdown(179-179)
cpp/tensorrt_llm/nanobind/runtime/bindings.cpp (4)
cpp/include/tensorrt_llm/runtime/request.h (1)
generatedTokensPerEngineStep(50-50)cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp (2)
create(74-212)create(74-78)cpp/include/tensorrt_llm/runtime/cudaStream.h (1)
synchronize(81-96)cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (2)
initMoeBindings(49-122)initMoeBindings(49-49)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength(47-47)tensorrt_llm/llmapi/llm_args.py (3)
SchedulerConfig(606-624)PeftCacheConfig(628-690)ExtendedRuntimePerfKnobConfig(836-862)
🪛 Ruff (0.12.2)
tests/unittest/bindings/test_executor_bindings.py
97-97: Local variable executor is assigned to but never used
Remove assignment to unused variable executor
(F841)
1203-1203: Line too long (260 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (52)
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)
339-339: LGTM! Correct type casting for pickle deserialization.The fix ensures
cuda_graph_cache_sizeis properly cast toSizeType32during deserialization, which aligns with the getter method and property definition.examples/models/core/qwen2audio/run.py (1)
125-126: LGTM! Good refactoring for enum parsing consistency.The change from direct
KVCacheTypeconstruction to usingKVCacheType.from_string()improves API consistency and provides better error handling while maintaining the same fallback behavior.cpp/CMakeLists.txt (2)
202-202: LGTM! Proper support for dual binding architecture.The condition extension to include
BUILD_DEEP_EPensures pybind11 is available for deep execution path features regardless of the primary binding type, supporting the nanobind integration strategy.
221-221: LGTM! Maintains consistency with subdirectory inclusion logic.This change ensures pybind11 headers are available when
BUILD_DEEP_EPis enabled, maintaining consistency with the subdirectory addition on line 202.cpp/tensorrt_llm/nanobind/executor/bindings.h (1)
1-30: LGTM! Well-designed header following best practices.The header demonstrates good design with proper licensing, clean namespace organization, minimal includes, and a focused API declaration. Fits well with the modular nanobind binding architecture.
cpp/tensorrt_llm/nanobind/runtime/moeBindings.h (1)
1-30: LGTM! Consistent design with good modular separation.The header follows the same excellent patterns as other nanobind headers with proper licensing, clean namespace organization, and focused API. The separation of MOE bindings maintains good modularity in the runtime bindings architecture.
examples/models/core/qwenvl/run.py (1)
121-122: LGTM! Proper enum parsing standardization.The change to use
KVCacheType.from_string()instead of direct constructor aligns with the codebase-wide standardization effort for enum parsing and provides better validation.scripts/build_wheel.py (2)
301-301: LGTM! Strategic shift to nanobind as default.The change to default to nanobind instead of pybind aligns with the PR's nanobind integration focus and represents a strategic shift toward nanobind as the preferred Python binding mechanism.
864-864: LGTM! Consistent default value alignment.The argument parser default correctly matches the function parameter default, ensuring consistency in the command-line interface.
tensorrt_llm/commands/build.py (2)
41-55: Excellent enum parsing helper implementation.The
enum_typehelper function is well-designed with:
- Support for both enum instances and string values
- Proper use of
from_stringmethod for validation- Clear error messages listing valid values
- Reusable pattern for other enum types
This improves the robustness of command-line argument parsing.
151-151: LGTM! Proper integration of enum helper.The use of
enum_type(KVCacheType)provides better validation and error handling for the--kv_cache_typeargument compared to direct enum usage.tests/integration/test_lists/test-db/l0_a10.yml (1)
193-207: LGTM! Appropriate nanobind test configuration.The new
l0_a10_nanobindtest entry is well-configured:
- Targets appropriate hardware (A10 GPUs with single GPU setup)
- Uses correct platform constraints (Ubuntu Linux)
- Focuses on pre_merge testing for rapid CI feedback
- Tests the relevant
unittest/bindingssuite for nanobind validationThis supports the PR's nanobind integration testing objectives effectively.
examples/models/core/llama/summarize_long.py (1)
100-100: LGTM! Consistent enum parsing standardization.The change to use
KVCacheType.from_string()instead of direct constructor follows the same standardization pattern applied across the codebase and provides better validation for string-to-enum conversion.tensorrt_llm/builder.py (1)
596-596: LGTM: Standardized enum parsing approachThe change from direct
KVCacheType()construction toKVCacheType.from_string()improves enum handling by using an explicit parsing method. This aligns with the broader codebase effort to standardize enum string parsing and supports the nanobind integration.cpp/tensorrt_llm/nanobind/executor/executorConfig.h (1)
1-31: LGTM: Well-structured header fileThe header file follows proper C++ conventions with:
- Correct copyright header
- Proper include guard using
#pragma once- Clean namespace organization
- Clear function declaration for nanobind binding initialization
cpp/tensorrt_llm/nanobind/runtime/bindings.h (1)
1-31: LGTM: Properly structured header with dual initialization patternThe header file follows excellent C++ conventions and the dual function approach (
initBindingsandinitBindingsEarly) suggests a thoughtful two-phase initialization pattern for complex runtime bindings.cpp/tensorrt_llm/nanobind/batch_manager/bindings.h (1)
1-29: LGTM: Consistent and well-structured binding headerThe header file maintains consistency with other nanobind binding headers in the codebase, following proper C++ conventions with correct copyright, include guards, and clear namespace organization.
tensorrt_llm/llmapi/llm_args.py (2)
33-33: LGTM! Import addition follows established patterns.The import of
CacheTransceiverBackendTypeis properly aliased and follows the naming convention used by other executor binding imports in this file.
895-895: LGTM! Enum conversion follows established patterns.The conversion from string to enum using
from_string()method is consistent with the pattern used throughout the codebase for type-safe enum handling in the pybind layer. This change improves robustness by ensuring proper enum conversion rather than passing raw strings.tensorrt_llm/runtime/model_runner.py (1)
89-89: LGTM! Standardizes KVCacheType parsing withfrom_stringmethod.This change aligns with the codebase-wide standardization effort to use
KVCacheType.from_string()for string-to-enum conversion, ensuring consistent and robust parsing throughout the TensorRT LLM codebase.cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h (1)
1-40: LGTM! Well-structured nanobind binding header.The header follows standard patterns for nanobind bindings with:
- Proper copyright and license headers
- Correct include guard with
#pragma once- Appropriate namespace organization
- Consistent class structure with static
initBindingsmethods- Matches the pattern established in other nanobind binding headers
cpp/tensorrt_llm/pybind/executor/bindings.cpp (1)
247-257: LGTM! Improves API clarity and type safety.The lambda wrapper enhances the binding by:
- Making the timeout unit explicit (
timeout_msvstimeout)- Providing type safety with proper conversion from
doubletoint64_tforstd::chrono::milliseconds- Correctly handling both timeout provided and nullopt cases
- Aligning with the nanobind binding patterns used elsewhere in the codebase
cpp/tensorrt_llm/nanobind/CMakeLists.txt (3)
6-21: LGTM! Comprehensive expansion of nanobind source files.The expanded source list properly includes all the new nanobind binding files across:
- Batch manager components (algorithms, cache management, etc.)
- Executor bindings
- Runtime bindings including MOE support
- Testing and user buffer bindings
This aligns with the PR's goal of providing comprehensive Python bindings via nanobind.
32-35: LGTM! Appropriate conditional NVSHMEM linking.The conditional linking of NVSHMEM libraries when
ENABLE_NVSHMEMis set is the correct approach for handling this optional dependency.
37-55: LGTM! Updated linking configuration supports expanded functionality.The updates appropriately support the expanded nanobind bindings:
- Addition of
${CUDA_NVML_LIB}for GPU memory management- CUDA toolkit stubs rpath for proper runtime linking on Linux
- Standardized error message compilation flag
cpp/tensorrt_llm/pybind/bindings.cpp (2)
173-173: LGTM! Standardizes KVCacheType binding withfrom_stringmethod.This change aligns with the codebase-wide effort to standardize
KVCacheTypestring-to-enum conversion using the dedicatedfrom_stringstatic method, ensuring consistency across both pybind and nanobind bindings.
358-361: LGTM! Improves error handling robustness.Replacing the debug-only assertion with a runtime exception improves the robustness of the
SamplingConfigdeserialization by ensuring proper error handling in both debug and release builds.cpp/tensorrt_llm/nanobind/common/bindTypes.h (3)
1-22: LGTM - proper header structure.The copyright header and includes are well-structured and include only necessary dependencies.
24-31: Good template design.The namespace organization and function signature follow nanobind conventions appropriately.
32-43: Excellent container binding implementation.The binding correctly exposes standard set operations and Python special methods. The use of
overload_castforeraseproperly disambiguates method overloads, and thekeep_alive<0, 1>()annotation on the iterator ensures memory safety.jenkins/L0_Test.groovy (4)
67-69: Good configuration addition.The
NANOBIND_CONFIGfield follows the established pattern for build configuration constants.
77-77: Consistent configuration entry.The nanobind entry in
BUILD_CONFIGSfollows the established pattern with an appropriate tarball filename.
1749-1749: Well-structured test configuration.The "A10-Nanobind" test stage configuration follows the established format and uses the appropriate test list
l0_a10_nanobind.
1826-1828: Consistent configuration logic.The nanobind detection logic follows the established pattern for other build configurations and is properly integrated into the existing conditional structure.
cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp (4)
1-26: Well-structured header and includes.The copyright header, includes, and namespace aliases are appropriate and follow good practices.
32-36: Safe and well-designed class binding.The UBBuffer binding correctly exposes read-only properties and safely converts the address pointer to
intptr_tfor Python interaction. This prevents unsafe pointer manipulation from Python while still providing access to the address value.
38-44: Consistent and safe function bindings.The function bindings properly handle address conversions between
intptr_tandvoid*, maintaining consistency with the UBBuffer class binding. The parameter and return type handling is appropriate.
45-47: Complete and proper binding implementation.The final function binding and namespace closure are correctly implemented.
jenkins/Build.groovy (3)
50-54: LGTM: Configuration constants are properly defined.The new nanobind configuration constants follow the established naming convention and are correctly annotated with
@Fieldfor pipeline shared variables.
65-69: LGTM: Nanobind build configurations are well-structured.The new nanobind configurations correctly:
- Add
--binding_type nanobindparameter for nanobind-specific builds- Use distinct tarball names with "nanobind-" prefix to avoid conflicts
- Inherit appropriate cmake variables and wheel architectures from their vanilla counterparts
- Maintain consistency between x86_64 and aarch64 variants
Also applies to: 85-89
542-543: LGTM: Nanobind build stage is properly integrated.The new "Build TRT-LLM Nanobind" stage correctly:
- Uses conditional logic to select appropriate architecture-specific nanobind configuration
- Follows the established pattern of existing parallel build stages
- Integrates seamlessly with the existing
prepareLLMBuildfunctiontests/unittest/bindings/test_executor_bindings.py (2)
1276-1276: API update looks correctThe change from
datetime.timedelta(seconds=1)to1000milliseconds aligns with the updated nanobind binding API.
1518-1520: Good use of enum typeUsing
trtllm.CacheTransceiverBackendType.UCXinstead of the string"UCX"improves type safety and API consistency.cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp (1)
55-178: Well-structured nanobind implementationThe nanobind bindings for batch manager algorithms are well-implemented with:
- Consistent binding patterns across all algorithm classes
- Proper handling of optional parameters with std::nullopt defaults
- Appropriate tensor conversions between PyTorch and TensorRT-LLM formats
- Clean structured return types for Python consumption
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (1)
40-131: Comprehensive and well-designed conversion implementationThe LlmRequest conversion implementation is excellent:
- Clean helper functions for optional tensor conversion
- Proper CUDA stream unwrapping in callback adapter
- Deep copies of token vectors with appropriate shared_ptr usage
- All 49 parameters properly handled with clear comments
- Consistent use of from_torch() for tensor conversions
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (1)
35-38: Good documentation of the nanobind limitation workaround.The comment clearly explains why
c10::Streamis used instead ofc10::cuda::CUDAStreamand provides a reference to the test example.cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp (1)
46-85: Well-structured ModelSpec bindings.The bindings correctly use
reference_internalpolicy for method chaining and provide a proper copy implementation.cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)
77-104: Well-implemented cache transceiver bindings.The trampoline class correctly implements all pure virtual methods, and the bindings properly expose the C++ API to Python with appropriate naming conventions.
cpp/tensorrt_llm/nanobind/executor/executor.h (1)
30-127: Well-designed executor wrapper with proper GIL handling.The Executor class provides a comprehensive Python binding interface with:
- Multiple constructor overloads for different initialization scenarios
- Proper GIL release in blocking operations (
awaitResponsesmethods)- Clean RAII with
unique_ptrfor the underlying executor- Appropriate context manager protocol support
cpp/tensorrt_llm/nanobind/bindings.cpp (1)
79-79: Consider the implications of disabling leak warnings.Disabling nanobind leak warnings might hide memory management issues. Consider enabling them during development/testing and only disabling in production if necessary.
Is there a specific reason for disabling leak warnings? If it's due to known false positives, consider documenting the rationale.
cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
179-188: Good practice: GIL release during shutdownThe implementation correctly releases the Python GIL before calling the executor shutdown, allowing spawned threads to make forward progress. The logging statements help with debugging.
158-164: Fix incorrect data pointer assignment for encoder bufferThe encoder data pointer is incorrectly assigned from the decoder buffer.
Apply this fix:
uint8_t const* encoderData = reinterpret_cast<uint8_t const*>(encoderEngineBuffer.data()); size_t encoderSize = encoderEngineBuffer.size(); -uint8_t const* decoderData = reinterpret_cast<uint8_t const*>(decoderEngineBuffer.data()); +uint8_t const* decoderData = reinterpret_cast<uint8_t const*>(decoderEngineBuffer.data()); size_t decoderSize = decoderEngineBuffer.size();Likely an incorrect or invalid review comment.
|
PR_Github #12383 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #12390 [ run ] triggered by Bot |
|
PR_Github #12390 [ run ] completed with state |
7ede3a9 to
58bb0a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
cpp/tensorrt_llm/nanobind/executor/request.cpp (1)
902-908: Make parameter const reference for better const-correctness.The function doesn't modify the input bytes, so the parameter should be const reference.
m.def("deserialize_result", - [](nb::bytes& x) + [](nb::bytes const& x) { std::string str(x.c_str(), x.size()); std::istringstream is(str); return tle::serialize_utils::deserialize<tle::Result>(is); });
♻️ Duplicate comments (4)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (1)
98-112: Add validation for numpy array memory layout and null pointer checkThe previous review comment about validating numpy array contiguity and adding null pointer checks is still applicable to this refactored code.
tests/unittest/bindings/test_executor_bindings.py (1)
70-101: Add assertions to validate the executor creationThe test creates an executor with managed weights but doesn't verify that it was created successfully. The
executorvariable is assigned but never used.cpp/tensorrt_llm/nanobind/executor/request.cpp (2)
87-87: Fix undefined behavior in setstate methods.All setstate methods use placement new without first calling the destructor, causing undefined behavior that could lead to resource leaks.
Also applies to: 187-188, 208-210, 270-271, 289-290, 309-310, 329-329, 347-348, 377-379, 393-396, 454-457, 461-463, 516-518, 552-553, 583-603, 749-755, 780-782, 803-804, 827-832, 858-872, 919-920
608-683: Constructor has too many parameters.The Request constructor with 38 parameters is difficult to use and maintain. Consider using a builder pattern or configuration object.
🧹 Nitpick comments (3)
tests/unittest/bindings/test_executor_bindings.py (3)
1204-1211: Remove commented debug codeThis commented debug code should be removed as it's not needed in the test.
- # for prop in ['is_final', 'output_token_ids', 'cum_log_probs', 'log_probs', 'context_logits', 'generation_logits', 'encoder_output', 'finish_reasons', 'sequence_index', 'is_sequence_final', 'decoding_iter', 'context_phase_params', 'request_perf_metrics']: - # try: - # value = getattr(result, prop) - # pickle.dumps(value) - # print(f"✓ {prop} pickles successfully") - # except Exception as e: - # print(f"✗ {prop} fails: {e}") -
1923-1930: Consider removing debug print statementsDebug print statements should be removed or converted to proper logging if they need to be retained for diagnostic purposes.
def logits_post_processor_batched( req_id_batch: tp.List[int], logits_batch: tp.List[torch.Tensor], ids_batch: tp.List[tp.List[tp.List[int]]], stream_ptr: int, client_id_batch: tp.List[tp.Optional[int]]): for i, (req_id, client_id, logits) in enumerate( zip(req_id_batch, client_id_batch, logits_batch)): - print(f"DEBUG: Request {i}: req_id={req_id}, client_id={client_id}") - print( - f"DEBUG: Request {i}: logits dtype={logits.dtype}, shape={logits.shape}, device={logits.device}" - ) assert client_id == 123
1199-1200: Format long line for better readabilityLine 1200 exceeds the maximum line length limit.
- result.context_phase_params = trtllm.ContextPhaseParams([1, 2], 123, None, - None) + result.context_phase_params = trtllm.ContextPhaseParams( + [1, 2], 123, None, None)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp(2 hunks)cpp/tensorrt_llm/nanobind/bindings.cpp(2 hunks)cpp/tensorrt_llm/nanobind/common/bindTypes.h(2 hunks)cpp/tensorrt_llm/nanobind/common/customCasters.h(3 hunks)cpp/tensorrt_llm/nanobind/executor/executor.cpp(2 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp(8 hunks)cpp/tensorrt_llm/pybind/bindings.cpp(1 hunks)scripts/build_wheel.py(2 hunks)tensorrt_llm/llmapi/llm_args.py(2 hunks)tests/unittest/bindings/test_bindings_ut.py(0 hunks)tests/unittest/bindings/test_executor_bindings.py(5 hunks)
🧠 Learnings (2)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/bindings/test_executor_bindings.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tests/unittest/bindings/test_executor_bindings.py
97-97: Local variable executor is assigned to but never used
Remove assignment to unused variable executor
(F841)
1203-1203: Line too long (260 > 120)
(E501)
💤 Files with no reviewable changes (1)
- tests/unittest/bindings/test_bindings_ut.py
✅ Files skipped from review due to trivial changes (3)
- scripts/build_wheel.py
- cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
- cpp/tensorrt_llm/pybind/bindings.cpp
🚧 Files skipped from review as they are similar to previous changes (5)
- cpp/tensorrt_llm/nanobind/bindings.cpp
- cpp/tensorrt_llm/nanobind/common/bindTypes.h
- cpp/tensorrt_llm/nanobind/common/customCasters.h
- cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
- tensorrt_llm/llmapi/llm_args.py
🧰 Additional context used
🧠 Learnings (2)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/bindings/test_executor_bindings.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tests/unittest/bindings/test_executor_bindings.py
97-97: Local variable executor is assigned to but never used
Remove assignment to unused variable executor
(F841)
1203-1203: Line too long (260 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (1)
146-147: LGTM!The explicit cast to
nb::objectcorrectly aligns with the refactorednumpyToTensorfunction signature.cpp/tensorrt_llm/nanobind/executor/request.cpp (2)
213-228: Constructor implementation follows correct nanobind pattern.The use of placement new on
selfin the__init__lambda is the proper way to construct objects in nanobind Python bindings.
468-487: Well-structured handling of optional opaque_state parameter.The constructor correctly differentiates between cases with and without opaque_state, properly converting Python bytes to C++ vector when needed.
|
/bot run --disable-fail-fast |
|
PR_Github #12570 [ run ] triggered by Bot |
|
PR_Github #12570 [ run ] completed with state |
|
/bot run |
Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Linda-Stadter <[email protected]>
fda758a to
28ff348
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #12721 [ run ] triggered by Bot |
|
/bot run --disable-fail-fast |
|
PR_Github #12735 [ run ] triggered by Bot |
|
PR_Github #12721 [ run ] completed with state |
|
/bot run |
|
PR_Github #12754 [ run ] triggered by Bot |
|
PR_Github #12735 [ run ] completed with state |
|
PR_Github #12754 [ run ] completed with state |
Do not merge!
This draft sets nanobind as the default binding library and executes all tests in the CI pipeline with nanobind.
Summary by CodeRabbit
New Features
Nonefor more flexible configuration.Bug Fixes
logprobsparameters to ensure consistent boolean handling in sampling requests.Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.