-
Notifications
You must be signed in to change notification settings - Fork 2k
Feat/add lora cuda graph support #7336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat/add lora cuda graph support #7336
Conversation
Signed-off-by: Shahar Mor <[email protected]>
Signed-off-by: Shahar Mor <[email protected]>
📝 WalkthroughWalkthroughAdds LoRA support across the PyTorch executor stack and CUDA Graph path: propagates per-request LoRA params, introduces LoRA prefetch via a manager, updates dummy-request flow, and wires initialization from PyExecutor/worker. Also tweaks devcontainer config and comments out a Docker volume mount. Adds tests covering LoRA with CUDA Graph. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant Worker
participant PyExecutor
participant ModelEngine
participant ResourceMgr as ResourceManager/KVCacheMgr
participant GraphRunner as DecodingCUDAGraphRunner
participant Model
rect rgb(245,248,255)
note over PyExecutor,ResourceMgr: Initialization and LoRA prefetch
Client->>Worker: Start engine
Worker->>PyExecutor: Initialize backend
PyExecutor->>ModelEngine: set_lora_manager(resource_manager)
PyExecutor->>ModelEngine: prefetch_lora_dirs()
ModelEngine->>ResourceMgr: Access PEFT cache / LoRA manager
ResourceMgr-->>ModelEngine: LoRA adapters info
end
rect rgb(245,255,245)
note over Client,Model: Inference with CUDA Graph
Client->>Worker: Submit requests (may include LoRA)
Worker->>PyExecutor: Generate(batch with LoRA)
PyExecutor->>ModelEngine: forward(batch, resource_manager)
alt Graph not captured
ModelEngine->>ResourceMgr: add_dummy_requests(lora_request=...)
ResourceMgr-->>ModelEngine: Dummy LlmRequests
ModelEngine->>GraphRunner: Construct(..., lora_params=from requests)
ModelEngine->>GraphRunner: Capture forward(inputs+lora_params)
else Graph captured
ModelEngine->>GraphRunner: Run with updated lora_params
end
GraphRunner->>Model: forward(..., lora_params)
Model-->>GraphRunner: logits/tokens
GraphRunner-->>ModelEngine: outputs
ModelEngine-->>PyExecutor: responses
PyExecutor-->>Worker: responses
Worker-->>Client: responses
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
tensorrt_llm/executor/worker.py (1)
166-170: Wire C++ PeftCacheManager into LoraManager.
- In
worker.py, immediately afterself._lora_manager = self.engine.get_lora_manager(), invokeself._lora_manager.set_cpp_peft_cache_manager( self.engine.resource_manager.get_cpp_peft_cache_manager() )- In
LoraManager.__init__, changetoself._cpp_peft_cache_manager: Optional[...] = Noneso the constructor‐passed cache manager is retained.self._cpp_peft_cache_manager = cpp_peft_cache_managertests/unittest/llmapi/test_llm_pytorch.py (1)
883-923: Batch test: add guards, determinism, and resource management.-def test_lora_graph_multiple_requests(): +@force_ampere +@skip_gpu_memory_less_than_40gb +def test_lora_graph_multiple_requests(): @@ - llm = LLM( - model=f"{llm_models_root()}/llama-models/llama-7b-hf", - lora_config=lora_config, - # cuda_graph_config=None) - cuda_graph_config=CudaGraphConfig(max_batch_size=2)) + with LLM( + model=f"{llm_models_root()}/llama-models/llama-7b-hf", + lora_config=lora_config, + cuda_graph_config=CudaGraphConfig(max_batch_size=2)) as llm: @@ - sampling_params = SamplingParams(max_tokens=20) + sampling_params = SamplingParams(max_tokens=20, temperature=0.0) @@ - outputs = llm.generate(prompts, sampling_params, lora_request=lora_requests) + outputs = llm.generate(prompts, sampling_params, lora_request=lora_requests)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
430-486: Handle None entries and tighten typing for lora_request.
- Current code assumes lora_request[i] is non-None; guard against None to avoid AttributeError.
- Use a broader Sequence[Any] type to clarify expectations and avoid TODO.
- def add_dummy_requests( + def add_dummy_requests( self, request_ids: List[int], @@ - max_beam_width: int = 1, - lora_request: Optional[List] = None, # TODO smor fill type hint + max_beam_width: int = 1, + lora_request: Optional[Sequence[Any]] = None, ): @@ - if lora_request is not None and i < len(lora_request): - lora_task_id = lora_request[i].task_id - lora_weights = lora_request[i].weights - lora_config = lora_request[i].config + if lora_request is not None and i < len(lora_request): + req_i = lora_request[i] + if req_i is not None: + lora_task_id = req_i.task_id + lora_weights = req_i.weights + lora_config = req_i.configAdd imports at file top if not present:
from typing import Any, Sequencetensorrt_llm/_torch/pyexecutor/model_engine.py (2)
576-885: Call LoRA setup/prefetch in warmup to guarantee availability.Warmup relies on has_lora_prefetched but nothing here triggers set_lora_manager/prefetch. Consider invoking them if prefetch list is present.
def get_cuda_graph_warmup_request(batch_size, draft_len):Upfront in warmup (after kv_cache_manager/spec_resource_manager retrieval):
+ # Ensure LoRA manager and prefetch are ready before graph capture. + try: + self.set_lora_manager(resource_manager) + if not self.has_lora_prefetched and self.lora_prefetch_requests_list: + self.prefetch_lora_dirs() + except Exception: + logger.exception("LoRA initialization/prefetch failed; proceeding without LoRA in warmup.")
1685-1713: Ensure lora_params shape stability under CUDA graph
CUDA graphs require fixed tensor shapes across replays;_get_lora_params_from_requestscurrently builds a dynamic layout per batch (varying (layer_id,module_id) keys), which will break graph captures when the union of adapters changes. Capture the full superset of adapters once (warmup) and pad missing entries with zeros at runtime.
🧹 Nitpick comments (13)
tensorrt_llm/lora_helper.py (1)
17-17: Tighten typing and avoid importing Any when a precise type can be used.
Anyis only needed forlora_request. Prefer a forward-declared protocol/type to document the contract (e.g., "LoRARequest") and use a sequence type.Apply:
-from typing import Any, Dict, List, Optional +from typing import Dict, List, Optional, Sequencetensorrt_llm/lora_manager.py (2)
705-711: Add a concise docstring and validate setter input.
set_cpp_peft_cache_manageris public; document purpose and add a lightweight type check to fail early on incorrect injection.Apply:
def set_cpp_peft_cache_manager( self, cpp_peft_cache_manager: tb_internal.batch_manager.PeftCacheManager ): - self._cpp_peft_cache_manager = cpp_peft_cache_manager + """Inject CPP PEFT cache manager used by is_adapter_in_cpu_cache().""" + if not hasattr(cpp_peft_cache_manager, "is_task_cached"): + raise TypeError("cpp_peft_cache_manager must expose is_task_cached(adapter_uid: int) -> bool") + self._cpp_peft_cache_manager = cpp_peft_cache_manager
712-722: Document known race and make return explicitly boolean.Return a strict bool and reference the known CPU-cache race so call sites treat it as best-effort optimization.
Apply:
- return ( - self._cpp_peft_cache_manager.is_task_cached(adapter_uid) - if self._cpp_peft_cache_manager - else False - ) + # Best-effort: known race in CPU cache presence checks; callers must handle false positives/negatives. + return bool(self._cpp_peft_cache_manager and self._cpp_peft_cache_manager.is_task_cached(adapter_uid))Also consider adding a short comment referencing the limitation so it isn’t regressed.
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
21-22: Add NVIDIA copyright header for 2025.This Python file lacks the required header per repo guidelines.
Add at file top (outside these hunks):
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Also applies to: 58-58, 79-80, 113-115
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
220-222: Guard LoRA prefetch on manager availability.Call prefetch only when a LoRA manager is set to avoid surprises if PEFT cache is not present.
Apply:
- self.model_engine.set_lora_manager(self.resource_manager) - self.model_engine.prefetch_lora_dirs() + self.model_engine.set_lora_manager(self.resource_manager) + if getattr(self.model_engine, "lora_manager", None) is not None: + self.model_engine.prefetch_lora_dirs()tensorrt_llm/_torch/pyexecutor/model_engine.py (8)
1-1: Add NVIDIA copyright header.All source files must prepend the NVIDIA copyright header for 2025.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
30-30: Prefer module-level import to preserve namespace.Guideline says “preserve module namespaces.” Import the module and qualify symbols.
-from tensorrt_llm.lora_helper import LoraConfig, LoraManager +import tensorrt_llm.lora_helper as lora_helperYou’ll also need to update annotations/usages:
- lora_config: Optional[lora_helper.LoraConfig]
- self.lora_manager: Optional[lora_helper.LoraManager]
852-852: Remove debug comment.“SMOR suppose to fail here, verify” should not be in production.
- torch.cuda.synchronize() # SMOR suppose to fail here, verify + torch.cuda.synchronize()
1324-1326: New parameters lack doc coverage.Add brief doc entries for cache_indirection_buffer and lora_params to explain expectations and shape constraints during CUDA graph replay.
2220-2241: _prepare_inputs signature extended; propagate doc and ensure None-safe defaults.Add a docstring for lora_params and confirm callers always pass None or a stable-layout dict; keep cache_indirection_buffer optional.
2101-2217: DoRA flag handling may be misleading.You retain is_dora based on scaling_vec_pointer == 0. Prior flow removed DoRA usage; keep it but don’t convert to tensors (good), and ensure downstream ignores it to avoid overhead.
If any consumer uses is_dora in PyTorch flow, please point to it; otherwise we should drop it from the packed payload for clarity.
275-277: Type annotations and Python 3.8 compatibility.If you adopt module-level import for lora_helper, also update the init arg annotation to lora_helper.LoraConfig. Additionally, builtin generics like list[str] require Python 3.9+ unless using “from future import annotations”. If this file must run on 3.8, consider typing.List or enabling future annotations at top.
+from __future__ import annotationsor migrate to typing.List/typing.Dict.
488-506: Add basic unit tests for LoRA+CUDA graph paths.
- Warmup with prefetch on and off
- Mixed context+generation batch
- Replay after changing adapter set (ensure fixed layout)
I can draft pytest cases targeting tests/unittest/llmapi/test_llm_pytorch.py to cover these scenarios.
Also applies to: 592-609, 1082-1122, 2220-2241, 2284-2294
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (10)
.devcontainer/devcontainer.json(1 hunks).devcontainer/docker-compose.yml(1 hunks)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py(4 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(14 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(2 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(3 hunks)tensorrt_llm/executor/worker.py(1 hunks)tensorrt_llm/lora_helper.py(2 hunks)tensorrt_llm/lora_manager.py(2 hunks)tests/unittest/llmapi/test_llm_pytorch.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic
Files:
tensorrt_llm/executor/worker.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/lora_manager.pytensorrt_llm/lora_helper.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.pytests/unittest/llmapi/test_llm_pytorch.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header with current year to all source files
Files:
tensorrt_llm/executor/worker.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/lora_manager.pytensorrt_llm/lora_helper.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.pytests/unittest/llmapi/test_llm_pytorch.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (9)
📓 Common learnings
Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Applied to files:
tensorrt_llm/executor/worker.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/executor/worker.pytensorrt_llm/lora_manager.pytensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/executor/worker.pytensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/llmapi/test_llm_pytorch.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tests/unittest/llmapi/test_llm_pytorch.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tests/unittest/llmapi/test_llm_pytorch.py
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
tests/unittest/llmapi/test_llm_pytorch.py
📚 Learning: 2025-08-19T12:45:35.429Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:2086-2092
Timestamp: 2025-08-19T12:45:35.429Z
Learning: DoRA (Delta Orthogonal Rank Adaptation) functionality has been removed from the PyTorch flow in tensorrt_llm/_torch/pyexecutor/model_engine.py. The is_dora field is computed but not used downstream in the PyTorch flow, so converting it to a tensor would be wasteful overhead.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
🧬 Code graph analysis (6)
tensorrt_llm/executor/worker.py (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
get_lora_manager(301-302)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
get_lora_manager(1091-1093)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
set_lora_manager(488-492)prefetch_lora_dirs(494-505)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
get_lora_manager(1091-1093)
tensorrt_llm/lora_manager.py (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
PeftCacheManager(1018-1126)cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (1)
PeftCacheManager(231-255)
tensorrt_llm/lora_helper.py (1)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
lora_request(635-656)
tests/unittest/llmapi/test_llm_pytorch.py (4)
tensorrt_llm/llmapi/llm_args.py (1)
CudaGraphConfig(106-163)tensorrt_llm/executor/request.py (1)
LoRARequest(24-53)tensorrt_llm/lora_helper.py (1)
LoraConfig(83-102)tests/unittest/utils/util.py (1)
similar(371-373)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)
tensorrt_llm/_torch/models/modeling_phi4mm.py (2)
lora_request(635-656)lora_config(612-632)tensorrt_llm/lora_manager.py (1)
lora_weights(1157-1158)tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
LlmRequest(271-411)tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
get_lora_manager(301-302)
🪛 Biome (2.1.2)
.devcontainer/devcontainer.json
[error] 16-16: unexpected character #
(parse)
[error] 16-16: String values must be double quoted.
(parse)
[error] 16-16: Minus must be followed by a digit
(parse)
[error] 16-16: String values must be double quoted.
(parse)
[error] 16-16: String values must be double quoted.
(parse)
🔇 Additional comments (8)
.devcontainer/docker-compose.yml (1)
26-26: Confirm impact of removing the HF cache volume.Commenting out the HuggingFace cache mount can increase cold-start latency and container disk usage (HF_HOME still points to /huggingface). If this was intentional for CI isolation, consider documenting it; otherwise, restore the mount.
tensorrt_llm/lora_manager.py (1)
11-11: Use Python 3.8-compatible typing in tensorrt_llm/lora_manager.py.
- LoraModelConfig: change
tolora_target_modules: list[str] trtllm_modules_to_hf_modules: dict[str, str]from typing import List, Dict lora_target_modules: List[str] trtllm_modules_to_hf_modules: Dict[str, str]- Constructor signature: replace
withcpp_peft_cache_manager: PeftCacheManager | None = Nonefrom typing import Optional cpp_peft_cache_manager: Optional[PeftCacheManager] = None- Ensure
List,Dict, andOptionalare imported fromtypingat the top.⛔ Skipped due to learnings
Learnt from: shaharmor98 PR: NVIDIA/TensorRT-LLM#7231 File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509 Timestamp: 2025-08-26T06:07:02.166Z Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.Learnt from: ixlmar PR: NVIDIA/TensorRT-LLM#7294 File: tensorrt_llm/_torch/modules/rms_norm.py:17-17 Timestamp: 2025-08-27T14:23:55.545Z Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
58-58: Storing lora_params on the instance is fine, but only if they are preallocated tensors.Ensure callers pass preallocated device tensors; otherwise, changing LoRA per run won’t affect the captured graph.
tensorrt_llm/executor/worker.py (1)
166-170: Good switch to Engine-owned LoRA manager.Using self.engine.get_lora_manager() under the PyTorch backend keeps ownership localized to the engine and removes CPP coupling. Looks correct given the new PyExecutor API.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
301-303: Accessor is fine.get_lora_manager passthrough is straightforward and matches the worker’s usage.
tests/unittest/llmapi/test_llm_pytorch.py (1)
8-8: Import change LGTM.Importing CudaGraphConfig (and PeftCacheConfig from same module) aligns with where these types live.
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
1120-1122: DecodingCUDAGraphRunner supports lora_params in init and run/capture
Constructor signature includeslora_params: Optional[dict], it’s assigned toself.lora_params, included in the metadata dict, and updated in the capture path. No further changes required.
619-633: No changes needed: add_dummy_requests already supports lora_request parameter.
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.